ebook img

Data Mining II PDF

62 Pages·2016·7.59 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining II

2. Text Mining D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 118/179 Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text numerically To learn how to make use of enormous amounts of unlabeled data To learn how to find co-occurring keywords in documents D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 119/179 2.1 Basics of Text Representation and Analysis basedon: CharuAggarwal,DataMining-TheTextbook,Springer2015,Chapter13 D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 120/179 What is text mining? Definition Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the (biomedical) literature. Motivation Most knowledge is stored in terms of texts, both in industry and in academia. This alone makes text mining an integral part of knowledge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text. D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 121/179 Why text mining? Text data is growing in an unprecedented manner Digital libraries Web and Web-enabled applications (e.g. Social networks) Newswire services D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 122/179 Text mining terminology Important definitions A set of features of text is also referred to as a lexicon. A document can be either viewed as a sequence or multidimensional record. A collection of documents is referred to as a corpus. D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 123/179 Text mining terminology Number of special characteristics of text data Very sparse Diverse length Nonnegative statistics Side information is often available, e.g. Hyperlink, meta-data Lots of unlabeled data D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 124/179 What is text mining? Common tasks Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents Document ranking: rank all documents in the collection Document selection: classify documents into relevant and irrelevant Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a category that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 125/179 Evaluation text mining Precision and Recall Let the set of documents that are relevant to a query be denoted as {Relevant} and the set of retrieved documents as {Retrieved}. The precision is the percentage of retrieved documents that are relevant to the query ∣{Relevant}∩{Retrieved}∣ precision= (1) ∣{Retrieved}∣ The recall is the percentage of relevant documents that were retrieved by the query: ∣{Relevant}∩{Retrieved}∣ recall= (2) ∣{Relevant}∣ D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 126/179 Text representation Tokenization Tokenization is the process of identifying keywords in a document. Not all words in a text are relevant. Text mining ignores stop words. Stop words form the stop list. Stop lists are context-dependent. D-BSSE KarstenBorgwardt DataMiningIICourse, Basel SpringSemester2016 127/179

Description:
2.1 Basics of Text Representation and Analysis based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13. D-BSSE. Karsten Borgwardt. Data Mining II Course, Basel. Spring Semester 2016. 120 / 179
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.