ebook img

Data Mining Methods for the Content Analyst: An Introduction to the Computational Analysis of Content PDF

121 Pages·2011·2.374 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining Methods for the Content Analyst: An Introduction to the Computational Analysis of Content

US Generic Paperback Template 6 x 9 CMYK D RESEARCH METHODS IN HUMANITIES / SOCIAL SCIENCES a t a M “This volume provides a rigorous yet readable introduction to digital textual i n analysis. Leetaru has provided a sorely needed and contemporary account of text in Data Mining MethoDs g analytics that makes it accessible to those not yet familiar with this increasingly M e important set of methods.” t h for the Content –Marshall Scott Poole, Senior Research Scientist, NCSA, and Director of I-CHASS, o D University of Illinois Urbana-Champaign s f o r analyst “This is a much-needed book that fills a critical gap in offering a readily t h accessible and comprehensive overview of computer methods for the emerging e C field of digital humanities.” o n an introDuCtion to the –Orville Vernon Burton, Founding Director of the Institute of Computing in t e Humanities, Arts, and Social Sciences, and Director of Clemson University n t CoMputational analysis of Content CyberInstitute a n a l y “Kalev Leetaru’s excellent book on data mining methods for the content analyst s t is a comprehensive study of the content analysis tools and techniques that have a profound impact not only on scientific research, but in humanities, finance, and public policy as well. Leetaru provides a wealth of detail and informed analysis on a subject of growing importance for researchers and those who serve them. “ –Bernard F. Reilly, Jr., President, The Center for Research Libraries KaleV hannes leetaru “This book is written by someone capable of conceptualizing the complexity of data mining and rendering the distinct approaches to it clear. It wisely avoids le e promoting specific software as it changes faster than a book can be updated.” ta r –Klaus Krippendorff, Professor of Communication at the Annenberg School for u Communication, University of Pennsylvania ISBN 978-0-415-89514-9 www.routledge.com 9 780415 895149 Cover image: © Getty Images DATA MINING METHODS FOR THE CONTENT ANALYST With continuous advancements and an increase in user popularity, data mining technologies serve as an invaluable resource for researchers across a wide range of disciplines in the humanities and social sciences. In this comprehensive guide, author and research scientist Kalev Leetaru introduces the approaches, strategies, and methodologies of current data mining techniques, offering insights for new and experienced users alike. Designed as an instructive reference to computer-based analysis approaches, each chapter of this resource explains a set of core concepts and analytical data mining strategies, along with detailed examples and steps relating to current data mining practices. Every technique is considered with regard to context, theory of operation and methodological concerns, and focuses on the capabilities and strengths relating to these technologies. In addressing critical methodologies and approaches to automated analytical techniques, this work provides an essential overview to a broad innovative fi eld. Kalev Hannes Leetaru is Senior Research Scientist for Content Analysis at the University of Illinois Institute for Computing in Humanities, Arts, and Social Science, and Center Affi liate of the National Center for Supercomputing Applications. He leads a number of large initiatives centering on the application of high-performance computing to grand challenge problems using massive-scale document and data archives. DATA MINING METHODS FOR THE CONTENT ANALYST An Introduction to the Computational Analysis of Content Kalev Hannes Leetaru First published 2012 by Routledge 711 Third Avenue, New York, NY 10017 Simultaneously published in the UK by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2012 Taylor & Francis The right of Kalev Hannes Leetaru to be identifi ed as the author of this work has been asserted by him in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifi cation and explanation without intent to infringe. First edition published by Lawrence Erlbaum Associates, Inc. 1997 Library of Congress Cataloging in Publication Data Leetaru, Kalev. Data mining methods for the content analyst : an introduction to the computational analysis of content / by Kalev Leetaru. p. cm. — (Routledge communication series) Includes indexes. ISBN 978-0-415-89513-2 — ISBN 978-0-415-89514-9 — ISBN 978-0-203-14938-6 1. Data mining. I. Title. QA76.9.D343L45 2011 006.3′12—dc23 ISBN: 978-0-415-89513-2 (hbk) ISBN: 978-0-415-89514-9 (pbk) ISBN: 978-0-203-14938-6 (ebk) Typeset in Bembo and Stone Sans by Refi neCatch Limited, Bungay, Suffolk, UK To my parents, Hannes and Marilyn. CONTENTS List of T ables and Figures xi Acknowledgments xiii 1 Introduction 1 What Is Content Analysis? 1 Why Use Computerized Analysis Techniques? 2 Standalone Tools or Integrated Suites 3 Transitioning from Theory to Practice 5 Chapter in Summary 6 2 Obtaining and Preparing Data 7 Collecting Data from Digital Text Repositories 7 Are the Data Meaningful? 8 Using Data in Unintended Ways 9 Analytical Resolution 10 Types of Data Sources 11 Finding Sources 12 Searching Text Collections 13 Sources of Incompleteness 14 Licensing Restrictions and Content Blackouts 16 Measuring Viewership 17 Accuracy and Convenience Samples 17 Random Samples 18 Multimedia Content 19 Converting to Textual Format 19 viii Contents Prosody 19 Example Data Sources 20 Patterns in Historical War Coverage 20 Competitive Intelligence 20 Global News Coverage 21 Downloading Content 22 Digital Content 22 Print Content 23 Preparing Content 23 Document Extraction 23 Cleaning 24 Post Filtering 24 Reforming/Reshaping 25 Content Proxy Extraction 25 Chapter in Summary 25 3 Vocabulary Analysis 26 The Basics 26 Word Histograms 26 Readability Indexes 27 Normative Comparison 28 Non-word Analysis 28 Colloquialisms: Abbreviations and Slang 29 Restricting the Analytical Window 29 Vocabulary Comparison and Evolution/Chronemics 30 Advanced Topics 32 Syllables, Rhyming, and “Sounds Like” 32 Gender and Language 33 Authorship Attribution 33 Word Morphology, Stemming, and Lemmatization 33 Chapter in Summary 34 4 Correlation and Co-occurrence 36 Understanding Correlation 36 Computing Word Correlations 37 Directionality 38 Concordance 39 Co-occurrence and Search 40 Language Variation and Lexicons 40 Non-co-occurrence 41 Contents ix Correlation with Metadata 41 Chapter in Summary 42 5 Lexicons, Entity Extraction, and Geocoding 43 Lexicons 43 Lexicons and Categorization 44 Lexical Correlation 45 Lexicon Consistency Checks 45 Thesauri and Vocabulary Expanders 47 Named Entity Extraction 48 Lexicons and Processing 48 Applications 49 Geocoding, Gazetteers, and Spatial Analysis 51 Geocoding 51 Gazetteers and the Geocoding Process 52 Operating Under Uncertainty 54 Spatial Analysis 55 Chapter in Summary 56 6 Topic Extraction 57 How Machines Process Text 57 Unstructured Text 58 Extracting Meaning from Text 58 Applications of Topic Extraction 59 Comparing/Clustering Documents 60 Automatic Summarization 60 Automatic Keyword Generation 61 Multilingual Analysis: Topic Extraction with Multiple Languages 62 Chapter in Summary 63 7 Sentiment Analysis 65 Examining Emotions 65 Evolution 65 Evaluation 66 Analytical Resolution: Documents versus Objects 67 Hand-crafted versus Automatically Generated Lexicons 68 Other Sentiment Scales 68

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.