Knowledge Discovery for Counterterrorism and Law Enforcement © 2009 by Taylor & Francis Group, LLC C7399_FM.indd 1 9/26/08 1:55:34 PM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand- books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues. PUBLISHED TITLES UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications Sugato Basu, Ian Davidson, and Kiri L. Wagstaff KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn © 2009 by Taylor & Francis Group, LLC C7399_FM.indd 2 9/26/08 1:55:34 PM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Knowledge Discovery for Counterterrorism and Law Enforcement David Skillicorn © 2009 by Taylor & Francis Group, LLC C7399_FM.indd 3 9/26/08 1:55:34 PM CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-1-4200-7399-7 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher can- not assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copy- right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro- vides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com © 2009 by Taylor & Francis Group, LLC C7399_FM.indd 4 9/26/08 1:55:34 PM “text” — 2008/9/15 — 10:29 — page v — #2 Dedicated to the memory of those innocent people who lost their lives in the first Bali bombing, October 12th 2002. © 2009 by Taylor & Francis Group, LLC “text” — 2008/9/15 — 10:29 — page vii — #4 Contents Preface xv List of Figures xix 1 Introduction 1 1.1 What is ‘Knowledge Discovery’? . . . . . . . . . . . . . . . . . 2 1.1.1 Main Forms of Knowledge Discovery . . . . . . . . . . . 2 1.1.2 The Larger Process. . . . . . . . . . . . . . . . . . . . . 4 1.2 What is an Adversarial Setting? . . . . . . . . . . . . . . . . . 6 1.3 Algorithmic Knowledge Discovery. . . . . . . . . . . . . . . . . 11 1.3.1 What is Different about Adversarial Knowledge Discovery? . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Data 21 2.1 Kinds of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.1 Data about Objects . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Low-Level Data . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.3 Data about Connections . . . . . . . . . . . . . . . . . . 23 2.1.4 Textual Data . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.5 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Data That Changes . . . . . . . . . . . . . . . . . . . . . . . . 26 vii © 2009 by Taylor & Francis Group, LLC “text” — 2008/9/15 — 10:29 — page viii — #5 viii Contents 2.2.1 Slow Changes in the Underlying Situation . . . . . . . . 26 2.2.2 Change is the Important Property . . . . . . . . . . . . 26 2.2.3 Stream Data . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Fusion of Different Kinds of Data . . . . . . . . . . . . . . . . . 27 2.4 How is Data Collected? . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Transaction Endpoints . . . . . . . . . . . . . . . . . . . 30 2.4.2 Interaction Endpoints . . . . . . . . . . . . . . . . . . . 30 2.4.3 Observation Endpoints . . . . . . . . . . . . . . . . . . . 31 2.4.4 Human Data Collection . . . . . . . . . . . . . . . . . . 31 2.4.5 Reasons for Data Collection . . . . . . . . . . . . . . . . 32 2.5 Can Data Be Trusted? . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 The Problem of Noise . . . . . . . . . . . . . . . . . . . 35 2.6 How Much Data? . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.1 Data Interoperability. . . . . . . . . . . . . . . . . . . . 37 2.6.2 Domain Knowledge. . . . . . . . . . . . . . . . . . . . . 38 3 High-Level Principles 41 3.1 What to Look for . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Subverting Knowledge Discovery . . . . . . . . . . . . . . . . . 48 3.2.1 Subverting the Data-Collection Phase . . . . . . . . . . 48 3.2.2 Subverting the Analysis Phase . . . . . . . . . . . . . . 49 3.2.3 Subverting the Decision-and-Action Phase . . . . . . . . 50 3.2.4 The Difficulty of Fabricating Data . . . . . . . . . . . . 50 3.3 Effects of Technology Properties . . . . . . . . . . . . . . . . . 53 3.4 Sensemaking and Situational Awareness . . . . . . . . . . . . . 56 3.4.1 Worldviews . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Taking Account of the Adversarial Setting over Time . . . . . . 60 3.6 Does This Book Help Adversaries? . . . . . . . . . . . . . . . . 61 3.7 What about Privacy? . . . . . . . . . . . . . . . . . . . . . . . 62 © 2009 by Taylor & Francis Group, LLC “text” — 2008/9/15 — 10:29 — page ix — #6 Contents ix 4 Looking for Risk – Prediction and Anomaly Detection 67 4.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.1 Misconceptions . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.2 The Problem of Human Variability . . . . . . . . . . . . 68 4.1.3 The Problem of Computational Difficulty . . . . . . . . 68 4.1.4 The Problem of Rarity . . . . . . . . . . . . . . . . . . . 68 4.1.5 The Problem of Justifiable Preemption. . . . . . . . . . 70 4.1.6 The Problem of Hindsight Bias . . . . . . . . . . . . . . 71 4.1.7 What are the Real Goals? . . . . . . . . . . . . . . . . . 72 4.2 Outline of Prediction Technology . . . . . . . . . . . . . . . . . 75 4.2.1 Building Predictors. . . . . . . . . . . . . . . . . . . . . 75 4.2.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.4 Reasons for a Prediction . . . . . . . . . . . . . . . . . . 78 4.2.5 Prediction Errors . . . . . . . . . . . . . . . . . . . . . . 78 4.2.6 Reasons for Errors . . . . . . . . . . . . . . . . . . . . . 80 4.2.7 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.8 Prediction with an Associated Confidence . . . . . . . . 82 4.3 Concealment Opportunities . . . . . . . . . . . . . . . . . . . . 82 4.4 Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.2 Ensembles of Predictors . . . . . . . . . . . . . . . . . . 88 4.4.3 Random Forests . . . . . . . . . . . . . . . . . . . . . . 93 4.4.4 Support Vector Machines . . . . . . . . . . . . . . . . . 95 4.4.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . 101 4.4.6 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.7 Attribute Selection . . . . . . . . . . . . . . . . . . . . . 105 4.4.8 Distributed Prediction . . . . . . . . . . . . . . . . . . . 107 4.4.9 Symbiotic Prediction . . . . . . . . . . . . . . . . . . . . 107 © 2009 by Taylor & Francis Group, LLC “text” — 2008/9/15 — 10:29 — page x — #7 x Contents 4.5 Tactics and Process . . . . . . . . . . . . . . . . . . . . . . . . 108 4.6 Extending the Process . . . . . . . . . . . . . . . . . . . . . . . 113 4.7 Special Case: Looking for Matches . . . . . . . . . . . . . . . . 114 4.8 Special Case: Looking for Outliers . . . . . . . . . . . . . . . . 115 4.9 Special Case: Frequency Ranking . . . . . . . . . . . . . . . . . 119 4.9.1 Frequent Records . . . . . . . . . . . . . . . . . . . . . . 120 4.9.2 Records Seen Before . . . . . . . . . . . . . . . . . . . . 121 4.9.3 Records Similar to Those Seen Before . . . . . . . . . . 122 4.9.4 Records with Some Other Frequency . . . . . . . . . . . 123 4.10 Special Case: Discrepancy Detection . . . . . . . . . . . . . . . 123 5 Looking for Similarity – Clustering 127 5.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.2 Outline of Clustering Technology . . . . . . . . . . . . . . . . . 130 5.3 Concealment Opportunities . . . . . . . . . . . . . . . . . . . . 134 5.4 Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.4.1 Distance-Based Clustering . . . . . . . . . . . . . . . . . 136 5.4.2 Density-Based Clustering . . . . . . . . . . . . . . . . . 138 5.4.3 Distribution-Based Clustering . . . . . . . . . . . . . . . 139 5.4.4 Decomposition-Based Clustering . . . . . . . . . . . . . 141 5.4.5 Hierarchical Clustering. . . . . . . . . . . . . . . . . . . 144 5.4.6 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.4.7 Clusters and Prediction . . . . . . . . . . . . . . . . . . 150 5.4.8 Symbiotic Clustering . . . . . . . . . . . . . . . . . . . . 150 5.5 Tactics and Process . . . . . . . . . . . . . . . . . . . . . . . . 151 5.6 Special Case – Looking for Outliers Revisited . . . . . . . . . . 152 © 2009 by Taylor & Francis Group, LLC
Description: