ebook img

Foundations of Data Science PDF

419 Pages·2014·1.979 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Foundations of Data Science

Foundations of Data Science 1 John Hopcroft Ravindran Kannan Version 21/8/2014 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes are in good enough shape to prepare lectures for a modern theoretical course in computer science. Please do not put solutions to exercises online as it is important for students to work out solutions for themselves rather than copy them from the internet. Thanks JEH 1Copyright 2011. All rights reserved 1 Contents 1 Introduction 7 2 High-Dimensional Space 10 2.1 Properties of High-Dimensional Space . . . . . . . . . . . . . . . . . . . . . 12 2.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 The High-Dimensional Sphere . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 The Sphere and the Cube in High Dimensions . . . . . . . . . . . . 16 2.3.2 Volume and Surface Area of the Unit Sphere . . . . . . . . . . . . . 17 2.3.3 The Volume is Near the Equator . . . . . . . . . . . . . . . . . . . 20 2.3.4 The Volume is in a Narrow Annulus . . . . . . . . . . . . . . . . . . 23 2.3.5 The Surface Area is Near the Equator . . . . . . . . . . . . . . . . 24 2.4 Volumes of Other Solids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Generating Points Uniformly at Random on the Surface of a Sphere . . . . 27 2.6 Gaussians in High Dimension . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Bounds on Tail Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8 Applications of the tail bound . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.9 Random Projection and Johnson-Lindenstrauss Theorem . . . . . . . . . . 38 2.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Best-Fit Subspaces and Singular Value Decomposition (SVD) 52 3.1 Singular Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . 56 3.3 Best Rank k Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Left Singular Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Power Method for Computing the Singular Value Decomposition . . . . . . 62 3.6 Applications of Singular Value Decomposition . . . . . . . . . . . . . . . . 64 3.6.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 64 3.6.2 Clustering a Mixture of Spherical Gaussians . . . . . . . . . . . . . 65 3.6.3 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6.4 Singular Vectors and Ranking Documents . . . . . . . . . . . . . . 71 3.6.5 An Application of SVD to a Discrete Optimization Problem . . . . 72 3.7 Singular Vectors and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . 75 3.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4 Random Graphs 85 4.1 The G(n,p) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.1 Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.1.2 Existence of Triangles in G(n,d/n) . . . . . . . . . . . . . . . . . . 91 4.2 Phase Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 The Giant Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2 4.4 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5 Cycles and Full Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5.1 Emergence of Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5.2 Full Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.5.3 Threshold for O(lnn) Diameter . . . . . . . . . . . . . . . . . . . . 118 4.6 Phase Transitions for Increasing Properties . . . . . . . . . . . . . . . . . . 119 4.7 Phase Transitions for CNF-sat . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.8 Nonuniform and Growth Models of Random Graphs . . . . . . . . . . . . . 126 4.8.1 Nonuniform Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.8.2 Giant Component in Random Graphs with Given Degree Distribution127 4.9 Growth Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.9.1 Growth Model Without Preferential Attachment . . . . . . . . . . . 128 4.9.2 Growth Model With Preferential Attachment . . . . . . . . . . . . 135 4.10 Small World Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5 Random Walks and Markov Chains 153 5.1 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.2 Electrical Networks and Random Walks . . . . . . . . . . . . . . . . . . . . 158 5.3 Random Walks on Undirected Graphs with Unit Edge Weights . . . . . . . 162 5.4 Random Walks in Euclidean Space . . . . . . . . . . . . . . . . . . . . . . 169 5.5 The Web as a Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.6 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.6.1 Metropolis-Hasting Algorithm . . . . . . . . . . . . . . . . . . . . . 178 5.6.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.7 Areas and Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.8 Convergence of Random Walks on Undirected Graphs . . . . . . . . . . . . 183 5.8.1 Using Normalized Conductance to Prove Convergence . . . . . . . . 188 5.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6 Learning and VC-dimension 202 6.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.2 Linear Separators, the Perceptron Algorithm, and Margins . . . . . . . . . 204 6.3 Nonlinear Separators, Support Vector Machines, and Kernels . . . . . . . . 209 6.4 Strong and Weak Learning - Boosting . . . . . . . . . . . . . . . . . . . . . 214 6.5 Number of Examples Needed for Prediction: VC-Dimension . . . . . . . . 216 6.6 Vapnik-Chervonenkis or VC-Dimension . . . . . . . . . . . . . . . . . . . . 219 6.6.1 Examples of Set Systems and Their VC-Dimension . . . . . . . . . 220 6.6.2 The Shatter Function . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.6.3 Shatter Function for Set Systems of Bounded VC-Dimension . . . 224 6.6.4 Intersection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 226 3 6.7 The VC Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.8 Simple Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 6.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7 Algorithms for Massive Data Problems 238 7.1 Frequency Moments of Data Streams . . . . . . . . . . . . . . . . . . . . . 238 7.1.1 Number of Distinct Elements in a Data Stream . . . . . . . . . . . 239 7.1.2 Counting the Number of Occurrences of a Given Element. . . . . . 243 7.1.3 Counting Frequent Elements . . . . . . . . . . . . . . . . . . . . . . 243 7.1.4 The Second Moment . . . . . . . . . . . . . . . . . . . . . . . . . . 245 7.2 Matrix Algorithms Using Sampling . . . . . . . . . . . . . . . . . . . . . . 248 7.2.1 Matrix Multiplication Using Sampling . . . . . . . . . . . . . . . . 248 7.2.2 Sketch of a Large Matrix . . . . . . . . . . . . . . . . . . . . . . . . 250 7.3 Sketches of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 8 Clustering 260 8.1 Some Clustering Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.2 A k-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . 263 8.3 A Greedy Algorithm for k-Center Criterion Clustering . . . . . . . . . . . 265 8.4 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 8.5 Recursive Clustering Based on Sparse Cuts . . . . . . . . . . . . . . . . . . 273 8.6 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 8.7 Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 8.8 Dense Submatrices and Communities . . . . . . . . . . . . . . . . . . . . . 278 8.9 Flow Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 8.10 Finding a Local Cluster Without Examining the Whole Graph . . . . . . . 284 8.11 Axioms for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 8.11.1 An Impossibility Result . . . . . . . . . . . . . . . . . . . . . . . . 289 8.11.2 A Satisfiable Set of Axioms . . . . . . . . . . . . . . . . . . . . . . 295 8.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 9 Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation 301 9.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 9.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 9.3 Graphical Models, and Belief Propagation . . . . . . . . . . . . . . . . . . 310 9.4 Bayesian or Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 311 9.5 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 9.6 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 9.7 Tree Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 9.8 Message Passing in general Graphs . . . . . . . . . . . . . . . . . . . . . . 315 9.9 Graphs with a Single Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . 317 4 9.10 Belief Update in Networks with a Single Loop . . . . . . . . . . . . . . . . 319 9.11 Maximum Weight Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 320 9.12 Warning Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 9.13 Correlation Between Variables . . . . . . . . . . . . . . . . . . . . . . . . . 325 9.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 10 Other Topics 332 10.1 Rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 10.2 Hare System for Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 10.3 Compressed Sensing and Sparse Vectors . . . . . . . . . . . . . . . . . . . 335 10.3.1 Unique Reconstruction of a Sparse Vector . . . . . . . . . . . . . . 336 10.3.2 The Exact Reconstruction Property . . . . . . . . . . . . . . . . . . 339 10.3.3 Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . . 340 10.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 10.4.1 Sparse Vector in Some Coordinate Basis . . . . . . . . . . . . . . . 342 10.4.2 A Representation Cannot be Sparse in Both Time and Frequency Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 10.4.3 Biological . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 10.4.4 Finding Overlapping Cliques or Communities . . . . . . . . . . . . 345 10.4.5 Low Rank Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 10.5 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 10.6 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 10.6.1 The Ellipsoid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 350 10.7 Integer Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 10.8 Semi-Definite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 352 10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 11 Appendix 357 11.1 Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 11.2 Useful relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 11.3 Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 11.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 11.4.1 Sample Space, Events, Independence . . . . . . . . . . . . . . . . . 370 11.4.2 Linearity of Expectation . . . . . . . . . . . . . . . . . . . . . . . . 371 11.4.3 Union Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 11.4.4 Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 11.4.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 11.4.6 Variance of the Sum of Independent Random Variables . . . . . . . 372 11.4.7 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.4.8 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 373 11.4.9 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . 373 11.4.10Bayes Rule and Estimators . . . . . . . . . . . . . . . . . . . . . . . 376 11.4.11Tail Bounds and Chernoff inequalities . . . . . . . . . . . . . . . . . 378 5 11.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 382 11.5.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . 382 11.5.2 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 384 11.5.3 Relationship between SVD and Eigen Decomposition . . . . . . . . 386 11.5.4 Extremal Properties of Eigenvalues . . . . . . . . . . . . . . . . . . 386 11.5.5 Eigenvalues of the Sum of Two Symmetric Matrices . . . . . . . . . 388 11.5.6 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 11.5.7 Important Norms and Their Properties . . . . . . . . . . . . . . . . 391 11.5.8 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 11.5.9 Distance between subspaces . . . . . . . . . . . . . . . . . . . . . . 395 11.6 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 11.6.1 Generating Functions for Sequences Defined by Recurrence Rela- tionships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 11.6.2 The Exponential Generating Function and the Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 11.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 11.7.1 Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 401 11.7.2 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 11.7.3 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 11.7.4 Application of Mean Value Theorem . . . . . . . . . . . . . . . . . 402 11.7.5 Sperner’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 11.7.6 Pru¨fer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Index 410 6 Foundations of Data Science† John Hopcroft and Ravindran Kannan 21/8/2014 1 Introduction Computer science as an academic discipline began in the 60’s. Emphasis was on pro- gramming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understand- ing of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory. While traditional areas of computer science are still important and highly skilled indi- viduals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods. Early drafts of the book have been used for both undergraduate and graduate courses. Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems. This book starts with the treatment of high dimensional geometry. Modern data in diverse fields such as Information Processing, Search, Machine Learning, etc., is often †Copyright 2011. All rights reserved 7 represented advantageously as vectors with a large number of components. This is so even in cases when the vector representation is not the natural first choice. Our intuition from two or three dimensional space can be surprisingly off the mark when it comes to high dimensional space. Chapter 2 works out the fundamentals needed to understand the differences. The emphasis of the chapter, as well as the book in general, is to get across the mathematical foundations rather than dwell on particular applications that are only briefly described. The mathematical areas most relevant to dealing with high-dimensional data are ma- trix algebra and algorithms. We focus on singular value decomposition, a central tool in this area. Chapter 4 gives a from-first-principles description of this. Applications of sin- gular value decomposition include principal component analysis, a widely used technique which we touch upon, as well as modern applications to statistical mixtures of probability densities, discrete optimization, etc., which are described in more detail. Central to our understanding of large structures, like the web and social networks, is building models to capture essential properties of these structures. The simplest model is that of a random graph formulated by Erdo¨s and Renyi, which we study in detail proving that certain global phenomena, like a giant connected component, arise in such structures with only local choices. We also describe other models of random graphs. Oneofthesurprisesofcomputerscienceoverthelasttwodecadesisthatsomedomain- independent methods have been immensely successful in tackling problems from diverse areas. Machine learning is a striking example. We describe the foundations of machine learning, both learning from given training examples, as well as the theory of Vapnik- Chervonenkis dimension, which tells us how many training examples suffice for learning. Another important domain-independent technique is based on Markov chains. The un- derlying mathematical theory, as well as the connections to electrical networks, forms the core of our chapter on Markov chains. The field of algorithms has traditionally assumed that the input data to a problem is presented in random access memory, which the algorithm can repeatedly access. This is not feasible for modern problems. The streaming model and other models have been formulated to better reflect this. In this setting, sampling plays a crucial role and, indeed, we have to sample on the fly. in Chapter ?? we study how to draw good samples efficiently and how to estimate statistical, as well as linear algebra quantities, with such samples. One of the most important tools in the modern toolkit is clustering, dividing data into groups of similar objects. After describing some of the basic methods for clustering, such as the k-means algorithm, we focus on modern developments in understanding these, as well as newer algorithms. The chapter ends with a study of clustering criteria. This book also covers graphical models and belief propagation, ranking and voting, 8 sparse vectors, and compressed sensing. The appendix includes a wealth of background material. A word about notation in the book. To help the student, we have adopted certain notations, and with a few exceptions, adhered to them. We use lower case letters for scaler variables and functions, bold face lower case for vectors, and upper case letters for matrices. Lower case near the beginning of the alphabet tend to be constants, in the middle of the alphabet, such as i, j, and k, are indices in summations, n and m for integer sizes, and x, y and z for variables. Where the literature traditionally uses a symbol for a quantity, we also used that symbol, even if it meant abandoning our convention. If we have a set of points in some vector space, and work with a subspace, we use n for the numberofpoints,dforthedimensionofthespace,andk forthedimensionofthesubspace. The term ”almost surely” means with probability one. We use lnn for the natural logarithm and logn for the base two logarithm. If we want base ten, we will use log . 10 (cid:0) (cid:1)2 To simplify notation and to make it easier to read we use E2(1−x) for E(1−x) and (cid:0) (cid:1) E(1−x)2 for E (1−x)2 . 9 2 High-Dimensional Space In many applications data is in the form of vectors. In other applications, data is not in the form of vectors, but could be usefully represented by vectors. The Vector Space Model [SWY75] is a good example. In the vector space model, a document is represented by a vector, each component of which corresponds to the number of occurrences of a par- ticular term in the document. The English language has on the order of 25,000 words or terms, so each document is represented by a 25,000 dimensional vector. A collection of n documents is represented by a collection of n vectors, one vector per document. The vec- tors may be arranged as columns of a 25,000×n matrix. See Figure 2.1. A query is also represented by a vector in the same space. The component of the vector corresponding to a term in the query, specifies the importance of the term to the query. To find documents about cars that are not race cars, a query vector will have a large positive component for the word car and also for the words engine and perhaps door, and a negative component for the words race, betting, etc. One needs a measure of relevance or similarity of a query to a document. The dot product or cosine of the angle between the two vectors is an often used measure of sim- ilarity. To respond to a query, one computes the dot product or the cosine of the angle between the query vector and each document vector and returns the documents with the highest values of these quantities. While it is by no means clear that this approach will do well for the information retrieval problem, many empirical studies have established the effectiveness of this general approach. The vector space model is useful in ranking or ordering a large collection of documents in decreasing order of importance. For large collections, an approach based on human understanding of each document is not feasible. Instead, an automated procedure is needed that is able to rank documents with those central to the collection ranked highest. Eachdocumentisrepresentedasavectorwiththevectorsformingthecolumnsofamatrix A. The similarity of pairs of documents is defined by the dot product of the vectors. All pairwise similarities are contained in the matrix product ATA. If one assumes that the documentscentraltothecollectionarethosewithhighsimilaritytootherdocuments,then computing ATA enables one to create a ranking. Define the total similarity of document i to be the sum of the entries in the ith row of ATA and rank documents by their total similarity. It turns out that with the vector representation on hand, a better way of ranking is to first find the best fit direction. That is, the unit vector u, for which the sum of squared perpendicular distances of all the vectors to u is minimized. See Figure 2.2. Then, one ranks the vectors according to their dot product with u. The best-fit direction is a well-studied notion in linear algebra. There is elegant theory and efficient algorithms presented in Chapter 3 that facilitate the ranking as well as applications in many other domains. In the vector space representation of data, properties of vectors such as dot products, 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.