ebook img

Programming Collective Intelligence: Building Smart Web 2.0 Applications PDF

360 Pages·2007·3.31 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Programming Collective Intelligence: Building Smart Web 2.0 Applications

Praise forProgramming Collective Intelligence “Ireviewafewbookseachyear,andnaturally,Ireadafairnumberduringthecourseof my work. And I have to admit that I have never had quite as much fun reading a preprint of a book as I have in reading this. Bravo! I cannot think of a better way for a developertofirstlearnthesealgorithmsandmethods,norcanIthinkofabetterwayfor me (an old AI dog) to reinvigorate my knowledge of the details.” —Dan Russell, Uber Tech Lead, Google “Toby’s book does a great job of breaking down the complex subject matter of machine- learning algorithms into practical, easy-to-understand examples that can be used directly to analyze social interaction across the Web today. If I had this book two years ago, it would have saved me precious time going down some fruitless paths.” —Tim Wolters, CTO, Collective Intellect “ProgrammingCollectiveIntelligenceisastellarachievementinprovidingacomprehensive collection of computational methods for relating vast amounts of data. Specifically, it appliesthesetechniquesincontextoftheInternet,findingvalueinotherwiseisolateddata islands. If you develop for the Internet, this book is a must-have.” —Paul Tyma, Senior Software Engineer, Google Programming Collective Intelligence Other resources from O’Reilly Related titles Web 2.0 Report AI for Game Developers Learning Python Mastering Algorithms with Mastering Algorithms with C Perl oreilly.com oreilly.com is more than a complete catalog of O’Reilly books. You’llalsofindlinkstonews,events,articles,weblogs,sample chapters, and code examples. oreillynet.comistheessentialportalfordevelopersinterestedin openandemergingtechnologies,includingnewplatforms,pro- gramming languages, and operating systems. Conferences O’Reillybringsdiverseinnovatorstogethertonurturetheideas thatsparkrevolutionaryindustries.Wespecializeindocument- ing the latest tools and systems, translating the innovator’s knowledge into useful skills for those in the trenches. Visit conferences.oreilly.com for our upcoming events. Safari Bookshelf (safari.oreilly.com) is the premier online refer- ence library for programmers and IT professionals. Conduct searchesacrossmorethan1,000books.Subscriberscanzeroin on answers to time-critical questions in a matter of seconds. Read the books on your Bookshelf from cover to cover or sim- ply flip to the page you need. Try it today for free. Programming Collective Intelligence Building Smart Web 2.0 Applications Toby Segaran Beijing • Cambridge • Farnham • Köln • Paris • Sebastopol • Taipei • Tokyo Programming Collective Intelligence by Toby Segaran Copyright © 2007 Toby Segaran. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 [email protected]. Editor: Mary Treseler O’Brien Indexer: Julie Hawks Production Editor: Sarah Schneider Cover Designer: Karen Montgomery Copyeditor: Amy Thomson Interior Designer: David Futato Proofreader: Sarah Schneider Illustrators: RobertRomanoandJessamynRead Printing History: August 2007: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’ReillyMedia,Inc.ProgrammingCollectiveIntelligence,theimageofKingpenguins,andrelatedtrade dress are trademarks of O’Reilly Media, Inc. Manyofthedesignationsusedbymanufacturersandsellerstodistinguishtheirproductsareclaimedas trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. Whileeveryprecautionhasbeentakeninthepreparationofthisbook,thepublisherandauthorassume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. This book uses RepKover™, a durable and flexible lay-flat binding. ISBN-10: 0-596-52932-5 ISBN-13: 978-0-596-52932-1 [M] Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiii 1. Introduction to Collective Intelligence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Collective Intelligence? 2 What Is Machine Learning? 3 Limits of Machine Learning 4 Real-Life Examples 5 Other Uses for Learning Algorithms 5 2. Making Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Collaborative Filtering 7 Collecting Preferences 8 Finding Similar Users 9 Recommending Items 15 Matching Products 17 Building a del.icio.us Link Recommender 19 Item-Based Filtering 22 Using the MovieLens Dataset 25 User-Based or Item-Based Filtering? 27 Exercises 28 3. Discovering Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Supervised versus Unsupervised Learning 29 Word Vectors 30 Hierarchical Clustering 33 Drawing the Dendrogram 38 Column Clustering 40 vii K-Means Clustering 42 Clusters of Preferences 44 Viewing Data in Two Dimensions 49 Other Things to Cluster 53 Exercises 53 4. Searching and Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 What’s in a Search Engine? 54 A Simple Crawler 56 Building the Index 58 Querying 63 Content-Based Ranking 64 Using Inbound Links 69 Learning from Clicks 74 Exercises 84 5. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Group Travel 87 Representing Solutions 88 The Cost Function 89 Random Searching 91 Hill Climbing 92 Simulated Annealing 95 Genetic Algorithms 97 Real Flight Searches 101 Optimizing for Preferences 106 Network Visualization 110 Other Possibilities 115 Exercises 116 6. Document Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Filtering Spam 117 Documents and Words 118 Training the Classifier 119 Calculating Probabilities 121 A Naïve Classifier 123 The Fisher Method 127 Persisting the Trained Classifiers 132 Filtering Blog Feeds 134 viii | Table of Contents Improving Feature Detection 136 Using Akismet 138 Alternative Methods 139 Exercises 140 7. Modeling with Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Predicting Signups 142 Introducing Decision Trees 144 Training the Tree 145 Choosing the Best Split 147 Recursive Tree Building 149 Displaying the Tree 151 Classifying New Observations 153 Pruning the Tree 154 Dealing with Missing Data 156 Dealing with Numerical Outcomes 158 Modeling Home Prices 158 Modeling “Hotness” 161 When to Use Decision Trees 164 Exercises 165 8. Building Price Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Building a Sample Dataset 167 k-Nearest Neighbors 169 Weighted Neighbors 172 Cross-Validation 176 Heterogeneous Variables 178 Optimizing the Scale 181 Uneven Distributions 183 Using Real Data—the eBay API 189 When to Use k-Nearest Neighbors 195 Exercises 196 9. Advanced Classification: Kernel Methods and SVMs . . . . . . . . . . . . . . . . . . . 197 Matchmaker Dataset 197 Difficulties with the Data 199 Basic Linear Classification 202 Categorical Features 205 Scaling the Data 209 Table of Contents | ix

Description:
Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in thi
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.