MEAP Edition Manning Early Access Program Algorithms of the Intelligent Web Second Edition Version 9 Copyright 2016 Manning Publications For more information on this and other Manning titles go to www.manning.com https://forums.manning.com/forums/algorithms-of-the-intelligent-web-second-edition welcome Thank you for purchasing the MEAP edition of Algorithms of the Intelligent Web, Second Edition. Intelligent Algorithms are a very hot topic right now and some of this interest can be placed at the doors of the giants of the web. Google, Facebook, LinkedIn and many others have all openly discussed their research to better deal with the vast quantities of data generated by their users, and to use this data to better target messages towards you, the reader of this book! While much of this research is open to the public, it can often seem difficult to get started in this complex area. The development of intelligent algorithms sits at the intersection of several disciplines and the practitioner is often well versed in databases and data systems, statistics and machine learning. Conscious that this it is difficult to obtain this level of exposure without being immersed in the field, we decided to rewrite Algorithms of the Intelligent Web with this in mind. In these chapters you will find many of the core algorithms used to make real-time decisions on the web. Concepts are introduced in plain English and illustrated with Python’s scikit learn. We’ll navigate a host of algorithms together and provide you with the basic mathematics to illuminate them, without drowning in Greek symbols. We hope you that you enjoy the second edition of Algorithms of the Intelligent Web and that it will occupy an important place on your digital (and physical!) bookshelf. We also encourage you to post any questions or comments you have about the content in the book’s forum. We appreciate knowing where we can make improvements and increase your understanding of the material. —Dr Douglas McIlwraith ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/algorithms-of-the-intelligent-web-second-edition brief contents 1 Building applications for the intelligent web 2 Extracting structure from data: clustering and transforming your data 3 Recommending relevant content 4 Classification: placing things where they belong 5 Case study: click prediction for online advertising 6 Deep learning and neural networks 7 Making the right choice 8 The future of the intelligent web APPENDIXES A Capturing Data on the Web 1 1 Building applications for the intelligent web This chapter covers Recognizing intelligence on the web Types of intelligent algorithms Evaluating intelligent algorithms The intelligent web means different things to different people. To some it represents the evolution of the web into a more responsive and useful entity that can learn from and react to its users. To others it represents the inclusion of the web into many more aspects of our lives. To me, far from being the first iteration of Skynet, in which computers take over in a dystopian future, the intelligent web is about designing and implementing more naturally responsive applications that make our online experiences better in some quantifiable way. There’s a good chance that every reader has encountered machine intelligence on many separate occasions, and this chapter will highlight some examples so that you’ll be better equipped to recognize these in the future. This will, in turn, help you understand what’s really happening under the hood when you interact with intelligent applications again. Now that you know this book isn’t about writing entities that will try to take over the world, we should perhaps discuss some other things that you won’t find within these pages! First, this is very much a back-end book. In these pages, you won’t learn about beautiful interactive ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. 2 visualizations or platforms. For this, we refer you to excellent publications by Scott Murray,1 David McCandless,2 and Edward Tufte.3 Suffice to say that we don’t have space within these pages to do this topic justice along with what we’re about to cover. Also, this book won’t teach you statistics, but to gain the most from this book we’ll assume you have a 101 level of knowledge or higher, in that you should have at least taken a course in statistics at some point in the past. This is also not a book about data science. A plethora of titles are available that will help the data science practitioner, and I do hope that this book will be useful to data scientists, but these chapters contain little detail about how to be a data scientist. For these topics we refer you to the texts by Joel Grus 4 and Foster Provost and Tom Fawcett.5 Nor is this book a detailed book about algorithm design. We’ll often skim over details as to the design of algorithms and provide more intuition than a deep dive into specifics. This will allow us to cover much more ground, perhaps at the cost of some rigor. Think of each chapter as a trail of breadcrumbs leading you through the important aspects of that approach and nudging you toward resources where you can learn more. Although many of the examples within the pages of this book are written using scikit- learn,6 this is not a book about scikit-learn! This is merely the tool by which we can demonstrate the approaches presented in this book. We’ll never provide an example without at least an intuitive introduction as to why the algorithm works. In some cases we’ll go deeper, but in many cases you should continue your research outside the pages of this book. So what then is this book about? Within the pages of this book we’ll cover the tools that provide an end-to-end view of intelligent algorithms as we see them today. We’ll talk about the information that’s collected about you, the average web user, and how that information can be channeled into useful streams so that it can be used to make predictions about your behavior—changing those predictions as your behavior changes. This means that we’ll often deviate from the standard “algorithm 101” book, in favor of giving you a flavor (!) of all the important aspects about intelligent algorithms. We’ll even discuss (in appendix A) a publish/subscribe technology that allows large quantities of data to be organized during ingestion. Although this has no place in a book that’s strictly about data science or algorithms, we believe that it has a fundamental place in a book about the intelligent web. This doesn’t mean that we ignore data science or algorithms—quite the contrary! We’ll cover most of the important algorithms used by the majority of the leading players in the intelligent algorithm space. Where possible, we reference known examples of 1 Scott Murray, Interactive Data Visualization for the Web (O’Reilly, 2013). 2 David McCandless, Information is Beautiful (HarperCollins, 2010). 3 Edward Tufte, The Visual Display of Quantitative Information (Graphics Press USA, 2001). 4 Joel Grus, Data Science From Scratch: First Principles with Python (O’Reilly, 2015). 5 Foster Provost and Tom Fawcett, Data Science for Business (O’Reilly Media, 2013). 6 http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_classifier.html#example-mixture-plot-gmm-classifier-py. ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. 3 these in the wild, so that you can test your knowledge against the behavior of such systems— and no doubt impress your friends! But we’re getting ahead of ourselves. In this chapter we’ll provide several examples of the application of intelligent algorithms that you should find immediately recognizable. We’ll talk more about what intelligent algorithms can’t do, before providing you with a taxonomy of the field that can be used to hang your newly learned concepts upon. Finally, we’ll present you with a number of methods to evaluate intelligent algorithms and impart to you some useful things to know. We already hear you asking, “What is an intelligent algorithm?” For the purposes of this book, we’ll refer to any algorithm that uses data to modify its behavior as intelligent. Remember that when you interact with an algorithm, you’re merely interacting with a set of distinct rules. Intelligent algorithms differ in that they can change their behavior as they run, often resulting in a user experience that many would say is intelligent. Figure 1.1 summarizes this behavior. Here you see an intelligent algorithm responding to events within the environment and making decisions. By ingesting data from the context in which it operates (which may include the event itself), the algorithm is evolving. It evolves in the sense that the decision is no longer deterministic given the event. The intelligent algorithm may make different decisions at different points, depending on the data it has ingested. Figure 1.1 Overview of an intelligent algorithm. Such algorithms display intelligence because the decisions they make change dependent on the data they’ve received. 1.1 An intelligent algorithm in action: Google Now To demonstrate this concept we’ll try to deconstruct the operation of Google Now. Note that the specific details of this project are proprietary to Google and so we’ll be relying on our own experience to illustrate how this system might work internally. For those of you with Android devices, this product may be immediately recognizable, but for the iOS users among us, Google Now is Google’s answer to Siri and has the product tagline ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. 4 ‘‘The right information at just the right time.” This is essentially an application that can use various sources of information and alert you of nearby restaurants, events, traffic jams, and the like that it believes will be of interest to you. To demonstrate the concept of an intelligent algorithm, let’s take an even more concrete example from Google Now. When it detects a traffic jam on your normal route to work, it will show you some information before you set off from work. Neat! But how might this be achieved? First, let’s try to understand exactly what might be happening here. The application knows about your location through its GPS and registered wireless stations, so at any given time the application will know where you are, down to a reasonable level of granularity. In the context of figure 1.1, this is one aspect of the data that’s being used to change the behavior of the algorithm. From here it’s a short step to determine home and work locations. This is performed through the use of prior knowledge, that is, knowledge that has somehow been distilled into the algorithm before it started learning from the data. In this case, the prior knowledge could take the form of the following rules: The location most usually occupied overnight is home. The location most usually occupied during the day is work. People, in general, travel to their workplace and then home again almost every day. Although this is an imperfect example, it does illustrate my point well—that a concept of work, home, and commuting exists within our society, and that together with the data and a model, inference can be performed, that is, we can determine the likely home and work locations along with likely commuting routes. We qualify our information with the word likely because many models will allow us to encapsulate the notion of probability or likelihood within our inferences. When a new phone is purchased or a new account is registered with Google, it will take Google Now some time to reach these conclusions. Similarly, if users move to a different home or change jobs, it will take time for Google Now to relearn these locations. The speed at which the model can respond to change is referred to as the learning rate. In order to display relevant information regarding travel route plans (to make a decision based on an event), we still don’t have enough information. The final piece of the puzzle is predicting when a user is about to leave one location and travel to the other. Similar to our previous application, we could model leaving times and update this over time to reflect changing patterns of behavior. For a given time in the future, it’s now possible to provide a likelihood that a user is in a given location and is about to leave that location in favor of another. If this likelihood triggers a threshold value, Google Now can perform a traffic search and return it to the user as a notification. This specific part of Google Now is quite complex and probably has its own team devoted to it, but you can easily see that the framework through which it operates is an intelligent algorithm: it has used data about your movements to understand your routine and tailor personalized responses for you (decisions) based on your current location (event). Figure 1.2 provides a graphical overview of this process. ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. 5 Figure 1.2 Graphical overview of one aspect of the Google Now project. In order for Google Now to make predictions about your future locations, it uses a model about past locations, along with your current position. Priors are a way to distill initially known information into the system. One interesting thing to note is that the product Google Now is probably a using a whole suite of intelligent algorithms in the background. Algorithms perform text searches of your Google Calendar, trying to make sense of your schedule, while interest models churn away to try to decide which web searches are relevant and if new content should be flagged for your interest. As a developer in the intelligent algorithm space, you’ll be called on to use your skills in this area to develop new solutions from complex requirements, carefully identifying each subsection of work that can be tackled with an existing class of intelligent algorithm. Each solution that you create should be grounded among and built on work in the field—much of which you’ll find within the pages of this text. We’ve introduced several key terms here in italics, and we’ll refer to these in the coming chapters as we tackle individual algorithms in more depth. 1.2 The intelligent algorithm lifecycle In the previous section we introduced you to the concept of an intelligent algorithm comprising a black box, taking data and making predictions from events. We also more concretely drew on an example at Google, looking at their Google Now project. You might wonder then how ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. 6 intelligent algorithm designers come up with their solutions. There is a general lifecycle, adopted from Ben Fry’s Computational Information Design, 7 that you could refer to when designing your own solutions, as shown in figure 1.3. Figure 1.3 The intelligent algorithm lifecycle When designing intelligent algorithms you first must acquire data (the focus of appendix A) and then parse and clean it, because it’s often not in the format you require. You must then understand that data, which you can achieve through data exploration and visualization. Subsequently, you can represent that data in more appropriate formats (focus of chapter 2). At this point you’re ready to train a model and evaluate the predictive power of your solution. Chapters 3 through 7 cover various models that you might use. At the output of any stage you can return to an earlier stage; we highlighted the most common return paths using dotted lines. 1.3 Further examples of intelligent algorithms Let’s review some more applications that have been leveraging algorithmic intelligence over the last decade. A turning point in the history of the web was the advent of search engines, 7 Ben Fry, Computational Information Design (Boston, MA: MIT, 2004). ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
Description: