History Machine Learning in Production TopDicseveloping and Optimizing Data Science Workflows and Applications Andrew Kelleher Tutorials Adam Kelleher Offers & Deals Highlights Settings Support Si O t History Part I TopPicsrinciples of Framing Tutorials Chapter 1, “The Role of the Data Scientist,” provides background information about the field of data science. This should serve as a starting point to gain context for the role of Offers & Deals data science in industry. Highlights Chapter 2, “Project Workflow,” describes project workflow, and how it relates to the principles of agile software development. Settings Support Chapter 1 History The Role of the Data Scientist Topics 1.1 INTRODUCTION Tutorials We want to set the context for this book by exposing the reader to the focus on Offers & Deals products, rather than methods, early on. Data scientists often make shortcuts, use rules of thumb, and forego rigor. They do this in favor of speed, and with reasonable levels of Highlights uncertainty with which to make decisions. The world moves fast, and businesses don’t have time for you to write a dissertation on error bars when they need answers to hard Settings questions. Support We’ll begin by describing how the sizes of companies put different demands on a data scientist. Then, we’ll describe agile development: the framework for building products Sign Out that keeps them responsive to the world outside of the office. We’ll discuss ladders and career development. These are useful for both the data scientists and the companies they work for. They lay out expectations of companies have for their scientists, and help scientists see which traits the company has found useful. Finally, we’ll describe what data scientists actually “do” with their time. 1.2 THE ROLE OF THE DATA SCIENTIST The role of the data scientist is different depending on the context. It’s worth having an indepth understanding of some of the factors that influence your role so you can adapt as your role changes. A lot of this chapter is informed by working within a company that grew from around 150 people to almost 1,500 in a few short years. As the size changed, the roles, supporting structure, management, interdepartmental communications, infrastructure, and expectations of the role changed with it. Adam came in as a data scientist at 300 people, and Andrew as an engineer at 150 people. We both stayed on as the company grew over the years. Here is some of what we learned. 1.2.1 Company Size When the company was smaller, we tended to be generalists. We didn’t have the head count to have people work on very specific tasks, even though that might lead to deeper analyses, depth of perspective, and specialized knowledge about the products. As a data scientist at a small company, Adam did analyses across several products, and spanning several departments. As the company grew, our team’s roles tended to get more specialized, and data scientists tended to start working more on one product or a small number of related products. There’s an obvious benefit: they can have deep knowledge of a very sophisticated product, and so they have fuller context and nuanced understanding that they might not be capable of if they were working on several different products. A popular team structure is for a product to be built and maintained by a small, mostly autonomous team. We’ll go into detail on that in the next section. When our company was smaller, team members often performed a much more general role, acting as the machine learning engineer, the data analyst, the quantitative researcher, and even the product manager and project manager. As the company grew, they hired more people to take on these roles, and so team member’s roles became more specialized. 1.2.2 Team Context Most of the context of this book will be for data scientists working in small, autonomous teams, roughly following the Agile Manifesto. This was largely developed in the context of software engineering, and so the focus is on producing code. It extends well to executing data science projects. The manifesto is as follows: • Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan this list indicates where the priorities lie. The items on the right of each bullet are still very important, but the items on the left are the priorities. This means that team structure is flat, with more experienced people working alongside (rather than above) more junior people. They share skills with interactions like paircoding and peer reviewing each other’s code. A great benefit to this is that everyone learns quickly, from direct interactions with more senior teammates as peers. A drawback is that there can be a little friction when senior developers have their code reviewed by junior team members. The team’s overall goal is to produce working software quickly, and so it’s okay to procrastinate on documentation. There is generally less focus on process, and more on getting things done. As long as the team knows what’s going on, and they’re capable on onboarding new members efficiently enough, they can focus on the work of shipping products. On the other side of this, the focus on moving fast causes teams to take shortcuts. This can lead systems to be more fragile. It can also create an evergrowing list of things to do more perfectly later. These takes make up what is called technical debt. Much like debt in finance, it’s a natural part of the process. Many argue, especially in smaller companies, that it’s a necessary part of the process. You do enough “paying the debt” by writing documentation, making cleaner abstractions, and adding test coverage to keep a sustainable pace of development, and keep from introducing bugs. Teams generally work directly with stakeholders, and data scientists often have a front facing role in these interactions. There is constant feedback between teams and their stakeholders to make sure the project is still aligned with stakeholder priorities. This is opposed to “contract negotiation,” where the requirements are laid out, and the team decouples from the stakeholders, delivering the product at a later date. In business, things move fast. Priorities change, and the team and product must adapt to those changes. Frequent feedback from stakeholders lets teams learn about changes quickly, and adapt to them before investing too much in the wrong product and features. It’s hard to predict the future. If you come up with a long or moderateterm plan, priorities can shift, team structure can change, and the plan can fall apart. Planning is important, and trying to stick to a plan is important. You’ll do all you can to make a plan for building an amazing product, but you’ll often have to respond quickly and agilely to change. It can be hard to throw out your favorite plans as priorities shift, but it’s a necessary part of the job. The data scientist is an integral member of these teams. They help their team develop products, and help the product managers evaluate the product’s performance. Throughout product development, there are critical decisions to make about its features. To that end, a data scientist works with product managers and engineers to formulate questions to answer. They can be as simple as “What unit on this page generates the most clicks?” and as complex as “How would the site perform if the recommender system didn’t exist?” Data lets us answer these questions, and data scientists are the people who analyze and help interpret the data for making these decisions. They do this in the context of a dynamic team environment, and have to work quickly and effectively in response to change. 1.2.3 Ladders and Career Development 1.2.3 Ladders and Career Development Sometimes data scientists are contrasted against the data analyst. The data analyst has an overlapping skill set, which includes querying databases, making plots, doing statistics, and interpreting data. In addition to these, according to this view, a data scientist is someone who can build production machinelearning systems. If that were an apt view, then there might not be such a thing as a junior data scientist. It’s not typical to start your career building production machinelearning systems. Most companies have welldefined “ladders” for career advancement, with specific skills expected at each level. The team’s goal is to build and ship products. There any many skills that are critically important for this which have nothing to do with data. Ladders go beyond technical skills to include communication skills, understanding project scope, and balancing long and shortterm goals. Generally, companies will define an “individual contributor” track and a “management” track. Junior scientists will start in the same place, and shift onto a specific track as their skills develop. They generally start out being able to execute tasks on projects with guidance from more senior team members. They advance to being able to execute tasks more autonomously. Finally, they’re the ones helping people execute tasks, and usually take more of a role in project planning. The shift often happens at this point, when they hit the “senior” level of their title. 1.2.4 Importance The data scientist, like everyone on their teams, has an important role. Analysis can lie on the “critical path” of a project’s development. This means that the analysis might need to be finished before a project can proceed and be delivered. If a data scientist isn’t skillful with their analysis, and delivers too slowly or incompletely, they might block progress. You don’t want to be responsible for delaying the release of a product or feature! Without data, decisionmakers might move more toward experience and intuition. While these might not be wrong, they’re not the best way to make decisions. Adding data to the decisionmaking process moves business more toward science. The data scientist, then, has a critical role in making business decisions more rational. 1.2.5 The Work Breakdown Anecdotally, probably 80 to 90 percent of the work a data scientist does is basic analysis and reporting on experimental and observational data. Much of the data the scientist has to work with is observational, since experimental data takes time and resources to collect, while observational data is essentially “free” once you’ve implemented data collection. This makes observational data analysis methods important to be familiar with. We’ll examine correlation and causation later in this book. We’ll develop understanding of observational data analysis methods by contrasting them with experimental data, and understanding why observational results are often biased. Many data scientists work primarily with experimental data. We’ll cover experiment design and analysis in some detail as well. Good experiment design is very hard. Web scale experiments, while often providing very large samples, don’t guarantee you’ll actually be able to measure experimental effects you’re looking for, even when they’re large! Randomized assignment doesn’t even guarantee you’ll have correct experimental results (due to selection bias). We’ll cover all of this and more later in the book. The other 10 or so percent of the work is the stuff you usually read about when you hear about data science in the news. It’s the cool machine learning, artificial intelligence, and internetofthings applications that are so exciting, and drive so many people toward the field of data science. In a very real sense, these applications are the future, but they’re also the minority of the work data scientists do, unless they’re the hybrid data scientist/machine learning engineer type. Those roles are relatively rare, and are generally for very senior data scientists. This book is aimed at entry to midlevel data scientists. We want to give you the skills to start developing your career in whichever direction you’d like, so you can find the data science role that is perfect for you. 1.3 CONCLUSION Getting things right can be hard. Often, the need to move fast supercedes the need to get it right. Consider the case when you need to decide between two policies, A and B, which cost the same amount to implement. You must implement one, and time is a factor. If you can show that the effect of policy A, Y (A) is more positive than policy Y (B), it doesn’t matter how much more positive it is. As long as Y (A) − Y (B) > 0, policy A is the right choice. As long as your measurement is good enough to be within 100 percent of the correct difference, you know enough to make the policy choice! At this point, you should have a better idea of what it means to be a data scientist. Now that you understand a little about the context, we can start exploring the product development process. Chapter 2 Recommended Project Workflow Playlists Andrew and Adam Kelleher History 2.1 INTRODUCTION Topics This chapter focuses on the work flow of executing data science tasks as oneoffs vs tasks that will eventually make up components in production systems. We look at a few Tutorials diagrams of common work flows and propose combining two as a general approach. At the end of this chapter the reader should understand where they fit in an organization Offers & Deals that uses datadriven analyses to fuel innovation. We’ll start by giving a little more context about team structure. Then, we break down the workflow into several steps: Highlights Planning, Design/ Preprocessing, and Analysis/Modeling. These steps often blend together, and are usually not formalized. At the end, you’ll have gone from the concept Settings of a product, like a recommender system or a deepdive analysis, to a working prototype or result. Support At that stage, you’re ready to start working with engineers to have the system Sign Out implemented in production. That might mean bringing an algorithm into a production settings, automating a report, or something else. We should say that as you get to be a more senior data scientist, your workflow can evolve to look more like an engineer’s workflow. Instead of prototyping in a jupyter notebook on your computer, you might prototype a model as a component of a microservice. This section is really aimed at getting a data scientist oriented with the steps that start them toward building prototypes for models. When you’re prototyping data products, it’s important to keep in mind the broader context of the organization. They focus should be more on testing value propositions than on perfect architecture, clean code, and crisp software abstractions. Those things take time, and the world changes quickly. With that in mind, we spend the remainder of this chapter talking about the agile methodology, and how data products should follow that methodology like any other piece of software. 2.2 THE DATA TEAM CONTEXT When you’re faced with a problem that you might solve with machine learning, there are usually many options available. You could make a fast, heuristic solution involving very little math, but that you could produce in a day and move on to the next project. You could take a smarter approach and probably achieve better performance. The cost is your time, and the loss of opportunity to spend that time working on a different product. Finally, you could implement the state of the art. That usually means you’d have to research the best approach before even beginning coding, implement algorithms from scratch, and potentially solve unsolved problems with how to scale the implementation. When you’re working with limited resources, as you usually are in a data science context, the third option usually isn’t the best choice. If you want a highquality and competitive product, the first option might not be the best either. Where you fall along the spectrum between getitdone and stateoftheart depends on the problem, the context, and the resources available. If you’re making a healthcare diagnosis system the stakes are much higher than if you’re building a recommendation system. In order to understand why you’ll use machine learning at all, you have to have a little context for where and how it’s used. In this section, we’ll try to give some understanding of how teams are structured, what some workflows might look like, and practical constraints on machine learning. 2.2.1 Embedding vs Pooling Resources In our experience, we’ve seen two models for data science teams. The first is a “pool of resources” where the team gets a request and someone on the team fulfills it. The second is for members of the team to “embed” with other teams in the organization to help them with their work. In the first “pool of resources” approach each request of the team gets assigned and triaged like with any project. Some member of the team executes it, and if they need help they lean on someone else. A common feature of this approach is that tasks aren’t necessarily related; and it’s not formally decided that a single member of the team executes all the tasks in a certain domain or that a single member should handle all incoming requests from a particular person. It makes sense to have the same person answer the questions for the same stakeholders so they can develop more familiarity with the products, and more rapport with the stakeholders. When teams are small, the