ebook img

Data Analysis and Machine Learning with Kaggle(2021)[][9781801817479] PDF

531 Pages·10.671 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Analysis and Machine Learning with Kaggle(2021)[][9781801817479]

The Kaggle Book Data analysis and machine learning for competitive data science Konrad Banachewicz Luca Massaron BIRMINGHAM—MUMBAI Packt and this book are not officially connected with Kaggle. This book is an effort from the Kaggle community of experts to help more developers. The Kaggle Book Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Producer: Tushar Gupta Acquisition Editor – Peer Reviews: Saby Dsilva Project Editor: Parvathy Nair Content Development Editor: Lucy Wan Copy Editor: Safis Editing Technical Editor: Karan Sonawane Proofreader: Safis Editing Indexer: Sejal Dsilva Presentation Designer: Pranit Padwal First published: April 2022 Production reference: 2220422 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80181-747-9 www.packt.com Foreword I had a background in econometrics but became interested in machine learning techniques, ini- tially as an alternative approach to solving forecasting problems. As I started discovering my interest, I found the field intimidating to enter: I didn’t know the techniques, the terminology, and didn’t have the credentials that would allow me to break in. It was always my dream that Kaggle would allow people like me the opportunity to break into this powerful new field. Perhaps the thing I’m proudest of is the extent to which Kaggle has made data science and machine learning more accessible. We’ve had many Kagglers go from newbies to top machine learners, being hired at places like NVIDIA, Google, and OpenAI, and starting companies like DataRobot. Luca and Konrad’s book helps make Kaggle even more accessible. It offers a guide to both how Kaggle works, as well as many of the key learnings that they have taken out of their time on the site. Collectively, they’ve been members of Kaggle for over 20 years, entered 330 competitions, made over 2,000 posts to Kaggle forums, and shared over 100 notebooks and 50 datasets. They are both top-ranked users and well-respected members of the Kaggle community. Those who complete this book should expect to be able to engage confidently on Kaggle – and engaging confidently on Kaggle has many rewards. Firstly, it’s a powerful way to stay on top of the most pragmatic developments in machine learn- ing. Machine learning is moving very quickly. In 2019, over 300 peer reviewed machine learning papers were published per day. This volume of publishing makes it impossible to be on top of the literature. Kaggle ends up being a very valuable way to filter what developments matter on real-world problems – and Kaggle is useful for more than keeping up with the academic litera- ture. Many of the tools that have become standard in the industry have spread via Kaggle. For example, XGBoost in 2014 and Keras in 2015 both spread through the community before making their way into industry. Secondly, Kaggle offers users a way to “learn by doing.” I’ve heard active Kagglers talk about com- peting regularly as “weight training” for machine learning. The variety of use cases and problems they tackle on Kaggle makes them well prepared when they encounter similar problems in indus- try. And because of competition deadlines, Kaggle trains the muscle of iterating quickly. There’s probably no better way to learn than to attempt a problem and then see how top performers tackled the same problem (it’s typical for winners to share their approaches after the competition). So, for those of you who are reading this book and are new to Kaggle, I hope it helps make Kaggle less intimidating. And for those who have been on Kaggle for a while and are looking to level up, I hope this book from two of Kaggle’s strongest and most respected members helps you get more out of your time on the site. Anthony Goldbloom Kaggle Founder and CEO Contributors About the authors Konrad Banachewicz holds a PhD in statistics from Vrije Universiteit Amsterdam. During his period in academia, he focused on problems of extreme dependency modeling in credit risk. In addition to his research activities, Konrad was a tutor and supervised master’s students. Starting from classical statistics, he slowly moved toward data mining and machine learning (this was before the terms “data science” or “big data” became ubiquitous). In the decade after his PhD, Konrad worked in a variety of financial institutions on a wide array of quantitative data analysis problems. In the process, he became an expert on the entire lifetime of a data product cycle. He has visited different ends of the frequency spectrum in finance (from high-frequency trading to credit risk, and everything in between), predicted potato prices, and analyzed anomalies in the performance of large-scale industrial equipment. As a person who himself stood on the shoulders of giants, Konrad believes in sharing knowledge with others. In his spare time, he competes on Kaggle (“the home of data science”). I would like to thank my brother for being a fixed point in a chaotic world and continuing to provide inspiration and motivation. Dzięki, Braciszku. Luca Massaron is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is the author of bestselling books on AI, machine learning, and algorithms. Luca is also a Kaggle Grandmaster who reached no. 7 in the worldwide user rankings for his perfor- mance in data science competitions, and a Google Developer Expert (GDE) in machine learning. My warmest thanks go to my family, Yukiko and Amelia, for their support and loving patience as I prepared this new book in a long series. My deepest thanks to Anthony Goldbloom for kindly writing the foreword for this book and to all the Kaggle Masters and Grandmasters who have so enthusiastically contributed to its making with their interviews, suggestions, and help. Finally, I would like to thank Tushar Gupta, Parvathy Nair, Lucy Wan, Karan Sonawane, and all of the Packt Publishing editorial and production staff for their support on this writing effort. About the reviewer Dr. Andrey Kostenko is a data science and machine learning professional with extensive expe- rience across a variety of disciplines and industries, including hands-on coding in R and Python to build, train, and serve time series models for forecasting and other applications. He believes that lifelong learning and open-source software are both critical for innovation in advanced analytics and artificial intelligence. Andrey recently assumed the role of Lead Data Scientist at Hydroinformatics Institute (H2i.sg), a specialized consultancy and solution services provider for all aspects of water management. Prior to joining H2i, Andrey had worked as Senior Data Scientist at IAG InsurTech Innovation Hub for over 3 years. Before moving to Singapore in 2018, he worked as Data Scientist at TrafficGuard. ai, an Australian AdTech start-up developing novel data-driven algorithms for mobile ad fraud detection. In 2013, Andrey received his doctorate degree in Mathematics and Statistics from Monash University, Australia. By then, he already had an MBA degree from the UK and his first university degree from Russia. In his spare time, Andrey is often found engaged in competitive data science projects, learning new tools across R and Python ecosystems, exploring the latest trends in web development, solving chess puzzles, or reading about the history of science and mathematics. Dr. Firat Gonen is the Head of Data Science and Analytics at Getir. Gonen leads the data sci- ence and data analysis teams delivering innovative and cutting edge Machine Learning projects. Before Getir, Dr. Gonen was managing Vodafone Turkey’s AI teams. Prior to Vodafone Turkey, he was the Principal Data Scientist at Dogus Group (one of Turkey’s largest conglomerates). Gonen holds extensive educational qualifications including a PhD degree in NeuroScience and Neural Networks from University of Houston and is an expert in Machine Learning, Deep Learning, Visual Attention, Decision-Making & Genetic Algorithms with over more than 12 years in the field. He has authored several peer-review journal papers. He’s also a Kaggle Triple GrandMaster and has more than 10 international data competition medals. He was also selected as the 2020 Z by HP Data Science Global Ambassador. About the interviewees We were fortunate enough to be able to collect interviews from 31 talented Kagglers across the Kaggle community, who we asked to reflect on their time on the platform. You will find their answers scattered across the book. They represent a broad range of perspectives, with many in- sightful responses that are as similar as they are different. We read each one of their contributions with great interest and hope the same is true for you, the reader. We give thanks to all of them and list them in alphabetical order below. Abhishek Thakur, who is currently building AutoNLP at Hugging Face. Alberto Danese, Head of Data Science at Nexi. Andrada Olteanu, Data Scientist at Endava, Dev Expert at Weights and Biases, and Z by HP Global Data Science Ambassador. Andrew Maranhão, Senior Data Scientist at Hospital Albert Einstein in São Paulo. Andrey Lukyanenko, Machine Learning Engineer and TechLead at MTS Group. Bojan Tunguz, Machine Learning Modeler at NVIDIA. Chris Deotte, Senior Data Scientist and Researcher at NVIDIA. Dan Becker, VP Product, Decision Intelligence at DataRobot. Dmitry Larko, Chief Data Scientist at H2O.ai. Firat Gonen, Head of Data Science and Analytics at Getir and Z by HP Global Data Science Am- bassador. Gabriel Preda, Principal Data Scientist at Endava. Gilberto Titericz, Senior Data Scientist at NVIDIA. Giuliano Janson, Senior Applied Scientist for ML and NLP at Zillow Group. Jean-François Puget, Distinguished Engineer, RAPIDS at NVIDIA, and the manager of the NVIDIA Kaggle Grandmaster team. Jeong-Yoon Lee, Senior Research Scientist in the Rankers and Search Algorithm Engineering team at Netflix Research. Kazuki Onodera, Senior Deep Learning Data Scientist at NVIDIA and member of the NVIDIA KGMON team. Laura Fink, Head of Data Science at Micromata. Martin Henze, PhD Astrophysicist and Data Scientist at Edison Software. Mikel Bober-Irizar, Machine Learning Scientist at ForecomAI and Computer Science student at the University of Cambridge. Osamu Akiyama, Medical Doctor at Osaka University. Parul Pandey, Data Scientist at H2O.ai. Paweł Jankiewicz, Chief Data Scientist & AI Engineer as well as Co-founder of LogicAI. Rob Mulla, Senior Data Scientist at Biocore LLC. Rohan Rao, Senior Data Scientist at H2O.ai. Ruchi Bhatia, Data Scientist at OpenMined, Z by HP Global Data Science Ambassador, and grad- uate student at Carnegie Mellon University. Ryan Chesler, Data Scientist at H2O.ai. Shotaro Ishihara, Data Scientist and Researcher at a Japanese news media company. Sudalai Rajkumar, an AI/ML advisor for start-up companies. Xavier Conort, Founder and CEO at Data Mapping and Engineering. Yifan Xie, Co-founder of Arion Ltd, a data science consultancy firm. Yirun Zhang, final-year PhD student at King’s College London in applied machine learning.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.