Feature Factory: A Collaborative, Crowd-Sourced Machine Learning System by Alex Christopher Wang Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 ○c Massachusetts Institute of Technology 2015. All rights reserved. Author ................................................................ Department of Electrical Engineering and Computer Science May 22, 2015 Certified by............................................................ Kalyan Veeramachaneni Research Scientist Thesis Supervisor Accepted by ........................................................... Albert Meyer Chairman, Masters of Engineering Thesis Committee 2 Feature Factory: A Collaborative, Crowd-Sourced Machine Learning System by Alex Christopher Wang Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2015, in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science and Engineering Abstract In this thesis, I designed, implemented, and tested a machine learning learning system designed to crowd-source feature discovery called Feature Factory. Feature Factory provides a complete web-based platform for users to define, extract, and test features on any given machine learning problem. This project involved designing, implement- ing, and testing a proof-of-concept version of this platform. Creating the platform involved developing user-side infrastructure and system-side infrastructure. The user-side infrastructure required careful design decisions to provide users with a clear and concise interface and workflow. The system-side infrastructure involved constructing an automated feature aggregation, extraction, and testing pipeline that can be executed with a few simple commands. Testing was performed by presenting three different machine learning problems to test users via the user-side infrastructure of Feature Factory. Users were asked to write features for the three different machine learning problems as well as comment on the usability of the system. The system- side infrastructure was utilized to analyze the effectiveness and performance of the features written by the users. Thesis Supervisor: Kalyan Veeramachaneni Title: Research Scientist 3 4 Acknowledgments I would like acknowledge Kalyan Veeramachaneni for all of his support, guidance, and help throughout the course of this thesis. I would also like to thank my parents for always loving me, supporting me, and being the reason I got to where I am today. 5 6 Contents 1 Introduction 17 1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Design of Feature Factory 27 2.1 User-side Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.1 Interactive IPython notebook . . . . . . . . . . . . . . . . . . 28 2.1.2 Collaborative framework . . . . . . . . . . . . . . . . . . . . . 30 2.1.3 Machine Learning Service . . . . . . . . . . . . . . . . . . . . 30 2.2 System-side Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.1 MySQL Database . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.2 Automatic Feature Extraction Infrastructure . . . . . . . . . . 32 2.2.3 Leaderboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Uploading New Machine Problems to Feature Factory 35 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Problem-Specific Code . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Template IPython Notebook . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Setup and Registration . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Sample Dataset and Example Feature . . . . . . . . . . . . . . 37 7 3.3.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . 39 4 Feature Factory: First trials 41 4.1 Experiment methodology . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 KDDCup 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Integration with Feature Factory . . . . . . . . . . . . . . . . 48 4.2.3 Test-Train-Submission . . . . . . . . . . . . . . . . . . . . . . 49 4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Route Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.4 Integration with Feature Factory . . . . . . . . . . . . . . . . 52 4.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 IJCAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Integration with Feature Factory . . . . . . . . . . . . . . . . 55 4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Lessons Learned from Trials . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.1 Feature Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.2 Feedback on User Experience . . . . . . . . . . . . . . . . . . 57 5 Challenges Faced 59 5.1 Example problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8 5.2 Generating a coherent data sample . . . . . . . . . . . . . . . . . . . 60 5.3 Debugging user submitted features . . . . . . . . . . . . . . . . . . . 60 5.4 Noisy Estimates of Feature Accuracy . . . . . . . . . . . . . . . . . . 64 5.5 Compute challenges with feature extraction . . . . . . . . . . . . . . 64 5.6 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6 Conclusion and Future Work 69 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 9 10
Description: