Anytime Prediction: Efficient Ensemble Methods for Any Computational Budget Alexander Grubb January21,2014 CMU-CS-14-100 SchoolofComputerScience ComputerScienceDepartment CarnegieMellonUniversity Pittsburgh,PA15213 ThesisCommittee: J.AndrewBagnell,Chair AvrimBlum MartialHebert AlexanderSmola HalDaume´ III,UniversityofMaryland Submittedinpartialfulfillmentoftherequirements forthedegreeofDoctorofPhilosophy (cid:13)c 2014AlexanderGrubb ThisworksupportedbyONRMURIgrantN00014-09-1-1052andtheU.SArmyResearchLaboratoryunderthe CollaborativeTechnologyAllianceProgram,CooperativeAgreementW911NF-10-2-0016 Keywords: anytime prediction, budgeted prediction, functional gradient methods, boosting, greedyoptimization,featureselection Abstract A modern practitioner of machine learning must often consider trade-offs between accuracy and complexitywhenselectingfromavailablemachinelearningalgorithms. Predictiontaskscanrange fromrequiring real-timeperformance tobeing largelyunconstrainedin theiruse ofcomputational resources. In each setting, an ideal algorithm utilizes as much of the available computation as possibletoprovidethemostaccurateresult. Thisissueisfurthercomplicatedbyapplicationswherethecomputationalconstraintsarenotfixed in advance. In many applications predictions are often needed in time to allow for adaptive be- haviors which respond to real-time events. Such constraints often rely on a number of factors at prediction time, making it difficult to select a fixed prediction algorithm a priori. In these situ- ations, an ideal approach is to use an anytime prediction algorithm. Such an algorithm rapidly producesaninitialpredictionandthencontinuestorefinetheresultastimeallows,producingfinal resultswhichdynamicallyimprovetofitanycomputationalbudget. Our approach uses a greedy, cost-aware extension of boosting which fuses the disparate areas of functional gradient descent and greedy sparse approximation algorithms. By using a cost-greedy selection procedure our algorithms provide an intuitive and effective way to trade-off computa- tionalcostandaccuracyforanycomputationalbudget. Thisapproachlearnsasequenceofpredic- torstoapplyastimeprogresses,usingeachnewresulttoupdateandimprovethecurrentprediction as time allows. Furthermore, we present theoretical work in the different areas we have brought together, and show that our anytime approach is guaranteed to achieve near-optimal performance with respect to unknown prediction time budgets. We also present the results of applying our al- gorithms to a number of problem domains such as classification and object detection that indicate that our approach to anytime prediction is more efficient than trying to adapt a number of existing methodstotheanytimepredictionproblem. We also present a number of contributions in areas related to our primary focus. In the functional gradient descent domain, we present convergence results for smooth objectives, and show that for non-smooth objectives the widely used approach fails both in theory and in practice. To rectify this we present new algorithms and corresponding convergence results for this domain. We also present novel, time-based versions of a number of greedy feature selection algorithms and give correspondingapproximationguaranteesfortheperformanceofthesealgorithms. ForMax. Acknowledgments This document would certainly not exist without the support and input of my advisor, Drew Bag- nell. His willingness to pursue my own ideas and his ability to guide me when my own insight failed have been invaluable. I’m forever indebted to him for the immense amount of things I have learnedfromhimovertheyears. MyonlyhopeformyresearchcareeristhatIcanapproachfuture problemswiththesameeyefordeep,insightfulquestionsthatDrewhasbroughttoourwork. I would also like to thank the rest of my committee: Avrim Blum, Martial Hebert, Alex Smola, andHalDaume´ III.Theirtimeandinputhasimprovedthisworkinsomanyways. Thoughnoton this list, Geoff Gordon has also provided many helpful insights and discussions and deserves an unofficialspothere. I havebeen fortunateenough to havea numberof other mentorsand advisorsthat have helpedme reach this point. I’m very grateful to those who helped introduce me to academic research and advised me throughout my undergraduate and early graduate career, particularly Dave Touretzky andPaulRybski. I’d also like to thank Steven Rudich. The Andrew’s Leap program that he runs for Pittsburgh-area highschoolstudentsfosteredmyinterestinComputerScienceveryearlyonandultimatelyledme to where I am today. Were it not for my involvement with this program so many years ago, I can’t imaginehowmycareermighthaveunfolded. ThroughoutthisworkI’vehadtheopportunitytocollaboratewithanumberofgreatpeople: Elliot Cuzzillo, Felix Duvallet, Martial Hebert, Hanzhang Hu, and Dan Mun˜oz. Thank you for sharing your work with me and giving me the opportunity to share mine. I am also grateful to Dave Bradley and Nathan Ratliff. Their work on functional gradient methods helped greatly in shaping and focusing my early work. This work would also not have been possible without a number of informal collaborations. The conversations and insights gleaned from Debadeepta Dey, Ste´phane Ross, Suvrit Sra, Kevin Waugh, Brian Ziebart and Jiaji Zhou, as well as the entire LairLab, have beenimmenselyhelpful. The community and atmosphere at Carnegie Mellon, especially the Computer Science program have been wonderful throughout this process. My love of Pittsburgh is no secret, but the journey vii viii ACKNOWLEDGMENTS that is graduate school is notorious for being difficult wherever you are. That my own personal journey was as enjoyable as it was is a testament to the great people here. To name a few: Jim Cipar, John Dickerson, Mike Dinitz, Jason Franklin, Sam Ganzfried, Tony Gitter, Severin Hacker, Elie Krevat, Dan Mun˜oz, Abe Othman, Ste´phane Ross, Jiˇr´ı Sˇimsˇa, Jenn Tam, Kevin Waugh, and ErikZawadzki. Yourfriendshiphasbeengreatlyuplifting. Finally, I am most grateful to my family. I thank my parents, Suzan and Bob, for the unending support and opportunity that they have provided throughout my life. This work is as much a product of their effort as it is mine. As for my wife, Sarah, there is no amount of gratitude I can give that can compensate for the amount of love, support and patience she has shown me. Thank you. Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 I Functional Gradient Methods 13 2 FunctionalGradientMethods 15 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 FunctionalGradientDescent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 RestrictedGradientDescent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 SmoothConvergenceResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 GeneralConvexConvergenceResults . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 FunctionalGradientExtensions 41 3.1 StructuredBoosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 StackedBoosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 II Greedy Optimization 51 4 BudgetedSubmodularFunctionMaximization 53 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 ApproximateSubmodularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 ApproximateGreedyMaximization . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Bi-criteriaApproximationBoundsforArbitraryBudgets . . . . . . . . . . . . . . 61 ix x CONTENTS 5 SparseApproximation 67 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 RegularizedSparseApproximation . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 ConstrainedSparseApproximation . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 GeneralizationtoSmoothLosses . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 SimultaneousSparseApproximation . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6 GroupedFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.7 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 III Anytime Prediction 107 6 SPEEDBOOST: AnytimePredictionAlgorithms 109 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 AnytimePredictionFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.3 SPEEDBOOST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4 TheoreticalGuarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7 STRUCTUREDSPEEDBOOST: AnytimeStructuredPrediction 129 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2 AnytimeStructuredPrediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.3 AnytimeSceneUnderstanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.4 ExperimentalAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8 Conclusion 141 8.1 FutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Bibliography 145
Description: