UNDERSTANDING VISUAL APPEARANCE ON THE WEB USING LARGE-SCALE CROWDSOURCING AND DEEP LEARNING ADissertation PresentedtotheFacultyoftheGraduateSchool ofCornellUniversity inPartialFulfillmentoftheRequirementsfortheDegreeof DoctorofPhilosophy by SeanCameronBell August2016 (cid:13)c 2016SeanCameronBell ALLRIGHTSRESERVED UNDERSTANDINGVISUALAPPEARANCEONTHEWEBUSING LARGE-SCALECROWDSOURCINGANDDEEPLEARNING SeanCameronBell,Ph.D. CornellUniversity2016 Automaticallyunderstandingscenesistheholygrailofcomputervision. Real- world scenes have a vast array of interesting objects, materials, textures, and surfaces. With scenes, people want to edit photographs, search by object and materialproperties,visualizechangestoroomsandbuildings,browsecollections by visual similarity, and explain images to the visually impaired. However, the tools and data that we have for recognizing, editing, and exploring the applicationofscenepropertiesforeverydayproblemsarestillquitelimited. We cannoteasilyunderstand,search,andaggregatevisualconceptsinthebillionsof photosthatareuploadedeverydaytotheweb. Recently,largedatacollectionscombinedwithmachinelearninghaveopened new frontiers in scene understanding. In this thesis, we introduce new large- scalecrowdsourceddatasetsformaterialandvisualunderstandinginthewild. Usingthesenew datasets,wedevelop newstate-of-the-artalgorithmsforscene understandingofmaterials,objects,shapes,andstyle. Weproposemultiplenewlarge-scale,first-of-their-kinddatasetsinthewild. OpenSurfacescontainsthousandsofsegmentedsurfacesannotatedwithmate- rial, texture, and contextualinformation. MINC (Materials in Context)includes millionsofpointsannotatedwithmaterials. Bothareatleastanorderofmagni- tude larger than prior datasets. The Intrinsic Images in the Wild (IIW) dataset includes millions of crowdsourced annotations of relative comparisons of ma- terial properties at pairs of points in each scene. These datasets all require carefulcrowdsourcingandusetheabilityofhumanstojudgematerialsdespite variationsinillumination,viewpoint,imagingconditions,andcontext. Using these large-scale datasets we demonstrate state-of-the-art algorithms for material recognition using OpenSurfaces and MINC, and intrinsic image decomposition using IIW.We also develop state-of-the-art algorithms for object detection(Inside-OutsideNetwork,ION)andvisualsearchforstylesimilarity (ProductNet). In this thesis we have demonstrated how the combination of crowdsourcing atscaleandnewdeeplearningarchitecturescancreatenewtoolstoletconsumers understandandeditimages,scenes,materialsandobjects. BIOGRAPHICALSKETCH SeanBell wasbornin Toronto,Canada. Whenfirstasked “whatdoyouwant to bewhenyougrowup?”,hewouldanswer“computerprogrammer!”,noteven knowingwhattheydid—hejustlovedcomputers. Inearlyhighschool,hewould spendhissparetimeinthecomputerlab,workingonmakinggraphicaleffects andanimationswithJavaApplets. From2007to2011,hestudiedEngineering ScienceattheUniversityofToronto. Inhissecondyear,histeamwonfirstplace intheAER201EngineeringDesignProject,programmingthemicro-controllerfor arobotthatdispensesanexactnumberofcandiesathighspeed(averyuseful device). In his last year, he built a ray-tracer that “borrowed” unusedcomputers acrosscampustorenderhisscenes,onescanlineatatime. During his undergrad, Sean spent five summers working at Hill & Schu- macher,apatentlawfirm,andalmostwentintopatentlawasapossiblecareer. However,hewasmoreinterestedinwritingsoftwaretohelpdraftpatentsthan thepatentsthemselves. Thisturnedintohisundergraduatethesis, whichwasa real-timesystemtodetectinconsistenciesinpatentsastheywerebeingdrafted. Since 2011, Sean has been studying for his doctorate degree in Computer ScienceatCornellUniversity. Whenhefirstarrived,hewasn’tsurewhetherto workonnaturallanguageprocessing,computergraphics,ormachinelearning. Hequicklyfoundhisplaceintheboundarybetweengraphicsandvision,and has since enjoyed five wonderful years studying at Cornell. Upon graduation in 2016, he is co-foundinga deep learning company, GrokStyle, based on his work invisualsearchandstylesimilarity. iii Thisthesisisdedicatedtomyparents,fortheirloveandsupport, andtoStephanieSang,forputtingupwithsomuch. iv ACKNOWLEDGEMENTS This thesis would not have been possible without the support, guidance, and mentorshipofmyadvisorProf. KavitaBala. Kavitahasalwaysencouragedme tostriveforthebestpossibleversionofanythingthatIworkon,andhasbeen centraltolearninghowtodoresearch. IwouldliketothankProf. NoahSnavely, as my close collaborator and committee member; his ideas and feedback have been invaluable to research meetings. I would also like to thank collaborators PaulUpchurch,LarryZitnick,andRossGirshick,andmyotherPhDcommittee member, Prof. Charles Van Loan, as well as Profs. David Bindel and Serge Belongieforservingasproxiesonmycommittee. IthankmyfriendsandcolleaguesintheGraphicsandVisionLab,fortheir camaraderie, willingness to discuss research ideas and read paper drafts, and for making it such a great place to work: Kevin Matzen, Tim Langlois, Bala´zs Kova´cs,andPaulUpchurch,aswellasAlbertLiu,PramookKhungurn,Daniel Hauagge, Kyle Wilson, Nicolas Savva, Scott Wehrwein, Eston Schweikart, Jui- hsienWang,WenzelJakob,andStevenAn. Iwouldalsoliketothankcomputer graphics professors Steve Marschner and Doug James, as well as the entire CornellComputerSciencedepartment, foralwaysattendingmytalkswithgreat feedback. AttheUniversityofToronto,Iwouldliketothankmycolleaguesandfriends for making undergrad so enjoyable, and for cultivating friendly competition: Trevor Campbell, Konstantine Tsotsos, Jamie Liu, Rick Zhang, Manan Arya, Sanae Rosen, Catherine Chen, Amy Chen, Angela Yoo, Zoya Gavrilov, Mark Harfouche,AdamPan,andCaseyScott-Songin. IthankProf. KyrosKutulakos forinspiringmetoconsiderresearchincomputergraphics. v Iwouldlike tothankthecollaboratorsandcolleagues thatImeetatconfer- encesandinternships,allthosewhoattendedmytalksandposters,andthose who emailed me about my work, with exciting ideas, questions, and discus- sionsaboutresearch,inparticularAbhinavShrivastava,IshanMisra,JonBarron, AndrejKarpathy,PeterGehler. I owe everything to my family, including my grandparents, cousins, aunts, uncles,mysiblingsIanandRobyn,andespeciallymyparentsSallyandGraydon, for creating sucha wonderful and supportive environment, helping me atevery stageinlife,andforputtingupwithmebeingawayforsolong. Iamgratefulto mygrandpaScottyBellforfundingmyundergraduateeducation. To my girlfriend Stephanie Sang, who supported me through the crunch times,keptmecompany,listenedtomyresearchstruggles,sentmehand-drawn comics, brought me home-cooked meals to the lab, and helped give my life meaningoutsidethelab. Youputupwithsomuch,andIameternallygrateful. Finally, I would like to thank the Zabs sandwich from Collegetown Bagels, andthelattefromGimmeCoffee,forbeingsodelicious. vi TABLEOFCONTENTS BiographicalSketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v TableofContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction 1 2 OpenSurfaces: ARichlyAnnotatedCatalogofSurfaceAppearance 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Communityphotocollections . . . . . . . . . . . . . . . . . 13 2.3.2 Humanannotation . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 OpenSurfacesdatarepresentation . . . . . . . . . . . . . . 15 2.3.4 Annotationstages . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 TheOpenSurfacesannotationpipeline . . . . . . . . . . . . . . . . 17 2.4.1 Stage1: Filteringimagesbyscenecategory . . . . . . . . . 18 2.4.2 Stage2: Flagimageswithimproperwhitebalance . . . . . 19 2.4.3 Stage3: Materialsegmentation . . . . . . . . . . . . . . . . 20 2.4.4 Stages4and5: Namingmaterialsandobjects . . . . . . . . 23 2.4.5 Stage6: Planarityvoting . . . . . . . . . . . . . . . . . . . . 24 2.4.6 Stage7: Rectifiedtextures . . . . . . . . . . . . . . . . . . . 25 2.4.7 Stage8: Appearancematching . . . . . . . . . . . . . . . . 26 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.1 OpenSurfacesstatistics . . . . . . . . . . . . . . . . . . . . . 30 2.5.2 Taskanalytics . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Proof-of-ConceptApplications . . . . . . . . . . . . . . . . . . . . 37 2.6.1 Texturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.2 Informedscenesimilarity . . . . . . . . . . . . . . . . . . . 38 2.6.3 Futureapplications . . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Conclusionsandfuturework . . . . . . . . . . . . . . . . . . . . . 40 2.8 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Material Recognition in the Wild with the Materials in Context Database 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 PriorWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 TheMaterialsinContextDatabase(MINC) . . . . . . . . . . . . 52 3.3.1 Sourcesofdata . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Segments,Clicks,andPatches . . . . . . . . . . . . . . . . 54 vii 3.4 Materialrecognitioninreal-worldimages . . . . . . . . . . . . . . 58 3.4.1 Trainingprocedure . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.2 Fullscenematerialclassification . . . . . . . . . . . . . . . 59 3.5 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.1 Patchmaterialclassification . . . . . . . . . . . . . . . . . . 61 3.5.2 Fullscenematerialsegmentation . . . . . . . . . . . . . . . 64 3.5.3 ComparingMINCtoFMD . . . . . . . . . . . . . . . . . . 67 3.5.4 ComparingCNNswithpriormethods . . . . . . . . . . . . 68 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4 IntrinsicImagesintheWild 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3.1 Whatjudgementsshouldwecollect? . . . . . . . . . . . . . 76 4.3.2 Whichimagesandwhichpairsofpoints? . . . . . . . . . . 78 4.3.3 Annotationinterface . . . . . . . . . . . . . . . . . . . . . . 81 4.3.4 Dataverification . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.5 Errormetric: WHDR . . . . . . . . . . . . . . . . . . . . . . 86 4.3.6 Discussionandresults . . . . . . . . . . . . . . . . . . . . . 87 4.4 IntrinsicImagesAlgorithm . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.2 Stage1: Optimizereflectance . . . . . . . . . . . . . . . . . 93 4.4.3 Stage2: Optimizeforshading . . . . . . . . . . . . . . . . . 99 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.2 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5.3 MITIntrinsicImagesdataset . . . . . . . . . . . . . . . . . 108 4.6 Limitationsandfuturework . . . . . . . . . . . . . . . . . . . . . . 108 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 Inside-Outside Net: Detecting Objects in Context with Skip Pooling andRecurrentNeuralNetworks 112 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 Priorwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3 Architecture: Inside-OutsideNet(ION) . . . . . . . . . . . . . . . 117 5.3.1 Poolingfrommultiplelayers . . . . . . . . . . . . . . . . . 118 5.3.2 ContextfeatureswithIRNNs . . . . . . . . . . . . . . . . . 119 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.1 Experimentalsetup . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.2 PASCALVOC2007 . . . . . . . . . . . . . . . . . . . . . . . 127 5.4.3 PASCALVOC2012 . . . . . . . . . . . . . . . . . . . . . . . 128 viii
Description: