Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? ErjinZhou ZhiminCao QiYin Face++,MegviiInc. Face++,MegviiInc. Face++,MegviiInc. [email protected] [email protected] [email protected] 5 Abstract 100 1 DeepID2+ DeepID2 0 GaussianFace Facerecognitionperformanceimprovesrapidlywiththe 98 2 DeepID DeepFace n recent deep learning technique developing and underlying 96 TL Joint Bayesian FR+FCN large training dataset accumulating. In this paper, we re- High-dim LBP a 0 J pnoitriotnouprerofbosremrvaantcieo.nsAocncohrodwingbigtodtahteaseimopbascetrsvathtieonres,cowge- uracy 9924 BaTyRoeemsvi-aisvnist e-FPdae ctee HyLberaidrn Dinegep #Dataset ~ 10K c 2 buildourMegviiFaceRecognitionSystem,whichachieves Ac Associate-Predict #Dataset < 100K 99.50% accuracy on the LFW benchmark, outperforming 90 #Dataset > 100K ] the previous state-of-the-art. Furthermore, we report the V 88 performanceinareal-worldsecuritycertificationscenario. C There still exists a clear gap between machine recognition 86 . Multiple LE + comp s and human performance. We summarize our experiments 84 c and present three challenges lying ahead in recent face 2009 2010 2011 2012 2013 2014 2015 [ Year recognition. Andweindicateseveralpossiblesolutionsto- 1 wardsthesechallenges.Wehopeourworkwillstimulatethe v Figure1.AdataperspectivetotheLFWhistory.Largeamounts community’s discussion of the difference between research 0 ofweb-collecteddataiscomingupwiththerecentdeeplearning benchmarkandreal-worldapplications. 9 waves. Extremeperformanceimprovementisgainedthen. How 6 doesbigdataimpactfacerecognition? 4 0 1.INTRODUCTION . 1 Thismotivatesustoexplorehowbigdataimpactstherecog- The LFW benchmark [8] is intended to test the recog- 0 nitionperformance. 5 nitionsystem’sperformanceinunconstrainedenvironment, 1 which is considerably harder than many other constrained Hence,wecollectlargeamountsoflabeledwebdata,and v: dataset (e.g., YaleB [6] and MultiPIE [7]). It has become buildaconvolutionalnetworkframework. Twocriticalob- i the de-facto standard regarding to face-recognition-in-the- servationsareobtained. First,thedatadistributionanddata X wild performance evaluation in recent years. Extensive sizedoinfluencetherecognitionperformance. Second,we r works have been done to push the accuracy limit on it observe that performance gain by many existing sophisti- a [3,16,4,1,2,5,11,10,12,14,13,17,9]. catedmethodsdecreasesastotaldatasizeincreases. Throughout the history of LFW benchmark, surpris- According to our observations, we build our Megvii ing improvements are obtained with recent deep learning FaceRecognitionSystembysimplestraightforwardconvo- techniques [17, 14, 13, 10, 12]. The main framework lutionalnetworkswithoutanysophisticatedtuningtricksor of these systems are based on multi-class classification smartarchitecturedesigns.Surprisingly,byutilizingalarge [10, 12, 14, 13]. Meanwhile, many sophisticated methods web-collectedlabelleddataset,thisnaivedeeplearningsys- aredevelopedandappliedtorecognitionsystems(e.g.,joint temachievesstate-of-the-artperformanceontheLFW.We Bayesianin[4,2,10,12,13],modelensemblein[10,14], achieve the 99.50% recognition accuracy, surpassing the multi-stage feature in [10, 12], and joint identification and humanlevel. Furthermore,weintroduceanewbenchmark, verification learning in [10, 13]). Indeed, large amounts calledChineseID(CHID)benchmark,toexploretherecog- of outside labeled data are collected for learning deep net- nition system’s generalization. The CHID benchmark is works. Unfortunately,thereislittleworkoninvestigatethe intended to test the recognition system in a real security relationshipbetweenbigdataandrecognitionperformance. certificateenvironmentwhichconstrainsonChinesepeople 1 andrequiresverylowfalsepositiverate.Unfortunately,em- (a) The Distribution of MFC Database piricalresultsshowthatagenericmethodtrainedwithweb- 1000 collecteddataandhighLFWperformancedoesn’timplyan nces 800 acceptableresultonsuchanapplication-drivenbenchmark. nsta 600 Whenwekeepthefalsepositiveratein10−5,thetruepos- of i itiverateis66%, whichdoesnotmeetourapplication’sre- mber 400 Nu 200 quirement. Bysummarizingtheseexperiments,wereportthreemain 011 3300010 6600010 9900010 112200010 115500010 118800010 Individual ID challengesinfacerecognition:databias,verylowfalsepos- (b) Continued Performance Improvement (c) Long-tail Effect itive criteria, and cross factors. Despite we achieve very 97.6 97.7 highaccuracyontheLFWbenchmark,theseproblemsstill 97.4 97.6 existandwillbeamplifiedinmanyspecificreal-worldap- 97.2 97.5 plications. Hence, from an industrial perspective, we dis- uracy969.78 uracy97.4 cussseveralwaystodirectthefutureresearch. Ourcentral Acc96.6 Acc97.3 concernisarounddata: howtocollectdataandhowtouse 96.4 97.2 data. We hope these discussions will contribute to further 96.2 97.1 studyinfacerecognition. 963000 6000 9000 12000 15000 18000 973000 6000 9000 12000 15000 18000 Number of Individuals Number of Individuals 2. A DATA PERSPECTIVE TO FACE Figure2.Datatalks. (a)ThedistributionoftheMFCdatabase. RECOGNITION All individuals are sorted by the number of instances. (b) Per- formanceunderdifferentamountsoftrainingdata. TheLFWac- AninterestingviewoftheLFWbenchmarkhistory(see curacyriseslinearlyasdatasizeincreases. Eachsub-trainingset Fig.1)displaysthatanimplicitlydataaccumulationunder- choosesindividualsrandomlyfromtheMFCdatabase.(c)Perfor- liestheperformanceimprovement. Theamountofdataex- manceunderdifferentamountsoftrainingdata, meanwhileeach panded 100 times from 2010 to 2014 (e.g., from about 10 sub-database chooses individuals with the largest number of in- thousandtrainingsamplesinMultipleLE[3]to4millions stances. Long-taileffectemergeswhennumberofindividualsare images in DeepFace [14]). Especially, large amounts of greater than 10,000: keep increasing individuals with a few in- web-collecteddataiscomingupwiththerecentdeeplearn- stancesperpersondoesnothelptoimproveperformance. ing waves and huge performance improvement is gained then. We measure the similarity between two images through a We are interested in this phenomenon. How does big simpleL2norm. data, especially the large amounts of web-collected data, impactstherecognitionperformance? 4.CRITICALOBSERVATIONS 3.MEGVIIFACERECOGNITIONSYSTEM Wehaveconductedaseriesexperimentstoexploredata impacts on recognition performance. We first investigate 3.1.MegviiFaceClassificationDatabase. how do data size and data distribution influence the sys- We collect and label a large amount of celebrities from tem performance. Then we report our observations with Internet, referred to as the Megvii Face Classification many sophisticated techniques appeared in previous liter- (MFC) database. It has 5 million labeled faces with about atures, whentheycomeupwithlargetrainingdataset. All 20,000individuals. Wedeleteallthepersonwhoappeared of these experiments are set up with our ten layers CNN, in the LFW manually. Fig. 2 (a) shows the distribution of applyingtothewholefaceregion. theMFCdatabase,whichisaveryimportantcharacteristic 4.1.ProsandConsofweb-collecteddata ofweb-collecteddatawewilldescribelater. Web-collecteddatahastypicallong-tailcharacteristic:A 3.2.Naivedeepconvolutionalneuralnetwork. few “rich” individuals have many instances, and a lot of We develop a simple straightforward deep network ar- individualsare“poor”withafewinstancesperperson(see chitecturewithmulti-classclassificationonMFCdatabase. Fig. 2(a)). In this section, we first explore how total data Thenetworkcontainstenlayersandthelastlayerissoftmax size influence the final recognition performance. Then we layerwhichissetintrainingphaseforsupervisedlearning. discussthelong-taileffectintherecognitionsystem. The hidden layer output before the softmax layer is taken Continuedperformanceimprovement.Largeamounts as the feature of input image. The final representation of oftrainingdataimprovethesystem’sperformanceconsider- thefaceisfollowedbyaPCAmodelforfeaturereduction. ably. Weinvestigatethisbytrainingthesamenetworkwith 2 differentnumberofindividualsfrom4,000to16,000. The individuals are random sampled from the MFC database. Deep CNN Training Phase Hence, each sub database keeps the original data distribu- Softmax CMlauslstii-ficclaatsiosn tion. Fig.2(b)presentseachsystem’sperformanceonthe Deep CNN LFWbenchmark.Theperformanceimproveslinearlyasthe amountsofdataaccumulates. Deep CNN Testing Phase Long tail effect. Long tail is a typical characteristic in PCA L2 Distance theweb-collecteddataandwewanttoknowtheimpactto Deep CNN the system’s performance. We first sort all individuals by the number of instances, decreasingly. Then we train the Raw Image Cropped Patches Naïve CNNs Face Representation same network with different number of individuals from Figure 3. Overview of Megvii Face Recognition System. We 4,000to16,000. Fig.2(c)showstheperformanceofeach designasimple10layersdeepconvolutionalneuralnetworkfor systems in the LFW benchmark. Long tail does influence recognition. Fourfaceregionsarecroppedforrepresentationex- totheperformance. Thebestperformanceoccurswhenwe traction. We train our networks on the MFC database under the takethefirst10,000individualswiththemostinstancesas traditionalmulti-classclassificationframework. Intestingphase, the training dataset. On the other words, adding the indi- a PCA model is applied for feature reduction, and a simple L2 viduals with only a few instances do not help to improve normisusedformeasuringthepairoftestingfaces. therecognitionperformance. Indeed,theseindividualswill furtherharmthesystem’sperformance. 5.1.ResultsontheLFWbenchmark 4.2.Traditionaltricksfadeasdataincreasing. We achieve 99.50% accuracy on the LFW benchmark, Wehaveexploredmanysophisticatedmethodsappeared which is the best result now and beyond human perfor- in previous literatures and observe that as training data in- mance. Fig.4showsallfailedcasesinoursystem. Except creases, little gain is obtained by these methods in our ex- forafewpairs(referredtoas“easycases”),mostcasesare periments. Wehavetried: considerablyhardtodistinguish,evenfromahuman.These •JointBayesian:modelingthefacerepresentationwithin- “hardcases”sufferfromseveraldifferentcrossfactors,such dependentGaussianvariables[4,2,10,12,13]; as large pose variation, heavy make-up, glass wearing, or •Multi-stagefeatures: combininglastseverallayers’out- other occlusions. We indicate that, without other priors putsasthefacerepresentation[10,12]; (e.g.,WehavewatchedTheHours,soweknowthatbrown •Clustering: labelingeachindividualswiththehierarchi- hair“VirginiaWoolf”isNicoleKidman),it’sveryhardto cal structure and learning with both coarse and fine labels correctthemostremainpairs.Basedonthis,wethinkarea- [15]; sonableupperlimitofLFWisabout99.7%ifallthe“easy • Joint identification and verification: adding pairwise cases”aresolved. constrains on the hidden layer of multi-class classification framework[10,13]. 5.2.Resultsonthereal-worldapplication All of these sophisticated methods will introduce ex- In order to investigate the recognition system’s per- tra hyper-parameters to the system, which makes it harder formance in real-world environment, we introduce a new to train. But when we apply these methods to the MFC benchmark, referred to as Chinese ID (CHID) benchmark. database by trial and error, according to our experiments, WecollectthedatasetofflineandspecializeonChinesepeo- little gain is obtain compared with the simple CNN archi- ple. DifferentfromtheLFWbenchmark,CHIDbenchmark tectureandPCAreduction. is a domain-specific task to Chinese people. And we are interestedinthetruepositiveratewhenwekeepfalseposi- 5.PERFORMANCEEVALUATION tiveinaverylowrate(e.g.,FP = 10−5). Thisbenchmark In this section, we evaluate our system to the LFW is intended to mimic a real security certification environ- benchmark and a real-world security certification applica- mentandtestrecognitionsystems’performance. Whenwe tion. Based on our previous observations, we train the applyour“99.50%”recognitionsystemtotheCHIDbench- wholesystemwith10,000most“rich”individuals.Wetrain mark,theperformancedoesnotmeettherealapplication’s the network on four face regions (i.e., centralized at eye- requirements. The”beyondhuman”systemdoesnotreally brow, eye center, nose tip, and mouth corner through the work as it seems. When we keep the false positive rate in facial landmark detector). Fig. 3 presents an overview of 10−5,thetruepositiverateis66%.Fig.5showssomefailed the whole system. The final representation of the face is casesinFP = 10−5 criteria. Theagevariation, including theconcatenationonfourfeaturesandfollowedbyPCAfor intra-variation(i.e., sameperson’sfacescapturedindiffer- featurereduction. entage)andinter-variation(i.e.,peoplewithdifferentages), 3 (a) Easy Cases False Positive False Negative (b) Hard Cases False Positive False Negative Pose Occlusion Pose Makeup Errata Occlusion Figure4.30FailedCasesintheLFWbenchmark. Wepresentallthefailedcases,andgroupthemintotwoparts. (a)showsthefailed casesregardedas“easycases”,whichwebelievecanbesolvedwithabettertrainingsystemundertheexistingframework. (b)showsthe “hardcases”. Thesecasesallpresentsomespecialcrossfactors,suchasocclusion,posevariation,orheavymake-up. Mostofthemare evenhardforhuman.Hence,webelievethatwithoutanyotherpriors,itishardforcomputertocorrectthesecases. is a typical characteristic in the CHID benchmark. Unsur- can only provide a starting point; it is a baseline for face prisingly, the system suffers from this variation, because recognition. Most web-collected faces come from celebri- they are not captured in the web-collected MFC database. ties: smiling,make-up,young,andbeautiful. Itisfarfrom Wedohumantestonallofourfailedcases.Afteraveraging imagescapturedinthedailylife. Despitethehighaccuracy 10independentresults,itshows90%casescanbesolvedby in the LFW benchmark, its performance still hardly meets human,whichmeansthemachinerecognitionperformance therequirementsinreal-worldapplication. isstillfarfromhumanlevelinthisscenario. Very low false positive rate. Real-world face recogni- tionhasmuchmorediversecriteriathanwetreatedinprevi- 6.CHALLENGESLYINGAHEAD ousrecognitionbenchmarks.Aswestatebefore,inmostse- curitycertificationscenario,customersconcernmoreabout Based on our evaluation on two benchmarks, here we the true positive rate when false positive is kept in a very summarizethreemainchallengestothefacerecognition. lowrate. AlthoughweachieveveryhighaccuracyinLFW Data bias. The distribution of web-collected data is benchmark,oursystemisstillfarfromhumanperformance extremely unbalanced. Our experiments show a amount inthesereal-worldsetting. of people with few instances per individual do not work Crossfactors. Throughoutthe failedcasestudy onthe in a simple multi-class classification framework. On the LFWandCHIDbenchmark,pose,occlusion,andagevaria- other hand, we realize that large-scale web-collected data tionaremostcommonfactorswhichinfluencethesystem’s 4 thesizedata,especiallyinmodelingphysicalfactors. False Positive False Negative Oneofourobservationsisthatthelong-taileffectexists inthesimplemulti-classclassificationframework. Howto use long-tail web-collected data effectively is an interest- ingissueinthefuture. Moreover,howtotransferageneric recognitionsystemintoadomain-specificapplicationisstill aopenquestion. This report provides our industrial view on face recog- nition,andwehopeourexperimentsandobservationswill stimulate discussion in the community, both academic and industrial,andimprovefacerecognitiontechniquefurther. References Figure 5. Some Failed Cases in the CHID Benchmark. The [1] T.BergandP.N.Belhumeur. Tom-vs-peteclassifiersandidentity-preserving alignmentforfaceverification.InBMVC,volume2,page7.Citeseer,2012. recognition system suffers from the age variations in the CHID [2] X.Cao,D.Wipf,F.Wen,G.Duan,andJ.Sun. Apracticaltransferlearning benchmark, including intra-variation (i.e., same person’s faces algorithmforfaceverification. InComputerVision(ICCV),2013IEEEInter- captured in different age) and inter-variation (i.e., people with nationalConferenceon,pages3208–3215.IEEE,2013. different ages). Because little age variation is captured by the [3] Z.Cao,Q.Yin,X.Tang,andJ.Sun. Facerecognitionwithlearning-based web-collecteddata,notsurprisingly,thesystemcannotwellhan- descriptor. InComputerVisionandPatternRecognition(CVPR),2010IEEE Conferenceon,pages2707–2714.IEEE,2010. dle this variation. Indeed, we do human test on all these failed [4] D.Chen,X.Cao,L.Wang,F.Wen,andJ.Sun.Bayesianfacerevisited:Ajoint cases. Results show that 90% failed cases can be solved by hu- formulation.InComputerVision–ECCV2012,pages566–579.Springer,2012. man.Therestillexistsabiggapbetweenmachinerecognitionand [5] D.Chen, X.Cao, F.Wen, andJ.Sun. Blessingofdimensionality: High- humanlevel. dimensionalfeatureanditsefficientcompressionforfaceverification.InCom- puterVisionandPatternRecognition(CVPR),2013IEEEConferenceon,pages 3025–3032.IEEE,2013. [6] A.Georghiades,P.Belhumeur,andD.Kriegman. Fromfewtomany:Illumi- performance. However, westilllackasufficientinvestiga- nationconemodelsforfacerecognitionundervariablelightingandpose.IEEE tiononthesecrossfactors,andalsolackaefficientmethod Trans.PatternAnal.Mach.Intelligence,23(6):643–660,2001. tohandlethemclearlyandcomprehensively. [7] R.Gross,I.Matthews,J.Cohn,T.Kanade,andS.Baker.Multi-pie.Imageand VisionComputing,28(5):807–813,2010. [8] G.B.Huang,M.Ramesh,T.Berg,andE.Learned-Miller.Labeledfacesinthe 7.FUTUREWORKS wild:Adatabaseforstudyingfacerecognitioninunconstrainedenvironments. TechnicalReport07-49,UniversityofMassachusetts,Amherst,October2007. [9] C.LuandX.Tang. Surpassinghuman-levelfaceverificationperformanceon Largeamountsofweb-collecteddatahelpusachievethe lfwwithgaussianface.arXivpreprintarXiv:1404.3840,2014. state-of-the-art result on the LFW benchmark, surpassing [10] Y.Sun,Y.Chen,X.Wang,andX.Tang. Deeplearningfacerepresentationby thehumanperformance. Butthisisjustanewstartingpoint jointidentification-verification.InAdvancesinNeuralInformationProcessing Systems,pages1988–1996,2014. offacerecognition.Thesignificanceofthisresultistoshow [11] Y.Sun,X.Wang,andX.Tang. Hybriddeeplearningforfaceverification. that face recognition is able to go out of laboratories and InComputerVision(ICCV),2013IEEEInternationalConferenceon,pages comeintoourdailylife. Whenwearefacingthereal-work 1489–1496.IEEE,2013. applicationinsteadofasimplebenchmark, therearestilla [12] Y.Sun,X.Wang,andX.Tang.Deeplearningfacerepresentationfrompredict- ing10,000classes.InComputerVisionandPatternRecognition(CVPR),2014 lotofworkswehavetodo. IEEEConferenceon,pages1891–1898.IEEE,2014. Ourexperimentsdoemphasizethatdataisanimportant [13] Y.Sun,X.Wang,andX.Tang.Deeplylearnedfacerepresentationsaresparse, selective,androbust.arXivpreprintarXiv:1412.1265,2014. factorintherecognitionsystem. Andwepresentfollowing [14] Y.Taigman,M.Yang,M.Ranzato,andL.Wolf. Deepface: Closingthegap issues as an industrial perspective to the expect of future tohuman-levelperformanceinfaceverification. InComputerVisionandPat- researchinfacerecognition. ternRecognition(CVPR),2014IEEEConferenceon,pages1701–1708.IEEE, 2014. Ononehand,developingmoresmartandefficientmeth- [15] Z.Yan,V.Jagadeesh,D.DeCoste,W.Di,andR.Piramuthu. Hd-cnn: Hi- ods mining domain-specific data is one of the important erarchicaldeepconvolutionalneuralnetworkforimageclassification. arXiv preprintarXiv:1410.0736,2014. ways to improve performance. For example, video is one [16] Q.Yin,X.Tang,andJ.Sun. Anassociate-predictmodelforfacerecognition. of data sources which can provide tremendous amounts of InComputerVisionandPatternRecognition(CVPR),2011IEEEConference data with spontaneous weakly-labeled faces, but we have on,pages497–504.IEEE,2011. notexploredcompletelyandappliedthemtothelarge-scale [17] Z.Zhu,P.Luo,X.Wang,andX.Tang. Recovercanonical-viewfacesinthe wildwithdeepneuralnetworks.arXivpreprintarXiv:1404.3543,2014. facerecognitionyet. Ontheotherhand, datasynthesizeis another direction to generate more data. For example, it is very hard to collect data with intra-person age variation manually. Soareliableagevariationgeneratormayhelpa lot. 3D face reconstruction is also a powerful tool to syn- 5