ebook img

Human centric visual analysis with deep learning PDF

160 Pages·2020·2.737 MB·English
by  Lin L
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Human centric visual analysis with deep learning

Liang Lin · Dongyu Zhang · Ping Luo · Wangmeng Zuo Human Centric Visual Analysis with Deep Learning Human Centric Visual Analysis with Deep Learning Liang Lin Dongyu Zhang (cid:129) (cid:129) Ping Luo Wangmeng Zuo (cid:129) Human Centric Visual Analysis with Deep Learning 123 LiangLin Dongyu Zhang Schoolof Data andComputer Science Schoolof Data andComputer Science SunYat-sen University SunYat-sen University Guangzhou, Guangdong,China Guangzhou, Guangdong,China PingLuo WangmengZuo Schoolof Information Engineering Schoolof Computer Science TheChinese University of HongKong Harbin Institute of Technology Hong Kong,HongKong Harbin, China ISBN978-981-13-2386-7 ISBN978-981-13-2387-4 (eBook) https://doi.org/10.1007/978-981-13-2387-4 ©SpringerNatureSingaporePteLtd.2020 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Foreword WhenLiangaskedmetowritetheforewordtohisnewbook,Iwasveryhappyand proud to see the success that he has achieved in recent years. I have known Liang since 2005, when he visited the Department of Statistics of UCLA as a Ph.D. student. Very soon, I was deeply impressed by his enthusiasm and potential in academicresearchduringregulargroupmeetingsandhispresentations.Since2010, LianghasbeenbuildinghisownlaboratoryatSunYat-senUniversity,whichisthe bestuniversityinsouthernChina.Ivisitedhimandhisresearchteaminthesummer of2010andspentawonderfulweekwiththem.Overtheseyears,Ihavewitnessed hisfantasticsuccessofhimandhisgroup,whosetanextremelyhighstandard.His workondeepstructuredlearningforvisualunderstandinghasbuilthisreputationas awell-establishedprofessorincomputervisionandmachinelearning.Specifically, Liangandhisteamhavefocusedonimprovingfeaturerepresentationlearningwith several interpretable and context-sensitive models and applied them to many computer vision tasks, which is also the focus of this book. On the other hand, he has a particular interest in developing new models, algorithms, and systems for intelligenthuman-centricanalysiswhilecontinuingtofocusonaseriesofclassical research tasks such as face identification, pedestrian detection in surveillance, and human segmentation. The performance of human-centric analysis has been sig- nificantly improved by recently emerging techniques such as very deep neural networks,andnewadvancesinlearningandoptimization.Theresearchteamledby Liang is one of the main contributors in this direction and has received increasing attentionfromboththeacademyandindustry.Insum,Liangandhiscolleaguesdid an excellent jobwith thebook, which isthemost up-to-date resourceyou can find and a great introduction to human-centric visual analysis with emerging deep structured learning. If you need more motivation than that, here is the foreword: In this book, you will find a wide range of research topics in human-centric visual analysis including both classical (e.g., face detection and alignment) and newly rising topics (e.g., fashion clothing parsing), and a series of state-of-the-art solutions addressing these problems. For example, a newly emerging task, human parsing,namely, decomposingahumanimageintosemanticfashion/bodyregions, v vi Foreword is deeply and comprehensively introduced in this book, and you will find not only thesolutionstotherealchallengesofthisproblembutalsonewinsightsfromwhich more general models or theories for related problems can be derived. To the best of our knowledge, to date, a published systematic tutorial or book targeting this subject is still lacking, and this book will fill that gap. I believe this book will serve the research community in the following aspects: (1) It provides an overview of the current research in human-centric visual analysis and highlights the progress and difficulties. (2) It includes a tutorial in advanced techniques of deep learning, e.g., several types of neural network architectures,optimization methods,andtechniques. (3)Itsystematicallydiscusses themainhuman-centricanalysistasksondifferentlevels,rangingfromface/human detection and segmentation to parsing and other higher level understanding. (4) It provideseffectivemethodsanddetailedexperimentalanalysisforeverytaskaswell as sufficient references and extensive discussions. Furthermore, although the substantial content of this book focuses on human-centric visual analysis, it isalso enlightening regarding the development of detection, parsing, recognition, and high-level understanding methods for other AI applications such as robotic perception. Additionally, some new advances in deep learning are mentioned. For example, Liang introduces the Kalman normalization method, which was invented by Liang and his students, for improving and accel- erating the training of DNNs, particularly in the context of microbatches. I believe this book will be very helpful and important to academic professors/students as well as industrial engineers working in the field of vision surveillance, biometrics, and human–computer interaction, where human-centric visual analysis is indispensable in analyzing human identity, pose, attributes, and behaviors. Briefly, this book will not only equip you with the skills to solve the applicationproblems butwill also give youafront-row seattothedevelopment of artificial intelligence. Enjoy! Alan Yuille Bloomberg Distinguished Professor of Cognitive Science and Computer Science Johns Hopkins University, Baltimore, Maryland, USA Preface Human-centricvisualanalysisisregardedasoneofthemostfundamentalproblems in computer vision, which augments human images in a variety of application fields. Developing solutions for comprehensive human-centric visual applications could have crucial impacts in many industrial application domains such as virtual reality, human–computer interaction, and advanced robotic perception. For exam- ple,clothingvirtualtry-onsimulationsystemsthatseamlesslyfitvariousclothesto thehumanbodyshapehaveattractedmuchcommercialinterest.Inaddition,human motionsynthesisandpredictioncanbridgevirtualandrealworlds,facilitatingmore intelligent robotic–human interactions by enabling causal inferences for human activities. Research on human-centric visual analysis is quite challenging. Nevertheless, through the continuous efforts of academic and industrial researchers, continuous progress has been achieved in this field in recent decades. Recently, deep learning methodshavebeenwidelyappliedtocomputervision.Thesuccessofdeeplearning methods can be partly attributed to the emergence of big data, newly proposed networkmodels,andoptimizationmethods.Withthedevelopmentofdeeplearning, considerableprogresshasalsobeenachievedindifferentsubtasksofhuman-centric visual analysis. For example, in facial recognition, the accuracy of the deep model-based method has exceeded the accuracy of humans. Other accurate face detection methods are also based on deep learning models. This progress has spawned many interesting and practical applications, such as face ID in smart- phones, which can identify individual users and detect fraudulent authentication based onfaces. In this book, we will provide an in-depth summary of recent progress in human-centric visual analysis based on deep learning methods. The book is orga- nizedintofiveparts.Inthefirstpart,Chap.1firstprovidesthebackgroundofdeep learning methods including a short review of the development of artificial neural networksandthebackpropagationmethodtogivethereaderabetterunderstanding of certain deep learning concepts. We also introduce a new technique for the trainingofdeepneuralnetworks.Subsequently,inChap.2,weprovideanoverview of the tasks and the current progress of human-centric visual analysis. vii viii Preface In the second part, we introduce tasks related to how to localize a person in an image.Specifically,wefocusonfacedetectionandpedestriandetection.InChap.3, we introduce the facial landmark localization method based on a cascaded fully convolutionalnetwork.Theproposedmethodfirstgenerateslow-resolutionresponse maps to identify approximate landmark locations and then produces fine-grained responsemapsoverlocalregionsformoreaccuratelandmarklocalization.Wethen introduce the attention-aware facial hallucination method, which generates a high-resolution facial image from a low-resolution image. This method recurrently discovers facial parts and enhances them by fully exploiting the global interde- pendency of facial images. In Chap. 4, we introduce a deep learning model for pedestrian detection based onregion proposalnetworks and boosted forests. Inthethirdpart,severalrepresentativehumanparsingmethodsaredescribed.In Chap. 5,we firstintroduceanew benchmarkfor thehuman parsingtask, followed by a self-supervised structure-sensitive learning method for human parsing. In Chaps. 6–7, instance-level human parsing and video instance-level human parsing methods are introduced. In the fourth part, person verification and face verification are introduced. In Chap.8,wedescribeacross-modaldeepmodelforpersonverification.Themodel acceptsdifferentinputmodalitiesandproducesprediction.InChap.9,weintroduce a deep learning model for face recognition by exploiting unlabeled data based on active learning. The last part describes a high-level task and discusses the progress of human activity recognition. The book is based on our years of research on human-centric visual analysis. Since 2010, with grant support from the National Natural Science Foundation of China (NSFC), we have developed our research plan. Since then, an increasing number of studies have been conducted in this area. We would like to express our gratitude to our colleagues and Ph.D. students, i.e., Prof. Xiaodan Liang, Prof. Guanbin Li, Dr. Pengxu Wei, Dr. Keze Wang, Dr. Tianshui Chen, Dr. Qingxing Cao, Dr. Guangrun Wang, Dr. Lingbo Liu, and Dr. Ziliang Chen, for their con- tributionstotheresearchachievementsonthistopic.Ithasbeenourgreathonorto work with them on this inspiring topic in recent years. Guangzhou, China Liang Lin Contents Part I Motivation and Overview 1 The Foundation and Advances of Deep Learning. . . . . . . . . . . . . . 3 1.1 Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Formulation of Neural Network. . . . . . . . . . . . . . . . . . 6 1.2 New Techniques in Deep Learning . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Batch Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Batch Kalman Normalization . . . . . . . . . . . . . . . . . . . 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Human-Centric Visual Analysis: Tasks and Progress . . . . . . . . . . . 15 2.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Facial Landmark Localization . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Conventional Approaches . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Deep-Learning-Based Models . . . . . . . . . . . . . . . . . . . 17 2.3 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Benchmarks for Pedestrian Detection. . . . . . . . . . . . . . 18 2.3.2 Pedestrian Detection Methods . . . . . . . . . . . . . . . . . . . 19 2.4 Human Segmentation and Clothes Parsing . . . . . . . . . . . . . . . . 21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Part II Localizing Persons in Images 3 Face Localization and Enhancement. . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Facial Landmark Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 The Cascaded BB-FCN Architecture . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Backbone Network. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Branch Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 Ground Truth Heat Map Generation . . . . . . . . . . . . . . 34 ix x Contents 3.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 Evaluation Metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Performance Evaluation for Unconstrained Settings . . . 36 3.3.4 Comparison with the State of the Art . . . . . . . . . . . . . 36 3.4 Attention-Aware Face Hallucination. . . . . . . . . . . . . . . . . . . . . 37 3.4.1 The Framework of Attention-Aware Face Hallucination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.2 Recurrent Policy Network. . . . . . . . . . . . . . . . . . . . . . 40 3.4.3 Local Enhancement Network . . . . . . . . . . . . . . . . . . . 42 3.4.4 Deep Reinforcement Learning. . . . . . . . . . . . . . . . . . . 42 3.4.5 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 Pedestrian Detection with RPN and Boosted Forest . . . . . . . . . . . . 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Region Proposal Network for Pedestrian Detection. . . . 49 4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3 Boosted Forest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Part III Parsing Person in Detail 5 Self-supervised Structure-Sensitive Learning for Human Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Look into Person Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Self-supervised Structure-Sensitive Learning. . . . . . . . . . . . . . . 62 5.3.1 Self-supervised Structure-Sensitive Loss . . . . . . . . . . . 64 5.3.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . 66 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6 Instance-Level Human Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3 Crowd Instance-Level Human Parsing Dataset . . . . . . . . . . . . . 73 6.3.1 Image Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4 Part Grouping Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4.1 PGN Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.2 Instance Partition Process . . . . . . . . . . . . . . . . . . . . . . 78

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.