Table Of ContentDEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2017
Semi-supervised Learning for
Real-world Object Recognition
using Adversarial Autoencoders
SUDHANSHU MITTAL
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
Semi-supervised Learning for
Real-world Object
Recognition using
Adversarial Autoencoders
SUDHANSHU MITTAL
Master in Computer Science
Date: December 22, 2017
Supervisor: Prof. Thomas Brox (University of Freiburg), Prof.
Wolfram Burgard (University of Freiburg), Prof. Atsuto Maki (KTH)
Examiner: Prof. Danica Kragic
School of Computer Science and Communication
ii
Abstract
Formanyreal-worldapplications,labeleddatacanbecostlytoobtain.
Semi-supervisedlearningmethodsmakeuseofsubstantiallyavailable
unlabeleddataalongwithfewlabeledsamples. Mostofthelatestwork
onsemi-supervisedlearningforimageclassificationshowperformance
onstandardmachinelearningdatasetslikeMNIST,SVHN,etc. Inthis
work,weproposeaconvolutionaladversarialautoencoderarchitecture
forreal-worlddata. Wedemonstratetheapplicationofthisarchitecture
for semi-supervised object recognition. We show that our approach
can learn from limited labeleddata and outperform fully-supervised
CNN baseline method by about 4% on real-world datasets. We also
achievecompetitiveperformanceontheMNISTdatasetcomparedto
state-of-the-artsemi-supervisedlearningtechniques. Tospurresearch
inthisdirection,wecompiledtworeal-worlddatasets: Internet(WIS)
datasetandReal-world(RW)datasetwhichconsistsofmorethan20K
labeledsampleseach,comprisingofsmallhouseholdobjectsbelonging
totenclasses. Wealso showapossibleapplicationofthismethodfor
onlinelearninginrobotics.
iii
Sammanfattning
I de flesta verklighetsbaserade tillämpningar kan det vara kostsamt
att erhålla märkt data. Inlärningsmetoder som är semi-övervakade
använder sig oftast i stor utsträckning av omärkt data med stöd av
enliten mängdmärktdata. Mycketav detsenastearbetet inomsemi-
övervakadeinlärningsmetoderförbildklassificeringvisarprestandapå
standardiserad maskininlärning så som MNIST, SVHN, och så vidare.
Idethärarbetetföreslårvienconvolutionaladversarialautoencoder
arkitekturförverklighetsbaseraddata. Videmonstrerartillämpningen
avdennaarkitekturförsemi-övervakadobjektidentifieringochvisar
attvårttillvägagångssättkanlärasigavettbegränsatantalmärktdata.
Därmed överträffar vi den fullt övervakade CNN-baslinjemetoden
medca. 4%påverklighetsbaseradedatauppsättningar. Viuppnåräven
konkurrenskraftig prestanda på MNIST datauppsättningen jämfört
medmodernasemi-övervakadeinlärningsmetoder. Förattstimulera
forskningen i denhär riktningen, samlade vi tvåverklighetsbaserade
datauppsättningar: Internet (WIS) och Real-world (RW) datauppsät-
tningar,sombeståravmerän20000märktaprovvardera,somutgörs
av småhushållsobjekt tillhörandestio klasser. Vivisar ocksåen möjlig
tillämpningavdenhärmetodenföronline-inlärningirobotik.
iv
Acknowledgement
I would like to thank my supervisors at the University of Freiburg,
Prof. ThomasBroxandProf. WolframBurgardforgivingmethisop-
portunitytopursuemymasterthesisattheirlab. Igreatlyappreciate
their constant support, feedback and guidance throughout the thesis
work. IwouldliketothankmysupervisoratKTH,Prof. AtsutoMaki
forsupportingthiscollaborationinallrespectsandforhismeticulous
feedback on scientific writing. I would like to thank Prof. Danica
KragicJensfeltforexaminingthethesisandorganizingthepublicpre-
sentationatKTH.IoweagreatdebtofgratitudetoAndreasEiteland
Maxim Tatarchenko for beinggreat mentors, for countless discussions,
motivationandguidance.
I had the privilege of discussing and learning from many excep-
tional researchers at AIS. Special thanks to Gabriel Oliveira, Ayush
Dewan,TayyabNaseer,MarcelBinzandNohaRadwanfornumerous
interesting discussions. Manythanks toAndreas Eitel, MichaelKeser
andPhilippJundfortheirtechnicalsupport. IwouldliketothankAn-
dreasEitelandProf. WolframBurgardforofferingmeastudentjobat
AISwhichsupportedmefinanciallythroughoutmystayinGermany. I
thankAnnaHellbergGustafssonfromKTHforprovidingmeErasmus+
scholarshipformystayinGermany.
I thank Andreas Eitel, Maxim Tatarchenko and Florian Kraemer for
proofreadingthethesisreport. Thisworkwouldnothavebeenpossible
without the support of everyone at the AIS group. Special thanks to
MarcusLundin,GabrielaZarzarGandlerandSebastianZarzarGandler
forhelping me writetheSwedish versionoftheabstract. I thankevery-
onewhohelpedmetocollectthedataset: TobiasPaxian,AndreasEitel,
V.K.Mittal,ShashiKabdal,HimanshuMittal,ShrutiKabdal,ShuchiKab-
dal,HannahRosaNesswetter,DavidCzudnochowski,AnandNarayan,
Sophie Ninnemann, Gabriela Zarzar Gandler, Jingwei Zhang, Oier
Mees,RendaniMbuvha,RonakShah,VishakhaPatel,AndyWachaja
andFedericoBoniardi.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Ethics,SocietalAspectsandSustainability . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 OverviewoftheThesis . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 ArtificialNeuralNetworks . . . . . . . . . . . . . . . . . . 6
2.1.1 ConvolutionalNeuralNetworks . . . . . . . . . . 7
2.2 DeepGenerativeModels . . . . . . . . . . . . . . . . . . . 9
2.2.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 GenerativeAdversarialNetwork . . . . . . . . . . 12
3 RelatedWork 15
3.1 DeepGenerativeModels . . . . . . . . . . . . . . . . . . . 15
3.1.1 VAE-basedMethods . . . . . . . . . . . . . . . . . 16
3.1.2 GAN-basedMethods . . . . . . . . . . . . . . . . . 16
3.1.3 HybridMethods . . . . . . . . . . . . . . . . . . . 16
3.1.4 Real-worldApplications . . . . . . . . . . . . . . . 17
4 Methodology 19
4.1 AdversarialAutoencoders . . . . . . . . . . . . . . . . . . 19
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 BasicAAEArchitecture . . . . . . . . . . . . . . . 20
4.1.3 LearningLatentDistributions . . . . . . . . . . . . 22
4.1.4 Semi-supervisedAAE . . . . . . . . . . . . . . . . 23
4.1.5 ConvolutionalSemi-supervisedAAEArchitecture 27
v
vi CONTENTS
5 ExperimentsandResults 30
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 MNISTDataset . . . . . . . . . . . . . . . . . . . . 30
5.1.2 InternetDataset . . . . . . . . . . . . . . . . . . . . 30
5.1.3 Real-worldDataset . . . . . . . . . . . . . . . . . . 32
5.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . 34
5.2 LearningoftheLatentDistribution . . . . . . . . . . . . . 35
5.3 Semi-supervisedClassification . . . . . . . . . . . . . . . 38
5.3.1 ImplementationDetails . . . . . . . . . . . . . . . 38
5.3.2 ObjectRecognitionResults . . . . . . . . . . . . . 42
5.4 OnlineLearningwithAAE . . . . . . . . . . . . . . . . . 46
5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 ConclusionandFutureWork 50
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A Datasets 57
A.1 DatasetFiltering . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Real-worldDataset: VideoStreams . . . . . . . . . . . . . 58
B ArchitectureDetails 59
B.1 Semi-supervisedConvolutionalAAE . . . . . . . . . . . . 59
B.1.1 AdversarialNetwork: Discriminator . . . . . . . . 59
B.1.2 AutoencoderNetwork . . . . . . . . . . . . . . . . 60
B.1.3 Classification/AdversarialNetwork: Generator . 61
Chapter 1
Introduction
1.1 Motivation
Theideabehindsemi-supervisedlearningforobjectrecognitioncomes
from the learning ability of human beings. A human child can learn
about objects like animals, toys, etc. from only a few examples. For
example,onceachildisshownwhatacatlookslike,itcanthereafter
recognizeanewtypeofcatsintheworld. Humanbeingsdonotrequire
thousands of labeled examples to learn the visual appearance of an
object,andtheybecomebetteratrecognitionwithsubsequentexposure
toothervariantsofthatobject.
Image classification is one of the important tasks in the field of
computer vision. This taskis highlyrelevantfor variousapplications
likeautonomous driving,service robotics, remotesensing andmedical
diagnosis. Most of the latest image classification methods like Deep
ResidualNetworks[16]requirealargecollectionofmanuallylabeled
imagestoperformwell. Collectinglabeledsamplescanbedifficultand
veryexpensiveforspecificreal-worldapplications.
Onewaytotacklethischallengeisbyleveraginginformationfrom
unlabeled data in an unsupervised or semi-supervised manner. Al-
though image classification in a completely unsupervised manner is
notyetpracticalforcomplexdistributionslikenaturalimages,recent
methodsbasedonneuralnetworkshaveshownpromisingresultsfor
semi-supervised learning. In semi-supervised learning methods, we
canmakeuseofunlabeleddatafortraining-typicallyasmallamount
oflabeleddatawithalargeamountofunlabeleddata. Semi-supervised
methods make use of unlabeled data to better capture the shape of
1
2 CHAPTER 1. INTRODUCTION
underlying data distribution and generalize better to new samples.
In fields like medical science and robotics, it is much easier to obtain
unlabeled data as compared to obtaining labeled data. For example,
in robotics, a mobile robot can autonomously interact with the envi-
ronmentandcollectunlabeleddatainabundancewithoutanyhuman
supervision. Therefore,semi-supervisedlearningisvery wellsuitedto
fieldslikerobotics.
Several methods have been studied in the literature for semi-su-
pervisedlearning. Inthis work,weplantofocuson techniquesbased
ongenerativemodels. Buildingscalablegenerativemodelsto capture
rich distributions such as audio, images or video is one of the impor-
tant challenges in machine learning. Until recently, deep generative
models,suchasRestrictedBoltzmannMachines,DeepBeliefNetworks
and Deep Boltzmann Machines were trained primarily by sampling
algorithms. In these sampling-based approaches, the methods become
moreimpreciseastrainingprogresses. Thishappensbecausesamples
from the procedures are unable to mix between modes fast enough.
In recent years, several deep generative models, namely, Variational
Autoencoder(VAE)andGenerativeAdversarialNetwork(GAN),have
been developed that can be trained via direct back-propagation and
avoidthedifficultiesthatcomewithsampling-basedtraining.
Figure1.1: ExamplesforeachclassfromtheReal-world(RW)dataset:
banana,bottle,bowl,calculator,can,cup,orange,scissors,soccer-ball
andwatering-can.
In this work, we explore, how well the latest methods based on
deep generative models can be used to recognize objects using semi-
supervised learning methods. We scale one such methodcalled Adver-
CHAPTER 1. INTRODUCTION 3
sarial Autoencoders (AAE) forobject recognition on real-world image
datasets. Figure 1.1 gives a glimpse of our real-world object dataset.
AAEisa hybrid approachwhichusesideas fromVariationalAutoen-
coder (VAE) and Generative Adversarial Network (GAN). AAE is a
probabilisticautoencoderthatusesanadversarialframeworkforvaria-
tional inference. In a probabilistic autoencoder, the encoder approxi-
mates a posterior distribution, and the decoder is used to stochastically
reconstructtheinputdatafromthelatentvariables;theresultingmodel
capturesthedistributionoverimages. Latentvariablearethevariables
thatarenotdirectlyobservedbutratherareinferredusingamathemat-
icalmodel,fromotherobservedvariables.
Onlinelearningisarelatedtaskwhichishighlyrelevantforrobotics.
For example in service robotics, every time a new mobile robot is set
up in a new environment, it needs to adapt to the environment and
learntheobjectsinthatenvironmentforaninteractiveapplication. The
traditionalwayistoannotatealltheobjectsmanuallytorecognizeand
interact with them. Additionally, the variety of objects also changes
dynamicallyinanygivenenvironment. Toreducetheseexpenses,we
can deploy a robot with a semi-supervised learning approach. The
robot’slearningmodelcanbeinitiallytrainedwithonlyafewlabeled
instanceoftheobjects,andthentherobotcanadaptitsmodeltoincrease
theclassificationperformanceovertimebycollectingmoreunlabeled
data. In this work, we also show how this semi-supervised learning
method may be used for online learning on real-world data. Since
our real-world data is similar to the data captured by the robots, this
methodcanbereadilyappliedtorobotics.
1.1.1 Ethics, Societal Aspects and Sustainability
Thecontributionsofthisthesis workareverytechnicalconcerningthe
usageofdeepgenerativemodelsforsemi-supervisedobjectrecognition,
althoughtherearemanypossibleapplicationsofobjectrecognitionin
general for example autonomous driving, medical diagnosis, service
robotics,etc.
Some applications of semi-supervised classification can be highly
relevantforthesociety,forexample,cancertumordetectioninmagnetic
resonancespectroscopicimages. Sinceweallknowthatcancerisafatal
disease and more than 10 million people are diagnosed with cancer
everyyearworldwide,itisoneofthemainchallengesthatoursociety