IEEE Transactions on Affective Computing Volume:6; Year2015 AA Automatic Group Happiness Intensity Analysis Abhinav Dhall, Member, IEEE, Roland Goecke, Member, IEEE, and Tom Gedeon, Senior Member, IEEE Abstract—Therecentadvancementofsocialmediahasgivenusersaplatformtosociallyengageandinteractwithalarger population.Millionsofimagesandvideosarebeinguploadedeverydaybyusersonthewebfromdifferenteventsandsocial gatherings.Thereisanincreasinginterestindesigningsystemscapableofunderstandinghumanmanifestationsofemotional attributesandaffectivedisplays.Asimagesandvideosfromsocialeventsgenerallycontainmultiplesubjects,itisanessential steptostudythesegroupsofpeople.Inthispaper,westudytheproblemofhappinessintensityanalysisofagroupofpeoplein animageusingfacialexpressionanalysis.Auserperceptionstudyisconductedtounderstandvariousattributes,whichaffecta person’sperceptionofthehappinessintensityofagroup.Weidentifythechallengesindevelopinganautomaticmoodanalysis systemandproposethreemodelsbasedontheattributesinthestudy.An‘inthewild’image-baseddatabaseiscollected.To validatethemethods,bothquantitativeandqualitativeexperimentsareperformedandappliedtotheproblemofshotselection, eventsummarisationandalbumcreation.Theexperimentsshowthattheglobalandlocalattributesdefinedinthepaperprovide usefulinformationforthemeexpressionanalysis,withresultsclosetohumanperceptionresults. IndexTerms—Facialexpressionrecognition,groupmood,unconstrainedconditions. (cid:70) 1 INTRODUCTION the problem in a probabilistic graphical model framework based on a relatively weighted soft-assignment. Automatic facial expression analysis has seen much re- Analysing the theme expression conveyed by groups of search in recent times. However, little attention has been people in images is an unexplored problem that has many giventotheestimationoftheoverallexpressionthemecon- real-world applications: image search, retrieval, represen- veyed by a group of people in an image. With the growing tation and browsing; event summarisation and highlight popularity of data sharing and broadcasting websites such creation; candid photo shot selection; expression apex de- as YouTube and Flickr, every day users are uploading mil- tection in video; video thumbnail creation etc. A recent lionsofimagesandvideosofsocialeventssuchasaparty, Forbes magazine article [1] discusses the lack of ability of weddingoragraduationceremony.Generally,thesevideos current image search engines to use context. Information and images were recorded in different conditions and may such as the mood of a group can be used to model the contain one or more subjects. From a view of automatic context. These problems, where group mood information emotionanalysis,thesediversescenarioshavereceivedless can be utilised, are a motivation for exploring the various attention in the affective computing community. group mood models. One basic approach is to average the Consideranillustrativeexampleofinferringthemoodof happinessintensitiesofallpeopleinagroup.However,the agroupofpeopleposingforagroupphotographataschool perception of the mood of a group is defined by attributes reunion. To scale the current emotion detection algorithms such as where people stand, how much of their face is to work on this type of data in the wild, there are several visible etc. These social attributes play an important role challenges to overcome such as emotion modelling of in defining the overall happiness1 an image conveys. groups of people, labelled data, and face analysis. Expres- sionanalysishasbeenalongstudiedproblem,focussingon inferring the emotional state of a single subject only. This 2 KEY CONTRIBUTIONS paper discusses the problem of automatic mood analysis 1) Anautomaticframeworkforhappinessintensityanal- of a group of people. Here, we are interested in knowing ysis of a group of people in images based on the anindividual’sintensityofhappinessanditscontributionto social context. theoverallmoodofthescene.Thecontributiontowardsthe 2) A weighted model is presented, taking into consid- themeexpressioncanbeaffectedbythesocialcontext.The eration the global and local attributes that affect the contextcanconstitutevariousglobalandlocalfactors(such perceived happiness intensity of a group. as the relative position of the person in the image, their 3) Alabelled‘inthewild’databasecontainingimagesof distance from the camera and the level of face occlusion). groups of people is collected using a semi-automatic We model this global and local information based on a process and compared with existing databases. group graph, embed these features in our method and pose The remainder of the paper is organised as follows: Section 3 discusses prior work on various aspects of the • A.DhallandR.GoeckearewiththeUniversityofCanberra.T.Gedeon visual analysis of a group of people. Section 4 describes is with the Australian National University E-mail: {abhinav.dhall, tom.gedeon}@anu.edu.au,[email protected] 1.This paper uses the terms ‘mood’, ‘expression’ and ‘happiness’ interchangeablyfordescribingthemoodofagroupofpeopleinanimage. A IEEE Transactions on Affective Computing Volume:6; Year2015 A the problems and challenges involved in automatic group 3.2 Top-downTechniques expression analysis. The details of a 149-user survey in- In an interesting top-down approach, [5] proposed contex- vestigating attributes affecting the perception of mood are tual features based on the group structure for computing discussedinSection5.An‘inthewild’databasecollection the age and gender of individuals. The global attributes method is detailed in Section 6. Section 7 discusses a described here are similar to [5]’s contextual features of basic model based on averaging. Global context based socialcontext.However,theproblemin[5]isinversetothe social features are presented in Section 8. Local context problem of inferring the mood of a group of people in an based on occlusion intensity is described in Section 9. The image, which is discussed in this paper. Their experiments global and local contexts are combined and applied to the on images obtained from the web, show an impressive averaging approach in Section 10. The manual attributes increaseinperformancewhenthegroupcontextisused.In are combined with data-driven attributes in a supervised another top-down approach, [6] model the social relation- hierarchal Bayesian framework in Section 11. Section 12 shipbetweenpeoplestandingtogetherinagroupforaiding discussestheresultsoftheproposedframeworks,including recognition. The social relationships are inferred in unseen both quantitative and qualitative experiments. images by learning them from weakly labelled images. 3 LITERATURE REVIEW A graphical model based on social relationships, such as ‘father-child’ and ‘mother-child’, and social relationship The analysis of a group of people in an image or a video features, such as relative height, height difference and face has recently received much attention in computer vision. ratio. In [12] a face discovery method based on exploring Methods can be broadly divided into two categories: a) socialfeatures,suchasonsocialeventimages,isproposed. Bottom-up methods: The subject’s attributes are used to In object detection and recognition work by [13], scene infer information at the group level [2], [3], [4]; b) Top- context information and its relationship with the objects down methods: The group/sub-group information is used is described. Moreover, [14] acknowledges the benefit of as a prior for inference of subject level attributes [5], [6]. using global spatial constraints for scene analysis. In face 3.1 Bottom-upTechniques recognition [15], social context is employed to model the relationship between people, e.g. between friends on Trackinggroupsofpeopleinacrowdhasbeenofparticular Facebook, using a Conditional Random Field (CRF) [16]. interest lately [2]. Based on trajectories constructed from the movement of people, [2] propose a hierarchical clus- Recently,[17]proposedaframeworkforselectingcandid tering algorithm which detects sub-groups in crowd video shotsfromavideoofasingleperson.Aphysiologicalstudy clips. In an interesting experiment, [3] installed cameras at was conducted, where 150 subjects were shown images fourlocationsontheMITcampusandtriedtoestimatethe of a person. They were asked to rate the attractiveness of mood of people looking into the camera and compute a the images and mention attributes, which influenced their mood map for the campus using the Shore framework [7] decision. Professional photographers were also asked to for face analysis, which detects multiple faces in a scene label the images. Further, a regression model was learnt in real-time. The framework also generates attributes such based on various attributes, such as eye blink, clarity of as age, gender and pose. In [3], the scene level happiness faceandfacepose.Alimitationofthisapproachisthatthe averagestheindividualpersons’smiles.However,inreality, samples contain a single subject only. group emotion is not an averaging model [8], [9]. There [18]proposedaffectbasedvideoclipbrowsingbylearn- are attributes, which affect the perception of a group’s ing two regression models, predicting valence and arousal emotion and the emotion of the group itself. The literature values, to describe the affect. The regression models learnt in social psychology suggests that group emotion can be on an ensemble of audio-video features, such as motion, conceptualised in different ways and is best represented by shot switch rate, frame brightness, pitch, bandwidth, roll pairing the top-down and bottom-up approaches [8], [9]. off, and spectral flux. However, expression information for In another interesting bottom-up method, [10] proposed individuals or groups in the scenes was not used. group classification for recognising urban tribes (a group The literature for analysing a single subject’s happiness of people part of a common activity). Low-level features, / smile is rich. One prominent approach by [19] proposed such as colour histograms, and high-level features, such as a new image-based database labelled for smiling and non- age, gender, hair and hat, were used as attributes (using smiling images and evaluated several state of the art meth- the Face.com API) to learn a Bag-of-Words (BoW)-based ods for smile detection. However, in the existing literature, classifier. To add the group context, a histogram describing the faces are considered independent of each other. For the distance between two faces and the number of over- computing the contribution of each subject, two types of lapping bounding boxes was computed. Fourteen classes factorsaffectgrouplevelemotionanalysis:(1)Localfactors depicting various groups, such as ‘informal club’, ‘beach (individual subject level): age, gender, face visibility, face party’ and ‘hipsters’, were used. The experiments showed pose, eye blink etc. (2) Global factors: where do people that a combination of attributes can be used to describe a stand, with whom people stand etc. In this paper, the focus type of group. In ‘Hippster wars’ [11], a framework based is on face visibility, smile intensity, relative face size and on clothes related features was proposed for classifying a relative face distance. Labelled images containing groups group of people based on their social group type. of people are required, which we collect from Flickr. A IEEE Transactions on Affective Computing Volume:6; Year2015 A 4 CHALLENGES analysis system, one will want a subject independent facial landmark detector [21]. Further, [21] argue that a subject Thefollowingsubsectionsdiscussthechallengesincreating dependent facial parts method, such as Active Appearance an automatic system for happiness intensity inference. Models (AAM) [22], performs better than subject indepen- dent Constrained Local Models (CLM) [23]. However, if a 4.1 Attributes properdescriptorisusedontopofpreviouslyalignedfaces Human perception of the mood of a group of people is fromasubjectindependentdetector,thealigmenterrorcan very subjective. [9] argue that the mood of a group is be compensated for. Moreover, [24] show the effectiveness composed by two broad categories of components: top- of the Mixture of Pictorial Structures [24] over CLM and down and bottom-up. Top-down is the affective context, AAM for facial landmark localisation when there is much i.e. attributes such as group history, background, social head movement. Motivated by these arguments, the parts eventetc.,whichhaveaneffectonthegroupmembers.For basedmodelof[25]2 isusedhere(Section12).Theimages example, a group of people laughing at a party displays inthenewdatabaseHAPPEIweredownloadedfromFlickr, happiness in a different way than a group of people in creating the challenge of real-world varied illumination an office meeting room. From an image perspective, this scenarios and differences in face resolution. To overcome means that the scene/background information can be used these,LPQandPHOGareused,asLPQisrobusttovaried as affective context. The bottom-up component deals with illumination and PHOG is robust to scale [26]. thesubjectsinthegroupintermsofattributesofindividuals that affect the perception of the group’s mood. It defines 5 SURVEY the contribution of individuals to the overall group mood. To understand the attributes (Section 4.1) affecting the From now on, the top-down component is referred to perceptionofthegroupmood,auserstudywasconducted. as ‘global context’ and the bottom-up component as ‘local Twosetsofsurveysweredeveloped.Inthefirstpart(Figure context’.Therecanbevariousattributes,whichdefinethese 1), subjects were asked to compare two images for their two components. For example, global context contains but apparent mood and rate the one with a higher positive is not limited to scene information, social event informa- mood. Further, they were asked various questions about tion,whoisstandingwithwhom,wherearepeoplestanding the attibutes/reasons, which made them choose a specific inanimageandw.r.t.thecamera.Localcontext,i.e.subject image/group out of the two images/groups. A total of 149 specificattributes,coveranindividual’smood/emotion,face subjectsparticipatedinthissurvey(94males/55females). visibility,facesizew.r.t.neighbours,age,gender,headpose There are a total of three cases in the first survey. Figure and eye blink. To further understand these attributes, a 1 shows the questions in one of the cases. Cases 1, 2 and perceptionuserstudyisperformed.Thestudyanditsresults 3 in Figure 2 describe the analysis of the responses of the are detailed in Section 5. participants for the three cases in the survey. On the left of the figures, the two images to be compared are displayed. 4.2 DataandLabelling The images in the survey were chosen on the basis of Data simulating ‘in the wild’ conditions is a major two criteria: (1) to validate the hypothesis that adding an challenge for making emotion recognition methods work occlusion attribute to the model decreased the error (user in real-world conditions. Generally, emotion analysis perception vs. model output), which was noticed in the databases are lab-recorded and contain a single subject in earlier experiments in [27]. Therefore, in one case, two an image or video. It is easy to ask people to pose in a images shot in succession were chosen, in which one of laboratory, but acted expressions are very different from the subjects covered his face in the first shot (Figure 2 spontaneousones.Anyoneworkinginemotionanalysiswill Case 1). It is interesting to note that a larger number of attesttothedifficultyofcollectingspontaneousdatainreal- survey participants (69.0%) chose image B in Figure 2 worldconditions.Forlearningandtestinganemotionanal- Case 1 as having a more positive mood score on the scale ysis system, labelled data containing groups of people in of neutral towards thrilled. Out of these 69.0%, 51.1% differentsocialscenariosisrequired.Oncethedataisavail- chose ‘faces being less occluded’ as one of the reasons. able, the next task is labelling. According to [20], moods The other dominating attribute for their decision was the are low-intensity, diffuse affective states that usually do larger number of people smiling (54.6%). Both attributes not have a clear antecedent. Mood can be positive/pleasant arecorrelated;itiseasiertoinfertheexpressionofaperson and negative/unpleasant. The type of labelling (discrete or when the face is clearly visible. continuous) required is problem dependent. The database (2) In the other case (Figure 2 Case 2), two images proposed in this paper – HAPPEI – is labelled for neutral were randomly chosen. 76.0% of participants ranked Im- to pleasant mood with discrete levels. age A higher than Image B. 46.0% of participants chose ‘larger number of people smiling’ as the most dominant 4.3 VisualAnalysis attribute(49.1%forparticipantswhoselectedImageAand 37.1% for participants who selected Image B). 46.4% of Inferring the group mood involves classic computer vision tasks. As a pre-processing step, the first challenge is face 2.[24]reportthattheirmethodworksbetterthan[25];however,dueto andfaciallandmarkdetection.Ideally,forafacialdynamics itslowercomputationalcomplexity[25]isusedhere. IEEE Transactions on Affective Computing Volume:6; Year2015 A A Fig.1. Screenshotofacaseintheusersurvey(Section5). participants who chose Image A selected ‘large smiles of chose ‘the large size of face(s) with smiles in the im- people in the center of the group’. Also, 56.0% of the age’. For the question ‘What are the dominating at- participants chose the presence of a pleasant background tributes/characteristics of the leader(s) in the group that as the reason for their selection. In the question: ‘Any affect the group’s mood?’, the participants mentioned the other reason for your choice?’, participants pointed to the large size of face(s) with smiles in the image, the centre context, background, togetherness of the group, and party location of the subject, large smiles, and people standing like scenario in Image A (Figure 2 Case 2), which made closer. Based on the user survey, in the following sections, them consider the group in Image A being happier. Other the attributes which are discussed further are relative loca- considerable responses were body pose, people closer to tionofmembersofagroup,relativefacesizetoestimateif each other in the group and spontaneous facial expressions apersonisinthefrontorback,facevisibility/occlusionand (i.e. when subjects are not explicitly posing in front of the mood of a member. In a recent study by Jiang et al. [28], camera). For the question: ‘Please define the scene in one theauthorsconductedaneyetrackingbasedstudytolocate wordinImageA’and‘Pleasedefinethesceneinoneword theobjectsand attributes,whichare salientingroup/crowd in Image B’, the majority of the participants defined the images. Their saliency related attributes are similar to our scene in the context of mood such as ‘relaxed’, ‘pleasant’, findings. Based on observations from the fixations, [28] ‘interesting’, ‘happening’, or ‘enjoyable’. proposed high-level features related to attributes such as In Figure 2 Case 3, 41.0% of the survey participants face size, face occlusion and pose. IEEE Transactions on Affective Computing Volume:6; Year2015 A A (a)Case1 (b)Case2 (c)Case3 Fig.2. Resultsoftheanalysisofsomecasesinthesurvey(Section5). IEEE Transactions on Affective Computing Volume:6; Year2015 A A 6 DATABASE CREATION AND LABELLING Popular facial expression databases, such as CK+ [29], FEEDTUM [30] and MultiPIE [31], are all ‘individual’ centric databases, i.e. contain a single subject only in any particular image or video. For the problem in this work, there are various databases, which are partially relevant [19],[4].Inan interesting work,[19]proposedtheGENKI database. It was collected from Google images shot in unconstrained conditions containing a single subject per image smiling or non-smiling. However, it does not fulfill our requirements as the intensity level of happiness is not labelledatbothimageandfacelevel.[4]proposeadynamic (a)AcollageofsampleimagesinHAPPEI. facial expressions database – Acted Facial Expressions In The Wild (AFEW) – collected from movies. It contains bothsingleandmultiplesubjectvideos.However,thereare no intensity labels present and multiple subject video clips are few in number. Therefore, a new labelled database for image-based group mood analysis is required. Databases such as GENKI and AFEW have been com- piled using semi-automatic approaches. [19] used Google (b)SamplefacelevelhappinessintensitylabelsinHAPPEI. images based on a keyword search for finding relevant images. [4] used a recommender system based approach Fig.3. TheHAPPEIdatabase. whereasystemsuggestedvideoclipstothelabellersbased on emotion related keywords in closed caption subtitles. label Smile. The LabelMe [32] based Bonn annotation tool This makes the process of database creation and labelling [33] was used for labelling. It is interesting to note that easier and less time consuming. Inspired by [4], [19], a ideally one would like to infer the mood of a group by the semi-automatic approach is followed. Web based photo means of self-rating along with the perception of mood of sharing websites such as Flickr and Facebook contain thegroup.Inthiswork,noself-ratingwasconductedasthe billionsofimages.Fromaresearchperspective,notonlyare data was collected from the internet. In this database, the these a huge repository of images but also come with rich labelsarebasedontheperceptionofthelabellers.Onecan associated labels, which contain very useful information see this work as a stepping stone to group mood analysis. describing the scene in the images. The aim of the models (Sections 7, 10 and 11.1) proposed We collected a labelled ‘in the wild’ database – called inthisworkistoinfertheperceivedgroupmoodasclosely HAPpyPEopleImages(HAPPEI)–fromFlickrcontaining as possible to human observers. 4886 images. A Matlab based program was developed to automaticallysearchanddownloadimages,whichhadkey- 7 GROUP EXPRESSION MODEL words associated with groups of people and events. A total Given an image I containing a group of people G of size of 40 keywords were used (e.g. ‘party + people’, ‘group s and their happiness intensity level I , a simple Group H + photo’, ‘graduation + ceremony’, ‘marriage’, ‘bar’, ‘re- Expression Model (GEM) can be formulated as an average union’, ‘function’, ‘convocation’). After downloading the of the happiness intensities of all faces in the group images, a Viola-Jones object detector trained on different (cid:80) I data(frontalandposemodelsinOpenCV)wasexecutedon GEM= i Hi . (1) the images. Only images containing more than one subject s werekept.Falsedetectionsweremanuallyremoved.Figure In this simple formulation, both global information, e.g. 3(a) shows a collage of images from the database. the relative position of people in the image, and local All images were annotated with a group level mood information, e.g. the level of occlusion of a face, are intensity (‘neutral’ to ‘thrilled’). Moreover, in the 4886 ignored. In order to add the bottom-up and top-down com- images, 8500 faces were manually annotated for face level ponents(Section4.1),itisproposedheretoaddthesesocial happiness intensity, occlusion intensity and pose by four context features as weights to the process of determining humanlabelers,whoannotateddifferentimages.Themood the happiness intensity of a group image. The experiments wasrepresentedbythehappinessintensitycorrespondingto (Section 12) on HAPPEI confirm the positive effect of six stages of happiness: Neutral, Small Smile, Large Smile, addingsocialfeatureweightstoGEM.Inthenextsection, SmallLaugh,LargeLaughandThrilled (Figure3(b)).Asa methods for computing the global context are discussed. referenceduringlabelling,whentheteethofamemberofa group were visible, the happiness intensity was labelled as 8 GLOBAL CONTEXT a laugh. When the mouth was open wide, a Thrilled label Barsade and Gibson [8] as well as Kelley and Barsade wasassigned.Afacewithaclosedmouthwasassignedthe [9] emphasise the contribution of the top-down component IEEE Transactions on Affective Computing Volume:6; Year2015 A A Fig.4. TopLeft: Imagewithmoodintensityscore=70.TopRight: Min-spantreedepictingconnectionbetween faces.BottomLeft:HappinessintensityheatmapgeneratedusingtheGEM model(moodintensityscore=81). Bottom Right: Happiness intensity heat map with social context. The contribution of the faces with respect to theirneighbours(F2andF4)towardstheoverallintensityofthegroupisweighted(moodintensityscore=71.4). Addingtheglobalcontextfeaturereducestheerror. to the perception of the happiness intensity of a group. thecamera.Here,itisassumedthattheexpressionintensity Here,thetop-downcontributionrepresentstheeffectofthe oftheirfacescontributesmoretotheperceivedgroupmood group on a subject. Furthermore, in the survey (Section 5), intensity than the faces of people standing in the back. participants mentioned attributes such as location and face Eichner and Ferrari [35] made a similar assumption to find sizeofsubjectsinagroup.Inthissection,weformulatethe if a person is standing in the foreground or at the back in weightsfordescribingthetop-downcomponent.Thetipof a multiple people pose detection scenario. the nose p is considered as the position of a face f in the Based on the centre locations p of all faces in a group i i i image. To map the global structure of the group, a fully G, the centroid c of G is computed. The relative distance g connected graph G = (V,E) is constructed. Here, V ∈ G δ of each face f is i i i represents a face in the group and each edge represents the link between two faces (Vi,Vm) ∈ E. The weight δi =||pi−cg|| . (3) w(V ,V ) is the Euclidean distance between p and p . i m i m δ isfurthernormalisedbasedonthemeanrelativedistance. Prim’s minimal spanning tree algorithm [34] is computed i Faces closer to the centroid are given a higher weighting onG,whichprovidesinformationabouttherelativeposition than faces further away. Using Equations 2 and 3, a global of people in the group with respect to their neighbours. In weight is assigned to each face in the group Figure 4, the min-span tree of the group graph is shown. Once the location and minimally connected neighbours ψ =||1−αδ ||∗ θi (4) of a face are known, the relative size of a face f with i i 2β−1 i respect to its neighbours is calculated. The size of a face whereparametersαandβ controltheeffectoftheseweight is taken as the distance between the location of the eyes factors on the global weight. Figure 5 demonstrates the (intraocular distance), d =||l−r||. The relative face size i effect of the global context on the overall output of GEM. θ of f is then given by i i d θ = i (2) 9 LOCAL CONTEXT i (cid:80) d /n i i In the previous section, global context features, which (cid:80) wheretheterm d /nisthemeanfacesizeinaregionr compute weights on the basis of two factors: (1) where i i aroundfacef ,withrcontainingatotalofnfacesincluding do people stand in a group and (2) how far are they away i f .Generallyspeaking,thefaceswhichhavealargersizein from the camera, are defined. The bottom-up component i a group photo are of the people who are standing closer to of the framework is now defined in this section. The local IEEE Transactions on Affective Computing Volume:6; Year2015 A A Fig.5. TopLeft:Imagewithhappinessintensityscore=70.TopRight:Min-spantreeshowingconnectionbetween faces. Bottom Left: Happiness intensity heat map. Bottom Right: Happiness intensity heat map with social context,thecontributionoftheoccludedfaces(F2andF4)towardstheoverallintensityofthegroupispenalised. contextisdescribedintermsofanindividualperson’slevel samples x of dimension N. Y is the corresponding set i of face visibility and happiness intensity. of vectors y of dimension M. Then, the PLS framework i Occlusion Intensity Estimate: Occlusion in faces, self- defines a decomposition of matrices X and Y as induced (e.g. sunglasses) or due to interaction between X = TPT+E (5) peopleingroups(e.g.onepersonstandingpartiallyinfront Y = UQT+E (6) of another and occluding the face), is a common problem. Lind and Tang [36] introduced an automatic occlusion where T and U are the n × p score matrices of the p detection and rectification method for faces via GraphCut- extractedlatentprojections.TheN×pmatrixPandM×p based detection and confidence sampling. They also pro- matrix Q denote the corresponding loading matrices and posed a face quality model based on global correlation and the n × N matrix E and n × M matrix F denote the localpatternstoderiveocclusiondetectionandrectification. residual matrices that account for the error made by the Thepresenceofocclusiononafacereducesitsvisibility projection. The classic NIPALS method [41] is used to and, therefore, hampers the clear estimation of facial ex- solve the optimisation criteria pressions.Italsoreducestheface’scontributiontotheover- all expression intensity of a group portrayed in an image. [cov(t,u)]2 = [cov(Xw,Yc)]2 Becauseofthis,thehappinessintensitylevelIHofafacefi = max|r|=|s|=1[cov(Xr,Ys)]2 (7) inagroupispenalisedif(atleastpartially)occluded.Thus, where cov(t;u) = tTu/n is the sample covariance be- alongwithanautomaticmethodforocclusiondetection,an tweenthescorevectorstandu.Thescorevectors{t }p estimate of the level of occlusion is required. Unlike [36], i i=1 are good predictors of Y and the inner relation between itisproposedtolearnamappingmodelF :X→Y,where thescorevectorstanduisgivenbyU=TA+H,where X are the descriptors calculated on the faces and Y is the A is a p×p diagonal matrix and H is the residual matrix. amount of occlusion. To perform classification, the regression matrix B is The mapping function F is learnt using the Kernel calculated as Partial Least Squares (KPLS) [37] regression framework. The PLS set of methods has recently become very popular B=XTU(TTXXTU)−1TTY . (8) in computer vision [38], [39], [40]. In [39] and [40], PLS is used for dimensionality reduction as a prior step For a given test sample matrix Xtest, the estimated labels before classification. [38] use KPLS based regression for matrix Yˆ is given by simultaneous dimensionality reduction and age estimation Yˆ =X B (9) test andshowthatKPLSworkswellforfaceanalysiswhenthe featurevectorishighdimension.Fortheocclusionintensity For a detailed description, readers may refer to [37]. estimation problem, the training set X is a set of input Now, for a non-linear mapping, the kernel trick can be IEEE Transactions on Affective Computing Volume:6; Year2015 A A applied to the PLS method. X is substituted with Φ = [Φ(x ),...,Φ(x )]T, which maps input data to a higher- 1 n dimensional space. The kernel matrix is then defined by the Gram matrix K=ΦΦT, in which the kernel function defines each element K = k(x ,x ). Therefore, Equa- i,j i j tions 8 and 9 can be rewritten as Yˆ = K R (10) test R = U(TTKTU)−1TTY (11) where K = Φ ΦT is the kernel matrix for test test test samples. The input sample vector x is a normalised combination i ofHue,SaturationandthePHOG[42]foreachface.Inthe training set, X contains both occluded and non-occluded Fig.6. Manuallydefinedattributes. faces, Y contains the labels identifying the amount of occlusion (where 0 signifies no occlusion). The labels assumptionsinGEM donotholdtrueforsomescenarios, w weremanuallycreatedduringthedatabasecreationprocess forexample:whenababyisinthelapofthemother.Inthis (Section 6). The output label y is used to compute the i case, the system will give a higher weight to the mother localweightλ ,whichwillpenaliseI forafacef inthe i H i and a lower weight to the baby, assuming that the mother presence of occlusion. It is defined as is in the front and the baby is in the background. In order to implicitly add the effect of other attributes, a feature λ =||1−γy || (12) i i augmentation approach is presented next. where γ is the parameter, which controls the effect of the Lately,attributeshavebeenverypopularinthecomputer local weight. vision community (e.g. [44]). Attributes are defined as Happiness Intensity Computation: A regression based high-level semantically meaningful representations. They mapping function F is learnt using KPLS for regressing have been, for example, successfully applied to object thehappinessintensityofasubject’sface.Theinputfeature recognition[44],sceneanalysis[45]andfaceanalysis[46]. vectoristhePHOGdescriptorcomputedoveralignedfaces. Based on the regressed happiness intensities, the at- As discussed earlier, the advantage of learning via KPLS tributes are defined as Neutral, Small Smile, Large Smile, is that it performs dimensionality reduction and prediction Small Laugh, Large Laugh, Thrilled, and for occlusion in one step. Moreover, KPLS based classification has been as Face Visible, Partial Occlusion and High Occlusion. successfully applied to facial action units [43]. These attributes are computed for each face in the group. Attributes based on global context are Relative Distance 10 WEIGHTED GROUP EXPRESSION and Relative Size. Figure 6 describes the manual attributes MODEL for faces in a group. Defining attributes manually is a subjective task, which The global and local contexts defined in Eq. 4 and 12 are canresultinmanyimportantdiscriminativeattributesbeing used to formulate the relative weight for each face f as i ignored.Inspiredby[47],low-levelfeaturebasedattributes π =λ ψ (13) are computed. They propose the use of manually defined i i i attributes along with data-driven attributes. Their exper- This relative weight is applied to the IH of each face in iments show a leap in performance for human action the group G and based on Eq. 1 and 13, the new weighted recognition based on a combination of manual and data- GEM is defined as driven attributes. A weighted bag of visual words based on (cid:80) I π extracting low-level features is computed. GEM = i Hi i . (14) w s Furthermore,atopicmodelislearntusingLatentDirich- let allocation (LDA) [48]. The manually defined and This formulation takes into consideration the structure of weighted data-driven attributes are combined to form a the group and the local context of the faces in it. The single feature. contribution of each face f ’s I towards the overall i Hi perception of the group mood is weighted relatively, here I is the happiness intensity of f . Hi i 11.1 AugmentedGroupExpressionModel Topic models, though originally developed for document 11 SOCIAL CONTEXT AS ATTRIBUTES analysis,havebeensuccessfullyappliedtocomputervision The social features described above can also be viewed as problems. One very popular topic modelling technique is manuallydefinedattributes.Fromthesurvey(Section5),it LDA [48], a hierarchical Bayesian model, where topic is evident that along with the social context features, there proportions for a document are drawn from a Dirichlet aremanyotherssuchasage,attractiveness,andgender.The distribution and words in the document are repeatedly IEEE Transactions on Affective Computing Volume:6; Year2015 A A sampled from a topic, which itself is drawn from those Supervised LDA (SLDA) by adding a response variable topic proportions. for each document. It was shown to perform better for [46] introduced people-LDA, where topics were mod- regressionandclassificationtasks.Usingasupervisedtopic elled around faces in images along with titles from news. model is a natural choice for HAPPEI, as the human The work of [46] has some similarities to the method annotated labels for the happiness intensities at the image proposed here but the single biggest difference is that [46] level arepresent. The document corpusis the setof groups create topics around single people rather than around a G. The word here represents each face in G. The Max group of people. The proposed group model creates topics Entropy Discriminant LDA (MedLDA) [51] is computed around happiness intensitie for a group of people. For for topic model creation and test label inference. The LDA learning the topic model, a dictionary is learnt first. formulation for groups is referred to as GEM . In LDA Weighted Soft Assignment: K-means is applied to the the results section, the average model GEM is compared image features for defining the visual words. For creating with the weighted average model GEM and the feature w a histogram, each word of a document is assigned to one augmented topic model GEM . LDA or more visual words. If the assignment is limited to one word,itiscalledhardassignmentandifmultiplewordsare 12 EXPERIMENTS considered, it is called soft assignment. The cons of hard assignment are that if a patch (face in a group G) in an 12.1 FaceProcessingPipeline imageissimilartomorethanonevisualword,themultiple Given an image, Viola-Jones (VJ) object detector [52] association information is lost. Therefore, [49] defined a models trained on frontal and profile faces are applied to weightedsoftassignmenttoweightthesignificanceofeach theimages.Forextractingthefiducialpoints,thepart-based visual word towards a patch. For a visual vocabulary of K point detector of [25] is applied. The resulting nine points visual words, a K-dimensional vector T = [t1...tK] with describe the location of the left and right corners of both eachcomponenttk representingtheweightofavisualword eyes, the centre point of the nose, left and right corners of k in an group G is defined as thenostrils,andtheleftandrightcornersofthemouth.For aligning the faces, an affine transform is applied. (cid:88)N (cid:88)Mi 1 As the images were collected from Flickr, containing t = sim(j,k), (15) k 2i−1 different scenarios and complex backgrounds, classic face i j detectors, such as the VJ object detector, result in a fairly where Mi represents the number of face fj whose ith high false positive rate (13.6%). To minimise this error, nearest neighbour is the visual word k. The measure a non-linear binary SVM [53] is trained. The training set sim(j,k) represents the similarity between face fj and the contains samples of faces and non-faces. For face samples, visual word k. It is worth noting that the contribution of alltruepositivesfromtheoutputoftheVJdetectorapplied each word is dependent to its similarity to a visual word to 1300 images from the HAPPEI database are selected. weighted by the factor 1 . For non-faces, the samples are manually selected from the 2i−1 Relatively weighted soft-assignment: Along with the sameVJoutput.Tocreatealargenumberoffalsepositives contribution of each visual word to a group G, it is from real world data, an image set containing monuments, interesting to add the global attributes as weights here. It mountains and water scenes (but no persons facing the is intuitive to note that the weights affect the frequency camera) is constructed. To learn the parameters for SVM, componentofwords,whichhererepresentfacesinagroup. five-fold cross validation is performed. Therefore, it is similar to applying weights in GEM w and looking at the contribution based on a neighbourhood 12.2 ImplementationDetails analysis of a particular subject under consideration. As the final goal is to understand the contribution of each face f Given a test image I containing group G, the faces in the i towards the happiness intensity of its group G, the relative group are first detected and aligned, then cropped to a size weight formulated in Eq. 13 is used to define a ‘relatively of 70 × 70 pixels. For the happiness intensity detection, weighted’ soft-assignment. Eq. 15 can then be modified as PHOG features are extracted from the face. Here, the pyramidlevelL=3,anglerange=[0−360]andbincount t =(cid:88)N (cid:88)Mi ψj sim(j,k) . (16) =16. The number of latent variables is chosen as 18 after k 2i−1 empirical validation. PHOG is scale invariant. The use of i j PHOG is motivated by [54], where PHOG performed well Now, along with weights for each nearest visual word for for facial expression analysis. a patch, another weight term is being introduced, which The parameters for MedLDA are α = 0.1, k = 25, for representsthecontributionofthepatchtothegroup.These SVM fold=5. 1500 documents are used for training and data-driven visual words are appended with the manual 500 for testing. The range of labels is the group mood attributes. Note that the histogram computed here is in- intensity range [0-100] with a step size of 10. For learning fluenced by the global attributes of the faces in the group. the dictionary, the number of words k is empirically set to The default LDA formulation is an unsupervised 60.InEq.4and12,theparametersaresettoα=0.3,β = Bayesian method. In their recent work, [50] proposed the 1.1andγ =0.1,whichareweightsthatcontroltheeffectof
Description: