ebook img

Group Visual Sentiment Analysis PDF

6.4 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Group Visual Sentiment Analysis

Group Visual Sentiment Analysis Zeshan Hussain, Tariq Patanam and Hardie Cate June 6, 2016 7 1 0 Abstract 2 Related Work 2 n a In this paper, we introduce a framework for classi- 2.1 Summary of Previous Work J fying images according to high-level sentiment. We Individualsentimentanalysishasbeenalongstudied 7 subdivide the task into three primary problems: emo- problem. There exists a hosts of methods to extract tion classification on faces, human pose estimation, ] emotion features. Notable among them is the Half- V and 3D estimation and clustering of groups of peo- Octave Multi-Scale Gaussian derivative technique to C ple. Weintroducenovelalgorithmsformatchingbody quickly extract . Employing this method, Jain and . parts to a common individual and clustering people Crowley were able to accurately detect over 90% s in images based on physical location and orientation. c smiles accurately [8] [ Our results outperform several baseline approaches. Discovering groups in images itself is a significant 1 problem. Detecting people around occlusions and v viewpoint changes, and then grouping according to 5 orientations and the poses of the people for multiple 8 1 Introduction groups of people is a highly complex problem that 8 1 hasonlyrecentlybeensolved. Specifically,Choietal. 0 A key problem in computer vision is sentiment anal- describe a model that learns an ensemble of discrim- 1. ysis. Uses include identifying sentiment for social inative interaction patterns to encode the relation- 0 networkenhancementandemotionunderstandingby ships between people in 3D [4]. Their model learns 7 robotic systems. Identifying individual people and a minimization potential function to efficiently map 1 faces has many uses but has been explored exten- out groups of people in images as well as their activ- : v sively. Group sentiment analysis, on the other hand, ity and pose (i.e. standing, facing each other). Prior i has been much less researched. Given an image of a workhasfocusedontheactivityofasinglepersonor X group of people, we may wish to describe the over- pairs of people [9] [6] [11] [10] as well as pose estima- r a all sentiment of the scene. For example, if we have tionofgroups,albeitoperatingundertheassumption an image of students in a classroom, we may wish to that there is only one group in the image [7]. determine the level of human interaction, happiness Therehasalsobeensomeworkdoneanalyzingthe of the students, their degree of focus, and other soft sentiment of groups in images. Dhall et al., for ex- scene characteristics that together describe the over- ample, measure happiness intensity levels in groups all sentiment. To tackle this problem, we propose a of people by utilizing a model that takes advantage multi-label classification system that outputs labels of the global and local attributes of the image [5]. for each of our scene characteristics. To perform this Borth et al. take a more general approach to senti- classification, we localize dense features from faces ment analysis by training several concept classifiers and poses of people as well as spatial relations of onmillionsofimagestakenfromFlickr. Aprediction people in the image. for an arbitrary image is an adjective-noun pair that 1 defines the scene as well as its sentiment [1]. 3.2 Technical Background and Finally, work has been done analyzing crowds of Datasets people by taking a more top-down approach. Using Ourprimarydatasetconsistsof597images,eachcor- behavior priors, Rodriguez et al. track the motion respondingtooneofsixsceneswhichare”busstop,” of crowds. Other methods include taking a bottom- ”cafeteria,””classroom,””conference,””library,”and up approach, where collective motion patterns are ”park.” It is the same dataset used by [4]. We de- learned and used to constrain the motion of individ- fine four scene characteristics, namely level of hu- uals within crowds in a certain test scene [12]. maninteraction, physicalactivity, happiness, andfo- cus, that describe each image. In order to provide 2.2 Improvements on Previous Work a ground truth labeling for image characteristics, we go through all of our images and manually annotate We contribute a novel multi-label group sentiment themonascaleof1to4,with1astheleastand4as classification system that is built by using several the most. For instance, a classroom scene with stu- componentsfrompreviouspapers,includingemotion dents working at their desks would most likely cor- andposeclassificationaswellasgroupstructureand respond to a focus score of 4, whereas an image of orientation prediction. Our method of combining people staring off into space while waiting for a bus thesefeaturestoclassifygroupsentimentisanewap- would likely have a score of 1. proach to sentiment analysis. Additionally, we pro- pose a simple, novel clustering approach to predict 3.3 Technical Solution group structure and orientation. Specifically, we use a heuristic for 3D estimation of the people in the To get a better sense for the problem, we begin with image and then use a variant of k-means to cluster several (naive) baseline approaches that we improve them. upon with more advanced techniques. First, we cre- ate a color histogram for each image that we use as inputfeaturesforanSVM.Eachpixelconsistsofval- 3 Technical Approach ues for 3 color channels and we divide the possible values for channel into 8 bins, so each pixel votes for 3.1 Summary one of 8×8×8 = 512 bins according to the values in its color channels. We then use this histogram of For each input image, I, we perform the following 512 values as input for an SVM. This model ends up feature extractions: overfittingonoursmalldatasetbutperformsslightly better than random on a validation set. 1. Emotion Extraction Asasecondbaseline, weincorporatefeaturesfrom bounding boxes that define the locations of people 2. Poselet Estimation in the images. These bounding boxes come with the dataset which was used in the paper by Choi et 3. Group Structure and Orientation Estimation al. discussed above. We use the coordinates of the bounding boxes directly as features for input to an Supposethateachfeatureextractionresultsinthe SVM. We limit the number of bounding boxes per feature vectors, f , f , f , respectively. Our extrac- image to 15 (since the number of bounding boxes in 1 2 3 tion is done in such a way that for a fixed i=1,2,3, most images were below this threshold) by randomly the number of entries in f is the same for all im- selectingboxesiftherearemorepresentintheimage i ages. This allows us to simply concatenate features and padding our flattened vector of coordinates with from each of the three approaches for use in an SVM zeros if there are fewer boxes in the image. Similar classification. to the above approach, this technique only slightly 2 outperforms a random selection process, with more a large number of torso contained in them, so these detailed results discussed below. boundingboxesprovidetighterrestraints. Wechoose thelargesttorsocontainedinP becausewefindthat thereareoftenalargenumberofsmallerroneoustor- 3.4 Emotion Classification soswhilelargertorso(thatalsofitwithinabounding Our emotion classification pipeline involves several box) are more likely to correspond to actual human stages. torsos. 3.4.1 Face Detection First, we detect faces by leveraging the Poselet de- tector [2] and the Haar Cascade classifier built into OpenCV[3].Foragivenimage,wegetallface-related poselets (effectively bounding boxes within the im- age)fromthePoseletdetectorandthenruntheHaar Cascadeclassifiertodetectfaceswithinthesebound- ingboxes. Weremoveduplicatefacesbycheckingfor overlaps among the face bounding boxes. Unfortunately, the dataset we use contains many peoplewhoarefarawayfromthecameraandsotheir faces are often small and therefore difficult to detect with traditional face detection methods. To address thisissue,wetakeadvantageofthefactthatwehave bounding boxes from people from [4] and that we can identify torsos of people using poselets. This ad- ditional information makes it possible to adjust the settings of the Haar classifier to be lenient and then prune erroneous faces using a novel matching algo- rithm that we describe below. SupposewehaveanimageI. Supposefurtherthat we have sets of bounding boxes in the image for peo- ple, faces, and torsos which we denote P, F, and T, respectively. First,weloopoverF andassigntoeach Figure1: FaceextractionwhendoneonlywithaHaar face f ∈ F a person p ∈ P such that f is contained Cascade face detector (top) vs. first looking for head in p and p is the bounding box of the ones remaining poselets and then detecting faces within them that minimizes the distance between the centers of the top edges of the bounding boxes. This distance is a useful heuristic because faces should be centered At the end of the algorithm, each bounding box and near the top within the bounding box to which corresponding to a person is assigned at most one they belong. face and one torso. Any unassigned bounding box Next, we loop over P from smallest to largest and for a person, face, or torso is discarded. This allows assign to each person p ∈ P a torso t ∈ T such us to concentrate on the most important people in that t is contained in p and t is the largest such the image (i.e., people nearest to and likely facing torso from among the remaining unassigned torsos. the camera). Torsos are potentially useful later on We proceed through P from smallest to largest be- forposeestimationusingsilhouettes. Facesareuseful cause smaller bounding boxes are less likely to have bothforemotionclassificationandfor3Destimation. 3 3.4.2 Emotion Classification Method binningprocessstandardizesthelengthofthefeature vector for each image so that we can run an SVM on Once the faces are detected, we classify them with these features for each of our final sentiment classes. emotion labels. In order to do so, we generate fea- turesforeachfaceusingaMulti-scaleGaussianpyra- 3.5 Pose Estimation midmodeledaftertheapproachfoundin[8].Givena faceimageasinput,weconverttheimagetograyscale As mentioned earlier, the Poselet detector provides anditerativelyperformGaussianconvolutionsonthe a wealth of information relating to people in images. image at varying scales. At each iteration, we com- There are 150 types of poselet descriptors, and each pute first- and second-order image gradients at each one acts as a detector for different body parts (e.g., pixel, then aggregate the means and standard devi- left leg, right arm, head, shoulders, etc.). ations of these gradients in 4x4 windows across the Givenaninputimage,wegetallposeletsasbound- entire image. Each 4x4 window at a given scale pro- ing boxes along with an i.d. and score for each duces 10 features, 5 from the means of the five dif- poselet. The i.d. identifies the type of poselet and ferent gradient types and 5 from the standard de- the score indicates the accuracy with which the im- viations. At each scale, we aggregate features three agewithintheboundingboxmatchesthatparticular timesusingthehalf-octavemethod,andwebuildour poselettype. Wefilteroutallposeletsbelowacertain pyramidtoadepthofthreescales. Atscalei,weper- score threshold. Empirically, we find that a score of form the following convolution pipeline to an input 0.9 provides a reasonable threshold. After filtering image. out these erroneous poselets, we produce a 150-class feature vector for the image by binning poselets ac- cording to their i.d. This gives us a count of the Given: I (x,y,i) 0 number of occurrences of each poselet type. I (x,y,i)=I (x,y,i)∗g(x,y;σ) 1 0 AsdiscussedintheExperimentssection,thesefea- √ I (x,y,i)=I (x,y,i)∗g(x,y; 2σ) ture vectors provide a reasonable estimation of the 2 1 different human poses throughout the image. Pose- I (x,y,i+1)=I (x,y,i) 0 2 lets not only identify body parts, but they also give We use the aggregate of I , I , I at each stage as indications of the shapes of these parts (e.g., elbow 0 1 2 our features. bent vs. elbow straight). These feature vectors from Thoughthistechniqueisoriginallyusedstrictlyfor poselets ultimately prove to have significant predic- detecting smiles, we extend the technique to classify tive power. otheremotionssincethesamefeaturesshouldberel- evant. Using the feature vectors produced from the 3.6 Group Structure and Orientation Multi-scale Gaussian pyramid, we train an SVM on Estimation these features to classify various emotions. We train onadatasetthatcontainsonlyfacesandisannotated Wenowdescribeourapproachforpredictingorienta- with emotions including happy, sad, surprised, fear, tion and group structures in images. First, we com- etc. A more complete discussion can be found in the pare two methods for prediction of orientation. We Experiments section, but ultimately we use a binary then propose a 3D estimation algorithm to estimate happy/neutralclassificationwhentrainingourSVM. the 3D coordinates of people. Then, using both ori- Once we have a prediction for each face, we divide entations and 3D coordinates, we run a variant of the image into 16 evenly spaced windows. Each win- the k-means algorithm to cluster people into groups. dow acts as 2-class histogram to produce a count of Below,wedetailourtwomethodsfororientationpre- the number of smiles and neutral faces for each por- diction. tionoftheimage. Wethenusethis32-elementvector In the first method, we use grab cut to get the as our feature vector for our emotion pipeline. This silhouettes of the people in each image. To estimate 4 relative foreground and background of each person, restriction on singleton clusters because we want to we utilize the matched torso and face per bounding avoidthesituationinwhichallcentersarecoincident box. Specifically, we mark a stripe of pixels in the with people since we do not know the value of k be- face of the person and a rectangular area of pixels forehand. Notethatwedonotusemeanshiftbecause in the torso of a person to be the foreground and wealsodonotknowtheappropriatewindowsizebe- all the pixels outside of the bounding box to be the forehand, and mean shift generally does not perform background. We then extract SIFT features on the as well on small, sparse datasets. We define our po- resultant silhouettes and train an SVM using these tentialfunctionbelowwhichwewanttomaximizeto features to predict orientation. achieve the optimal k. Our second method is slightly simpler. Instead of extractingSIFTfeaturesfromsilhouettes,weextract n n (cid:88) (cid:88) HOG features from the bounding box. The benefit (cid:126)o·(cid:126)c−k∗ d of this second method is that we obtain a denser fea- i i ture representation of the image without losing the where ”global” nature of the image. Because HOG looks at the image more globally, it does better with occlu- k =constant factor sions. For our orientation prediction, we move for- o=orientation vector ward with this second method due to its simplicity, representational efficiency, and time constraints. c=direction vector to cluster center Toestimatethedepthofeachperson,weutilizethe d=distance to cluster center following relationship between depth and size of the face,d= k,wheredisdepth,f isthesizeoftheface This allows us to avoid over-invidualizing the f groupings while only including persons within the a in the image, and k is a constant. Intuitively, as the reasonable distance. face size decreases, the depth of the face increases. Itisimportanttonotethatduetotimeconstraints, Finally, with each person mapped to a 3D coordi- wewereunabletofullyimplementthe3Destimation nate,wecanrunavariantofk-meansconstrainedby andorientationaspectsofourpipeline. Wedid,how- theorientationsofpeopletofindclustersofgroupsin ever, implement most of the pipeline and were able theimage. Ourvariationofk-meansusesanorienta- to produce a very accurate classifier of person ori- tion coefficient which linearly weights the traditional entations. We believe the approach outlined above Euclidean distance between a point and its corre- would identify groups of people adequately enough sponding cluster centroid. For each person with unit to provide additional predictive power on the overall orientation vector θ, we compute the vector θ −φ, problem of sentiment classification. where φ is the unit vector along the direction be- tween the person and the cluster center. The orien- tation coefficient c is computed as a linear function 4 Experiments of θ−φ, with c=0.5 when θ =φ and c=1.5 when θ = −φ. This orientation heuristic favors clusters in As mentioned, getting a grasp of the group senti- which more people in a cluster are facing the clus- ment requires several individualized parts to come ter center, mimicking situations in which people are together. Therefore, as to bring together these in- facing each other during a conversation. dividualized parts, we conducted a series of experi- With this modified distance function in place, we ments. runk-meanswithvaryingvaluesofk. Wethenchoose thevalueofkthatminimizesthesumofmodifieddis- 4.1 Baselines tances between points and their cluster centers while placing a restriction on the number of singleton clus- We start by establishing two baselines. Our first ters (clusters with only one person). We place this baseline is built by extracting color histograms for 5 each images then modeling an SVM using those fea- larger than those from the Kaggle set. The Kag- ture vectors. We achieve 28.3% or about 3% better gle set meanwhile often has pictures with hands than random. Our second baseline is built by build- covering parts of the face. Occlusions, then, are ing a feature vector from the bounding boxes. The particularly challenging to the Multi-Scale Gaussian idea behind this was to form a crude estimation of method. Second, the Gaussian Pyramid algorithm thegroupings. Itperformedrelativelypoorly,achiev- seemsparticularlywell-adjustedtotheintricaciesofa ing an average accuracy of 30%, only 5% better than smile. Whilenotperformingaswellwhenidentifying randomly binning into one of four intensities for a ”happy”verus”neutral”intheKagglesetasitdidon sentiment class. the GENKI set, the algorithm did better within the Kagglesetonthebinaryclassificationof”happy”ver- sus the binary classification of ”sad.” This could be 4.2 Smile Detection because the sadness expressed in the Kaggle dataset A key component of group sentiment is determining was usually visually less expressive than the happi- individualsentiment. Webeginbytestingourimple- ness,andthereforelessdistinctfromtheneutralstate mentation of the Half-Octave Multi-Scale Guassian Pyramidalgorithm[8]ontheGENKI4KDataset[13]. 4.4 Orientation Classification We successfully achieve an accuracy of 83.5% when classifying in binary fashion for smile or neutral, on The next stage in overall group sentiment analysis is par with current methods for smile detection. incorporating the groupings between the individual people. In order to do that, we built a relatively Crowley Us robust SVM model using HOG feature extraction Training error N/A 0.0 trained on the TUD Multiview Pedestrian Dataset, Testing error 0.082 0.165 successfully achieving 90.3% accuracy when classify- ing orientations into one of eight cardinal directions. Table 1: Error comparisons with Crowley paper Suchahighaccuracyrateislikelyduetothefactthat gradientorientationscaneasilydefinetheoverallori- Admittedly, our accuracy is slightly below that of entationofpeople,andthatgradientorientationsare Jain & Crowley as we do not fine-tune the param- the basis of HOG extracted features. eters of the soft-margin SVM as meticulously. Yet as we will see this individualized emotion extraction algorithm plays a key roll in improving overall group sentiment analysis prediction. 4.3 Sad and Happy Detection with the Kaggle Dataset UsingthesameGaussianPyramidalgorithm,wefur- Figure6: AcomparisonofaGENKIhappy,aneutral ther test our ability to extract features and detect Kaggle, and a sad Kaggle image. Note there is little emotions on the Kaggle Facial Expressions Recogni- visual difference between the neutral and sad Kaggle tionChallengedataset. Thedetectionsystemworked images less robustly on this set achieving 71.5% accuracy in classifyingbetweenhappinessandneutralityand56% when classifying between sadness and neutrality. 4.5 Group Sentiment Anlaysis Therearetwoimportantobservationstonotehere. First, the images in the GENKI set are more re- We can then featurize our (detailed in the our ap- fined, featuring almost always only faces that are proach)distinctfeaturesintoanoverallscenefeature 6 vector. Their performance is detailed below. randomselectionwhilewithanintensityclassifierwe achieve an 80.4% improvement over random classi- 4.5.1 Individual Emotion Extraction fication. Thus there is not a strong improvement in usingabinaryclassifieroverthefourdegreeintensity When using an overall facial emotion feature vector classifierwhichmeanstheweaknessesmorelikely,not by, we achieve strong results for happiness and ac- inthegradientsoflabelingbutinthepreviouslymen- tivity. This is expected as our Multi-Scale Gaussian tioned techniques of feature extraction itself. algorithm proves strongest in being able to detect smiles which correlate heavily with the overall hap- 4.6 Final Notes piness of a scene as well as, in the case of the Group Discovery Dataset, with activity. In contrast, over- Unfortunatelywedidnothaveenoughtimetouseour all group ”focus” is probably more correlated with robust orientation detection model to test group dis- posture and ”interaction” more correlated with the covery. Anorderingofourthoughts,however,arede- number and size of the groupings. tailed in the technical approach. In order to achieve evenbetterresultsinthefuture,wewouldlikelyneed 4.5.2 Poselets to incorporate more complex scene information such as objects in the background and scene type. When using only the most identifying poses visible in the scene, weighing them by their score from the poselet detector, we achieve better results only for 5 Conclusion ”activity.” While we expect better results for focus as well, this could very well be an indication that In closing, we summarize a couple of important re- ”focus” is not easily measurable from one frame of a sults. Forourfeatureextractionresults,wehavehigh scene. performance detecting emotion (namely, smiles) as well as on predicting orientation. Using these fea- 4.5.3 Individual Emotion Extraction and tures, we perform reasonably on happiness and ac- Poselets tivity, We see a strong correlation between our ex- tracted emotions with happiness level as well as be- Our final experiment isour true goal: testing ourco- tween poselets and activity level in the image, which hesive method for classifying overall scene sentiment isanintuitiveresult. Inourfurtherwork,wehopeto analysis. For each sentiment we bin into four inten- complete our group estimation pipeline and combine sities and achieve accuracies as follows: these features into our final feature vectors. Adding the group features will most likely increase the ac- 4.5.4 A Binary Classification of Intensities curacy of our system in predicting interaction and focus. Weachieve45%accuracyinclassifyingthehappiness Inthispaper,wecontributeamulti-labelclassifica- of scenes into their right intensity bucket. However, tionsystemthatperformssignificantlyaboverandom classifying other sentiment values were less accurate. chancewithourfeatureextractionmethods. Ourap- To a great degree, this could be because of the fick- proach effectively utilizes previous feature extraction leness of human labeling. It is often hard, even for a methods as well as novel methods. human, to distinguish between a focus intensity of 3 Finally here is a link to our Github repo: and a focus intensity of 4. As such, we ran one final https://github.com/zeshanmh/VisualSentiment experiment,labelingallintensityvaluesof3or4as1 and all intensity values of 1 or 2 as 0. The following are the results for this modified binary classifier. Taking happiness as an example, as it is our most accurateresult,weachievea53.4%improvementover 7 Combined Feature Extraction Results Metric Happiness Activity Interaction Focus Training Er- 0.306 0.285 0.310 0.266 ror Testing 0.558 0.617 0.642 0.667 Error Table 2: Errors for the SVM trained by extracting emotions and poselets Combined Feature Extraction Results under a Binary Classifier Metric Happiness Activity Interaction Focus Training Er- 0.189 0.163 0.237 0.277 ror Testing 0.233 0.242 0.4 0.4167 Error Table 3: Errors for the SVM trained by extracting emotions and poselets but using a binary classifier 6 References IEEE International Workshop on. IEEE. 2005, pp. 65–72. [1] Damian Borth et al. “Large-scale visual sen- [7] Marcin Eichner and Vittorio Ferrari. “We timent ontology and detectors using adjective are family: Joint pose estimation of multiple noun pairs”. In: Proceedings of the 21st ACM persons”. In: Computer Vision–ECCV 2010. international conference on Multimedia. ACM. Springer, 2010, pp. 228–242. 2013, pp. 223–232. [8] Varun Jain and James Crowley. “Smile detec- [2] Lubomir Bourdev and Jitendra Malik. “Pose- tionusingmulti-scalegaussianderivatives”.In: lets: Body part detectors trained using 3d hu- 12th WSEAS International Conference on Sig- man pose annotations”. In: Computer Vision, nalProcessing,RoboticsandAutomation.2013. 2009 IEEE 12th International Conference on. IEEE. 2009, pp. 1365–1372. [9] Ivan Laptev. “On space-time interest points”. In: International Journal of Computer Vision [3] G. Bradski. “opencv”. In: Dr. Dobb’s Journal 64.2-3 (2005), pp. 107–123. of Software Tools (2000). [10] Jingen Liu, Jiebo Luo, and Mubarak Shah. [4] WongunChoietal.“Discoveringgroupsofpeo- “Recognizing realistic actions from videos “in ple in images”. In: Computer Vision–ECCV the wild””. In: Computer Vision and Pattern 2014. Springer, 2014, pp. 417–433. Recognition, 2009. CVPR 2009. IEEE Confer- [5] Abhinav Dhall, Roland Goecke, and Tom ence on. IEEE. 2009, pp. 1996–2003. Gedeon. “Automatic group happiness inten- [11] Juan Carlos Niebles, Hongcheng Wang, and Li sity analysis”. In: Affective Computing, IEEE Fei-Fei. “Unsupervised learning of human ac- Transactions on 6.1 (2015), pp. 13–26. tion categories using spatial-temporal words”. [6] Piotr Doll´ar et al. “Behavior recognition via In: International journal of computer vision sparse spatio-temporal features”. In: Visual 79.3 (2008), pp. 299–318. Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint 8 [12] Mikel Rodriguez et al. “Data-driven crowd analysis in videos”. In: Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE. 2011, pp. 1235–1242. [13] http://mplab.ucsd.edu.The MPLab GENKI Database, GENKI-4K Subset. 9 Figure 2: Final confusion Figure 3: Final confusion Figure 4: Final confusion Figure 5: Final confusion matrix of interaction pre- matrix of focus prediction matrix of happiness pre- matrix of activity predic- diction diction tion 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.