Combining Language and Vision with a Multimodal Skip-gram Model AngelikiLazaridou NghiaThePham MarcoBaroni CenterforMind/BrainSciences UniversityofTrento {angeliki.lazaridou|thenghia.pham|marco.baroni}@unitn.it 5 1 0 2 r Abstract tion has led to the development of multimodal dis- a M tributionalsemanticmodels(MDSMs)(Brunietal., WeextendtheSKIP-GRAMmodelofMikolov 2014; Feng and Lapata, 2010; Silberer and Lapata, 2 etal.(2013a)bytakingvisualinformationinto 2014), that enrich linguistic vectors with perceptual 1 account. Like SKIP-GRAM, our multimodal information, most often in the form of visual fea- ] models(MMSKIP-GRAM)buildvector-based turesautomaticallyinducedfromimagecollections. L word representations by learning to predict C linguistic contexts in text corpora. However, MDSMs outperform state-of-the-art text-based . for a restricted set of words, the models are approaches, not only in tasks that directly require s c also exposed to visual representations of the access to visual knowledge (Bruni et al., 2012), but [ objects they denote (extracted from natural also on general semantic benchmarks (Bruni et al., 3 images),andmustpredictlinguisticandvisual 2014;SilbererandLapata,2014). However,current v features jointly. The MMSKIP-GRAM mod- MDSMs still have a number of drawbacks. First, 8 els achieve good performance on a variety of 9 semantic benchmarks. Moreover, since they they are generally constructed by first separately 5 propagatevisualinformationtoallwords,we building linguistic and visual representations of the 2 0 use them to improve image labeling and re- same concepts, and then merging them. This is ob- . trieval in the zero-shot setup, where the test viouslyverydifferentfromhowhumanslearnabout 1 conceptsareneverseenduringmodeltraining. 0 concepts, by hearing words in a situated perceptual 5 Finally,theMMSKIP-GRAMmodelsdiscover context. Second,MDSMsassumethatbothlinguis- 1 intriguingvisualpropertiesofabstractwords, ticandvisualinformationisavailableforallwords, : pavingthewaytorealisticimplementationsof v with no generalization of knowledge across modal- i embodiedtheoriesofmeaning. X ities. Third, because of this latter assumption of r fulllinguisticandvisualcoverage,currentMDSMs, a 1 Introduction paradoxically,cannotbeappliedtocomputervision Distributional semantic models (DSMs) derive tasks such as image labeling or retrieval, since they vector-based representations of meaning from pat- do not generalize to images or words beyond their ternsofwordco-occurrenceincorpora. DSMshave trainingset. been very effectively applied to a variety of seman- We introduce the multimodal skip-gram models, tictasks(Clark,2015;Mikolovetal.,2013b;Turney two new MDSMs that address all the issues above. and Pantel, 2010). However, compared to human Themodelsbuildupontheveryeffectiveskip-gram semantic knowledge, these purely textual models, approach of Mikolov et al. (2013a), that constructs just like traditional symbolic AI systems (Harnad, vectorrepresentationsbylearning,incrementally,to 1990;Searle,1984),areseverelyimpoverished,suf- predictthelinguisticcontextsinwhichtargetwords feringoflackofgroundinginextra-linguisticmodal- occur in a corpus. In our extension, for a subset ities(GlenbergandRobertson,2000). Thisobserva- of the target words, relevant visual evidence from natural images is presented together with the cor- from early-acquired concrete words to a larger vo- puscontexts(justlikehumanshearwordsaccompa- cabulary. However, they use subject-generated fea- nied by concurrent perceptual stimuli). The model tures as surrogate for realistic perceptual informa- must learn to predict these visual representations tion, and only test the model in small-scale simula- jointly with the linguistic features. The joint objec- tions of word learning. Hill and Korhonen (2014), tive encourages the propagation of visual informa- whose evaluation focuses on how perceptual infor- tion to representations of words for which no direct mation affects different word classes more or less visualevidencewasavailableintraining. Theresult- effectively, similarly to Howell et al., integrate per- ingmultimodally-enhancedvectorsachieveremark- ceptualinformationintheformofsubject-generated ably good performance both on traditional seman- featuresandtextfromimageannotationsintoaskip- tic benchmarks, and in their new application to the gram model. They inject perceptual information “zero-shot” image labeling and retrieval scenario. by merging words expressing perceptual features Very interestingly, indirect visual evidence also af- with corpus contexts, which amounts to linguistic- fectstherepresentationofabstractwords,pavingthe context re-weighting, thus making it impossible to waytoground-breakingcognitivestudiesandnovel separate linguistic and perceptual aspects of the in- applicationsincomputervision. duced representation, and to extend the model with non-linguisticfeatures. Weuseinsteadauthenticim- 2 RelatedWork ageanalysisasproxytoperceptualinformation,and we design a robust way to incorporate it, easily ex- There is by now a large literature on multimodal tendible to other signals, such as feature norm or distributional semantic models. We focus here on brainsignalvectors(Fysheetal.,2014). a few representative systems. Bruni et al. (2014) The recent work on so-called zero-shot learning propose a straightforward approach to MDSM in- to address the annotation bottleneck in image la- duction,wheretext-andimage-basedvectorsforthe beling (Frome et al., 2013; Lazaridou et al., 2014; samewordsareconstructedindependently,andthen Socher et al., 2013) looks at image- and text-based “mixed” by applying the Singular Value Decompo- vectorsfromadifferentperspective. Insteadofcom- sition to their concatenation. An empirically supe- bining visual and linguistic information in a com- rior model has been proposed by Silberer and La- mon space, it aims at learning a mapping from pata (2014), who use more advanced visual repre- image-totext-basedvectors. Themapping,induced sentations relying on images annotated with high- from annotated data, is then used to project images level “visual attributes”, and a multimodal fusion of objects that were not seen during training onto strategy based on stacked autoencoders. Kiela and linguisticspace,inordertoretrievethenearestword Bottou (2014) adopt instead a simple concatena- vectors as labels. Multimodal word vectors should tion strategy, but obtain empirical improvements by be better-suited than purely text-based vectors for usingstate-of-the-artconvolutionalneuralnetworks thetask,astheirsimilaritystructureshouldbecloser to extract visual features, and the skip-gram model tothatofimages. However,traditionalMDSMscan- for text. These and related systems take a two- notbeusedinthissetting,becausetheydonotcover stage approach to derive multimodal spaces (uni- wordsforwhichnomanuallyannotatedtrainingim- modal induction followed by fusion), and they are ages are available, thus defeating the generalizing only tested on concepts for which both textual and purpose of zero-shot learning. We will show be- visual labeled training data are available (the pio- low that our multimodal vectors, that are not ham- neering model of Feng and Lapata (2010) did learn pered by this restriction, do indeed bring a signifi- from text and images jointly using Topic Models, cant improvement over purely text-based linguistic but was shown to be empirically weak by Bruni et representationsinthezero-shotsetup. al.(2014)). Multimodal language-vision spaces have also Howelletal.(2005)proposeanincrementalmul- been developed with the goal of improving cap- timodal model based on simple recurrent networks tion generation/retrieval and caption-based image (Elman, 1990), focusing on grounding propagation retrieval (Karpathy et al., 2014; Kiros et al., 2014; Mao et al., 2014; Socher et al., 2014). These meth- ods rely on necessarily limited collections of cap- tioned images as sources of multimodal evidence, = + whereasweautomaticallyenrichaverylargecorpus the cute little sat on the mat CAT with images to induce general-purpose multimodal word representations, that could be used as input maximize context prediction maximize similarity embeddingsinsystemsspecificallytunedtocaption processing. Thus,ourworkiscomplementarytothis lineofresearch. map to visual space cat 3 MultimodalSkip-gramArchitecture Figure 1: “Cartoon” of MMSKIP-GRAM-B. Lin- 3.1 Skip-gramModel guistic context vectors are actually associated to We start by reviewing the standard SKIP-GRAM classes of words in a tree, not single words. SKIP- model of Mikolov et al. (2013a), in the version GRAM is obtained by ignoring the visual objective, we use. Given a text corpus, SKIP-GRAM aims MMSKIP-GRAM-A byfixingMu→v totheidentity at inducing word representations that are good at matrix. predicting the context words surrounding a target word. Mathematically, it maximizes the objective visual representation of the concepts they denote function: (just like in a conversation, where a linguistic T 1 (cid:88) (cid:88) utterance will often be produced in a visual scene T logp(wt+j|wt) (1) including some of the word referents). The visual t=1 −c≤j≤c,j(cid:54)=0 representation is also encoded in a vector (we where w1,w2,...,wT are words in the training describe in Section 4 below how we construct corpus and c is the size of the window around it). We thus make the skip-gram “multimodal” by target wt, determining the set of context words to adding a second, visual term to the original linguis- be predicted by the induced representation of wt. ticobjective,thatis,weextendEquation1asfollow: Following Mikolov et al., we implement a subsam- T 1 (cid:88) pling option randomly discarding context words as (L (w )+L (w )) (3) ling t vision t T aninversefunctionoftheirfrequency,controlledby t=1 hyperparameter t. The probability p(wt+j|wt), the where Lling(wt) is the text-based skip-gram ob- core part of the objective in Equation 1, is given by jective (cid:80) logp(w |w ), whereas the −c≤j≤c,j(cid:54)=0 t+j t softmax: L (w )termforceswordrepresentationstotake vision t eu(cid:48)wt+jTuwt visualinformationintoaccount. Notethatifaword p(wt+j|wt) = (cid:80)Ww(cid:48)=1eu(cid:48)w(cid:48)Tuwt (2) wsytstiesmnaotitcaalslysotchiaetecdasteo, ev.igs.u,alfoirnfdoertmeramtiionne,rsasanids where u and u(cid:48) are the context and target vector non-imageable nouns, but also more generally for w w representations of word w respectively, and W is any word for which no visual data are available, the size of the vocabulary. Due to the normaliza- L (w )issetto0. vision t tion term, Equation 2 requires O(|W|) time com- Wenowproposetwovariantsofthevisualobjec- plexity. A considerable speedup to O(log|W|), is tive,resultingintwodistinguishedmulti-modalver- achieved by using the hierarchical version of Equa- sionsoftheskip-grammodel. tion2 (MorinandBengio,2005),adoptedhere. 3.3 Multi-modalSkip-gramModelA 3.2 Injectingvisualknowledge One way to force word embeddings to take visual We now assume that word learning takes place in a representations into account is to try to directly situated context, in which, for a subset of the target increase the similarity (expressed, for example, words, the corpus contexts are accompanied by a by the cosine) between linguistic and visual rep- resentations, thus aligning the dimensions of the OurtextcorpusisaWikipedia2009dumpcompris- linguistic vector with those of the visual one (recall ingapproximately800Mtokens.1 Totrainthemulti- that we are inducing the first, while the second is modal models, we add visual information for 5,100 fixed),andmakingthelinguisticrepresentationofa words that have an entry in ImageNet (Deng et al., concept “move” closer to its visual representation. 2009), occur at least 500 times in the corpus and We maximize similarity through a max-margin have concreteness score ≥ 0.5 according to Turney framework commonly used in models connecting et al. (2011). On average, about 5% tokens in the language and vision (Weston et al., 2010; Frome et textcorpusareassociatedtoavisualrepresentation. al., 2013). More precisely, we formulate the visual Toconstructthevisualrepresentationofaword,we objectiveL (w )as: sample 100 pictures from its ImageNet entry, and vision t − (cid:88) max(0,γ−cos(uwt,vwt)+cos(uwt,vw(cid:48))) (4) extracta4096-dimensionalvectorfromeachpicture w(cid:48)∼Pn(w) using the Caffe toolkit (Jia et al., 2014), together where the minus sign turns a loss into a cost, γ is withthepre-trainedconvolutionalneuralnetworkof themargin,uwt isthetargetmultimodally-enhanced Krizhevsky et al. (2012). The vector corresponds word representation we aim to learn, vwt is the cor- to activation in the top (FC7) layer of the network. responding visual vector (fixed in advance) and vw(cid:48) Finally, we average the vectors of the 100 pictures ranges over visual representations of words (fea- associated to each word, deriving 5,100 aggregated tured in our image dictionary) randomly sampled visualrepresentations. fromdistributionP (w ). Theserandomvisualrep- n t resentations act as “negative” samples, encouraging Hyperparameters For both SKIP-GRAM and the u to be more similar to its own visual representa- MMSKIP-GRAM models, we fix hidden layer size wt tionthantothatofotherwords. Thesamplingdistri- to300. TofacilitatecomparisonbetweenMMSKIP- butioniscurrentlysettouniform,andthenumberof GRAM-AandMMSKIP-GRAM-B,andsincethefor- negativesamplescontrolledbyhyperparameterk. mer requires equal linguistic and visual dimension- ality, we keep the first 300 dimensions of the visual 3.4 Multi-modalSkip-gramModelB vectors. For the linguistic objective, we use hierar- The visual objective in MMSKIP-GRAM-A has the chicalsoftmaxwithaHuffmanfrequency-baseden- drawback of assuming a direct comparison of lin- coding tree, setting frequency subsampling option guisticandvisualrepresentations,constrainingthem t=0.001 and window size c=5, without tuning. to be of equal size. MMSKIP-GRAM-B lifts this The following hyperparameters were tuned on the constraintbyincludinganextralayermediatingbe- text9 corpus:2 MMSKIP-GRAM-A: k=20, γ=0.5; tweenlinguisticandvisualrepresentations(seeFig- MMSKIP-GRAM-B:k=5,γ=0.5,λ=0.0001. ure1forasketchof MMSKIP-GRAM-B).Learning 5 Experiments this layer is equivalent to estimating a cross-modal mapping matrix from linguistic onto visual repre- 5.1 Approximatinghumanjudgments sentations, jointly induced with linguistic word em- beddings. Theextensionisstraightforwardlyimple- Benchmarks AwidelyadoptedwaytotestDSMs mented by substituting, into Equation 4, the word and their multimodal extensions is to measure how representation u with z = Mu→vu , where well model-generated scores approximate human wt wt wt Mu→v is the cross-modal mapping matrix to be in- similarity judgments about pairs of words. We put duced. To avoid overfitting, we also add an L2 reg- together various benchmarks covering diverse as- ularization term for Mu→v to the overall objective pects of meaning, to gain insights on the effect of (Equation3),withitsrelativeimportancecontrolled perceptualinformationondifferentsimilarityfacets. byhyperparamerλ. Specifically, we test on general relatedness (MEN, Bruni et al. (2014), 3K pairs), e.g., pickles are re- 4 ExperimentalSetup lated to hamburgers, semantic (≈ taxonomic) simi- Theparametersofallmodelsareestimatedbyback- 1http://wacky.sslmit.unibo.it propagationoferrorviastochasticgradientdescent. 2http://mattmahoney.net/dc/textdata.html larity(Simlex-999,Hilletal.(2014),1Kpairs;Sem- onwordswithoutdirectvisualrepresentations. Sim, Silberer and Lapata (2014), 7.5K pairs), e.g., pickles are similar to onions, as well as visual sim- Results The state-of-the-art visual CNN FEA- ilarity (VisSim, Silberer and Lapata (2014), same TURES aloneperformremarkablywell,outperform- pairsasSemSimwithdifferenthumanratings),e.g., ing the purely textual model (SKIP-GRAM) in two pickleslooklikezucchinis. tasks, and achieving the best absolute performance on the visual-coverage subset of Simlex-999. Re- Alternative Multimodal Models We compare garding multimodal fusion (that is, focusing on our models against several recent alternatives. We the visual-coverage subsets), both MMSKIP-GRAM test the vectors made available by Kiela and Bottou models perform very well, at the top or just below (2014). Similarly to us, they derive textual features it on all tasks, with comparable results for the two with the skip-gram model (from a portion of the variants. Their performance is also good on the WikipediaandtheBritishNationalCorpus)anduse full data sets, where they consistently outperform visual representations extracted from the ESP data- SKIP-GRAM and SVD (that is much more strongly set (von Ahn and Dabbish, 2004) through a convo- affected by lack of complete visual information). lutional neural network (Oquab et al., 2014). They They’re just a few points below the state-of-the-art concatenatetextualandvisualfeaturesafternormal- MEN correlation (0.8), achieved by Baroni et al. izingtounitlengthandcenteringtozeromean. We (2014) with a corpus 3 larger than ours and exten- alsotestthevectorsthatperformedbestintheevalu- sivetuning. MMSKIP-GRAM-B isclosetothestate ationofBrunietal.(2014),basedontextualfeatures of the art for Simlex-999, reported by the resource extracted from a 3B-token corpus and SIFT-based creators to be at 0.41 (Hill et al., 2014). Most im- Bag-of-Visual-Wordsvisualfeatures(SivicandZis- pressively, MMSKIP-GRAM-A reaches the perfor- serman, 2003) extracted from the ESP collection. manceleveloftheSilbererandLapata(2014)model Bruniandcolleaguesfuseaweightedconcatenation on their SemSim and VisSim data sets, despite the ofthetwocomponentsthroughSVD.Wefurtherre- fact that the latter has full visual-data coverage and implement both methods with our own textual and uses attribute-based image representations, requir- visual embeddings as CONCATENATION and SVD ing supervised learning of attribute classifiers, that (withtargetdimensionality300,pickedwithouttun- achieve performance in the semantic tasks compa- ing). Finally, we present for comparison the results rable or higher than that of our CNN features (see onSemSimandVisSimreportedbySilbererandLa- Table 3 in Silberer and Lapata (2014)). Finally, if pata (2014), obtained with a stacked-autoencoders themultimodalmodels(unsurprisingly)bringabout architecture run on textual features extracted from a large performance gain over the purely linguistic WikipediawiththeStrudelalgorithm(Baronietal., modelonvisualsimilarity, theimprovementiscon- 2010)andattribute-basedvisualfeatures(Farhadiet sistently large also for the other benchmarks, con- al.,2009)extractedfromImageNet. firming that multimodality leads to better semantic All benchmarks contain a fair amount of words models in general, that can help in capturing differ- forwhichwedidnotusedirectvisualevidence. We ent types of similarity (general relatedness, strictly are interested in assessing the models both in terms taxonomic,perceptual). ofhowtheyfuselinguisticandvisualevidencewhen While we defer to further work a better un- they are both available, and for their robustness in derstanding of the relation between multimodal lack of full visual coverage. We thus evaluate them grounding and different similarity relations, Table intwosettings. Thevisual-coveragecolumnsofTa- 2 provides qualitative insights on how injecting ble1(thoseontheright)reportresultsonthesubsets visual information changes the structure of se- forwhichallcomparedmodelshaveaccesstodirect mantic space. The top SKIP-GRAM neighbours of visualinformationforbothwords. Wefurtherreport donuts are places where you might encounter them, results on the full sets (“100%” columns of Table whereasthemultimodalmodelsrelatethemtoother 1) for models that can propagate visual information take-away food, ranking visually-similar pizzas at and that, consequently, can meaningfully be tested the top. The owl example shows how multimodal MEN Simlex-999 SemSim VisSim Model 100% 42% 100% 29% 100% 85% 100% 85% KIELAANDBOTTOU - 0.74 - 0.33 - 0.60 - 0.50 BRUNIETAL. - 0.77 - 0.44 - 0.69 - 0.56 SILBERERANDLAPATA - - - - 0.70 - 0.64 - CNNFEATURES - 0.62 - 0.54 - 0.55 - 0.56 SKIP-GRAM 0.70 0.68 0.33 0.29 0.62 0.62 0.48 0.48 CONCATENATION - 0.74 - 0.46 - 0.68 - 0.60 SVD 0.61 0.74 0.28 0.46 0.65 0.68 0.58 0.60 MMSKIP-GRAM-A 0.75 0.74 0.37 0.50 0.72 0.72 0.63 0.63 MMSKIP-GRAM-B 0.74 0.76 0.40 0.53 0.66 0.68 0.60 0.60 Table1: Spearmancorrelationbetweenmodel-generatedsimilaritiesandhumanjudgments. Rightcolumns reportcorrelationonvisual-coveragesubsets(percentageoforiginalbenchmarkcoveredbysubsetsonfirst row of respective columns). First block reports results for out-of-the-box models; second block for visual andtextualrepresentationsalone;thirdblockforourimplementationofmultimodalmodels. Target SKIP-GRAM MMSKIP-GRAM-A MMSKIP-GRAM-B donut fridge,diner,candy pizza,sushi,sandwich pizza,sushi,sandwich owl pheasant,woodpecker,squirrel eagle,woodpecker,falcon eagle,falcon,hawk mural sculpture,painting,portrait painting,portrait,sculpture painting,portrait,sculpture tobacco coffee,cigarette,corn cigarette,cigar,corn cigarette,cigar,smoking depth size,bottom,meter sea,underwater,level sea,size,underwater chaos anarchy,despair,demon demon,anarchy,destruction demon,anarchy,shadow Table2: Orderedtop3neighboursofexamplewordsinpurelytextualandmultimodalspaces. Onlydonut andowlweretrainedwithdirectvisualinformation. models pick taxonomically closer neighbours of over the more abstract measurement sense picked concrete objects, since often closely related things by the MMSKIP-GRAM neighbours. For chaos, also look similar (Bruni et al., 2014). In particular, theyrankademon,thatis,aconcreteagentofchaos both multimodal models get rid of squirrels and at the top, and replace the more abstract notion of offer other birds of prey as nearest neighbours. despair with equally gloomy but more imageable No direct visual evidence was used to induce the shadows and destruction (more on abstract words embeddingsoftheremainingwordsinthetable,that below). are thus influenced by vision only by propagation. The subtler but systematic changes we observe in 5.2 Zero-shotimagelabelingandretrieval such cases suggest that this indirect propagation The multimodal representations induced by our is not only non-damaging with respect to purely models should be better suited than purely text- linguistic representations, but actually beneficial. basedvectorstolabelorretrieveimages. Inparticu- For the concrete mural concept, both multimodal lar,giventhatthequantitativeandqualitativeresults models rank paintings and portraits above less collected so far suggest that the models propagate closely related sculptures (they are not a form of visual information across words, we apply them to painting). For tobacco, both models rank cigarettes imagelabelingandretrievalinthechallengingzero- and cigar over coffee, and MMSKIP-GRAM-B shotsetup(seeSection2above).3 avoids the arguably less common “crop” sense cued by corn. The last two examples show how the 3We will refer here, for conciseness’ sake, to image label- multimodal models turn up the embodiment level ing/retrieval,but,asourvisualvectorsareaggregatedrepresen- in their representation of abstract words. For depth, tationsofimages,thetaskswe’remodelingconsist,morepre- their neighbours suggest a concrete marine setup cisely,inlabelingasetofpicturesdenotingthesameobjectand retrievingthecorrespondingsetgiventhenameoftheobject. Setup We take out as test set 25% of the 5.1K P@1 P@2 P@10 P@20 P@50 words we have visual vectors for. The multimodal SKIP-GRAM 1.5 2.6 14.2 23.5 36.1 MMSKIP-GRAM-A 2.1 3.7 16.7 24.6 37.6 models are re-trained without visual vectors for MMSKIP-GRAM-B 2.2 5.1 20.2 28.5 43.5 these words, using the same hyperparameters as above. For both tasks, the search for the correct Table3: Percentageprecision@kresultsinthezero- word label/image is conducted on the whole set of shotimagelabelingtask. 5.1Kword/visualvectors. In the image labeling task, given a visual vector P@1 P@2 P@10 P@20 P@50 representing an image, we map it onto word space, SKIP-GRAM 1.9 3.3 11.5 18.5 30.4 and label the image with the word corresponding MMSKIP-GRAM-A 1.9 3.2 13.9 20.2 33.6 to the nearest vector. To perform the vision-to- MMSKIP-GRAM-B 1.9 3.8 13.2 22.5 38.3 languagemapping,wetrainaRidgeregressionby5- Table4: Percentageprecision@kresultsinthezero- foldcross-validationonthetestset(forSKIP-GRAM shotimageretrievaltask. only,wealsoaddtheremaining75%ofword-image vectorpairsusedinestimatingthemultimodalmod- elstotheRidgetrainingdata).4 embeddings we are inducing, while general enough In the image retrieval task, given a linguis- to achieve good performance in the semantic tasks tic/multimodal vector, we map it onto visual space, discussed above, encode sufficient visual informa- andretrievethenearestimage. ForSKIP-GRAM,we tion for direct application to image analysis tasks. use Ridge regression with the same training regime Thisisespeciallyremarkablebecausethewordvec- as for the labeling task. For the multimodal mod- tors we are testing were not matched with visual els, since maximizing similarity to visual represen- representations at model training time, and are thus tationsisalreadypartoftheirtrainingobjective,we multimodal only by propagation. The best perfor- donotfitanextramappingfunction. ForMMSKIP- manceisachievedby MMSKIP-GRAM-B,confirm- GRAM-A, we directly look for nearest neighbours ing our claim that its Mu→v matrix acts as a multi- of the learned embeddings in visual space. For modalmappingfunction. MMSKIP-GRAM-B, we use the Mu→v mapping functioninducedwhilelearningwordembeddings. 5.3 Abstractwords Results In image labeling (Table 3) SKIP-GRAM We have already seen, through the depth and chaos is outperformed by both multimodal models, con- examples of Table 2, that the indirect influence of firming that these models produce vectors that are visualinformationhasinterestingeffectsontherep- directly applicable to vision tasks thanks to visual resentation of abstract terms. The latter have re- propagation. The most interesting results however ceivedlittleattentioninmultimodalsemantics,with are achieved in image retrieval (Table 4), which Hill and Korhonen (2014) concluding that abstract is essentially the task the multimodal models have nouns, in particular, do not benefit from propagated been implicitly optimized for, so that they could be perceptual information, and their representation is appliedtoitwithoutanyspecifictraining. Thestrat- even harmed when such information is forced on egy of directly querying for the nearest visual vec- them (see Figure 4 of their paper). Still, embod- tors of the MMSKIP-GRAM-A word embeddings iedtheoriesofcognitionhaveprovidedconsiderable worksremarkablywell,outperformingonthehigher evidence that abstract concepts are also grounded ranks SKIP-GRAM, which requires an ad-hoc map- in the senses (Barsalou, 2008; Lakoff and John- ping function. This suggests that the multimodal son,1999). Sincethewordrepresentationsproduced by MMSKIP-GRAM-A, including those pertaining 4WeuseonefoldtotuneRidgeλ,threetoestimatethemap- to abstract concepts, can be directly used to search pingmatrixandtestinthelastfold. Toenforcestrictzero-shot for near images in visual space, we decided to ver- conditions, we exclude from the test fold labels occurring in ify,experimentally,ifthesenearimages(ofconcrete the LSVRC2012 set that was employed to train the CNN of Krizhevskyetal.(2012),thatweusetoextractvisualfeatures. things) are relevant not only for concrete words, as expected, but also for abstract ones, as predicted by global |words| unseen |words| all 48% 198 30% 127 embodiedviewsofmeaning. concrete 73% 99 53% 30 More precisely, we focused on the set of 200 abstract 23% 99 23% 97 wordsthatweresampledacrosstheUSFnormscon- creteness spectrum by Kiela et al. (2014) (2 words Table 5: Subjects’ preference for nearest visual had to be excluded for technical reasons). This neighbourofwordsinKielaetal.(2014)vs.random set includes not only concrete (meat) and abstract pictures. Figure of merit is percentage proportion (thought) nouns, but also adjectives (boring), verbs of significant results in favor of nearest neighbour (teach), and even grammatical terms (how). Some acrosswords. Resultsarereportedforthewholeset, words in the set have relatively high concreteness aswellasforwordsabove(concrete)andbelow(ab- ratings, but are not particularly imageable, e.g.: stract) the concreteness rating median. The unseen hot, smell, pain, sweet. For each word in the set, columnreportsresultswhenwordsexposedtodirect we extracted the nearest neighbour picture of its visual evidence during training are discarded. The MMSKIP-GRAM-A representation, and matched it wordscolumnsreportsetcardinality. with a random picture. The pictures were selected from a set of 5,100, all labeled with distinct words freedom theory wrong (the picture set includes, for each of the words as- sociated to visual information as described in Sec- tion 4, the nearest picture to its aggregated visual representation). Since it is much more common for god together place concrete than abstract words to be directly repre- sented by an image in the picture set, when search- ing for the nearest neighbour we excluded the pic- turelabeledwiththewordofinterest,ifpresent(e.g., Figure 2: Examples of nearest visual neighbours of we excluded the picture labeled tree when picking some abstract words: on the left, cases where sub- the nearest neighbour of the word tree). We ran a jects preferred the neighbour to the random foil; on CrowdFlower5 survey in which we presented each theright,caseswheretheydidnot. test word with the two associated images (random- izing presentation order of nearest and random pic- ture), and asked subjects which of the two pictures significant preference for the model-predicted near- they found more closely related to the word. We estpictureforaboutonefourthoftheabstractterms. collected minimally 20 judgments per word. Sub- Whether a word was exposed to direct visual evi- jectsshowedlargeagreement(medianproportionof denceduringtrainingisofcoursemakingabigdif- majoritychoiceat90%),confirmingthattheyunder- ference, and this factor interacts with concreteness, stoodthetaskandbehavedconsistently. as only two abstract words were matched with im- We quantify performance in terms of proportion ages during training.6 When we limit evaluation to ofwordsforwhichthenumberofvotesforthenear- word representations that were not exposed to pic- est neighbour picture is significantly above chance tures during training, the difference between con- according to a two-tailed binomial test. We set sig- crete and abstract terms, while still large, becomes nificanceatp<0.05afteradjustingallp-valueswith lessdramaticthanifallwordsareconsidered. theHolmcorrectionforrunning198statisticaltests. Figure 2 shows four cases in which subjects ex- The results in Table 5 indicate that, in about half pressed a strong preference for the nearest visual the cases, the nearest picture to a word MMSKIP- neighbour of a word. Freedom, god and theory are GRAM-A representation is meaningfully related to strikinglyinagreementwiththeview, fromembod- theword. Asexpected,thisismoreoftenthecasefor iedtheories,thatabstractwordsaregroundedinrel- concretethanabstractwords. Still,wealsoobservea 6Inbothcases,theimagesactuallydepictconcretesensesof 5http://www.crowdflower.com thewords:amemoryboardformemoryandastopsignforstop. evant concrete scenes and situations. The together Model ρ exampleillustrateshowvisualdatamightgroundab- WORDFREQUENCY 0.22 KIELAETAL. -0.65 stractnotionsinsurprisingways. Forallthesecases, SKIP-GRAM 0.05 we can borrow what Howell et al. (2005) say about MMSKIP-GRAM-B 0.04 visualpropagationtoabstractwords(p.260): MMSKIP-GRAM-A -0.75 MMSKIP-GRAM-B* -0.71 Intuitively, thisissomethingliketryingtoexplain an abstract concept like love to a child by using (a) (b) concrete examples of scenes or situations that are associatedwithlove. Theabstractconceptisnever Figure 3: (a) Distribution of MMSKIP-GRAM-A fullygroundedinexternalreality,butitdoesinherit vectoractivationformeat (blue)andhope(red). (b) somemeaningfromthemoreconcreteconceptsto Spearmanρbetweenconcretenessandvariousmea- whichitisrelated. suresontheKielaetal.(2014)set. Ofcourse,notallexamplesaregood: thelastcol- umn of Figure 2 shows cases with no obvious rela- MMSKIP-GRAM-A representations and those gen- tionbetweenwordsandvisualneighbours(subjects eratedbymappingMMSKIP-GRAM-Bvectorsonto preferredtherandomimagesbyalargemargin). visual space (MMSKIP-GRAM-B*) achieve very Themultimodalvectorsweinducealsodisplayan high correlation (but, interestingly, not MMSKIP- interestingintrinsicpropertyrelatedtothehypothe- GRAM-B). This is further evidence that multimodal sis that grounded representations of abstract words learning is grounding the representations of both are more complex than for concrete ones, since ab- concreteandabstractwordsinmeaningfulways. stractconceptsrelatetovariedandcompositesitua- tions(BarsalouandWiemer-Hastings,2005). Anat- 6 Conclusion ural corollary of this idea is that visually-grounded representationsofabstractconceptsshouldbemore Weintroducedtwomultimodalextensionsof SKIP- diverse: Ifyouthinkofdogs,verysimilarimagesof GRAM. MMSKIP-GRAM-A is trained by directly optimizing the similarity of words with their visual specificdogswillcometomind. Youcanalsoimag- representations, thus forcing maximum interaction inetheabstractnotionoffreedom, butthenatureof the related imagery will be much more varied. Re- betweenthetwomodalities. MMSKIP-GRAM-Bin- cludes an extra mediating layer, acting as a cross- cently,Kielaetal.(2014)haveproposedtomeasure modalmappingcomponent. Theabilityofthemod- abstractness by exploiting this very same intuition. elstointegrateandpropagatevisualinformationre- However,theyrelyonmanualannotationofpictures sultedinwordrepresentationsthatperformedwellin via Google Images and define an ad-hoc measure bothsemanticandvisiontasks,andcouldbeusedas of image dispersion. We conjecture that the repre- inputinsystemsbenefitingfrompriorvisualknowl- sentations naturally induced by our models display edge(e.g.,captiongeneration). Ourresultswithab- a similar property. In particular, the entropy of our stract words suggest the models might also help in multimodalvectors,beinganexpressionofhowvar- tasks such as metaphor detection, or even retriev- ied the information they encode is, should correlate ing/generating pictures of abstract concepts. Their withthedegreeofabstractnessofthecorresponding incrementalnaturemakesthemwell-suitedforcog- words. As Figure 3(a) shows, there is indeed a dif- nitivesimulationsofgroundedlanguageacquisition, ferenceinentropybetweenthemostconcrete(meat) anavenueofresearchweplantoexplorefurther. andmostabstract(hope)wordsintheKielaetal.set. To test the hypothesis quantitatively, we mea- Acknowledgments sure the correlation of entropy and concreteness on the 200 words in the Kiela et al. (2014) set.7 We thank Adam Liska, Tomas Mikolov, the re- Figure 3(b) shows that the entropies of both the viewersandtheNIPS2014LearningSemanticsau- dience. We were supported by ERC 2011 Start- 7Since the vector dimensions range over the real number ing Independent Research Grant n. 283554 (COM- line, we calculate entropy on vectors that are unit-normed af- teraddingasmallconstantinsuringallvaluesarepositive. POSES). References [Fysheetal.2014] AlonaFyshe,ParthaPTalukdar,Brian Murphy, and Tom M Mitchell. 2014. Interpretable [Baronietal.2010] Marco Baroni, Eduard Barbu, Brian semanticvectorsfromajointmodelofbrain-andtext- Murphy, and Massimo Poesio. 2010. Strudel: A basedmeaning. InInProceedingsofACL,pages489– distributionalsemanticmodelbasedonpropertiesand 499. types. CognitiveScience,34(2):222–254. [GlenbergandRobertson2000] Arthur Glenberg and [Baronietal.2014] Marco Baroni, Georgiana Dinu, and David Robertson. 2000. Symbol grounding and Germa´n Kruszewski. 2014. Don’t count, pre- meaning: A comparison of high-dimensional and dict! asystematiccomparisonofcontext-countingvs. embodied theories of meaning. Journal of Memory context-predicting semantic vectors. In Proceedings andLanguage,3(43):379–401. ofACL,pages238–247,Baltimore,MD. [Harnad1990] StevanHarnad. 1990. Thesymbolground- [BarsalouandWiemer-Hastings2005] Lawrence Barsa- ingproblem. PhysicaD:NonlinearPhenomena,42(1- lou and Katja Wiemer-Hastings. 2005. Situating 3):335–346. abstractconcepts. InD.PecherandR.Zwaan,editors, [HillandKorhonen2014] FelixHillandAnnaKorhonen. Grounding Cognition: The Role of Perception and 2014. Learning abstract concept embeddings from Action in Memory, Language, and Thought, pages multi-modal data: Since you probably can’t see what 129–163. Cambridge University Press, Cambridge, I mean. In Proceedings of EMNLP, pages 255–265, UK. Doha,Qatar. [Barsalou2008] Lawrence Barsalou. 2008. Grounded [Hilletal.2014] Felix Hill, Roi Reichart, and Anna Ko- cognition. AnnualReviewofPsychology,59:617–645. rhonen. 2014. SimLex-999: Evaluating se- [Brunietal.2012] EliaBruni,GemmaBoleda,MarcoBa- mantic models with (genuine) similarity estimation. roni, and Nam Khanh Tran. 2012. Distributional se- http://arxiv.org/abs/arXiv:1408.3456. manticsinTechnicolor. InProceedingsofACL,pages [Howelletal.2005] Steve Howell, Damian Jankowicz, 136–145,JejuIsland,Korea. and Suzanna Becker. 2005. A model of grounded [Brunietal.2014] Elia Bruni, Nam Khanh Tran, and language acquisition: Sensorimotor features improve Marco Baroni. 2014. Multimodal distributional se- lexicalandgrammaticallearning. JournalofMemory mantics. Journal of Artificial Intelligence Research, andLanguage,53:258–276. 49:1–47. [Jiaetal.2014] Yangqing Jia, Evan Shelhamer, Jeff Don- [Clark2015] Stephen Clark. 2015. Vector space mod- ahue,SergeyKarayev,JonathanLong,RossGirshick, els of lexical meaning. In Shalom Lappin and SergioGuadarrama,andTrevorDarrell. 2014. Caffe: Chris Fox, editors, Handbook of Contemporary Se- Convolutionalarchitectureforfastfeatureembedding. mantics, 2nd ed. Blackwell, Malden, MA. In arXivpreprintarXiv:1408.5093. press; http://www.cl.cam.ac.uk/˜sc609/ [Karpathyetal.2014] Andrej Karpathy, Armand Joulin, pubs/sem_handbook.pdf. andLiFei-Fei. 2014. Deepfragmentembeddingsfor [Dengetal.2009] Jia Deng, Wei Dong, Richard Socher, bidirectionalimagesentencemapping. InProceedings Lia-Ji Li, and Li Fei-Fei. 2009. Imagenet: A large- ofNIPS,pages1097–1105,Montreal,Canada. scale hierarchical image database. In Proceedings of [KielaandBottou2014] Douwe Kiela and Le´on Bottou. CVPR,pages248–255,MiamiBeach,FL. 2014. Learning image embeddings using convolu- [Elman1990] JeffreyLElman. 1990. Findingstructurein tional neural networks for improved multi-modal se- time. Cognitivescience,14(2):179–211. mantics. In Proceedings of EMNLP, pages 36–45, [Farhadietal.2009] Ali Farhadi, Ian Endres, Derek Doha,Qatar. Hoiem,andDavidForsyth. 2009. Describingobjects [Kielaetal.2014] Douwe Kiela, Felix Hill, Anna Korho- by their attributes. In Proceedings of CVPR, pages nen, and Stephen Clark. 2014. Improving multi- 1778–1785,MiamiBeach,FL. modal representations using image dispersion: Why [FengandLapata2010] Yansong Feng and Mirella Lap- lessissometimesmore. InProceedingsofACL,pages ata. 2010. Visual information in semantic represen- 835–841,Baltimore,MD. tation. In Proceedings of HLT-NAACL, pages 91–99, [Kirosetal.2014] RyanKiros,RuslanSalakhutdinov,and LosAngeles,CA. Richard Zemel. 2014. Unifying visual-semantic em- [Fromeetal.2013] Andrea Frome, Greg Corrado, Jon beddings with multimodal neural language models. Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ran- In Proceedings of the NIPS Deep Learning and Rep- zato, and Tomas Mikolov. 2013. DeViSE: A deep resentation Learning Workshop, Montreal, Canada. visual-semanticembeddingmodel. InProceedingsof Published online: http://www.dlworkshop. NIPS,pages2121–2129,LakeTahoe,NV. org/accepted-papers.