Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities YadollahYaghoobzadehandHinrichSchu¨tze CenterforInformationandLanguageProcessing LMUMunich,Germany [email protected] Abstract YAGO(Suchaneketal.,2007). Forautomaticup- dating and completing the entity knowledge, text Entities are essential elements of natu- resourcessuchasnews,userforums,textbooksor ral language. In this paper, we present 7 any other data in the form of text are important 1 methodsforlearningmulti-levelrepresen- sources. Therefore, information extraction meth- 0 tations of entities on three complemen- ods have been introduced to extract knowledge 2 tary levels: character (character patterns aboutentitiesfromtext. Inthispaper,wefocuson n in entity names extracted, e.g., by neural a the extraction of entity types, i.e., assigning types J networks), word (embeddings of words to–ortyping–entities. Typeinformationcanhelp 6 in entity names) and entity (entity em- extraction of relations by applying constraints on 1 beddings). We investigate state-of-the- relationarguments. art learning methods on each level and ] We address a problem setting in which the fol- L findlargedifferences,e.g.,fordeeplearn- lowing are given: a KB with a set of entities C ingmodels,traditionalngramfeaturesand E, a set of types T and a membership function . s the subword model of fasttext (Bo- m : E × T (cid:55)→ {0,1} such that m(e,t) = 1 iff c janowski et al., 2016) on the character [ entityehastypet; andalargecorpusC inwhich level; for word2vec (Mikolov et al., 2 mentions of E are annotated. In this setting, we 2013) on the word level; and for the v address the task of fine-grained entity typing: we 5 order-aware model wang2vec (Ling et want to learn a probability function S(e,t) for a 2 al.,2015a)ontheentitylevel. 0 pairofentityeandtypetandbasedonS(e,t)in- 2 Weconfirmexperimentallythateachlevel ferwhetherm(e,t) = 1holds,i.e.,whetherentity 0 of representation contributes complemen- eisamemberoftypet. . 1 taryinformationandajointrepresentation We address this problem by learning a multi- 0 7 of all three levels improves the existing level representation for an entity that contains the 1 embeddingbasedbaselineforfine-grained informationnecessaryfortypingit. Oneimportant : v entity typing by a large margin. Addi- source is the contexts in which the entity is used. Xi tionally, we show that adding information We can take the standard method of learning em- r from entity descriptions further improves beddings for words and extend it to learning em- a multi-levelrepresentationsofentities. beddings for entities. This requires the use of an entity linker and can be implemented by replac- 1 Introduction ing all occurrences of the entity by a unique to- Knowledge about entities is essential for under- ken. Werefertoentityembeddingsasentity-level standing human language. This knowledge can representations. Previously, entity embeddings beattributional(e.g.,canFly,isEdible),type-based havebeenlearnedmostlyusingbag-of-wordmod- (e.g., isFood, isPolitician, isDisease) or relational els like word2vec (e.g., by Wang et al. (2014) (e.g,marriedTo,bornIn). Knowledgebases(KBs) andYaghoobzadehandSchu¨tze(2015)). Weshow are designed to store this information in a struc- below that order information is critical for high- tured way, so that it can be queried easily. Exam- qualityentityembeddings. ples of such KBs are Freebase (Bollacker et al., Entity-level representations are often uninfor- 2008), Wikipedia, Google knowledge graph and mative for rare entities, so that using only entity embeddings is likely to produce poor results. In 2 RelatedWork this paper, we use entity names as a source of in- Entity representation. Two main sources of in- formationthatiscomplementarytoentityembed- formation used for learning entity representation dings. Wedefineanentitynameasanounphrase are: (i)linksanddescriptionsinKB,(ii)nameand thatisusedtorefertoanentity. Welearncharacter contexts in corpora. We focus on name and con- andwordlevelrepresentationsofentitynames. texts in corpora, but we also include (Wikipedia) Forthecharacter-levelrepresentation,weadopt descriptions. Werepresententitiesonthreelevels: different character-level neural network architec- entity, wordandcharacter. Ourentity-levelrepre- tures. Ourintuitionisthatthereissub/crossword sentation is similar to work on relation extraction information, e.g., orthographic patterns, that is (Wang et al., 2014; Wang and Li, 2016), entity helpful to get better entity representations, espe- linking (Yamada et al., 2016; Fang et al., 2016), cially for rare entities. A simple example is that and entity typing (Yaghoobzadeh and Schu¨tze, a three-token sequence containing an initial like 2015). Our word-level representation with distri- “P.”surroundedbytwocapitalizedwords(“Rolph butionalwordembeddingsissimilarlyusedtorep- P.Kugl”)islikelytorefertoaperson. resent entities for entity linking (Sun et al., 2015) We compute the word-level representation as andrelationextraction(Socheretal.,2013; Wang thesumoftheembeddingsofthewordsthatmake etal., 2014). Novelentityrepresentationmethods up the entity name. The sum of the embeddings weintroduceinthispaperarerepresentationbased accumulates evidence for a type/property over all on fasttext (Bojanowski et al., 2016) sub- constituents, e.g., a name containing “stadium”, word embeddings, several character-level repre- “lake”or“cemetery”islikelytorefertoalocation. sentations,“order-aware”entity-levelembeddings In this paper, we compute our word level repre- andthecombinationofseveraldifferentrepresen- sentation with two types of word embeddings: (i) tationsintoonemulti-levelrepresentation. using onlycontextual information ofwords in the corpus,e.g.,byword2vec(Mikolovetal.,2013) Character-subword level neural networks. and (ii) using subword as well as contextual in- Character-level convolutional neural networks formation of words, e.g., by Facebook’s recently (CNNs) are applied by dos Santos and Zadrozny releasedfasttext(Bojanowskietal.,2016). (2014) to part of speech (POS) tagging, by dos In this paper, we integrate character-level and Santos and Guimara˜es (2015), Ma and Hovy word-levelwithentity-levelrepresentationstoim- (2016), and Chiu and Nichols (2016) to named provetheresultsofpreviousworkonfine-grained entity recognition (NER), by Zhang et al. (2015) typingofKBentities. Wealsoshowhowdescrip- and Zhang and LeCun (2015) to sentiment anal- tions of entities in a KB can be a complementary ysis and text categorization, and by Kim et al. source of information to our multi-level represen- (2016) to language modeling (LM). Character- tationtoimprovetheresultsofentitytyping,espe- level LSTM is applied by Ling et al. (2015b) to ciallyforrareentities. LM and POS tagging, by Lample et al. (2016) to Ourmaincontributionsinthispaperare: NER, by Ballesteros et al. (2015) to parsing mor- phologically rich languages, and by Cao and Rei • We propose new methods for learning en- (2016)tolearningwordembeddings. Bojanowski tityrepresentationsonthreelevels: character- et al. (2016) learn word embeddings by repre- level,word-levelandentity-level. senting words with the average of their character • Weshowthattheselevelsarecomplementary ngrams (subwords) embeddings. Similarly, Chen andajointmodelthatusesallthreelevelsim- etal.(2015)extendsword2vecforChinesewith provesthestateoftheartonthetaskoffine- jointmodelingwithcharacters. grainedentitytypingbyalargemargin. Fine-grained entity typing. Our task is to in- ferfine-grainedtypesofKBentities. KBcomple- • Weexperimentallyshowthatanorderdepen- tion is an application of this task. Yaghoobzadeh dent embedding is more informative than its andSchu¨tze(2015)’sFIGMENTsystemaddresses bag-of-word counterpart for entity represen- this task with only contextual information; they tation. donotusecharacter-levelandword-levelfeatures We release our dataset and source codes: of entity names. Neelakantan and Chang (2015) cistern.cis.lmu.de/figment2/. and Xie et al. (2016) also address a similar task, (cid:126)v(e) ∈ Rd to the hidden layer with size h. f is Output Layer (type probabilities) therectifierfunction. W ∈ R|T|×h istheweight out Hidden Layer matrix from hidden layer to output layer of size |T|. σ is the sigmoid function. Our objective is Entity Representation binarycrossentropysummedovertypes: Figure 1: Schematic diagram of our architecture (cid:88) (cid:16) (cid:17) − m logp +(1−m )log(1−p ) t t t t for entity classification. “Entity Representation” t ((cid:126)v(e)) is the (one-level or multi-level) vector rep- resentationofentity. Sizeofoutputlayeris|T|. wherem isthetruthandp theprediction. t t The key difficulty when trying to compute P(t|e)isinlearningagoodrepresentationforen- buttheyrelyonentitydescriptionsinKBs,which tity e. We make use of contexts and name of e to in many settings are not available. The problem represent its feature vector on the three levels of of Fine-grained mention typing (FGMT) (Yosef entity,wordandcharacter. et al., 2012; Ling and Weld, 2012; Yogatama et al., 2015; Del Corro et al., 2015; Shimaoka et 3.1 Entity-levelrepresentation al., 2016; Ren et al., 2016) is related to our task. Distributional representations or embeddings are FGMT classifies single mentions of named enti- commonly used for words. The underlying hy- ties to their context dependent types whereas we pothesis is that words with similar meanings tend attempt to identify all types of a KB entity from to occur in similar contexts (Harris, 1954) and the aggregation of all its mentions. FGMT can therefore cooccurwith similarcontext words. We still be evaluated in our task by aggregating the can extend the distributional hypothesis to enti- mentionleveldecisionsbutaswewillshowinour ties (cf. Wang et al. (2014), Yaghoobzadeh and experimentsforonesystem,i.e.,FIGER(Lingand Schu¨tze (2015)): entities with similar meanings Weld, 2012), our entity embedding based models tend to have similar contexts. Thus, we can learn arebetterinentitytyping. a d dimensional embedding(cid:126)v(e) of entity e from 3 Fine-grainedentitytyping a corpus in which all mentions of the entity have been replaced by a special identifier. We refer to Given(i)aKBwithasetofentitiesE,(ii)asetof these entity vectors as the entity level representa- types T, and (iii) a large corpus C in which men- tion(ELR). tions of E are linked, we address the task of fine- In previous work, order information of context grainedentitytyping(YaghoobzadehandSchu¨tze, words (relative position of words in the contexts) 2015): predict whether entity e is a member of wasgenerallyignoredandobjectivessimilartothe type t or not. To do so, we use a set of training SkipGram (henceforth: SKIP) model were used examples to learn P(t|e): the probability that en- to learn (cid:126)v(e). However, the bag-of-word context tity e has type t. These probabilities can be used is difficult to distinguish for pairs of types like to assign new types to entities covered in the KB (restaurant,food)and(author,book). Thissuggests aswellastypingunknownentities. that using order aware embedding models is im- WelearnP(t|e)withageneralarchitecture;see portant for entities. Therefore, we apply Ling et Figure 1. The output layer has size |T|. Unit t of al.(2015a)’sextendedversionofSKIP,Structured this layer outputs the probability for type t. “En- SKIP(SSKIP).Itincorporatestheorderofcontext tity Representation” ((cid:126)v(e)) is the vector represen- wordsintotheobjective. WecompareitwithSKIP tation of entity e – we will describe in detail in embeddingsinourexperiments. the rest of this section what forms(cid:126)v(e) takes. We model P(t|e) as a multi-label classification, and 3.2 Word-levelrepresentation trainamultilayerperceptron(MLP)withonehid- Words inside entity names are important sources denlayer: of information for typing entities. We define the word-levelrepresentation(WLR)astheaverageof (cid:16) (cid:17) (cid:2) (cid:3) (cid:0) (cid:1) P(t1|e)...P(tT|e) = σ Woutf Win(cid:126)v(e) the embeddings of the words that the entity name (1) contains (cid:126)v(e) = 1/n(cid:80)n (cid:126)v(w ) where (cid:126)v(w ) is i=1 i i where W ∈ Rh×d is the weight matrix from the embedding of the ith word of an entity name in of length n. We opt for simple averaging since PERSON names usually consist of two words, or entity names often consist of a small number of MUSIC namesareusuallylong,containingseveral words with clear semantics. Thus, averaging is a words. promising way of combining the information that Thefirstlayerofthecharacter-levelmodelsisa eachwordcontributes. lookup table that maps each character to an em- The word embedding, w(cid:126), itself can be learned bedding of size d . These embeddings capture c frommodelswithdifferentgranularitylevels. Em- similarities between characters, e.g., similarity in bedding models that consider words as atomic type of phoneme encoded (consonant/vowel) or units in the corpus, e.g., SKIP and SSKIP, are similarity in case (lower/upper). The output of word-level. On the other hand, embedding the lookup layer for an entity name is a matrix models that represent words with their charac- C ∈ Rl×dc where l is the maximum length of a ter ngrams, e.g., fasttext (Bojanowski et al., name and all names are padded to length l. This 2016), are subword-level. Based on this, we con- length l includes special start/end characters that sider and evaluate word-level WLR (WWLR) brackettheentityname. andsubword-levelWLR(SWLR)inthispaper.1 We experiment with four architectures to pro- duce character-level representations in this paper: Character-level Representation FORWARD (direct forwarding of character em- beddings), CNNs, LSTMs and BiLSTMs. The Max Pooling outputofeacharchitecturethentakestheplaceof theentityrepresentation(cid:126)v(e)inFigure1. FORWARD simply concatenates all rows of Convolution layer matrixC;thus,(cid:126)v(e) ∈ Rdc∗l. The CNN uses k filters of different window widths w to narrowly convolve C. For each fil- terH ∈ Rdc×w,theresultoftheconvolutionofH Lookup table layer overmatrixC isfeaturemapf ∈ Rl−w+1: f[i] = rectifier(C (cid:12)H +b) [:,i:i+w−1] Lipofen where rectifier is the activation function, b is the bias,C arethecolumnsitoi+w−1of [:,i:i+w−1] Figure 2: Example architecture for the character- C, 1 ≤ w ≤ 10 are the window widths we con- level CNN with max pooling. The input is sider and (cid:12) is the sum of element-wise multipli- “Lipofen”. Character embedding size is three. cation. Max pooling then gives us one feature for There are three filters of width 2 and four filters eachfilter. Theconcatenationofallthesefeatures ofwidth4. is our representation: (cid:126)v(e) ∈ Rk. An example CNNarchitectureisshowinFigure2. 3.3 Character-levelrepresentation The input to the LSTM is the character se- For computing the character level representation quence in matrix C, i.e., x1,...,xl ∈ Rdc. It (CLR),wedesignmodelsthattrytotypeanentity generates the state sequence h1,...,hl+1 and the based on the sequence of characters of its name. outputisthelaststate(cid:126)v(e) ∈ Rdh.2 Our hypothesis is that names of entities of a spe- The BiLSTM consists of two LSTMs, one go- cific type often have similar character patterns. ingforward,onegoingbackward. Thefirststateof Entities of type ETHNICITY often end in “ish” thebackwardLSTMisinitializedashl+1,thelast and“ian”,e.g.,“Spanish”and“Russian”. Entities state of the forward LSTM. The BiLSTM entity of type MEDICINE often end in “en”: “Lipofen”, representationistheconcatenationoflaststatesof “acetaminophen”. Also, some types tend to have forwardandbackwardLSTMs,i.e.,(cid:126)v(e) ∈ R2∗dh. specific cross-word shapes in their entities, e.g., 3.4 Multi-levelrepresentations 1Subwordmodelshavepropertiesofbothcharacter-level models (subwords are character ngrams) and of word-level Our different levels of representations can give models(theydonotcrossboundariesbetweenwords). They complementaryinformationaboutentities. probablycouldbeputineithercategory,butinourcontextfit theword-levelcategorybetterbecauseweseethegranularity levelwithrespecttotheentitiesandnotwords. 2WeuseBlocks(vanMerrie¨nboeretal.,2015). training. Thus,itcanfocusoncomplementaryin- Entity Representation formation related to the task that is not already present in other levels. The schematic diagram of our multi-level representation is shown in Fig- Character-level Word-level Entity-level Representation Representation Representation ure3. Figure3: Multi-levelrepresentation 4 Experimentalsetupandresults 4.1 Setup WLR and CLR. Both WLR models, SWLR Entity datasets and corpus. We address andWWLR,donothaveaccesstothecross-word the task of fine-grained entity typing and use characterngramsofentitynameswhileCLRmod- Yaghoobzadeh and Schu¨tze (2015)’s FIGMENT els do. Also, CLR is task specific by training on dataset3 for evaluation. The FIGMENT corpus theentitytypingdatasetwhileWLRisgeneric. On is part of a version of ClueWeb in which Free- the other hand, WWLR and SWLR models have base entities are annotated using FACC1 (URL, access to information that CLR ignores: the tok- 2016b; Gabrilovich et al., 2013). The FIGMENT enization of entity names into words and embed- entity datasets contain 200,000 Freebase entities dingsofthesewords. Itisclearthatwordsarepar- that were mapped to 102 FIGER types (Ling and ticularlyimportantcharactersequencessincethey Weld, 2012). We use the same train (50%), dev often correspond to linguistic units with clearly (20%) and test (30%) partitions as Yaghoobzadeh identifiablesemantics–whichisnottrueformost and Schu¨tze (2015) and extract the names from charactersequences. Formanyentities,thewords mentionsofdatasetentitiesinthecorpus. Wetake they contain are a better basis for typing than the most frequent name for dev and test entities the character sequence. For example, even if andthreemostfrequentnamesfortrain(eachone “nectarine” and “compote” did not occur in any taggedwithentitytypes). names in the training corpus, we can still learn Adding parent types to refine entity dataset. good word embeddings from their non-entity oc- FIGMENTignoresthatFIGERisaproperhierar- currences. Thisthenallowsustocorrectlytypethe chyoftypes;e.g.,whileHOSPITALisasubtypeof entity“AuntMary’sNectarineCompote”asFOOD BUILDING according to FIGER, there are entities basedonthesumofthewordembeddings. inFIGMENTthatarehospitals,butnotbuildings.4 WLR/CLR and ELR. Representations from Therefore, wemodifiedthe FIGMENTdatasetby entity names, i.e., WLR and CLR, by themselves addingforeachassignedtype(e.g.,HOSPITAL)its arelimitedbecausemanyclassesofnamescanbe parents (e.g., BUILDING). Thismakes FIGMENT used for different types of entities; e.g., person moreconsistentandeliminatesspuriousfalseneg- names do not contain hints as to whether they are atives(BUILDINGintheexample). referringtoapoliticianorathlete. Incontrast, the We now describe our baselines: (i) BOW ELR embedding is based on an entity’s contexts, & NSL: hand-crafted features, (ii) FIGMENT which are often informative for each entity and (Yaghoobzadeh and Schu¨tze, 2015) and (iii) can distinguish politicians from athletes. On the adaptedversionofFIGER(LingandWeld,2012). other hand, not all entities have sufficiently many We implement the following two feature sets informative contexts in the corpus. For these en- from the literature as a hand-crafted baseline for tities, their name can be a complementary source ourcharacterandwordlevelmodels. (i)BOW:in- of information and character/word level represen- dividualwordsofentityname(bothas-isandlow- tationscanincreasetypingaccuracy. ercased); (ii) NSL (ngram-shape-length): shape and length of the entity name (cf. Ling and Weld Thus, we introduce joint models that use com- (2012)),charactern-grams,1 ≤ n ≤ n ,n = binations of the three levels. Each multi-level max max 5 (we also tried n = 7, but results were worse model concatenates several levels. We train the max ondev)andnormalizedcharactern-grams: lower- constituent embeddings as follows. WLR and cased,digitsreplacedby“7”,punctuationreplaced ELRarecomputedasdescribedaboveandarenot by “.”. These features are represented as a sparse changed during training. CLR – produced by one of the character-level networks described above 3cistern.cis.lmu.de/figment/ – is initialized randomly and then tuned during 4Seegithub.com/xiaoling/figerforFIGER binary vector(cid:126)v(e) that is input to the architecture embeddings from the ClueWeb/FACC1 corpus. inFigure1. We use similar settings as our WWLR SKIP and FIGMENT is the model for entity typing pre- SSkip embeddings and keep the defaults of other sented by Yaghoobzadeh and Schu¨tze (2015). hyperparameters. Since the trained model of The authors only use entity-level representations fasttext is applicable for new words, we ap- for entities trained by SkipGram, so the FIG- ply the model to get embeddings for the filtered MENTbaselinecorrespondstotheentity-levelre- rarewordsaswell. sultshownasELR(SKIP)inthetables. Thethirdbaselineisusinganexistingmention- model hyperparameters levelentitytypingsystem,FIGER(LingandWeld, CLR(FF) dc=15,hmlp=600 CLR(LSTM) dc=70,dh=70,hmlp=300 2012). FIGER uses a wide variety of features CLR(BiLSTM) dc=50,dh=50,hmlp=200 CLR(CNN) dc=10,w=[1,..,8] on different levels (including parsing-based fea- n=100,hmlp=800 tures) from contexts of entity mentions as well as CLR(NSL) hmlp=800 BOW hmlp=200 the mentions themselves and returns a score for BOW+CLR(NSL) hmlp=300 WWLR hmlp=400 eachmention-typeinstanceinthecorpus. Wepro- SWLR hmlp=400 vide the ClueWeb/FACC1 segmentation of enti- WWLR+CLR(CNN) w=[1,...,7] dc=10,n=50,hmlp=700 ties, so FIGER does not need to recognize enti- SWLR+CLR(CNN) w=[1,...,7] ties.5 We use the trained model provided by the ELR(SKIP) dc=10,hnm=lp5=0,4h0m0lp=700 authors and normalize FIGER scores using soft- ELR(SSKIP) hmlp=400 ELR+CLR dc=10,w=[1,...,7] max to make them comparable for aggregation. n=100,hmlp=700 Weexperimentedwithdifferentaggregationfunc- ELR+WWLR hmlp=600 ELR+SWLR hmlp=600 tions(includingmaximumandk-largest-scoresfor ELR+WWLR+CLR dc=10,w=[1,...,7] n=50,hmlp=700 a type), but we use the average of scores since it ELR+SWLR+CLR dc=10,w=[1,...,7] gaveusthebestresultondev. Wecallthisbaseline n=50,hmlp=700 ELR+WWLR+CNN+TC dc=10,w=[1,...,7] AGG-FIGER. n=50,hmlp=900 ELR+SWLR+CNN+TC(MuLR) dc=10,w=[1,...,7] Distributional embeddings. For WWLR and n=50,hmlp=900 ELR,weuseSkipGrammodelinword2vecand AVG-DES hmlp=400 MuLR+AVG-DES dc=10,w=[1,...,7] SSkipmodelinwang2vec(Lingetal.,2015a)to n=50,hmlp=1000 learnembeddingsforwords,entitiesandtypes. To Table 1: Hyperparameters of different models. w obtainembeddingsforallthreeinthesamespace, is the filter size. n is the number of CNN feature weprocessClueWeb/FACC1asfollows. Foreach maps for each filter size. d is the character em- c sentence s, we add three copies: s itself, a copy bedding size. d is the LSTM hidden state size. h ofsinwhicheachentityisreplacedwithitsFree- h isthenumberofhiddenunitsintheMLP. mlp base identifier (MID) and a copy in which each entity(nottestentitiesthough)isreplacedwithan ID indicating its notable type. The resulting cor- Our hyperparameter values are given in Ta- pus contains around 4 billion tokens and 1.5 bil- ble 1. The values are optimized on dev. We use liontypes. AdaGradandminibatchtraining. Foreachexperi- We run SKIP and SSkip with the same setup ment,weselectthebestmodelondev. (200 dimensions, 10 negative samples, window We use these evaluation measures: (i) accu- size 5, word frequency threshold of 100)6 on this racy: an entity is correct if all its types and no corpustolearnembeddingsforwords,entitiesand incorrect types are assigned to it; (ii) micro aver- FIGER types. Having entities and types in the ageF : F ofalltype-entityassignmentdecisions; same vector space, we can add another feature 1 1 (iii)entitymacroaverageF : F oftypesassigned vector(cid:126)v(e) ∈ R|T| (referred to as TC below): for 1 1 toanentity,averagedoverentities;(iv)typemacro eachentity,wecomputecosinesimilarityofitsen- averageF : F ofentitiesassignedtoatype,aver- tityvectorwithalltypevectors. 1 1 agedovertypes. For SWLR, we use fasttext7 to learn word The assignment decision is based on thresh- 5MentiontypingisseparatedfromrecognitioninFIGER olding the probability function P(t|e). For each model.Soitcanuseoursegmentationofentities. 6ThethresholddoesnotapplyforMIDs. modelandtype,weselectthethresholdthatmax- 7github.com/facebookresearch/fastText imizesF1 ofentitiesassignedtothetypeondev. allentities headentities tailentities types: all head tail acc mic mac acc mic mac acc mic mac AGG-FIGER .566 .702 .438 1 MFT .000 .041 .041 .000 .044 .044 .000 .038 .038 ELR .621 .784 .480 2 CLR(FORWARD) .066 .379 .352 .067 .342 .369 .061 .374 .350 MuLR .669 .811 .541 3 CLR(LSTM) .121 .425 .396 .122 .433 .390 .116 .408 .391 4 CLR(BiLSTM) .133 .440 .404 .129 .443 .394 .135 .428 .404 Table 3: Type macro aver- 5 CLR(NSL) .164 .484 .464 .157 .470 .443 .173 .483 .472 age F on test for all, head 6 CLR(CNN) .177 .494 .468 .171 .484 .450 .187 .489 .474 1 7 BOW .113 .346 .379 .109 .323 .353 .120 .356 .396 and tail types. MuLR = 8 WWLR(SKIP) .214 .581 .531 .293 .660 .634 .173 .528 .478 ELR+SWLR+CLR+TC 9 WWLR(SSKIP) .223 .584 .543 .306 .667 .642 .183 .533 .494 all known? 10 SWLR .236 .590 .554 .301 .665 .632 .209 .551 .522 yes no 11 BOW+CLR(NSL) .156 .487 .464 .157 .480 .452 .159 .485 .469 CLR(NSL) .484.521.341 12 WWLR+CLR(CNN) .257 .603 .568 .317 .668 .637 .235 .567 .538 CLR(CNN) .494.524.374 13 SWLR+CLR(CNN) .241 .594 .561 .295 .659 .628 .227 .560 .536 BOW .346.435.065 14 ELR(SKIP) .488 .774 .741 .551 .834 .815 .337 .621 .560 SWLR .590.612.499 15 ELR(SSKIP) .515 .796 .763 .560 .839 .819 .394 .677 .619 BOW+NSL .497.535.358 16 AGG-FIGER .320 .694 .660 .396 .762 .724 .220 .593 .568 SWLR+CLR(CNN) .594.616.508 17 ELR+CLR .554 .816 .788 .580 .844 .825 .467 .733 .690 18 ELR+WWLR .557 .819 .793 .582 .846 .827 .480 .749 .708 Table 4: Micro F on test of 19 ELR+SWLR .558 .820 .796 .584 .846 .829 .480 .751 .714 1 20 ELR+WWLR+CLR .568 .823 .798 .590 .847 .829 .491 .755 .716 character, word level models 21 ELR+SWLR+CLR .569 .824 .801 .590 .849 .831 .497 .760 .724 for all, known (“known? yes”) 22 ELR+WWLR+CLR+TC .572 .824 .801 .594 .849 .831 .499 .759 .722 and unknown (“known? no”) 23 ELR+SWLR+CLR+TC .575 .826 .802 .597 .851 .831 .508 .762 .727 entities. Table 2: Accuracy (acc), micro (mic) and macro (mac) F ontestforall,headandtailentities. 1 4.2 Results – is more robust than hand engineered feature basedNSL.WeshowmoredetailedresultsinSec- Table 2 gives results on the test entities for all tion4.3. (about 60,000 entities), head (frequency > 100; about 12,200) and tail (frequency < 5; about Word-level models are on lines 7-10. BOW 10,000). MFT (line 1) is the most frequent type performs worse than WWLR because it cannot baseline that ranks types according to their fre- dealwellwithsparseness. SSKIPuseswordorder quencyinthetrainentities. Eachlevelofrepresen- information in WWLR and performs better than tationisseparatedwithdashedlines,and–unless SKIP. SWLR uses subword information and per- noted otherwise – the best of each level is joined forms better than WWLR, especially for tail en- inmultilevelrepresentations.8 tities. Integrating subword information improves Character-level models are on lines 2-6. The thequalityofembeddingsforrarewordsandmit- order of systems is: CNN > NSL > BiLSTM igatestheproblemofunknownwords. > LSTM > FORWARD. The results show that Joint word-character level models are complex neural networks are more effective than on lines 11-13. WWLR+CLR(CNN) and simple forwarding. BiLSTM works better than SWLR+CLR(CNN) beat the component models. LSTM, confirming other related work. CNNs This confirms our underlying assumption in probably work better than LSTMs because there designing the complementary multi-level models. arefewcomplexnon-localdependenciesinthese- BOW problem with rare words does not allow quence,butmanyimportantlocalfeatures. CNNs its joint model with NSL to work better than withmaxpoolingcanmorestraightforwardlycap- NSL. WWLR+CLR(CNN) works better than turelocalandposition-independentfeatures. CNN BOW+CLR(NSL) by 10% micro F , again due 1 also beats NSL baseline; a possible reason is that to the limits of BOW compared to WWLR. CNN – an automatic method of feature learning Interestingly WWLR+CLR works better than SWLR+CLR and this suggests that WWLR is 8Foraccuracymeasure: inthefollowingorderedlistsof indeedricherthanSWLRwhenCLRmitigatesits sets,A<Bmeansthatallmembers(rownumbersinTable2) ofAaresignificantlyworsethanallmembersofB: {1}< problemwithrare/unknownwords {2}<{3,...,11}<{12,13}<{14,15,16}<{17,...,23}. Entity-level models are on lines 14–15 and Testofequalproportions,α <0.05. SeeTable6intheap- pendixformoredetails. they are better than all previous models on lines CLR, and SWLR is slightly better than WWLR. Joint models of WWLR/SWLR with ELR+CLR gives more improvements, and SWLR is again slightly better than WWLR. ELR+WWLR+CLR and ELR+SWLR+CLR, are better than their two- levelcounterparts,againconfirmingthattheselev- elsarecomplementary. We get a further boost, especially for tail en- tities, by also including TC (type cosine) in the combinations(lines22-23). Thisdemonstratesthe potentialadvantageofhavingacommonrepresen- tationspaceforentitiesandtypes. Ourbestmodel, ELR+SWLR+CLR+TC (line 22), which we refer to as MuLR in the other tables, beats our initial Figure 4: t-SNE result of entity-level representa- baselines (ELR and AGG-FIGER) by large mar- tions gins, e.g., in tail entities improvements are more than8%inmicroF1. Table 3 shows type macro F for MuLR 1–13. Thisshowsthepowerofentity-levelembed- 1 (ELR+SWLR+CLR+TC) and two baselines. dings. In Figure 4, a t-SNE (Van der Maaten and There are 11 head types (those with ≥3000 train Hinton,2008)visualizationofELR(SKIP)embed- entities) and 36 tail types (those with <200 train dingsusingdifferentcolorsforentitytypesshows entities). These results again confirm the superi- that entities of the same type are clustered to- orityofourmulti-levelmodelsoverthebaselines: gether. SSKIPworksmarginallybetterthanSKIP AGG-FIGER and ELR, the best single-level forELR,especiallyfortailentities,confirmingour modelbaseline. hypothesis that order information is important for a good distributional entity representation. This 4.3 Analysis is also confirming the results of Yaghoobzadeh andSchu¨tze(2016),wheretheyalsogetbetteren- Unknown vs. known entities. To analyze the tity typing results with SSKIP compared to SKIP. complementarity of character and word level rep- They propose to use entity typing as an extrinsic resentations, as well as more fine-grained com- evaluationforembeddingmodels. parison of our models and the baselines, we di- Joint entity, word, and character level mod- videtestentitiesintoknownentities–atleastone elsareonlines16-23. TheAGG-FIGERbaseline wordoftheentity’snameappearsinatrainentity works better than the systems on lines 1-13, but – and unknown entities (the complement). There worsethanELRs. Thisisprobablyduetothefact are 45,000 (resp. 15,000) known (resp. unknown) thatAGG-FIGERisoptimizedformentiontyping testentities. anditistrainedusingdistantsupervisionassump- Table4showsthattheCNNworksonlyslightly tion. Parallel to our work, Yaghoobzadeh et al. better (by 0.3%) than NSL on known entities, but (2017) optimize a mention typing model for our worksmuchbetteronunknownentities(by3.3%), entity typing task by introducing multi instance justifying our preference for deep learning CLR learning algorithms, resulting comparable perfor- models. As expected, BOW works relatively well mance to ELR(SKIP). We will investigate their for known entities and really poorly for unknown methodinfuture. entities. SWLR beats CLR models as well as Joining CLR with ELR (line 17) results in BOW. The reason is that in our setup, word em- large improvements, especially for tail entities beddings are induced on the entire corpus using (5% micro F ). This demonstrates that for rare an unsupervised algorithm. Thus, even for many 1 entities, contextual information is often not suf- words that did not occur in train, SWLR has ac- ficient for an informative representation, hence cess to informative representations of words. The name features are important. This is also true joint model, SWLR+CLR(CNN), is significantly for the joint models of WWLR/SWLR and ELR better than BOW+CLR(NSL) again due to limits (lines 18-19). Joining WWLR works better than ofBOW.SWLR+CLR(CNN)isbetterthanSWLR inunknownentities. entities: all head tail AVG-DES .773 .791 .745 Case study of LIVING-THING. To understand MuLR .825 .846 .757 theinterplayofdifferentlevelsbetter,weperform MuLR+AVG-DES .873 .877 .852 a case study of the type LIVING-THING. Living Table 5: Micro average F results of MuLR and 1 beingsthatarenothumansbelongtothistype. descriptionbasedmodelandtheirjoint. WLRs incorrectly assign “Walter Leaf” (PERSON) and “Along Came A Spider” (MUSIC) to LIVING-THING because these names contain a level model as well as the joint of the two on this word referring to a LIVING-THING (“leaf”, “spi- smallerdataset. Sincethedescriptionsarecoming der”),buttheentityitselfisnota LIVING-THING. from Wikipedia, we use 300-dimensional Glove In these cases, the averaging of embeddings that (URL, 2016a) embeddings pretrained on Wikip- WLR performs is misleading. The CLR(CNN) dia+Gigawordtogetmorecoverageofwords. For types these two entities correctly because their MuLR,westillusetheembeddingswetrainedbe- names contain character ngram/shape patterns fore. thatareindicativeofPERSONandMUSIC. Results are shown in Table 5. While for head ELR incorrectly assigns “Zumpango” (CITY) entities,MuLRworksmarginallybetter,thediffer- and“LakeKasumigaura”(LOCATION)toLIVING- enceisverysmallintailentities. Thejointmodel THING because these entities are rare and words ofthetwo(byconcatenationofvectors)improves associated with living things (e.g., “wildlife”) themicroF1,withclearboostfortailentities. This dominate in their contexts. However, CLR(CNN) suggests that for tail entities, the contextual and and WLR enable the joint model to type the two nameinformationisnotenoughbyitselfandsome entites correctly: “Zumpango” because of the in- keywords from descriptions can be really helpful. formative suffix “-go” and “Lake Kasumigaura” Integrating more complex description-based em- becauseoftheinformativeword“Lake”. beddings, e.g., by using CNN (Xie et al., 2016), Whilesomeoftheremainingerrorsofourbest may improve the results further. We leave it for systemMuLRareduetotheinherentdifficultyof futurework. entitytyping(e.g.,itisdifficulttocorrectlytypea one-wordentitythatoccursonceandwhosename 5 Conclusion is not informative), many other errors are due to artifacts of our setup. First, ClueWeb/FACC1 is In this paper, we have introduced representations the result of an automatic entity linking system of entities on different levels: character, word andanyentitylinkingerrorspropagatetoourmod- and entity. The character level representation is els. Second,duetotheincompletenessofFreebase learnedfromtheentityname. Thewordlevelrep- (YaghoobzadehandSchu¨tze,2015),manyentities resentation is computed from the embeddings of in the FIGMENT dataset are incompletely anno- thewordsw intheentitynamewheretheembed- i tated, resulting in correctly typed entities being ding of w is derived from the corpus contexts of i evaluatedasincorrect. w . The entity level representation of entity e is i i Adding another source: description-based derivedfromthecorpuscontextsofe . Ourexper- i embeddings. Whileinthispaper,wefocusonthe iments show that each of these levels contributes contexts and names of entities, there is a textual complementary information for the task of fine- sourceofinformationaboutentitiesinKBswhich grained typing of entities. The joint model of all we can also make use of: descriptions of entities. three levels beats the state-of-the-art baseline by We extract Wikipedia descriptions of FIGMENT large margins. We further showed that extracting entities filtering out the entities (∼ 40,000 out of some keywords from Wikipedia descriptions of ∼200,000)withoutdescription. entities,whenavailable,canconsiderablyimprove Wethenbuildasimpleentityrepresentationby entity representations, especially for rare entities. averagingtheembeddingsofthetopk words(wrt We believe that our findings can be transferred to tf-idf)ofthedescription(henceforth,AVG-DES).9 othertaskswhereentityrepresentationmatters. This representation is used as input in Figure 1 Acknowledgments. This work was supported to train the MLP. We also train our best multi- byDFG(SCHU2246/8-2). 9k=20givesthebestresultsondev. References [Fangetal.2016] Wei Fang, Jianwen Zhang, Dilin Wang,ZhengChen,andMingLi. 2016. Entitydis- [Ballesterosetal.2015] MiguelBallesteros,ChrisDyer, ambiguation by knowledge and text jointly embed- and Noah A. Smith. 2015. Improved transition- ding. In Proceedings of The 20th SIGNLL Confer- based parsing by modeling characters instead of enceonComputationalNaturalLanguageLearning, words with lstms. In Proceedings of the 2015 pages260–269, Berlin, Germany, August.Associa- Conference on Empirical Methods in Natural Lan- tionforComputationalLinguistics. guageProcessing,pages349–359,Lisbon,Portugal, September.AssociationforComputationalLinguis- [Gabrilovichetal.2013] Evgeniy Gabrilovich, Michael tics. Ringgaard, and Amarnag Subramanya. 2013. Facc1: Freebaseannotationofcluewebcorpora. [Bojanowskietal.2016] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. [Harris1954] Zellig S. Harris. 1954. Distributional Enriching word vectors with subword information. structure. Word,10:146–162. CoRR,abs/1607.04606. [Kimetal.2016] YoonKim,YacineJernite,DavidSon- [Bollackeretal.2008] Kurt D. Bollacker, Colin Evans, tag, and Alexander M. Rush. 2016. Character- Praveen Paritosh, Tim Sturge, and Jamie Taylor. aware neural language models. In Proceedings of 2008. Freebase: a collaboratively created graph the Thirtieth AAAI Conference on Artificial Intel- databaseforstructuringhumanknowledge. InPro- ligence, February 12-17, 2016, Phoenix, Arizona, ceedings of the ACM SIGMOD International Con- USA.,pages2741–2749. ference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages [Lampleetal.2016] GuillaumeLample,MiguelBalles- 1247–1250. teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for [CaoandRei2016] Kris Cao and Marek Rei. 2016. A named entity recognition. In Proceedings of the jointmodelforwordembeddingandwordmorphol- 2016ConferenceoftheNorthAmericanChapterof ogy. In Proceedings of the 1st Workshop on Rep- theAssociationforComputationalLinguistics: Hu- resentationLearningforNLP,pages18–26,Berlin, man Language Technologies, pages 260–270, San Germany, August. Association for Computational Diego, California, June. Association for Computa- Linguistics. tionalLinguistics. [Chenetal.2015] XinxiongChen,LeiXu,ZhiyuanLiu, [LingandWeld2012] Xiao Ling and Daniel S. Weld. Maosong Sun, and Huan-Bo Luan. 2015. Joint 2012. Fine-grainedentityrecognition. InProceed- learningofcharacterandwordembeddings. InPro- ings of the Twenty-Sixth AAAI Conference on Arti- ceedings of the Twenty-Fourth International Joint ficial Intelligence, July 22-26, 2012, Toronto, On- Conference on Artificial Intelligence, IJCAI 2015, tario,Canada. Buenos Aires, Argentina, July 25-31, 2015, pages 1236–1242. [Lingetal.2015a] Wang Ling, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015a. Two/too sim- [ChiuandNichols2016] Jason Chiu and Eric Nichols. pleadaptationsofword2vecforsyntaxproblems. In 2016. Named entity recognition with bidirectional Proceedings of the 2015 Conference of the North lstm-cnns. TransactionsoftheAssociationforCom- American Chapter of the Association for Compu- putationalLinguistics,4:357–370. tational Linguistics: Human Language Technolo- gies, pages 1299–1304, Denver, Colorado, May– [DelCorroetal.2015] Luciano Del Corro, Abdalghani June.AssociationforComputationalLinguistics. Abujabal, Rainer Gemulla, and Gerhard Weikum. 2015. Finet: Context-awarefine-grainednameden- [Lingetal.2015b] Wang Ling, Chris Dyer, Alan W titytyping. InProceedingsofthe2015Conference Black, Isabel Trancoso, Ramon Fermandez, Silvio on Empirical Methods in Natural Language Pro- Amir,LuisMarujo,andTiagoLuis. 2015b. Finding cessing, pages 868–878, Lisbon, Portugal, Septem- function in form: Compositional character models ber.AssociationforComputationalLinguistics. for open vocabulary word representation. In Pro- ceedingsofthe2015ConferenceonEmpiricalMeth- [dosSantosandGuimara˜es2015] C´ıcero Nogueira dos ods in Natural Language Processing, pages 1520– Santos and Victor Guimara˜es. 2015. Boosting 1530, Lisbon, Portugal, September.Associationfor named entity recognition with neural character em- ComputationalLinguistics. beddings. CoRR,abs/1505.05008. [MaandHovy2016] Xuezhe Ma and Eduard Hovy. [dosSantosandZadrozny2014] C´ıcero Nogueira dos 2016. End-to-end sequence labeling via bi- Santos and Bianca Zadrozny. 2014. Learning directionallstm-cnns-crf. InProceedingsofthe54th character-level representations for part-of-speech Annual Meeting of the Association for Computa- tagging. In Proceedings of the 31th International tional Linguistics (Volume 1: Long Papers), pages ConferenceonMachineLearning,ICML2014,Bei- 1064–1074, Berlin, Germany, August. Association jing,China,21-26June2014,pages1818–1826. forComputationalLinguistics.