COS 402 – Machine Learning and Artificial Intelligence Fall 2016 Lecture 10: Capturing semantics: Word Embeddings Sanjeev Arora Elad Hazan (Borrows from slides of D. Jurafsky Stanford U.) Last time N-gram language models. Pr[w | w ] 2 1 (Assign a probability to each sentence; trained using an unlabeled corpus) Unsupervised learning Today: Unsupervised learning for semantics (meaning) Semantics Study of meaning of linguistic expressions. Mastering and fluently operating with it seems a precursor to AI (Turing test, etc.) But every bit of partial progress can immediately be used to improve technology: • Improving web search. • Siri, etc. • Machine translation • Information retrieval,.. Let’s talk about meaning What does the sentence “You like green cream” mean? ”Come up and see me some time.” How did words arrive at their meanings? Your answers seem to involve a mix of • Grammar/syntax • How words map to physical experience • Physical sensations “sweet”, “breathing”, “pain” etc. and mental states “happy,” “regretful” appear to be experienced roughly the same way by everybody, which help anchor some meanings) • • • (at some level, becomes philosophy). What are simple tests for understanding meaning? Which of the following means same as pulchritude: (a) Truth (b) Disgust (c) Anger (d) Beauty? Analogy questions: Man : Woman :: King : ?? “How can we teach a computer the notion of word similarity?” Test: Think of a word that co-occurs with: Cow, drink, babies, calcium… Distributional hypothesis of meaning, [Harris’54], [Firth’57] Meaning of a word is determined by words it co-occurs with. “Tiger” and “Lion” are similar because they cooccur with similar words (“jungle”, “hunt”, “predator”, “claws”, etc.) A computer could learn similarity by simply reading a text corpus; no need to wait for full AI! How can we quantify “distribution of nearby words”? A bottle of tesgüino is on the table Everybody likes tesgüino Tesgüino makes you drunk We make tesgüino out of corn. !"#$%(’((),+)#$,) Recall bigram frequency P(beer, drunk) = .")/#0 023( Redefine count(beer, drunk) = # of times beer and drunk appear within 5 words of each other. Matrix of all such bigram counts (# columns = V =106) study alcohol drink bottle lion Vector representation Beer 35 552 872 677 2 of “beer.” Milk 45 20 935 250 0 Semantic Jungle 12 35 110 28 931 Embedding Tesgüino 2 67 75 57 0 [Deerwester et al 1990] “Ignore bigrams with Which words look most similar to each other? frequent words like “and”, “for” ”Distributional hypothesis” è Word similarity boils down to vector similarity. 19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 • computer data pinch result sugar apricot 0 0 0.56 0 0.56 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is based on 19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 dot product the dot product operat•or from linear algebra, also called the inner product: inner product computer data pinch result sugar N apricot 0 0 0.56 0 0.56 dot-product(~v,~w) =~v ~w = v w = v w +v w +...+v w (19.10) pineapple i i 0 1 1 20 2 0.56N N 0 0.56 · digital i=1 0.62 0 0 0 0 X information 0 0.58 0 0.37 0 Intuitively, the dot proFdiguucret1a9c.6ts aThseaAdsdim-2 iLlaaprlaitcye smmoeottrhiecd PbPeMcIamuasterixitfrowmiltlheteadndd-2tsomoboething counts 19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MinUTFUigA. 1L7I.5N.FORMATION 7 • high just when the two vectors have large values in the same dimensions. Alterna- tively, vectors that have zeros in different dimensions—orthogonal vectors— will be computer data pinch Trehseulctosine—sluikgearmost measures for vector similarity used in NLP—is based on apricot 0very dissim0ilar,dowt pirto0hd.u5ac6t dotthperdood0tupcrotdoufct0o.0p.5e6rator from linear algebra, also called the inner product: pineapple 0 0 0.56 0 0.56 This raw dinonetr-pprroodudctuct, however, has a problem as a similarity metric: it favors digital 0.62 0 0 0 0 N informveactitoonr length 0long vecto0.r5s8. The ve0ctor leng0t.d3ho7ti-psrdodeuficnt(0~ev,d~wa)s=~v ~w = viwi = v1w1 +v2w2 +...+vNwN (19.10) · Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts i=1 X in Fig. 17.5. 19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 N • Intuitively, the dot product acts as a similarity metric because it will tend to be high just wh~evn t=he two vecvto2rs have large values in the same dim(e1n9s.i1o1ns). Alterna- The cosine—like most measures for vector similarity used in NL|P|—is vbased on i tively, vectors thactohumavpeuzteerros in diffedreantat dimensiopnisn—chorthogonarlevseuclttors— wilslubgear dotproduct the dot product operator from linear algebra, also called the inner producut: Xi=1 apverriycodtissimilar, with a0dot product of 00. 0.56 0 0.56 t innerproduct The dot pNroduct is highpeirniefaapTphvlieescrtaowrdisotl-oprnogd0uecrt,,whoiwthevheirg, hhae0sravparloubelesmin0a.s5e6aacshimdiliamriteynm0sieotrnic.: it fav0or.5s6 dot-product(~v,~w)M=o~vr·e~wf=requevinwtive=wctovor1lrwedn1gst+hhva2dwvlio2egn+itglao.v.l.ne+cgtevoNrrsw.vNTehcetover0sc(.t,16o92rs.1il0ne)ncgeththisedyefit0enenddasto co-o0ccur with 0more 0 i=1 X information 0 0.58 0 0.37 0 words and have higher co-occurrence values with each of them. Raw dot product N Intuitively, the dot product acts as a similarity mFeigtruicrebe1c9a.u6se iTt hweilAl tdedn-d2toLabpelace smoothed PPMI matrix from the add-2 smoothing counts thus will be higher for frequent words. But this is a p~vro=blem; wv2e’d like a similarity (19.11) high just when the two vectors have large values ininthFeigsa.m17e.5di.mensions. Alterna- | | v i u tively, vectors that havemzeerotrsiicntdhifafCetroetensltldisnimueesn hssiooimnws—silioamrtrhiiolatgryont awl voecwtoorsr—dswairllebierregardlessuoXif=1their frequency. t very dissimilar, with a dot pTrohduectsiomf 0p.lest way to mToheddifoyt ptrhoedudcot itsphrigohdeur ciftatvoecntoorrmis laolnigzeer,fwoirththheighveercvtaolurelseinngetahchisdimension. The cosine—like most measures for vector similarity used in NLP—is based on This raw dot-product, however, has a problem as a similarity metric: it favors More frequent words have longer vectors, since they tend to co-occur with more to divide the dot product by the lengths of each of the two vectors. This normalized dot product the dot product operator from linear algebra, also called the inner product: vectorlength long vectors. The vector length is “dRefiensecdalaes vectors twoo urdnsita lnednghtahve, choigmhepructoe- odcoctu prrreondceucvta.l”u es with each of them. Raw dot product dot product turns out to be the same as the cosine of the angle between the two inner product thus will be higher for frequent words. But this is a problem; we’d like a similarity N vectors, followinNg from the definition of the dot product between two vectors ~a and metric that tells us how similar two words are irregardless of their frequency. ~ ~v = v2 dot-product(~v,~w(1)9=.11~v) ~w = viwi = v1w1 +v2w2 +...+vNwN (19.10) b: | | v i The simplest way to mod·ify the dot product to normalize for the vector length is u i=1 i=1 uX to divide the dot product by the lengXths of each of the two vectors. This normalized • High when two vectors have large values t The dot product is higher if a vector is longer, with highedIrnovttuapliutrieovsdeiulnyce,tattcuhhrendsdimooeutntpstriooondb.eucthteacsatsmaesasa tshiemciolasirnietyomf tehteriacngbleecbaeutsweeeitnwthielltwteond to be in same dimensions. More frequent words have longer vectors, since they tveencdtotros,cfoo-loloc~wcuinrgwfirtohmmto~hreedefinition of the dot product between two vectors~a and high just wh~aenbthe=two~aveb•ctcoLoorsswqh (ainv efalcatr g0e) fvoarl uoersthiongtohneasla vmecetdoirms ensions. Alterna- ~ words and have higher co-occurrence values with eachb:of them. R·aw dot p|ro|d|uc|t tively, vectors~that have zerows iitnhdzieffreorse nint cdoimmepnlesimonesn—taroyr tdhiostgroibnuatlivoenctors— will be thus will be higher for frequent words. But this is a problem; w~ea’dblike a similarity Cosine Simivlaerriytyd i=s simil·ar, wi=th acdoostqproduct of 0. (19.12) metric that tells us how similar two words are irregardless of their frequency. ~a ~b ~a ~b = ~a ~b cosq The simplest way to modify the dot product to normTalhiziesfroarwthedvoetc-tporrolednugctth,ihsowever, has a problem as a similarity metric: it favors | || | · | || | ~ to divide the dot product by the lengvtehcstoorfleenagcthhof thleontwgovveeccttoorrss..TThhisenvoremctaolrizeledngth~a isb defined as dot productcotusirnnes out to beTthhee scaomseinaes tshiemciolsainreityofmtheetarnicglebebtewtweeeenntthwe otwvoector·s~~v a=ndco~wsqthus can be computed (19.12) ~a b vectors, following fromatsh:e definition of the dot product between two vectors~a and | || | N ~ b: Other measurecsos ionef similaTrihtye choasvinee bseimenil adrietyfinmeedtr…ic.betw~vee=n two vectvo2rs~v and ~w thus can be computed (19.11) as: N | | v i u i=1 uX v w ~a ~b = ~a ~b cosq i i t N · | || | The dot product~vis ~whigher if a vector is longer, with higher values in each dimension. ~a ~b cosine(~v,~w) = · = Xi=1 viwi (19.13) · = cosq More frequent ~wvo~wrds(h19a.v12e) longer ~vve~wctors, sinci=e1 they tend to co-occur with more ~a ~b cosine(~v,N~w) = · N= X (19.13) | || | | || | words and have higher co-occurr2en~vce~wvalue2s Nwith eaNch of them. Raw dot product v | || | w cosine The cosine similarity metric between two vectorths~uvsanwdi~wlltbheushciagnhbeercfoomrpfuretveqduent wiovrds. Butithis vis2 a prowbl2em; we’d like a similarity u u v i v i as: metric that tells us how siumiXlia=r1twouwXoi=rd1s aurei=i1rreguarid=l1ess of their frequency. uX uX t t t t For some applicationsTwheeFosprirmseop-mnleeosartpmwplaaiclyiazttioeonmesawocdehipfvyree-tcnhtoeormdr,oabltiyzperdoeiadvcuhicdvtientcogtonri,otbrbmyydaiilvtisizdelinefgnoigrtttbhhye,itvselcentogrthl,ength is N unit vector creating a unit vuneitcvteoctorrtoovfdicwilvreieinadtgeinttghhae1ud.noTitthpvureocstdowurceotfcbloyenutghltdeh lc1e.onTmghtphusustwoefeaecaoucunhldiotcfovmtehcpeutottewraofruvoneimtctvo~earcstbo.yrTfhroims n~aobrymalized ~v ~w i=d1ot product turns out to be the same as the cosine of the angle between the two cosine(~v,~w) = · = X (19.13) ~v ~w N vectoNrs, following from the definition of the dot product between two vectors ~a and | || | ~v2 w2 v bi:v i u u i=1 i=1 uX uX t t For some applications we pre-normalize each vector, by dividing it by its length, unitvector creating a unit vector of length 1. Thus we could compute a unit vector from~a by ~a ~b = ~a ~b cosq · | || | ~ ~a b · = cosq (19.12) ~ ~a b | || | cosine The cosine similarity metric between two vectors~v and ~w thus can be computed as: N v w i i ~v ~w i=1 cosine(~v,~w) = · = X (19.13) ~v ~w N N | || | v2 w2 v i v i u u i=1 i=1 uX uX t t For some applications we pre-normalize each vector, by dividing it by its length, unit vector creating a unit vector of length 1. Thus we could compute a unit vector from ~a by Reminder about properties of cosine • -1: vectors point in opposite directions • +1: vectors point in same directions • 0: vectors are orthogonal
Description: