Table Of ContentCOS 402 – Machine
Learning and
Artificial Intelligence
Fall 2016
Lecture 10: Capturing semantics: Word
Embeddings
Sanjeev Arora Elad Hazan
(Borrows from slides of D. Jurafsky Stanford U.)
Last time
N-gram language models. Pr[w | w ]
2 1
(Assign a probability to each sentence; trained using an unlabeled corpus)
Unsupervised
learning
Today: Unsupervised learning for semantics (meaning)
Semantics
Study of meaning of linguistic expressions.
Mastering and fluently operating with it seems a precursor to AI (Turing test, etc.)
But every bit of partial progress can immediately be used to improve technology:
• Improving web search.
• Siri, etc.
• Machine translation
• Information retrieval,..
Let’s talk about meaning
What does the sentence “You like green cream” mean?
”Come up and see me some time.”
How did words arrive at their meanings?
Your answers seem to involve a mix of
• Grammar/syntax
• How words map to physical experience
• Physical sensations “sweet”, “breathing”, “pain” etc. and mental states “happy,” “regretful” appear
to be experienced roughly the same way by everybody, which help anchor some meanings)
•
•
• (at some level, becomes philosophy).
What are simple tests for understanding meaning?
Which of the following means same as pulchritude:
(a) Truth (b) Disgust (c) Anger (d) Beauty?
Analogy questions:
Man : Woman :: King : ??
“How can we teach a computer the notion of
word similarity?”
Test: Think of a word that co-occurs with:
Cow, drink, babies, calcium…
Distributional hypothesis of meaning, [Harris’54], [Firth’57]
Meaning of a word is determined by words it co-occurs with.
“Tiger” and “Lion” are similar because they cooccur with
similar words (“jungle”, “hunt”, “predator”, “claws”, etc.)
A computer could learn similarity by simply reading
a text corpus; no need to wait for full AI!
How can we quantify “distribution of nearby words”?
A bottle of tesgüino is on the table
Everybody likes tesgüino
Tesgüino makes you drunk
We make tesgüino out of corn.
!"#$%(’((),+)#$,)
Recall bigram frequency P(beer, drunk) =
.")/#0 023(
Redefine
count(beer, drunk) = # of times beer and drunk appear within 5 words of each other.
Matrix of all such bigram counts (# columns = V =106)
study alcohol drink bottle lion
Vector representation
Beer 35 552 872 677 2 of “beer.”
Milk 45 20 935 250 0
Semantic
Jungle 12 35 110 28 931
Embedding
Tesgüino 2 67 75 57 0
[Deerwester et
al 1990]
“Ignore bigrams with
Which words look most similar to each other? frequent words like
“and”, “for”
”Distributional hypothesis” è Word similarity boils down to vector similarity.
19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7
•
computer data pinch result sugar
apricot 0 0 0.56 0 0.56
pineapple 0 0 0.56 0 0.56
digital 0.62 0 0 0 0
information 0 0.58 0 0.37 0
Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts
in Fig. 17.5.
The cosine—like most measures for vector similarity used in NLP—is based on
19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7
dot product the dot product operat•or from linear algebra, also called the inner product:
inner product
computer data pinch result sugar
N
apricot 0 0 0.56 0 0.56
dot-product(~v,~w) =~v ~w = v w = v w +v w +...+v w (19.10)
pineapple i i 0 1 1 20 2 0.56N N 0 0.56
·
digital i=1 0.62 0 0 0 0
X
information 0 0.58 0 0.37 0
Intuitively, the dot proFdiguucret1a9c.6ts aThseaAdsdim-2 iLlaaprlaitcye smmoeottrhiecd PbPeMcIamuasterixitfrowmiltlheteadndd-2tsomoboething counts
19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MinUTFUigA. 1L7I.5N.FORMATION 7
• high just when the two vectors have large values in the same dimensions. Alterna-
tively, vectors that have zeros in different dimensions—orthogonal vectors— will be
computer data pinch Trehseulctosine—sluikgearmost measures for vector similarity used in NLP—is based on
apricot 0very dissim0ilar,dowt pirto0hd.u5ac6t dotthperdood0tupcrotdoufct0o.0p.5e6rator from linear algebra, also called the inner product:
pineapple 0 0 0.56 0 0.56
This raw dinonetr-pprroodudctuct, however, has a problem as a similarity metric: it favors
digital 0.62 0 0 0 0 N
informveactitoonr length 0long vecto0.r5s8. The ve0ctor leng0t.d3ho7ti-psrdodeuficnt(0~ev,d~wa)s=~v ~w = viwi = v1w1 +v2w2 +...+vNwN (19.10)
·
Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts i=1
X
in Fig. 17.5. 19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7
N
• Intuitively, the dot product acts as a similarity metric because it will tend to be
high just wh~evn t=he two vecvto2rs have large values in the same dim(e1n9s.i1o1ns). Alterna-
The cosine—like most measures for vector similarity used in NL|P|—is vbased on i
tively, vectors thactohumavpeuzteerros in diffedreantat dimensiopnisn—chorthogonarlevseuclttors— wilslubgear
dotproduct the dot product operator from linear algebra, also called the inner producut: Xi=1
apverriycodtissimilar, with a0dot product of 00. 0.56 0 0.56
t
innerproduct
The dot pNroduct is highpeirniefaapTphvlieescrtaowrdisotl-oprnogd0uecrt,,whoiwthevheirg, hhae0sravparloubelesmin0a.s5e6aacshimdiliamriteynm0sieotrnic.: it fav0or.5s6
dot-product(~v,~w)M=o~vr·e~wf=requevinwtive=wctovor1lrwedn1gst+hhva2dwvlio2egn+itglao.v.l.ne+cgtevoNrrsw.vNTehcetover0sc(.t,16o92rs.1il0ne)ncgeththisedyefit0enenddasto co-o0ccur with 0more 0
i=1
X information 0 0.58 0 0.37 0
words and have higher co-occurrence values with each of them. Raw dot product
N
Intuitively, the dot product acts as a similarity mFeigtruicrebe1c9a.u6se iTt hweilAl tdedn-d2toLabpelace smoothed PPMI matrix from the add-2 smoothing counts
thus will be higher for frequent words. But this is a p~vro=blem; wv2e’d like a similarity (19.11)
high just when the two vectors have large values ininthFeigsa.m17e.5di.mensions. Alterna- | | v i
u
tively, vectors that havemzeerotrsiicntdhifafCetroetensltldisnimueesn hssiooimnws—silioamrtrhiiolatgryont awl voecwtoorsr—dswairllebierregardlessuoXif=1their frequency.
t
very dissimilar, with a dot pTrohduectsiomf 0p.lest way to mToheddifoyt ptrhoedudcot itsphrigohdeur ciftatvoecntoorrmis laolnigzeer,fwoirththheighveercvtaolurelseinngetahchisdimension.
The cosine—like most measures for vector similarity used in NLP—is based on
This raw dot-product, however, has a problem as a similarity metric: it favors
More frequent words have longer vectors, since they tend to co-occur with more
to divide the dot product by the lengths of each of the two vectors. This normalized
dot product the dot product operator from linear algebra, also called the inner product:
vectorlength long vectors. The vector length is “dRefiensecdalaes vectors twoo urdnsita lnednghtahve, choigmhepructoe- odcoctu prrreondceucvta.l”u es with each of them. Raw dot product
dot product turns out to be the same as the cosine of the angle between the two
inner product
thus will be higher for frequent words. But this is a problem; we’d like a similarity
N
vectors, followinNg from the definition of the dot product between two vectors ~a and
metric that tells us how similar two words are irregardless of their frequency.
~ ~v = v2 dot-product(~v,~w(1)9=.11~v) ~w = viwi = v1w1 +v2w2 +...+vNwN (19.10)
b: | | v i The simplest way to mod·ify the dot product to normalize for the vector length is
u
i=1 i=1
uX to divide the dot product by the lengXths of each of the two vectors. This normalized
• High when two vectors have large values
t
The dot product is higher if a vector is longer, with highedIrnovttuapliutrieovsdeiulnyce,tattcuhhrendsdimooeutntpstriooondb.eucthteacsatsmaesasa tshiemciolasirnietyomf tehteriacngbleecbaeutsweeeitnwthielltwteond to be
in same dimensions.
More frequent words have longer vectors, since they tveencdtotros,cfoo-loloc~wcuinrgwfirtohmmto~hreedefinition of the dot product between two vectors~a and
high just wh~aenbthe=two~aveb•ctcoLoorsswqh (ainv efalcatr g0e) fvoarl uoersthiongtohneasla vmecetdoirms ensions. Alterna-
~
words and have higher co-occurrence values with eachb:of them. R·aw dot p|ro|d|uc|t
tively, vectors~that have zerows iitnhdzieffreorse nint cdoimmepnlesimonesn—taroyr tdhiostgroibnuatlivoenctors— will be
thus will be higher for frequent words. But this is a problem; w~ea’dblike a similarity
Cosine Simivlaerriytyd i=s simil·ar, wi=th acdoostqproduct of 0. (19.12)
metric that tells us how similar two words are irregardless of their frequency.
~a ~b ~a ~b = ~a ~b cosq
The simplest way to modify the dot product to normTalhiziesfroarwthedvoetc-tporrolednugctth,ihsowever, has a problem as a similarity metric: it favors
| || | · | || |
~
to divide the dot product by the lengvtehcstoorfleenagcthhof thleontwgovveeccttoorrss..TThhisenvoremctaolrizeledngth~a isb defined as
dot productcotusirnnes out to beTthhee scaomseinaes tshiemciolsainreityofmtheetarnicglebebtewtweeeenntthwe otwvoector·s~~v a=ndco~wsqthus can be computed (19.12)
~a b
vectors, following fromatsh:e definition of the dot product between two vectors~a and | || |
N
~
b: Other measurecsos ionef similaTrihtye choasvinee bseimenil adrietyfinmeedtr…ic.betw~vee=n two vectvo2rs~v and ~w thus can be computed (19.11)
as: N | | v i
u
i=1
uX
v w
~a ~b = ~a ~b cosq i i t N
· | || | The dot product~vis ~whigher if a vector is longer, with higher values in each dimension.
~a ~b cosine(~v,~w) = · = Xi=1 viwi (19.13)
· = cosq More frequent ~wvo~wrds(h19a.v12e) longer ~vve~wctors, sinci=e1 they tend to co-occur with more
~a ~b cosine(~v,N~w) = · N= X (19.13)
| || |
| || | words and have higher co-occurr2en~vce~wvalue2s Nwith eaNch of them. Raw dot product
v | || | w
cosine The cosine similarity metric between two vectorths~uvsanwdi~wlltbheushciagnhbeercfoomrpfuretveqduent wiovrds. Butithis vis2 a prowbl2em; we’d like a similarity
u u v i v i
as: metric that tells us how siumiXlia=r1twouwXoi=rd1s aurei=i1rreguarid=l1ess of their frequency.
uX uX
t t t t
For some applicationsTwheeFosprirmseop-mnleeosartpmwplaaiclyiazttioeonmesawocdehipfvyree-tcnhtoeormdr,oabltiyzperdoeiadvcuhicdvtientcogtonri,otbrbmyydaiilvtisizdelinefgnoigrtttbhhye,itvselcentogrthl,ength is
N
unit vector creating a unit vuneitcvteoctorrtoovfdicwilvreieinadtgeinttghhae1ud.noTitthpvureocstdowurceotfcbloyenutghltdeh lc1e.onTmghtphusustwoefeaecaoucunhldiotcfovmtehcpeutottewraofruvoneimtctvo~earcstbo.yrTfhroims n~aobrymalized
~v ~w
i=d1ot product turns out to be the same as the cosine of the angle between the two
cosine(~v,~w) = · = X (19.13)
~v ~w
N vectoNrs, following from the definition of the dot product between two vectors ~a and
| || |
~v2 w2
v bi:v i
u u
i=1 i=1
uX uX
t t
For some applications we pre-normalize each vector, by dividing it by its length,
unitvector creating a unit vector of length 1. Thus we could compute a unit vector from~a by ~a ~b = ~a ~b cosq
· | || |
~
~a b
· = cosq (19.12)
~
~a b
| || |
cosine The cosine similarity metric between two vectors~v and ~w thus can be computed
as:
N
v w
i i
~v ~w
i=1
cosine(~v,~w) = · = X (19.13)
~v ~w
N N
| || |
v2 w2
v i v i
u u
i=1 i=1
uX uX
t t
For some applications we pre-normalize each vector, by dividing it by its length,
unit vector creating a unit vector of length 1. Thus we could compute a unit vector from ~a by
Reminder about properties of cosine
• -1: vectors point in opposite
directions
• +1: vectors point in same
directions
• 0: vectors are orthogonal
Description:Elad Hazan. COS 402 – Machine . processing systems to make a guess . a man with a jersey is dunking the ball at a basketball game the ball is