ebook img

Polylingual Topic Models - University of Massachusetts, Amherst PDF

23 Pages·2009·0.54 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Polylingual Topic Models - University of Massachusetts, Amherst

Polylingual Topic Models Hanna M. Wallach UniversityofMassachusettsAmherst [email protected] August 7, 2009 Joint work with D. Mimno, J. Naradowsky, D.A. Smith and A. McCallum Statistical Topic Models (cid:73) Useful for analyzing large, unstructured text collections bounds units policy data neurons bound hidden action space neuron loss network reinforcement clustering spike functions layer learning points synaptic error unit actions distance firing (cid:73) Topic-based search interfaces (http://rexa.info) (cid:73) Analysis of scientific progress over time (Blei&Lafferty,’07) (cid:73) Information retrieval (Wei&Croft,’06) PolylingualTopicModels HannaM.Wallach Automated Analysis of Text (cid:73) Previously: analyzing trends in text collections (Halletal.,’08) (cid:73) Monolingual models often work well: collections in English only (cid:73) Multilingual text collections are increasingly common (cid:73) Automated tools are most important for multilingual collections: (cid:73) Don’t know the language → cannot eyeball the data (cid:73) New documents will appear in other languages (cid:73) People typically only know a few languages (cid:73) Simultaneously analyze document content in many languages PolylingualTopicModels HannaM.Wallach Multiple Languages (cid:73) Why model multiple languages explicitly? (cid:73) Most statistical topic models are language-agnostic graph problem rendering algebra und la graphs problems graphics algebras von des edge optimization image ring die le vertices algorithm texture rings der du edges programming scene modules im les (cid:73) Hodgepodge of English, German, French topics (cid:73) Imbalanced corpus: maybe only one or two French topics PolylingualTopicModels HannaM.Wallach Parallel vs. Comparable Corpora (cid:73) A set of aligned documents is a “document tuple” (cid:73) Fully parallel corpora: documents are direct translations (cid:73) Corpora with a few parallel “glue” document tuples (cid:73) Comparable corpora: documents have similar semantic content PolylingualTopicModels HannaM.Wallach Polylingual Topic Model (cid:73) Generates a document tuple w = w1,...,wL by drawing... (cid:73) For real-world data, only the word tokens are observed PolylingualTopicModels HannaM.Wallach Key Characteristics (cid:73) Learning a model of all languages simultaneously (cid:73) A topic is a set of distributions over words, e.g., φ = φ1,...,φL t t t (cid:73) Works on tuples of aligned documents, rather than documents, but each tuple can be comprised of only a subset of languages (cid:73) Tuple-specific topic distributions ensure cross-language consistency: e.g., topic 13 in French is semantically similar to topic 13 in English (cid:73) Simple, Gibbs sampling inference algorithm (cid:73) Inference is linear in # of languages, not # of language pairs PolylingualTopicModels HannaM.Wallach EuroParl: Example Topics (T = 400) PolylingualTopicModels HannaM.Wallach EuroParl: Example Topics (T = 400) PolylingualTopicModels HannaM.Wallach EuroParl: Example Topics (T = 400) PolylingualTopicModels HannaM.Wallach

Description:
Aug 7, 2009 Fully parallel corpora: documents are direct translations. ▻ Corpora with a few russia russian chechnya cooperation region belarus. IT.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.