D E T ACL’05 Tutorial A 5 D 0 P 0 2 University of Michigan - Ann Arbor U d T r 3 S June 25, 2005 y A l u L J Introduction to Arabic Natural Language Processing Nizar Habash Columbia University Center for Computational Learning Systems 1 • Focus of this tutorial – Phenomena – Concepts – Approaches & Resources • What is ‘Arabic’? – Arabic Script – Arabic Language • Modern Standard Arabic (MSA) • Arabic Dialects 2 Road Map • Introduction • Orthography • Morphology • Syntax • Machine Translation Issues • Dialects 3 Road Map • Introduction • Orthography – Arabic Script – MSA Phonology and Spelling – Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues • Morphology • Syntax • Machine Translation Issues • Dialects 4 Arabic Script 5 Arabic Script Arabic script is an alphabet with allographic variants, optional zero-width diacritics and common ligatures. ﻲ ﺑِﺮ ﻌ ﻟﺍ ﻂﹸ ﳋﹶ ﺍ Arabic script is used to write many languages: Arabic, Persian, Kurdish, Urdu, Pashto, etc. 6 Arabic Script Alphabet • letter forms • letter marks • Arabic only • Other languages • Persian, Kurdish, Urdu, Pashto, etc. OCR output ambiguity 7 • Arabic Script Alphabet (MSA) • letters (form+mark) ش س ث ت ب • Distinctive ʃ / / /s/ /θ/ /t/ /b/ ؤ ئ ء M إ أ ا • Non-distinctive /ʔ/ 8 glottal stop aka hamza Arabic Script Letter Shapes • No distinction between print and handwriting • No capitalization • Right-to-left Stand ن ب ك م ش غ • Ambiguous alone shapes ز د ا • Connective (cid:9) (cid:25) آ (cid:23) (cid:22) (cid:21) initial letters • Disconnective (cid:8) (cid:20) (cid:19) (cid:18) (cid:17) (cid:16) medial letters (cid:5) (cid:1) (cid:3) (cid:7) (cid:15) (cid:14) (cid:13) (cid:12) (cid:11) final 9 Arabic Script Letter shaping (cid:1) (cid:15)&آ = (cid:15)&آ ب ت ك /katab/ b t k to write (cid:1) ب(cid:3)&آ = ب(cid:3)&آ ب ا ت ك /kitāb/ b ā t k book 10
Description: