Theory and Applications of Natural Language Processing Edited volumes Imed Zitouni Editor Natural Language Processing of Semitic Languages Theory and Applications of Natural Language Processing Series Editors: GraemeHirst(Textbooks) EduardHovy(Editedvolumes) Mark Johnson(Monographs) AimsandScope The field of Natural Language Processing (NLP) has expanded explosively over the past decade: growing bodies of available data, novel fieldsof applications, emerging areas and newconnectionstoneighboringfieldshaveallledtoincreasingoutputandtodiversification ofresearch. “TheoryandApplicationsofNaturalLanguageProcessing”isaseriesofvolumesdedicated toselectedtopicsinNLPandLanguageTechnology.Itfocusesonthemostrecentadvances inallareasofthecomputationalmodelingandprocessingofspeechandtextacrosslanguages and domains. Due to the rapid pace of development, the diversity of approaches and application scenarios are scattered in an ever-growing mass of conference proceedings, making entry into the field difficult for both students and potential users. Volumes in the seriesfacilitatethisfirststepandcanbeusedasateachingaid,advanced-level information resourceorapointofreference. The series encourages the submission of research monographs, contributed volumes and surveys,lecturenotesandtextbookscoveringresearchfrontiersonallrelevanttopics,offering aplatform for the rapid publication of cutting-edge research as well as for comprehensive monographsthatcoverthefullrangeofresearchonspecificproblemareas. The topics include applications of NLP techniques to gain insights into the use and functioningoflanguage,aswellastheuseoflanguagetechnologyinapplicationsthatenable communication,knowledgemanagementanddiscoverysuchasnaturallanguagegeneration, informationretrieval,question-answering,machinetranslation,localizationandrelatedfields. Thebooksareavailableinprintedandelectronic(e-book)form: * DownloadableonyourPC,e-readeroriPad * Enhanced by Electronic Supplementary Material, such as algorithms, demonstrations, software,imagesandvideos * Available online withinan extensive network of academic and corporate R&D libraries worldwide * Neveroutofprintthankstoinnovativeprint-on-demandservices * Competitively priced print editions for eBook customers thanks to MyCopy service http://www.springer.com/librarians/e-content/mycopy Forothertitlespublishedinthisseries,goto www.springer.com/series/8899 Imed Zitouni Editor Natural Language Processing of Semitic Languages 123 Editor ImedZitouni Microsoft Redmond,WA USA ISSN2192-032X ISSN2192-0338(electronic) ISBN978-3-642-45357-1 ISBN978-3-642-45358-8(eBook) DOI10.1007/978-3-642-45358-8 SpringerHeidelbergNewYorkDordrechtLondon LibraryofCongressControlNumber:2014936994 ©Springer-VerlagBerlinHeidelberg2014 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. While the advice and information in this book are believed to be true and accurate at the date of publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with respecttothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Modern communication technologies, such as the television and the Internet, have made readily available massive amounts of information in many languages. More such data is being generated in real time, 24h a day and 7 days a week, aided by social networking sites such as Facebook and Twitter. This information explosion is in the form of multilingual audio, video, and Web content. The task of processing this large amount of information demands effective, scalable, multilingualmediaprocessing,monitoring,indexing,andsearchsolutions.Natural LanguageProcessing(NLP)technologieshavelongbeenusedtoaddressthistask, andseveralresearchershavedevelopedseveraltechnicalsolutionsforit.Inthelast two decades, NLP researchers have developed exciting algorithms for processing largeamountsoftextinmanydifferentlanguages.NowadaystheEnglishlanguage has obtained the lion’s share in terms of available resources as well as developed NLP technical solutions. In this book, we address another group of interesting and challenging languages for NLP research, that is, the Semitic languages. The Semitic languageshave existed in written form since a very early date, with texts written in a script adapted from Sumerian cuneiform. Most scripts used to write Semitic languages are abjads, a type of alphabetic script that omits some or all of the vowels. This is feasible for these languages because the consonants in the Semitic languages are the primary carriers of meaning. Semitic languages have interesting morphology, where word roots are not themselves syllables or words, butisolatedsetsofconsonants(usuallythreecharacters).Wordsarecomposedout of roots by adding vowels to the root consonants (although prefixes and suffixes are often added as well). For example, in Arabic, the root meaning “write” has the form k - t - b. From this root, words are formed by filling in the vowels, e.g., kitAb“book,”kutub“books,”kAtib“writer,”kuttAb“writers,”kataba“hewrote,” yaktubu“hewrites,”etc.Semitic languages,asstatedinWikipedia,arespokenby more than 270 million people. The most widely spoken Semitic languages today areArabic(206millionnativespeakers),Amharic(27million),Hebrew(7million), Tigrinya(6.7million),Syriac(1million),andMaltese(419thousand).NLPresearch appliedtoSemiticlanguageshasbeenthefocusofattentionofmanyresearchersfor morethanadecade,andseveraltechnicalsolutionshavebeenproposed,especially v vi Preface Arabic NLP where we find a very large amount of accomplished research. This willbereflectedinthisbook,whereArabicwilltakethelion’sshare.Hebrewalso has been the center of attention of several NLP research works, but to a smaller degree when compared to Arabic. Most of the key published research works in Hebrew NLP will be discussed in this book. For Amharic, Maltese, and Syriac, because of the verylimited amountof NLP researchpublicly available, we didn’t limit ourselvesto presentkey techniques,but we also proposedsolutionsinspired fromArabicandHebrew.Ouraimforthisbookistoprovidea“one-stopshop”to all the requisite backgroundand practicaladvice when building NLP applications forSemiticlanguages.Whilethisisquiteatallorder,wehopethat,ataminimum, youfindthisbookausefulresource. Similar to English, the dominant approach in NLP for Semitic languages has beentobuildastatisticalmodelthatcanlearnfromexamples.Inthisway,amodel canberobusttochangesinthetypeoftextandeventhelanguageoftextonwhich it operates. With the rightdesign choices, the same modelcan be trained to work inanewdomainsimplybyprovidingnewexamplesinthatdomain.Thisapproach also obviates the need for researchers to lay out, in a painstaking fashion, all the rulesthatgoverntheproblemathandandthemannerinwhichthoserulesmustbe combined.Astatisticalsystemtypicallyallowsforresearcherstoprovideanabstract expressionofpossiblefeaturesoftheinput,wheretherelativeimportanceofthose features can be learned during the training phase and can be applied to new text duringthedecoding,orinference,phase.Whilethisbookwilldevotesomeattention tocutting-edgealgorithmsandtechniques,theprimarypurposewillbeathorough explicationofbestpracticesinthefield.Furthermore,everychapterdescribeshow thetechniquesdiscussedapplytoSemiticlanguages. Thisbookisdividedintotwoparts.PartI,includesthefirstfivechaptersandlays outseveralcore NLP problemsand algorithmsto attack those problems.The first chapter introduces some basic linguistic facts about Semitic languages, covering orthography,morphology,andsyntax.Italsoshowsacontrastiveanalysisofsome of these linguistic phenomena across the various languages. The second chapter introducestheimportantconceptofmorphology,thestudyofthestructureofwords, andwaystoprocessthediversearrayofmorphologiespresentinSemiticlanguages. Chapter 3 investigates the various methods of uncovering a sentence’s internal structure, or syntax. Syntax has long been a dominant area of research in NLP. This dominanceis explained in part by the fact that the structure of a sentence is related to the sentence’smeaning,and so uncoveringsyntactic structure can serve asafirststeptoward“understanding”asentence.Onestepbeyondsyntacticparsing toward understanding a sentence is to perform semantic parsing that consists in findingastructuredmeaningrepresentationforasentenceorasnippetoftext.This isthefocusofChap.4thatalsocoversarelatedsubproblemknownassemanticrole labeling,whichattemptstofindthesyntacticphrasesthatconstitutethearguments to some verb or predicate. By identifying and classifying a verb’s arguments, we come closer to producing a logical form for a sentence, which is one way to representasentence’smeaninginsuchawayastobereadilyprocessedbymachine. InseveralNLPapplications,onesimplywantstoaccuratelyestimatethelikelihood Preface vii ofa word(orwordsequence)ina phraseorsentence,withouttheneedtoanalyze syntactic or semantic structure. The history, or context, that is used to make that estimationmightbelongorshort,knowledgerich,orknowledgepoor.Theproblem ofproducingalikelihoodorprobabilityestimateforawordisknownaslanguage modeling,andisthesubjectofChap.5. Part II, takes the various core areas of NLP described in Part I and explains howtheyareappliedtoreal-worldNLPapplicationsavailablenowadaysforseveral Semitic languages. Chapters in this part of the book explore several tradeoffs in making various algorithmic and design choices when building a robust NLP applicationforSemiticlanguages,mainlyArabicandHebrew.Chapter6describes oneoftheoldestproblemsinthefield,namely,machinetranslation.Automatically translatingfromonelanguagetoanotherhaslongbeenaholygrailofNLPresearch, and in recent years the community has developed techniques that make machine translationa practicalreality,reapingrewardsafter decadesof effort.Thischapter discusses recent efforts and techniques in translating Semitic languages such as ArabicandHebrewtoandfromEnglish. The following three chapters focus on the core parts of a larger application area known as information extraction. Chapter 7, describes ways to identify and classifynamedentitiesintext.Chapter8,discussesthelinguisticrelationbetween two textualentitieswhichisdeterminedwhena textualentity(the anaphor)refers to another entity of the text which usually occurs before it (the antecedent). The lastchapterofthistrilogy,Chap.9,continuestheinformationextractiondiscussion, exploring techniques for finding out how two entities are related to each other, knownasrelationextraction. Thesubjectoffindingfewdocumentsorsubpartsofdocumentsthatarerelevant based on a search query is clearly an important NLP problem, as it shows the popularity of search engines such as Bing or Google. This problem is known as information retrieval and is the subject of Chap.10. Another way in which we might tackle the sheer quantity of text is by automatically summarizing it. This is the content of Chap.11. This problem either involves finding the sentences, or bits of sentences, that contribute toward providing a relevant summary, or else ingesting the text, summarizing its meaning in some internal representation, and then generating the text that constitutes a summary. Often, humans would like machines to process text automatically because they have questions they seek to answer. These questions can range from simple, factoid-like questions, such as “what is the family of languages to which Hebrew belongs?” to more complex questions such as “what political events succeeded the Tunisian revolution?” Chapter 12 discusses recent techniques to build systems to answer these types of questionsautomatically. Inseveralcases,wewouldlikeourspeechtobeautomaticallytranscribedsothat wecaninteractmorenaturallywithmachines.Theprocessofconvertingspeechinto text,knownasAutomaticSpeechRecognition,isthesubjectofChap.13.Plentyof advanceshavebeenmadeintherecentyearsinArabicspeechrecognition.Current systems for Arabic and Hebrew achieve very reasonable performance and can be usedinrealNLPapplications. viii Preface Asmuchaswehopethisbookisself-containedandcoversmostresearchwork inNLPforSemiticlanguages,wealsohopethatforyou,thereader,itservesasthe beginningandnotanend.Eachchapterhasalonglistofrelevantworkuponwhichit isbased,allowingyoutoexploreanysubtopicingreatdetail.Thelargecommunity of NLP researchers is growing throughoutthe world, and we hope you join us in ourexcitingeffortstoprocesslanguageautomaticallyandthatyouinteractwithus atuniversities,atresearchlabs,atconferences,onsocialnetworks,andelsewhere. NLPsystemsof thefutureandtheir applicationto Semitic languagesare goingto beevenmoreexcitingthantheoneswehavenow,andwelookforwardtoallyour contributions! BellevueWA,USA ImedZitouni March2014 Acknowledgments This book is the result of a highly collaborative effort. I am immensely grateful fortheencouragingsupportobtainedfromSpringer,especiallyfromOlgaChiarcos whohelpedthisprojecttogetoffthegroundandfromFedericaCorradidellAcqua forhersupportduringthedifferentstagesofthisprojectuntilcompletion.Iamalso gratefultoLynImesonforherefficiencyincopyeditingthisbook. A book of this kind would also not have been possible without the effort and technical acumen of my fellow chapter authors, so I owe huge thanks to Ray Fabri, Michael Gasser, Nizar Habash, George Kiraz, Shuly Wintner, Mouna Diab,YuvalMarton,IlanaHeintz,HanyHassan,KareemDarwish,BehrangMohit, KhadigaMahmoudSeddik,AliFarghaly,VittorioCastelli,YassineBenajiba,Paolo Rosso, Lahsen Abouenour, Omar Trigui, Karim Bouzoubaa, Lamia Belguith, Marem Ellouze, Mohamed H. Maaloul, Maher Jaoua, Fatma K. Jaoua, Philippe Blache, Hagen Soltau, George Saon, Lidia Mangu, Hong-Kwang Kuo, Brian Kingsbury,StephenChu,andFadiBiadsy. Finally,Iamverygratefultomyfellowtechnicalreviewcommitteefortheirgen- eroustime,effort,andvaluablefeedbackthatcontributedsignificantlyinimproving the technicalquality of the chapters.Huge thanksto Chafik Aloulou,Mohammed Afify, Yassine Benajiba, Joseph Dichy, Yi Fang, Kavita Ganesan, Spence Green, Dilek Hakkani-Tur, Xiaoqiang Luo, Siddharth Patwardhan, Eric Ringger, Murat Saraclar,RuhiSarikaya,LucyVanderwende,andBingZhao. ImedZitouni ix
Description: