Advanced Information and Knowledge Processing Series Editors Professor Lakhmi Jain [email protected] Professor Xindong Wu [email protected] Also in this series Gregoris Mentzas, Dimitris Apostolou, K.C. Tan, E.F.Khor and T.H. Lee Andreas Abecker and RonYoung Multiobjective Evolutionary Algorithms Knowledge Asset Management 1-85233-583-1 and Applications 1-85233-836-9 Michalis Vazirgiannis, Maria Halkidi Nikhil R. Pal and Lakhmi Jain (Eds) and Dimitrios Gunopulos Advanced Techniques in Knowledge Discovery Uncertainty Handling and Quality Assessment and Data Mining 1-85233-867-9 in Data Mining 1-85233-655-2 Amit Konar and Lakhmi Jain Asunción Gómez-Pérez, Mariano Cognitive Engineering1-85233-975-6 Fernández-López and Oscar Corcho Ontological Engineering1-85233-551-3 Miroslav Kárný (Ed.) Optimized Bayesian Dynamic Advising Arno Scharl (Ed.) 1-85233-928-4 Environmental Online Communication 1-85233-783-4 Yannis Manolopoulos, Alexandros Nanopoulos, Apostolos N. Papadopoulos and Shichao Zhang, Chengqi Zhang and Xindong Wu Yannis Theodoridis Knowledge Discovery in Multiple Databases 1-85233-703-6 R-trees: Theory and Applications1-85233-977-2 Jason T.L. Wang, Mohammed J. Zaki, Sanghamitra Bandyopadhyay, Ujjwal Maulik, Hannu T.T. Toivonen and Dennis Shasha (Eds) Lawrence B. Holder and Diane J. Cook (Eds) Data Mining in Bioinformatics 1-85233-671-4 Advanced Methods for Knowledge Discovery from Complex Data 1-85233-989-6 C.C. Ko, Ben M. Chen and Jianping Chen Creating Web-based Laboratories 1-85233-837-7 Marcus A. Maloof (Ed.) Machine Learning and Data Mining Manuel Graña, Richard Duro, Alicia d’Anjou for Computer Security1-84628-029-X and Paul P. Wang (Eds) Information Processing with Evolutionary Sifeng Liu and Yi Lin Algorithms 1-85233-886-0 Grey Information 1-85233-995-0 Colin Fyfe Vasile Palade, Cosmin Danut Bocaniala Hebbian Learning and Negative Feedback and Lakhmi Jain (Eds) Networks1-85233-883-0 Computational Intelligence in Fault Diagnosis 1-84628-343-4 Yun-Heh Chen-Burger and Dave Robertson Automating Business Modelling1-85233-835-0 Mitra Basu and Tin Kam Ho (Eds) Data Complexity in Pattern Recognition Dirk Husmeier, Richard Dybowski 1-84628-171-7 and Stephen Roberts (Eds) Probabilistic Modeling in Bioinformatics and Medical Informatics 1-85233-778-8 Samuel Pierre (Ed.) E-learning Networked Environments Ajith Abraham, Lakhmi Jain and Architectures 1-84628-351-5 and Robert Goldberg (Eds) Evolutionary Multiobjective Optimization Arno Scharl and KlausTochtermann (Eds) 1-85233-787-7 The Geospatial Web 1-84628-826-5 Ngoc Thanh Nguyen Amnon Meisels Advanced Methods for Inconsistent Knowledge Search by Constrained Agents Management 1-84628-888-3 978-1-84800-039-1 Francesco Camastra and Alesandro Vinciarelli Mikhail Prokopenko (Ed.) Machine Learning for Image, Video Advances in Applied Self-organizing Systems and Audio Analysis 978-1-84800-006-3 978-1-84628-981-1 András Kornai Mathematical Linguistics ABC AndrásKornai MetaCartaInc. 350MassachusettsAve. Cambridge,MA 02139 USA ISBN: 978-1-84628-985-9 e-ISBN: 978-1-84628-986-6 DOI: 10.1007/978-1-84628-986-6 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2007940401 © Springer-Verlag London Limited 2008 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Design and Patents Act 1988, this publication may only be repro- duced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the ab- sence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the informa- tion contained in this book and cannot accept any legal responsibility or liability for any errors or omis- sions that may be made. Printed on acid-free paper 9 8 7 6 5 4 3 2 1 Springer Science+Business Media Springer.com Tomyfamily Preface MathematicallinguisticsisrootedbothinEuclid’s(circa325–265BCE)axiomatic method and in Pa¯n.ini’s (circa 520–460 BCE) method of grammatical description. To be sure, both Euclid and Pa¯n.ini built upon a considerable body of knowledge amassedbytheirprecursors,butthesystematicity,thoroughness,andsheerscopeof theElementsandtheAsh.ta¯dhya¯y¯ıwouldplacethemamongthegreatestlandmarks ofallintellectualhistoryevenifwedisregardedthekeymethodologicaladvancethey made. Asweshallsee,thetwomethodsarefundamentallyverysimilar:theaxiomatic methodstartswithasetofstatementsassumedtobetrueandtransferstruthfromthe axiomstootherstatementsbymeansofafixedsetoflogicalrules,whilethemethod ofgrammaristostartwithasetofexpressionsassumedtobegrammaticalbothin formandmeaningandtotransfergrammaticalitytootherexpressionsbymeansofa fixedsetofgrammaticalrules. Perhapsbecauseoursubjectmatterhasattractedtheeffortsofsomeofthemost powerful minds (of whom we single out A. A. Markov here) from antiquity to the present day, there is no single easily accessible introductory text in mathematical linguistics. Indeed, to the mathematician the whole field of linguistics may appear tobehopelesslymiredincontroversy,andneithertheformidablebodyofempirical knowledge about languages nor the standards of linguistic argumentation offer an easyentrypoint. Thosewithamorepostmodernbentmayevengoasfarastodoubttheexistence ofasolidcoreofmathematicalknowledge,oftenpointingatthefalsetheoremsand incompleteordownrightwrongproofsthatslipthroughthepeerreviewprocessata perhapsalarmingrate.Ratherthanattemptingtodrownsuchdoubtsinriversofphilo- sophicalink,thepresentvolumewillsimplyproceedmoregeometricoinexhibiting thissolidcoreofknowledge.InChapters3–6,amathematicaloverviewofthetradi- tionalmainbranchesoflinguistics,phonology,morphology,syntax,andsemantics, ispresented. viii Preface Whoshouldreadthisbook? The book is accessible to anyone with sufficient general mathematical maturity (graduate or advanced undergraduate). No prior knowledge of linguistics or lan- guages is assumed on the part of the reader. The book offers a single entry point tothecentralmethodsandconceptsoflinguisticsthataremadelargelyinaccessible tothemathematician,computerscientist,orengineerbythesurprisinglyadversarial style of argumentation (see Section 1.2), the apparent lack of adequate definitions (seeSection1.3),andtheproliferationofunmotivatednotationandformalism(see Section 1.4) all too often encountered in research papers and monographs in the humanities.Thoseinterestedinlinguisticscanlearnagreatdealmoreaboutthesub- jectherethanwhatiscoveredinintroductorycoursesjustfromreadingthroughthe book and consulting the references cited. Those who plan to approach linguistics throughthisbookshouldbewarnedinadvancethatmanybranchesoflinguistics,in particular psycholinguistics, child language acquisition, and the study of language pathology, are largely ignored here – not because they are viewed as inferior to other branches but simply because they do not offer enough grist for the mathe- matician’s mill. Much of what the linguistically naive reader may find interesting aboutlanguageturnsouttobemorepertinenttocognitivescience,thephilosophyof language,andsociolinguistics,thantolinguisticsproper,andtheIntroductiongives theseissuestheshortestpossibleshrift,discussingthemonlytotheextentnecessary fordisentanglingmathematicallinguisticsfromotherconcerns. Conversely, issues that linguists sometimes view as peripheral to their enter- prise will get more discussion here simply because they offer such a rich variety ofmathematicaltechniquesandproblemsthatnobookonmathematicallinguistics thatignoredthemcouldbeconsideredcomplete.Afterabriefreviewofinformation theory in Chapter 7, we will devote Chapters 8 and 9 to phonetics, speech recog- nition, the recognition of handwriting and machine print, and in general to issues of linguistic signal processing and pattern matching, including information extrac- tion,informationretrieval,andstatisticalnaturallanguageprocessing.Ourtreatment assumes a bit more mathematical maturity than the excellent textbooks by Jelinek (1997) and Manning and Schu¨tze (1999) and intends to complement them. Kracht (2003) conveniently summarizes and extends much of the discrete (algebraic and combinatorial) work on mathematical linguistics. It is only because of the timely appearanceofthisexcellentreferenceworkthatthefirstsixchapterscouldbekept toamanageablesizeandwecoulddevotemorespacetothecontinuous(analyticand probabilistic)aspectsofthesubject.Inparticular,expositorysimplicitywouldoften dictatethatwekeeptheunderlyingparameterspacediscrete,butinthelaterchapters wewillbeconcentratingmoreonthecaseofcontinuousparameters,anddiscussthe issueofquantizationlossesexplicitly. In the early days of computers, there was a great deal of overlap between the concernsofmathematicallinguisticsandcomputerscience,andasurprisingamount of work that began in one field ended up in the other, sometimes explicitly as part ofcomputationallinguistics,butoftenasgeneraltheorywithitsrootsinlinguistics largely forgotten. In particular, the basic techniques of syntactic analysis are now Preface ix firmly embedded in the computer science curriculum, and the student can already choosefromalargevarietyoftextbooksthatcoverparsing,automata,andformallan- guagetheory.HerewesingleouttheclassicmonographbySalomaa(1973),which shows the connection to formal syntax in a way readily accessible to the mathe- matically minded reader. We will selectively cover only those aspects of this field thataddressspecificallylinguisticconcerns,andagainourguidingprinciplewillbe mathematical content, as opposed to algorithmic detail. Readers interested in the algorithmsshouldconsultthemanyexcellentnaturallanguageprocessingtextbooks nowavailable,ofwhichwesingleoutJurafskyandMartin(2000,withanewedition plannedin2008). Howisthebookorganized? Totheextentfeasiblewefollowthestructureofthestandardintroductorycoursesto linguistics,buttheemphasiswilloftenbeonpointsonlycoveredinmoreadvanced courses.Thebookcontainsmanyexercises.Theseare,forthemostpart,ratherhard (overlevel30inthesystemofKnuth1971)butextremelyrewarding.Especiallyin the later chapters, the exercises are often based on classical and still widely cited theorems, so the solutions can usually be found on the web quite easily simply by consulting the references cited in the text. However, readers are strongly advised not to follow this route before spending at least a few days attacking the problem. Unsolvedproblemspresentedasexercisesaremarkedbyanasterisk,asymbolthat we also use when presenting examples and counterexamples that native speakers would generally consider wrong (ungrammatical): Scorsese is a great director is a positive (grammatical) example while *Scorsese a great director is is a negative (ungrammatical)example. Someexercises,markedbyadagger(cid:1),requiretheability to manipulate sizeable data sets, but no in-depth knowledge of programming, data structures,oralgorithmsispresumed.Readerswhowritecodeeffortlesslywillfind theseexerciseseasy,astheyrarelyrequiremorethanafewsimplescripts.Thosewho findsuchexercisesproblematiccanomitthementirely.Theymayfailtogaindirect appreciationofsomeempiricalpropertiesoflanguagethatdrivemuchoftheresearch inmathematicallinguistics,buttheresearchitselfremainsperfectlyunderstandable evenifthemotivationistakenonfaith.AfewexercisesaremarkedbyaraisedM – thesearemajorresearchprojectsthereaderisnotexpectedtoseetocompletion,but spendingafewdaysonthemisstillvaluable. Becausefromtimetotimeitwillbenecessarytogiveexamplesfromlanguages that are unlikely to be familiar to the average undergraduate or graduate student of mathematics, we decided, somewhat arbitrarily, to split languages into two groups. MajorlanguagesarethosethathaveachapterinComrie’s(1990)TheWorld’sMajor Languages–thesewillbefamiliartomostpeopleandareleftunspecifiedinthetext. Minorlanguagesusuallyrequiresomedocumentation,bothbecauselanguagenames aresubjecttoagreatdealofspellingvariationandbecausedifferentgroupsofpeople mayuseverydifferentnamesforoneandthesamelanguage.Minorlanguagesare therefore identified here by their three-letter Ethnologue code (15th edition, 2005) giveninsquarebrackets[]. x Preface Each chapter ends with a section on further reading. We have endeavored to make the central ideas of linguistics accessible to those new to the field, but the discussionofferedinthebookisoftenskeletal,andreadersareurgedtoprobefurther. Generally,werecommendthosepapersandbooksthatpresentedtheideaforthefirst time,notjusttogivepropercreditbutalsobecausetheseoftenprovideperspective andinsightthatlaterdiscussionstakeforgranted.Readerswhoindustriouslyfollow the recommendations made here should do so for the benefit of learning the basic vocabulary of the field rather than in the belief that such reading will immediately placethemattheforefrontofresearch. Thebestwaytoreadthisbookistostartatthebeginningandtoprogresslinearly totheend,butthereaderwhoisinterestedonlyinaparticularareashouldnotfindit toohardtojumpinatthestartofanychapter.Tofacilitateskimmingandalternative readingplans,agenerousamountofforwardandbackwardpointersareprovided– in a hypertext edition these would be replaced by clickable links. The material is suitable for an aggressively paced one-semester course or a more leisurely paced two-semestercourse. Acknowledgments Manytyposandstylisticinfelicitieswerecaught,andexcellentreferencesweresug- gested, by Ma´rton Makrai (Budapest Institute of Technology), Da´niel Margo´csy (Harvard),DougMerritt(SanJose),ReinhardMuskens(Tilburg),AlmerindoOjeda (University of California, Davis), Ba´lint Sass (Budapest), Madeleine Thompson (UniversityofToronto),andGabrielWyler(SanJose).Thepainstakingworkofthe Springer editors and proofreaders, Catherine Brett, Frank Ganz, Hal Henglein, and JeffreyTaub,isgratefullyacknowledged. The comments of Tibor Beke (University of Massachusetts, Lowell), Michael Bukatin (MetaCarta), Anssi Yli-Jyra¨ (University of Helsinki), Pe´ter Ga´cs (Boston University),MarcusKracht(UCLA),Andra´sSere´ny(CEU),Pe´terSipta´r(Hungarian AcademyofSciences),AnnaSzabolcsi(NYU),Pe´terVa´mos(BudapestInstituteof Technology), Ka´roly Varasdi (Hungarian Academy of Sciences), and Da´niel Varga (BudapestInstituteofTechnology)resultedinsubstantiveimprovements. Writing this book would not have been possible without the generous support of MetaCarta Inc. (Cambridge, MA), the MOKK Media Research center at the BudapestInstituteofTechnologyDepartmentofSociology,theFarkasHellerFoun- dation, and the Hungarian Telekom Foundation for Higher Education – their help andcareisgratefullyacknowledged. Contents Preface............................................................ vii 1 Introduction................................................... 1 1.1 Thesubjectmatter.......................................... 1 1.2 Cumulativeknowledge ...................................... 2 1.3 Definitions ................................................ 3 1.4 Formalization.............................................. 4 1.5 Foundations ............................................... 6 1.6 Mesoscopy................................................ 6 1.7 Furtherreading ............................................ 7 2 Theelements .................................................. 9 2.1 Generation ................................................ 9 2.2 Axioms,rules,andconstraints................................ 13 2.3 Stringrewriting ............................................ 17 2.4 Furtherreading ............................................ 20 3 Phonology..................................................... 23 3.1 Phonemes................................................. 24 3.2 Naturalclassesanddistinctivefeatures......................... 28 3.3 Suprasegmentalsandautosegments............................ 33 3.4 Phonologicalcomputation ................................... 40 3.5 Furtherreading ............................................ 49 4 Morphology ................................................... 51 4.1 Theprosodichierarchy...................................... 53 4.1.1 Syllables ........................................... 53 4.1.2 Moras.............................................. 55 4.1.3 Feetandcola........................................ 56 4.1.4 Wordsandstresstypology............................. 56 4.2 Wordformation ............................................ 60