Lecture Notes in Artificial Intelligence 7614 Subseries of Lecture Notes in Computer Science LNAISeriesEditors RandyGoebel UniversityofAlberta,Edmonton,Canada YuzuruTanaka HokkaidoUniversity,Sapporo,Japan WolfgangWahlster DFKIandSaarlandUniversity,Saarbrücken,Germany LNAIFoundingSeriesEditor JoergSiekmann DFKIandSaarlandUniversity,Saarbrücken,Germany Hitoshi Isahara Kyoko Kanzaki (Eds.) Advances in Natural Language Processing 8thInternationalConferenceonNLP,JapTAL2012 Kanazawa, Japan, October 22-24, 2012 Proceedings 1 3 SeriesEditors RandyGoebel,UniversityofAlberta,Edmonton,Canada JörgSiekmann,UniversityofSaarland,Saarbrücken,Germany WolfgangWahlster,DFKIandUniversityofSaarland,Saarbrücken,Germany VolumeEditors HitoshiIsahara ToyohashiUniversityofTechnology InformationandMediaCenter 1-1Hibarigaoka,Tenpakucho Toyohashi441-8580,Japan E-mail:[email protected] KyokoKanzaki ToyohashiUniversityofTechnology InformationandMediaCenter 1-1Hibarigaoka,Tenpakucho Toyohashi441-8580,Japan E-mail:[email protected] ISSN0302-9743 e-ISSN1611-3349 ISBN978-3-642-33982-0 e-ISBN978-3-642-33983-7 DOI10.1007/978-3-642-33983-7 SpringerHeidelbergDordrechtLondonNewYork LibraryofCongressControlNumber:2012948107 CRSubjectClassification(1998):I.2,H.3,H.4,H.5,H.2,J.1,I.5 LNCSSublibrary:SL7–ArtificialIntelligence ©Springer-VerlagBerlinHeidelberg2012 Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,re-useofillustrations,recitation,broadcasting, reproductiononmicrofilmsorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965, initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsareliable toprosecutionundertheGermanCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublicationdoesnotimply, evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelaws andregulationsandthereforefreeforgeneraluse. Typesetting:Camera-readybyauthor,dataconversionbyScientificPublishingServices,Chennai,India Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface The 8th International Conference on Natural Language Processing (JapTAL 2012) took place in the city of Kanazawa,a leading tourist destination where 7 milliontouristsvisiteveryyear.Itwasagreatcastletownruledbyaninfluential leaderfromtheseventeenthcenturytothesecondhalfofthenineteenthcentury. Kanazawa has not suffered from any war devastation or big natural disaster up tonow.Therefore,itmaintainsrowsofhistoricalhousesandinheritstraditional handicrafts and traditional performing arts. JapTALwastheeighthintheseriesoftheTALconferences,followingIceTAL 2010 (Reykjavik, Iceland), GoTAL 2008 (Gothenburg, Sweden), FinTAL 2006 (Turku,Finland),EsTAL2004(Alicante,Spain),PorTAL2002(Faro,Portugal), VexTAL 1999 (Venice, Italy), and FracTAL 1997 (Besancon, France). This was the first time that the TAL conference was held outside Europe. Themainpurposeofthe TALconferenceseriesistobringtogetherscientists representing linguistics, computer science, and related fields, sharing a common interest in the advancement of computational linguistics and natural language processing. The conference consists of invited talks, oral and poster presentations, and special sessions on the applications and theory of natural language processing and related areas. It provides excellent opportunities for the presentation of in- terestingnew researchresultsanddiscussionsaboutthem,leadingtoknowledge transfer and the generation of new ideas. In the reviewing process of the main conference track, we received 42 sub- missions. Among them, we selected 27 submissions as long papers and five sub- missions as short papers. Therefore, the acceptance ratio for the long papers is 64% and the total acceptance ratio including short papers is 76%. We had two special sessions, i.e., “Game and NLP” and “Student/Young ResearcherSession.”Paperssubmittedforthesesessionsweretreatedseparately andacceptedpapers werepresentedineachsessionbut they arenotincluded in this volume. AsconferenceorganizersofJapTAL2012,wewouldliketothankallPCmem- berswhoreviewedsubmissionsverytightscheduleandalllocalstaffwhohelped us during the preparationof JapTAL. We would like to thank the International ExchangeProgramoftheNationalInstituteofInformationandCommunications Technology (NICT) for its support of JapTAL 2012. Hitoshi Isahara Kyoko Kanzaki Organization Program Chairs Hitoshi Isahara Kyoko Kanzaki Program Committee Takako Aikawa Microsoft Research, USA Johan Bos University of Groningen, The Netherlands Pierrette Bouillon Geneva University, Switzerland Caroline Brun Xerox Corporation,France Sylviane Cardey University of Franche-Comt´e,France Key-Sun Choi KAIST, Korea Christiane Fellbaum Princeton University, USA Filip Ginter University of Turku, Finland Peter Greenfield University of Franche-Comt´e,France Philippe de Groote INRIA, France Yurie Iribe ToyohashiUniversity of Technology, Japan Katsunori Kotani Kansai Gaidai University, Japan Krister Lind´en University of Helsinki, Finland Hrafn Loftsson Reykjavik University, Iceland Qing Ma Ryukoku University, Japan Bente Maegaard University of Copenhagen, Denmark Mathieu Morey Nanyang TechnologicalUniversity, Singapore Masayuki Okabe ToyohashiUniversity of Technology, Japan Guy Perrier INRIA, France KiyoakiShirai JAIST, Japan Virach Sornlertlamvanich NECTEC, Thailand Koichi Takeuchi Okayama University, Japan Midori Tatsumi ToyohashiUniversity of Technology, Japan Izabella Thomas University of Franche-Comt´e,France Noriko Tomuro DePaul University, USA Masatoshi Tsuchiya ToyohashiUniversity of Technology, Japan Jose Luis Vicedo University of Alicante, Spain Simo Vihjanen Lingsoft Ltd., Finland Xiaohong Wu Qinghai University for Nationalities, China Kazuhide Yamamoto Nagaoka University of Technology, Japan Yujie Zhang Beijing Jiaotong University, China Tiejun Zhao Harbin Institute of Technology, China Table of Contents Machine Translation The Impact of Crowdsourcing Post-editing with the Collaborative Translation Framework ........................................... 1 Takako Aikawa, Kentaro Yamamoto, and Hitoshi Isahara Translation of Quantifiers in Japanese-Chinese Machine Translation .... 11 Shaoyu Chen and Tadahiro Matsumoto Toward PracticalUse of Machine Translation........................ 23 Hitoshi Isahara Phrase-LevelPattern-Based Machine Translation Based on Analogical Mapping Method ................................................ 28 Jun Sakata, Masato Tokuhisa, and Jin’ichi Murakami Multilingual Issues ParallelTexts Extraction from Multimodal Comparable Corpora....... 40 Haithem Afli, Lo¨ıc Barrault, and Holger Schwenk A Reliable Communication System to Maximize the Communication Quality ......................................................... 52 Gan Jin and Natallia Khatseyeva DAnIEL: Language Independent Character-BasedNews Surveillance ... 64 Ga¨el Lejeune, Romain Brixtel, Antoine Doucet, and Nadine Lucas OOVTermTranslation,ContextInformationandDefinitionExtraction Based on OOV Term Type Prediction .............................. 76 Jian Qu, Akira Shimazu, and Le Minh Nguyen Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology ............................. 88 Fatiha Sadat Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora .................................... 97 Jyoti Srivastava and Sudip Sanyal Shallow Syntactic Preprocessing for Statistical Machine Translation .... 108 Hoai-Thu Vuong, Dao Ngoc Tu, Minh Le Nguyen, and Vinh Van Nguyen VIII Table of Contents Resouces Linguistic Rules Based Approachfor Automatic Restorationof Accents on French Texts ................................................. 118 Paul Brillant Feuto Njonko, Sylviane Cardey-Greenfield, and Peter Greenfield Word Clustering for Persian Statistical Parsing ...................... 126 Masood Ghayoomi Building a Lexically and Semantically-Rich Resource for Paraphrase Processing ...................................................... 138 Wannachai Kampeera and Sylviane Cardey-Greenfield Tagset Conversionwith Decision Trees.............................. 144 Bartosz Zaborowski and Adam Przepi´orkowski Fitting a Round Peg in a Square Hole: Japanese Resource Grammar in GF............................................................. 156 Elizaveta Zimina Semantic Analysis Arabic Language Analyzer with Lemma Extractionand Rich Tagset.... 168 Ahmed H. Aliwy Tracking Researcher Mobility on the Web Using Snippet Semantic Analysis ........................................................ 180 Jorge J. Garc´ıa Flores, Pierre Zweigenbaum, Zhao Yue, and William Turner Semantic Role Labelling without Deep Syntactic Parsing.............. 192 Konrad Gol(cid:2)uchowski and Adam Przepio´rkowski Temporal Information Extraction with Cross-Language Projected Data ........................................................... 198 Przemysl(cid:2)aw Jarzebowski and Adam Przepio´rkowski (cid:2) WordSense DisambiguationBasedon ExampleSentences inDictionary and Automatically Acquired from ParallelCorpus.................... 210 Pulkit Kathuria and Kiyoaki Shirai A Study on HierarchicalTable of Indexes for Multi-documents......... 222 Tho Thi Ngoc Le, Minh Le Nguyen, and Akira Shimazu Finding Good Initial Cluster Center by Using Maximum Average Distance ........................................................ 228 Samuel Sangkon Lee and Chia Y. Han Table of Contents IX Applying a Burst Model to Detect Bursty Topics in a Topic Model ..... 239 Yusuke Takahashi, Takehito Utsuro, Masaharu Yoshioka, Noriko Kando, Tomohiro Fukuhara, Hiroshi Nakagawa, and Yoji Kiyota UDRST: A Novel System for Unlabeled Discourse Parsing in the RST Framework...................................................... 250 Ngo Xuan Bach, Nguyen Le Minh, and Akira Shimazu Sentiment Analysis Long-TermGoal Discoveryin the Twitter Posts throughthe Word-Pair LDA Model ..................................................... 262 Dandan Zhu, Yusuke Fukazawa, Eleftherios Karapetsas, and Jun Ota Finding Social Relationships by Extracting Polite Language in Micro-blog Exchanges ............................................ 268 Norinobu Hatamoto, Yoshiaki Kurosawa, Shogo Hamada, Kazuya Mera, and Toshiyuki Takezawa Twitter Sentiment Analysis Based on Writing Style .................. 278 Hiroshi Maeda, Kazutaka Shimada, and Tsutomu Endo Extraction of User Opinions by Adjective-Context Co-clustering for Game Review Texts .............................................. 289 Kevin Raison, Noriko Tomuro, Steve Lytinen, and Jose P. Zagal Speech and Generation Automatic Phone Alignment: A Comparison between Speaker- Independent Models and Models Trained on the Corpus to Align........................................................... 300 Sandrine Brognaux, Sophie Roekhaut, Thomas Drugman, and Richard Beaufort A Story Generation System Based on Propp Theory: As a Mechanism in an Integrated Narrative Generation System ....................... 312 Shohei Imabuchi and Takashi Ogata Automatic Utterance Generation by Keeping Track of the Conversation’sFocus within the Utterance Window .................. 322 Yusuke Nishio and Dongli Han Author Index.................................................. 333 The Impact of Crowdsourcing Post-editing with the Collaborative Translation Framework Takako Aikawa1, Kentaro Yamamoto2, and Hitoshi Isahara2 1 Microsoft Research, Machine Translation Team [email protected] 2 Toyohashi University of Technology [email protected], [email protected] Abstract. This paper presents a preliminary report on the impact of crowdsourcing post-editing through the so-called “Collaborative Translation Framework” (CTF) developed by the Machine Translation team at Microsoft Research. We first provide a high-level overview of CTF and explain the basic functionalities available from CTF. Next, we provide the motivation and design of our crowdsourcing post-editing project using CTF. Last, we present the re- sults from the project and our observations. Crowdsourcing translation is an in- creasingly popular-trend in the MT community, and we hope that our paper can shed new light on the research into crowdsourcing translation. Keywords: Crowdsourcing post-editing, Collaborative Translation Framework. 1 Introduction The output of machine translation (MT) can be used either as-is (i.e., raw-MT) or for post-editing (i.e., MT for post-editing). Although the advancement of MT technology is making raw-MT use more pervasive, reservations about raw-MT still persist; espe- cially among users who need to worry about the accuracy of the translated contents (e.g., government organizations, education institutes, NPO/NGO, enterprises, etc.). Professional human translation from scratch, however, is just too expensive. To re- duce the cost of translation while achieving high translation quality, many places use MT for post-editing; that is, use MT output as an initial draft of translation and let human translators post-edit it. Many researchers (both from academia and industry) have been investigating how to optimize the post-editing process and developing tools that can achieve high productivity gains via MT for post-editing.1 Recently, another type of approach to reduce the cost of translation has surfaced; namely, crowdsourcing translation. Crowdsourcing translation started as a method to create training/evaluation data for statistical machine translation (SMT). For instance, with Amazon’s Mechanical Turk, one can create a huge amount of bilingual corpus 1 See Allen (2003, 2005)[1][2], O’Brien (2005)[3], Guerberof (2009a/b)[4][5], Koehn and Haddow (2009)[6], for instance. H. Isahara and K. Kanzaki (Eds.): JapTAL 2012, LNAI 7614, pp. 1–10, 2012. © Springer-Verlag Berlin Heidelberg 2012 2 T. Aikawa, K. Yamamoto, and H. Isahara data to build a new SMT system in a relatively inexpensive and quick way (Ambati et al. (2010)[7], Zaidan and Callison-Burch (2011)[8], Ambati and Vogel (2011)[9]).2 This paper introduces a new way of crowdsourcing translation. Our approach is unique in that it focuses on post-editing and uses a different platform; namely, the Collaborative Translation Framework (CTF) developed by the Machine Translation team at Microsoft Research. For our project, we used foreign students at Toyohashi University of Technology as editors and asked them to post-edit the MT output of the university’s English websites (http://www.tut.ac.jp/english/introduction/) via Mi- crosoft Translator (http://www.microsofttranslator.com) into their own languages using the CTF functionalities. This paper is a preliminary report on the results from this project. The organization of the paper is as follows: Section 2 provides a high level overview of CTF while describing various functionalities associated with CTF. Section 3 presents the design of our crowdsourcing project using Toyohashi Universi- ty of Technology websites. Section 4 presents a preliminary report on the results from the project and Section 5 provides our concluding remarks. 2 Collaborative Translation Framework (CTF) As mentioned at the outset of the paper, CTF has been developed by the Machine Translation team at Microsoft Research. CTF aims to create an environment where MT and humans can help each other to improve translation quality in an effective way. One of the prominent functionalities of CTF is to allow users to modify or edit the MT output from Microsoft Translator. Thus, with CTF, we can utilize the power of crowdsourcing to post-edit MT output. There are other types of functionalities associated with CTF, and in the following subsections, we describe these in more detail. 2.1 Basic Functionalities of CTF CTF functionalities have been fully integrated into Microsoft Translator’s Widget (http://www.microsofttranslator.com/widget), and one can experience how CTF works by visiting any website(s) with this Widget.3 For instance, let us look at Figure 1, which is the snapshot of the Widget on the English homepage at Toyohashi Univer- sity of Technology (http://www.tut.ac.jp/english/introduction/). With this Widget, users (or visitors of this website) can translate the entire web site automaticallyinto their own languages; select their target languages (in Figure 1, Japanese is being 2 Amazon’s Mechanical Turk has been also used for creating different types of data as well. For instance, see Callison-Burch (2009) [10] and Higgins et.al (2010)[11]. 3 CTF functionalities can also be called via Microsoft Translator’s public API’s. For more details on Microsoft Translator’s API’s, visit http://msdn.microsoft.com/ en-us/library/dd576287