WSSANLP 2016 6th Workshop on South and Southeast Asian Natural Language Processing Proceedings of the Conference December 11-16, 2016 Osaka, Japan Copyrightofeachpaperstayswiththerespectiveauthors(ortheiremployers). ISBN978-4-87974-705-1 ii Preface Welcometothe6thWorkshoponSouthandSoutheastAsianNaturalLanguageProcessing(WSSANLP -2016),acollocatedeventatthe26thInternationalConferenceonComputationalLinguistics(COLING 2016),December11-16,2016atOsakaInternationalConventionCenter,Osaka,Japan. SouthandSoutheastAsiacompriseofthecountries,Afghanistan,Bangladesh,Bhutan,India,Maldives, Nepal,PakistanandSriLanka. SoutheastAsia,ontheotherhand,consistsofBrunei,Burma,Cambodia, East Timor, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand and Vietnam. This area is the hometothousandsoflanguagesthatbelongtodifferentlanguagefamilieslikeIndo-Aryan,Indo-Iranian, Dravidian,Sino-Tibetan,Austro-Asiatic,Kradai,Hmong-Mien,etc. Intermsofpopulation,SouthAsian andSoutheastAsiarepresent35percentofthetotalpopulationoftheworldwhichmeansasmuchas2.5 billionspeakers. Someofthelanguagesoftheseregionshavealargenumberofnativespeakers: Hindi (5thlargestaccordingtonumberofitsnativespeakers), Bengali(6th), Punjabi(12th), Tamil(18th), and Urdu(20th). As internet and electronic devices including PCs and hand held devices including mobile phones have spread far and wide in the region, it has become imperative to develop language technology for these languages. Itisimportantforeconomicdevelopmentaswellasforsocialandindividualprogress. Acharacteristicoftheselanguagesisthattheyareunder-resourced. Thewordsoftheselanguagesshow rich variations in morphology. Moreover they are often heavily agglutinated and synthetic, making segmentation an important issue. The intellectual motivation for this workshop comes from the need to explore ways of harnessing the morphology of these languages for higher level processing. The task of morphology, however, in South and Southeast Asian Languages is intimately linked with segmentation fortheselanguages. ThegoalofWSSANLPis: •ProvidingaplatformtolinguisticandNLPcommunitiesforsharinganddiscussingideasandworkon SouthandSoutheastAsianlanguagesandcombiningefforts. • Development of useful and high quality computational resources for under resourced South and SoutheastAsianlanguages. We are delighted to present to you this volume of proceedings of the 6th Workshop on South and SoutheastAsianNaturalLanguageProcessing. Wehavereceivedtotal37submissionsinthecategories oflongpaperandshortpaper. Onthebasisofourreviewprocess,wehavecompetitivelyselected18full papersand3shortpapers. Welookforwardtoaninvigoratingworkshop. DekaiWu(ChairWSSANLP-2016), HongKongUniversityofScienceandTechnology,HongKong PushpakBhattacharyya(Co-ChairWSSANLP-2016), IndianInstituteofTechnologyPatna,India iii WorkshopChair DekaiWu,HongKongUniversityofScienceandTechnology,HongKong WorkshopCo-Chair PushpakBhattacharyya,IndianInstituteofTechnologyPatna,India KeyNoteSpeaker AlainDésoulières,INALCO-CERLOM,France Organisers M.G.AbbasMalik,AucklandUniversityofTechnology,Auckland,NewZealand(chair) SadafAbdulRauf,FatimaJinnahWomenUniversity,Islamabad,Pakistan MahsaMohaghegh,UnitecInstituteofTechnology,Auckland,NewZealand ProgrammeCommittee SadafAbdulRauf,FatimaJinnahWomenUniversity,Pakistan NaveedAfzal,CardiovascularBiomarkersLaboratory,MayoClinic,USA TafseerAhmed,DHASuffaUniversity,Pakistan AasimAli,UniversityofthePunjab,Pakistan JalalS.Alowibdi,UniversityofJeddah,SaudiArabia SalehAlshomrani,UniversityofJeddah,SaudiArabia AmerAlzaidi,UniversityofJeddah,SaudiArabia M.WaqasAnwar,COMSATSInstituteofTechnologyAbbottabad,Pakistan BalKrishnaBal,KathmanduUniversity,Nepal SivajiBandyopadhyay,JadavpurUniversity,India VincentBerment,GETALP-LIGandINALCO,France LaurentBesacier,UniversityofGrenoble,France PushpakBhattacharyya,IndianInstituteofTechnologyPatna,India HervéBlanchon,UniversityofGrenoble,France ChristianBoitet,UniversityofGrenoble,France MiriamButt,UniversityofKonstanz,Germany EricCastelli,InternationalResearchCenterMICA,Vietnam AmitavaDas,IndianInstituteofInformationTechnology,SriCity,India AlainDesoulieres,INALCO-CERLOM,France AlexanderGelbukh,CenterforComputingResearch,CIC,Mexico Choochart Haruechaiyasak, National Electronics and Computer Technology Center (NECTEC), Thailand v SarmadHussain,UniversityofEngineeringandTechnologyLahore,Pakistan AravindK.Joshi,UniversityofPennsylvania,USA AmbaKulkarni,UniversityofHyderabad,India GurpreetSinghLehal,PunjabiUniversity,Patiala,India HaizhouLi,InstituteforInfocommResearch,Singapore M.G.AbbasMalik,AucklandUniversityofTechnology,NewZealand MahsaMohaghegh,UnitecInstituteofTechnology,NewZealand AjitNarayanan,AucklandUniversityofTechnology,NewZealand K.V.S.Prasad,ChalmersUniversityofTechnology,Sweden BaliRanaivo-Malançon,UniversityofMalaysiaSarawak,Malaysia PaoloRosso,UniversitatPolitècnicadeValència,Spain HudaSarfraz,BeaconhouseNationalUniversity,Pakistan Hossein Sarrafzadeh, High Technology Transdisciplinary Research Network, Unitec Auckland, NewZealand L.Sobha,AU-KBCResearchCentre,India Virach Sornlertlamvanich, TCL, National Institute of Information and Communication Technol- ogy,Thailand RuvanWeerasinghe,UniversityofColomboSchoolofComputing,SriLanka vi Table of Contents FullPapers CompoundTypeIdentificationinSanskrit: WhatRolesdotheCorpusandGrammarPlay? AmrithKrishna,PavankumarSatuluri,ShubhamSharma,ApurvKumarandPawanGoyal......1 ComparisonofGrapheme-to-PhonemeConversionMethodsonaMyanmarPronunciationDictionary YeKyawThu,WinPaPa,YoshinoriSagisakaandNaotoIwahashi...........................11 Character-AwareNeuralNetworksforArabicNamedEntityRecognitionforSocialMedia MouradGridach........................................................................23 DevelopmentofaBengaliparserbycross-lingualtransferfromHindi AyanDas,AgnivoSahaandSudeshnaSarkar..............................................33 Sinhala Short Sentence Similarity Calculation using Corpus-Based and Knowledge-Based Similarity Measures JcsKadupitiya,SurangikaRanathungaandGihanDias.....................................44 EnrichingSourceforEnglish-to-UrduMachineTranslation BushraJawaid,AmirKamranandOndˇrejBojar............................................54 TheIMAGACT4ALLOntologyofAnimatedImages: ImplicationsforTheoreticalandMachineTransla- tionofActionVerbsfromEnglish-IndianLanguages PitambarBehera,SharminMuzaffar,Atulkr. OjhaandGirishJha...........................64 Crowdsourcing-basedAnnotationofEmotionsinFilipinoandEnglishTweets FerminRobertoLapitan,RizaTheresaBatista-NavarroandEliezerAlbacea..................74 SentimentAnalysisofTweetsinThreeIndianLanguages ShantaPhani,ShibamouliLahiriandArindamBiswas......................................83 DealingwithLinguisticDivergencesinEnglish-BhojpuriMachineTranslation PitambarBehera,NehaMouryaandVandanaPandey.......................................93 ThedevelopmentofawebcorpusofHindilanguageandcorpus-basedcomparativestudiestoJapanese MikiNishiokaandShiroAkasegawa .................................................... 104 AutomaticCreationofaSentenceAlignedSinhala-TamilParallelCorpus RiyafaAbdul Hameed, Nadeeshani Pathirennehelage, AnushaIhalapathirana, MaryamZiyad Mo- hamed,SurangikaRanathunga,SanathJayasena,GihanDiasandSandarekaFernando ............ 114 Clustering-basedPhoneticProjectioninMismatchedCrowdsourcingChannelsforLow-resourcedASR WendaChen,MarkHasegawa-Johnson,NancyChen,PreethiJyothiandLavVarshney.......123 ImprovingtheMorphologicalAnalysisofClassicalSanskrit OliverHellwig........................................................................132 QueryTranslationforCross-LanguageInformationRetrievalusingMultilingualWordClusters PaheliBhattacharya,PawanGoyalandSudeshnaSarkar...................................142 vii Astudyofattention-basedneuralmachinetranslationmodelonIndianlanguages AyanDas,PranayYerra,KenKumarandSudeshnaSarkar.................................153 ComprehensivePart-Of-SpeechTagSetandSVMbasedPOSTaggerforSinhala SandarekaFernando,SurangikaRanathunga,SanathJayasenaandGihanDias...............163 ShortPapers AlignMe: AframeworktogenerateParallelCorpusUsingOCRsandBilingualDictionaries PriyamBakliwal,DevadathVVandCVJawahar........................................173 LearningIndonesian-ChineseLexiconwithBilingualWordEmbeddingModelsandMonolingualSignals XinyingQiuandGangqinZhu..........................................................178 CreatingrichonlinedictionariesfortheLao–Frenchlanguagepair,reusableforMachineTranslation VincentBerment ...................................................................... 184 viii Conference Program Sunday,December11,2016 WSSANLP2016Openning 9:00–9:10 OpenningRemarks 9:10–10:00 KeyNotebyAlainDésoulières,INALCO,CERLOM,France 10:00–10:20 CoffeeandTeaBreak 10:20–12:00 WSSANLPSession1: OralPresentations SessionChair: HervéBlanchon 10:20–10:40 CompoundTypeIdentificationinSanskrit: WhatRolesdotheCorpusandGrammar Play? AmrithKrishna,PavankumarSatuluri,ShubhamSharma,ApurvKumarandPawan Goyal 10:40–11:00 Comparison of Grapheme-to-Phoneme Conversion Methods on a Myanmar Pro- nunciationDictionary YeKyawThu,WinPaPa,YoshinoriSagisakaandNaotoIwahashi 11:00–11:20 Character-AwareNeuralNetworksforArabicNamedEntityRecognitionforSocial Media MouradGridach 11:20–11:40 DevelopmentofaBengaliparserbycross-lingualtransferfromHindi AyanDas,AgnivoSahaandSudeshnaSarkar 11:40–12:00 SinhalaShortSentenceSimilarityCalculationusingCorpus-BasedandKnowledge- BasedSimilarityMeasures JcsKadupitiya,SurangikaRanathungaandGihanDias 12:00–13:30 LunchBreak ix Sunday,December11,2016(continued) 13:30–14:55 WSSANLPSession2: PosterPresentations SessionChair: KVSPrasad FullPapers EnrichingSourceforEnglish-to-UrduMachineTranslation BushraJawaid,AmirKamranandOndˇrejBojar The IMAGACT4ALL Ontology of Animated Images: Implications for Theoretical andMachineTranslationofActionVerbsfromEnglish-IndianLanguages PitambarBehera,SharminMuzaffar,Atulkr. OjhaandGirishJha Crowdsourcing-basedAnnotationofEmotionsinFilipinoandEnglishTweets FerminRobertoLapitan,RizaTheresaBatista-NavarroandEliezerAlbacea SentimentAnalysisofTweetsinThreeIndianLanguages ShantaPhani,ShibamouliLahiriandArindamBiswas DealingwithLinguisticDivergencesinEnglish-BhojpuriMachineTranslation PitambarBehera,NehaMouryaandVandanaPandey ThedevelopmentofawebcorpusofHindilanguageandcorpus-basedcomparative studiestoJapanese MikiNishiokaandShiroAkasegawa AutomaticCreationofaSentenceAlignedSinhala-TamilParallelCorpus Riyafa Abdul Hameed, Nadeeshani Pathirennehelage, Anusha Ihalapathirana, MaryamZiyadMohamed,SurangikaRanathunga,SanathJayasena,GihanDiasand SandarekaFernando ShortPapers Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries PriyamBakliwal,DevadathVVandCVJawahar LearningIndonesian-ChineseLexiconwithBilingualWordEmbeddingModelsand MonolingualSignals XinyingQiuandGangqinZhu Creating rich online dictionaries for the Lao–French language pair, reusable for MachineTranslation VincentBerment x
Description: