Criac¸a˜o de Le´xicos Bilingues para Traduc¸a˜o Automa´tica Estat´ıstica Lu´ıs Carlos Amado Magalha˜es Carvalho Dissertac¸a˜oparaobtenc¸a˜odoGraudeMestreem EngenhariaInforma´ticaedeComputadores Ju´ri Presidente: DoutoraMariadosReme´diosVazPereiraLopesCravo Orientador: DoutoraMariaLu´ısaTorresRibeiroMarquesdaSilvaCoheur Co-orientador: DoutoraIsabelMariaMartinsTrancoso Vogal: DoutorBrunoEmanueldaGrac¸aMartins Novembro 2010 Acknowledgements To my beloved girlfriend Ana. Without her, I would have never finished this course. Her precious advicesmademecarryontothisimportantgoalinmylife. Tomymother,whoalwaystriedtopassserenityonme. ShealwayscookmyfavoritedishwhenI wenttohavedinnerwithher: HERchickencurry:). TomyfatherinBrazil,whoneverputextrapressureonme,onlyworriedwithmywellfair. TomyTVstarsisterRitainNetherlandsandtomybrotherMarco,whoalwayswishedmeluckon everysingleexam. Imissher. TothemotherofmygirlfriendRosarinhawhoneverstoppedtoencourageme. Sheisalsoagreat cook,andI’mlookingforwardtohernextcookingmeal:). ToherotherdeardaughterJoana,whoneverstoppedbelievingme.SheisalsoagreatPokerplayer, mostlybecauseeverychipssheearnsshegivestome:). TomyprofessorLu´ısawhowasalwaysontopofeverything.Sheisagreatprofessor,funtobewith, verystrictontimeschedules,verydemanding,butatthesametimeeasygoing. Sincethefirstinterview, myimpressionwasthebestandIwasnotmistaken. Shealwayspointedtherightpathtome, andby followingit,Imanagedtobesuccessful. Otherwise,IthinkImightnotbeabletofinishthiscourseever. Ilearnedalotfromher. ManypeopletoldmethatIgotveryluckyonmyorientator. Iagree. IfIwere beginningmythesis,IdefinitelywouldliketohaveanorientatorlikeLu´ısa. TomygoodfriendRicardoandhiswifeAna,whomadethisSummer,despiteofhardwork,oneof thebest. IthinktheseainCostadeCaparicamissesusevenmorethanwemissit:). TomyfriendLuisandhisgirlfriendIneswhoineverysingleSaturdaymademeforgethowhard wastoaccomplishthistaskbyridingonhisbikeat240Km/h:). Theyplantedmethebikesyndrome, andnowIhavetogetoneofthosefastbabiestoo:). Theseaalsomissesthemalot. TomycolleagueTiagoLu´ıswhohelpedmealotinL2F.Ifheisnotstoppedimmediately,withhis selflessness,hemayfinishyourwholework,andwedonotwantthat:). TotheoutstandingperformanceofmyfootballclubSLBenficainthepreviousseason. Itmademe spendmanyjoyfulmoments,speciallytheonewith300thousandpeoplecelebratingthetitleinMarqueˆs dePombal. ToJamesHetfieldandMetallicaforperforminglivetomeinthelast4timesinPortugal. Itisnever enough. ToVirgemSutainthecarCD. TotheColonelwhoinexplicablyfailedmeinmylastflightasairplanepilotinAFA.Icouldnotbe moregratefultohim. Lisboa,November22,2010 Lu´ısCarlosCarvalho TomybelovedgirlfriendAnaandtomyfamilyandfriends. It is better to reign in Hell than to be slave in Heaven. Resumo Apesquisaefectuadanocontextodestetrabalhoresultounodesenvolvimentodeumaframeworkpara detecc¸a˜o de palavras cognatas entre diferentes l´ınguas. A framework centra-se em medidas de simi- laridadeentrepalavraseregrasdetransliterac¸a˜o. Adetecc¸a˜odecognatasfoifeitaemduasfases: pre- processamentoeclassificac¸a˜o. Afasedepreprocessamentoapenasusouumsubconjuntodasmedidas de similaridade por forma a descartar pares de palavras que na˜o partilhavam qualquer semelhanc¸a. As medidas foram Word Length, Lcsm, Lcsr, Jaro Winkler e Sequence Letters. Os pares resultantes foram enta˜oaproveitadosparaaprimeirafasedeclassificac¸a˜o: otreino. Estafasepermitiugerarummodelo baseadonasmedidasdesimilaridade. Estemodeloe´ utilizadoparapreverseaspalavrassa˜ocognatas. Detodasasmedidasdesimilaridade,apenastreˆssaousadas: Lcsm,LevenshteineDice. Apartirdestas medidas, o mo´dulo de cognatas atingiu uma F-measure de 66.93%. Apo´s a construc¸a˜o da framework, estafoiusadaparadetecc¸a˜odetraduc¸o˜esdeentidadesmencionadas. Estesegundomo´dulousoutreˆs reconhecedores de entidades mencionadas: Stanford NER para nomes escritos na l´ıngua inglesa, XIP NER e um me´todo adaptativo para nomes em portugueˆs. Dois me´todos foram utilizados: o primeiro usouoStansfordNERcomoXIPNER.OsegundoutilizouoStanfordNERmaisome´todoadaptativo. O primeiro alcanc¸ou F-measure de 62.65%, enquanto que o segundo me´todo revelou-se mais eficiente tendoatingidoF-measurede73.91%. Abstract Theresearchperformedinthecontextofthisthesisresultedinthedevelopmentofaframeworkforthe detectionofcognatesacrosstextsofdifferentlanguages. Theframeworkiscenteredinwordsimilarity measuresandtransliterationrules. Cognatedetectionwasaccomplishedintwophases: preprocessing andclassification. Thepreprocessingphaseusedonlyasubsetofthewholesetofsimilaritymeasures in order to discard pairs of words that did not share any resemblance. The measures used were Word Length, Lcsm, Lcsr, Jaro Winkler and Sequence Letters. Furthermore, the resulting pairs were used in the firststepofclassification:training.Trainingpermittedtogenerateamodelbasedonsimilaritymeasures. Thismodelisfurtherusedtopredictwhetherwordsarecognates. Fromthewholesetofsimilaritymea- sures,themodelusedonlythree: Lcsm,DiceandLevenshtein. Fromthesemeasures,thecognatemodule producedaF-measurerateof66.93%. Aftertheframeworkwasbuilt,itwasusedtodetecttranslations ofnamedentities. Thismoduleusedthreenamedentityrecognizers: StanfordNERforEnglishnames, XIPNERandanAdaptiveMethodtoacquirePortuguesenamedentities. Twoapproacheswereused: firstStanfordNERwasusedplustheXIPNER.ThesecondapproachconsistedintheuseoftheStanford NERagainsttheAdaptiveMethod. ThefirstapproachhadF-measurerateof62.65%,whilstthesecond onewasmoreefficient,73.91%ofF-measurerate.
Description: