ebook img

Criaç˜ao de L´exicos Bilingues para Traduç˜ao Autom - INESC-ID PDF

64 Pages·2010·0.64 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Criaç˜ao de L´exicos Bilingues para Traduç˜ao Autom - INESC-ID

Criac¸a˜o de Le´xicos Bilingues para Traduc¸a˜o Automa´tica Estat´ıstica Lu´ıs Carlos Amado Magalha˜es Carvalho Dissertac¸a˜oparaobtenc¸a˜odoGraudeMestreem EngenhariaInforma´ticaedeComputadores Ju´ri Presidente: DoutoraMariadosReme´diosVazPereiraLopesCravo Orientador: DoutoraMariaLu´ısaTorresRibeiroMarquesdaSilvaCoheur Co-orientador: DoutoraIsabelMariaMartinsTrancoso Vogal: DoutorBrunoEmanueldaGrac¸aMartins Novembro 2010 Acknowledgements To my beloved girlfriend Ana. Without her, I would have never finished this course. Her precious advicesmademecarryontothisimportantgoalinmylife. Tomymother,whoalwaystriedtopassserenityonme. ShealwayscookmyfavoritedishwhenI wenttohavedinnerwithher: HERchickencurry:). TomyfatherinBrazil,whoneverputextrapressureonme,onlyworriedwithmywellfair. TomyTVstarsisterRitainNetherlandsandtomybrotherMarco,whoalwayswishedmeluckon everysingleexam. Imissher. TothemotherofmygirlfriendRosarinhawhoneverstoppedtoencourageme. Sheisalsoagreat cook,andI’mlookingforwardtohernextcookingmeal:). ToherotherdeardaughterJoana,whoneverstoppedbelievingme.SheisalsoagreatPokerplayer, mostlybecauseeverychipssheearnsshegivestome:). TomyprofessorLu´ısawhowasalwaysontopofeverything.Sheisagreatprofessor,funtobewith, verystrictontimeschedules,verydemanding,butatthesametimeeasygoing. Sincethefirstinterview, myimpressionwasthebestandIwasnotmistaken. Shealwayspointedtherightpathtome, andby followingit,Imanagedtobesuccessful. Otherwise,IthinkImightnotbeabletofinishthiscourseever. Ilearnedalotfromher. ManypeopletoldmethatIgotveryluckyonmyorientator. Iagree. IfIwere beginningmythesis,IdefinitelywouldliketohaveanorientatorlikeLu´ısa. TomygoodfriendRicardoandhiswifeAna,whomadethisSummer,despiteofhardwork,oneof thebest. IthinktheseainCostadeCaparicamissesusevenmorethanwemissit:). TomyfriendLuisandhisgirlfriendIneswhoineverysingleSaturdaymademeforgethowhard wastoaccomplishthistaskbyridingonhisbikeat240Km/h:). Theyplantedmethebikesyndrome, andnowIhavetogetoneofthosefastbabiestoo:). Theseaalsomissesthemalot. TomycolleagueTiagoLu´ıswhohelpedmealotinL2F.Ifheisnotstoppedimmediately,withhis selflessness,hemayfinishyourwholework,andwedonotwantthat:). TotheoutstandingperformanceofmyfootballclubSLBenficainthepreviousseason. Itmademe spendmanyjoyfulmoments,speciallytheonewith300thousandpeoplecelebratingthetitleinMarqueˆs dePombal. ToJamesHetfieldandMetallicaforperforminglivetomeinthelast4timesinPortugal. Itisnever enough. ToVirgemSutainthecarCD. TotheColonelwhoinexplicablyfailedmeinmylastflightasairplanepilotinAFA.Icouldnotbe moregratefultohim. Lisboa,November22,2010 Lu´ısCarlosCarvalho TomybelovedgirlfriendAnaandtomyfamilyandfriends. It is better to reign in Hell than to be slave in Heaven. Resumo Apesquisaefectuadanocontextodestetrabalhoresultounodesenvolvimentodeumaframeworkpara detecc¸a˜o de palavras cognatas entre diferentes l´ınguas. A framework centra-se em medidas de simi- laridadeentrepalavraseregrasdetransliterac¸a˜o. Adetecc¸a˜odecognatasfoifeitaemduasfases: pre- processamentoeclassificac¸a˜o. Afasedepreprocessamentoapenasusouumsubconjuntodasmedidas de similaridade por forma a descartar pares de palavras que na˜o partilhavam qualquer semelhanc¸a. As medidas foram Word Length, Lcsm, Lcsr, Jaro Winkler e Sequence Letters. Os pares resultantes foram enta˜oaproveitadosparaaprimeirafasedeclassificac¸a˜o: otreino. Estafasepermitiugerarummodelo baseadonasmedidasdesimilaridade. Estemodeloe´ utilizadoparapreverseaspalavrassa˜ocognatas. Detodasasmedidasdesimilaridade,apenastreˆssaousadas: Lcsm,LevenshteineDice. Apartirdestas medidas, o mo´dulo de cognatas atingiu uma F-measure de 66.93%. Apo´s a construc¸a˜o da framework, estafoiusadaparadetecc¸a˜odetraduc¸o˜esdeentidadesmencionadas. Estesegundomo´dulousoutreˆs reconhecedores de entidades mencionadas: Stanford NER para nomes escritos na l´ıngua inglesa, XIP NER e um me´todo adaptativo para nomes em portugueˆs. Dois me´todos foram utilizados: o primeiro usouoStansfordNERcomoXIPNER.OsegundoutilizouoStanfordNERmaisome´todoadaptativo. O primeiro alcanc¸ou F-measure de 62.65%, enquanto que o segundo me´todo revelou-se mais eficiente tendoatingidoF-measurede73.91%. Abstract Theresearchperformedinthecontextofthisthesisresultedinthedevelopmentofaframeworkforthe detectionofcognatesacrosstextsofdifferentlanguages. Theframeworkiscenteredinwordsimilarity measuresandtransliterationrules. Cognatedetectionwasaccomplishedintwophases: preprocessing andclassification. Thepreprocessingphaseusedonlyasubsetofthewholesetofsimilaritymeasures in order to discard pairs of words that did not share any resemblance. The measures used were Word Length, Lcsm, Lcsr, Jaro Winkler and Sequence Letters. Furthermore, the resulting pairs were used in the firststepofclassification:training.Trainingpermittedtogenerateamodelbasedonsimilaritymeasures. Thismodelisfurtherusedtopredictwhetherwordsarecognates. Fromthewholesetofsimilaritymea- sures,themodelusedonlythree: Lcsm,DiceandLevenshtein. Fromthesemeasures,thecognatemodule producedaF-measurerateof66.93%. Aftertheframeworkwasbuilt,itwasusedtodetecttranslations ofnamedentities. Thismoduleusedthreenamedentityrecognizers: StanfordNERforEnglishnames, XIPNERandanAdaptiveMethodtoacquirePortuguesenamedentities. Twoapproacheswereused: firstStanfordNERwasusedplustheXIPNER.ThesecondapproachconsistedintheuseoftheStanford NERagainsttheAdaptiveMethod. ThefirstapproachhadF-measurerateof62.65%,whilstthesecond onewasmoreefficient,73.91%ofF-measurerate.

Description:
This model is further used to predict whether words are cognates. From the whole set of similarity mea- .. 2.3 Vector comparison using seed words as context .
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.