Table Of ContentSeries ISSN: 1947-4040 M
T
I
S L ED &C &
YNTHESIS ECTURES ON
E Morgan Claypool Publishers
M
H L T A
UMAN ANGUAGE ECHNOLOGIES N
N
Series Editor: Graeme Hirst, University of Toronto
Bitext Alignment
Bitext Alignment
Jörg Tiedemann, Uppsala University
This book provides an overview of various techniques for the alignment of bitexts. It describes general
concepts and strategies that can be applied to map corresponding parts in parallel documents on various
levels of granularity. Bitexts are valuable linguistic resources for many different research fields and practical
applications. The most predominant application is machine translation, in particular, statistical machine
B
translation. However, there are various other threads that can be followed which may be supported by IT
E
the rich linguistic knowledge implicitly stored in parallel resources. Bitexts have been explored in X
T
lexicography, word sense disambiguation, terminology extraction, computer-aided language learning and A
translation studies to name just a few. L
I
G
The book covers the essential tasks that have to be carried out when building parallel corpora starting
N
from the collection of translated documents up to sub-sentential alignments. In particular, it describes M
E
various approaches to document alignment, sentence alignment, word alignment and tree structure N
Jörg Tiedemann
T
alignment. It also includes a list of resources and a comprehensive review of the literature on alignment
techniques.
About SYNTHESIs
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis Lectures
M
provide concise, original presentations of important research and development O
R
topics, published quickly, in digital and print formats. For more information G
A
visit www.morganclaypool.com N SYNTHESIS LECTURES ON
&
& ISBN: 978-1-60845-510-2 C H L T
Morgan Claypool Publishers 90000 L UMAN ANGUAGE ECHNOLOGIES
A
Y
www.morganclaypool.com P
O
9 781608 455102 O
Graeme Hirst, Series Editor
L
Bitext Alignment
Synthesis Lectures on Human
Language Technologies
Editor
GraemeHirst,UniversityofToronto
Theseriesconsistsof50-to150-pagemonographsontopicsrelatingtonaturallanguageprocessing,
computationallinguistics,informationretrieval,andspokenlanguageunderstanding.Emphasisison
importantnewtechniques,onnewapplications,andontopicsthatcombinetwoormoreHLTsubfields.
BitextAlignment
JörgTiedemann
2011
LinguisticStructurePrediction
NoahA.Smith
2011
LearningtoRankforInformationRetrievalandNaturalLanguageProcessing
HangLi
2011
ComputationalModelingofHumanLanguageAcquisition
AfraAlishahi
2010
IntroductiontoArabicNaturalLanguageProcessing
NizarY.Habash
2010
Cross-LanguageInformationRetrieval
Jian-YunNie
2010
AutomatedGrammaticalErrorDetectionforLanguageLearners
ClaudiaLeacock,MartinChodorow,MichaelGamon,andJoelTetreault
2010
iii
Data-IntensiveTextProcessingwithMapReduce
JimmyLinandChrisDyer
2010
SemanticRoleLabeling
MarthaPalmer,DanielGildea,andNianwenXue
2010
SpokenDialogueSystems
KristiinaJokinenandMichaelMcTear
2009
IntroductiontoChineseNaturalLanguageProcessing
Kam-FaiWong,WenjieLi,RuifengXu,andZheng-shengZhang
2009
IntroductiontoLinguisticAnnotationandTextAnalytics
GrahamWilcock
2009
DependencyParsing
SandraKübler,RyanMcDonald,andJoakimNivre
2009
StatisticalLanguageModelsforInformationRetrieval
ChengXiangZhai
2008
Copyright© 2011byMorgan&Claypool
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedin
anyformorbyanymeans—electronic,mechanical,photocopy,recording,oranyotherexceptforbriefquotationsin
printedreviews,withoutthepriorpermissionofthepublisher.
BitextAlignment
JörgTiedemann
www.morganclaypool.com
ISBN:9781608455102 paperback
ISBN:9781608455119 ebook
DOI10.2200/S00367ED1V01Y201106HLT014
APublicationintheMorgan&ClaypoolPublishersseries
SYNTHESISLECTURESONHUMANLANGUAGETECHNOLOGIES
Lecture#14
SeriesEditor:GraemeHirst,UniversityofToronto
SeriesISSN
SynthesisLecturesonHumanLanguageTechnologies
Print1947-4040 Electronic1947-4059
Bitext Alignment
JörgTiedemann
UppsalaUniversity
SYNTHESISLECTURESONHUMANLANGUAGETECHNOLOGIES#14
M
&C Morgan &cLaypool publishers
ABSTRACT
Thisbookprovidesanoverviewofvarioustechniquesforthealignmentofbitexts.Itdescribesgeneral
concepts and strategies that can be applied to map corresponding parts in parallel documents on
variouslevelsofgranularity.Bitextsarevaluablelinguisticresourcesformanydifferentresearchfields
andpracticalapplications.Themostpredominantapplicationismachinetranslation,inparticular,
statisticalmachinetranslation.However,therearevariousotherthreadsthatcanbefollowedwhich
maybesupportedbytherichlinguisticknowledgeimplicitlystoredinparallelresources.Bitextshave
beenexploredinlexicography,wordsensedisambiguation,terminologyextraction,computer-aided
languagelearningandtranslationstudiestonamejustafew.
Thebookcoverstheessentialtasksthathavetobecarriedoutwhenbuildingparallelcorpora
startingfromthecollectionoftranslateddocumentsuptosub-sententialalignments.Inparticular,it
describesvariousapproachestodocumentalignment,sentencealignment,wordalignmentandtree
structurealignment.Italsoincludesalistofresourcesandacomprehensivereviewoftheliterature
onalignmenttechniques.
KEYWORDS
alignment, bitexts, parallel corpora, sentence alignment, word alignment, tree align-
ment,statisticalmachinetranslation,transductiongrammars,textmining,lexiconin-
duction
vii
Contents
Preface .................................................................. ix
Acknowledgments........................................................ xi
1 Introduction ..............................................................1
1.1 Applications ........................................................... 5
1.2 FurtherReadings ....................................................... 6
2 BasicConceptsandTerminology............................................7
2.1 BitextandAlignment ................................................... 7
2.2 AlignmentandSegmentation ............................................ 9
2.3 AlignmentSpacesandConstraints ....................................... 11
2.4 CorrelationsandCues.................................................. 15
2.5 AlignmentModelsandSearchAlgorithms ................................ 17
2.6 EvaluationofBitextAlignment.......................................... 21
2.7 SummaryandFurtherReading .......................................... 23
3 BuildingParallelCorpora ................................................ 27
3.1 DocumentAlignment .................................................. 29
3.2 MiningtheWeb....................................................... 32
3.3 ExtractingParallelDatafromComparableCorpora ........................ 34
3.4 SummaryandFurtherReading .......................................... 35
4 SentenceAlignment ..................................................... 37
4.1 Length-basedApproaches .............................................. 38
4.2 LexicalMatchingApproaches ........................................... 48
4.3 CombinedandResource-SpecificTechniques.............................. 53
4.4 SummaryandFurtherReading .......................................... 56
5 WordAlignment ........................................................ 59
5.1 GenerativeAlignmentModels........................................... 60
5.2 ConstraintsandHeuristics .............................................. 75
viii
5.3 DiscriminativeAlignmentModels ....................................... 81
5.4 TranslationSpottingandBilingualLexiconInduction....................... 99
5.5 SummaryandFurtherReading ......................................... 103
6 PhraseandTreeAlignment .............................................. 105
6.1 ParallelTreebanksandTreealignment ................................... 106
6.2 HierarchicalAlignmentandTransductionGrammars ...................... 111
6.3 SummaryandFurtherReading ......................................... 121
7 ConcludingRemarks ................................................... 123
7.1 FinalRecommendations ............................................... 124
A Resources&Tools...................................................... 125
Bibliography........................................................... 129
Author’sBiography ..................................................... 153
Description:This book provides an overview of various techniques for the alignment of bitexts. It describes general concepts and strategies that can be applied to map corresponding parts in parallel documents on various levels of granularity. Bitexts are valuable linguistic resources for many different research