Table Of Content

Bidirectional American Sign Language to English Translator Hardie Cate Zeshan Hussain [email protected] [email protected] December 11, 2015 1 Introduction phasis in the literature on the conversion between the grammars of the languages, which is what we examine in 7 IntheUSalone,thereareapproximately900,000hearing- this project. 1 impaired people whose primary mode of conversation is 0 sign language. For these people, communication with 2 3 Task Definition non-signers is a daily struggle, and they are often dis- n advantaged when it comes to finding a job, accessing a health care, etc. There are a few emerging technolo- Let us formalize the notation for this task. For the sake J of simplicity, we will use ASL as our “source language” giesdesignedtotranslatesignlanguagetoEnglishinreal 0 and English as our “target language.” In general, S will time,butmostofthecurrentresearchattemptstoconvert 1 be used to refer to the sentence in ASL, where S is a raw signs into English words. This aspect of the trans- sequence of signs S ,...,S and m is the length of the ] lation is certainly necessary, but it does not take into 1 m L sentence. Each sign S is represented by a word with up- account the grammatical differences between signed lan- i C percase letters. We also include special ”gesture tokens,” guages and spoken languages. In this paper, we outline s. our bidirectional translation system that converts sen- whichrepresentgesturesorhandmovements(e.g.,[point], c tences from American Sign Language (ASL) to English, [head shake]) that are not signs in and of themselves but [ rather modify other signs in the sentence. As a special and vice versa. case, which we address later, commas are also treated as 1 To perform machine translation between ASL and En- v glish, we utilize a generative approach. Specifically, we individual sign tokens. Similarly, we will use E to refer 5 to the English sentence, where E is a sequence of words employanadjustmenttotheIBMword-alignmentmodel 9 E ,...,E and l is the length of the sentence. 1 (IBM WAM1)[14] where we define language models for 1 l 7 2 English and ASL, as well as a translation model, and at- S :WRISTWATCH [point], WHO GIVE−YOU? input 0 tempt to generate a translation that maximizes the pos- 1. terior distribution defined by these models. Using these Eoutput :Who gave you that wristwatch? 0 models, we are able to quantify the concepts of fluency In the backward direction, the input is a grammatically 7 and faithfulness of a translation between languages. correct English sentence and the output is a sequence of 1 signs represented by words. An example for this input is : v i 2 Related Work E :He fell down. X input r Soutput :FELL. a In examining papers and projects related to this topic, we have discovered that translation between signed and To evaluate translations generated by our system as well spoken languages is still largely an open problem. The as the baselines in both directions, we use the BLEU- few emerging technologies that attempt to tackle this is- 2 score,[10] which takes in a predicted sentence and a sue are only capable of translating single words or short reference sentence, both in the target language. Let p n phrasesofASLintoEnglish. Infact,mostapproachesfo- be the the ratio of the number of shared n-grams to the cusonanalyzingvideosofsignsandconvertingtheseinto total number of n-grams in the predicted sentence. Then words, which a number of recent CS229 projects have for a predicted sentence p and reference sentence s, the done. However, several teams at the University of Penn- BLEU-N score is given by sylvania have taken a more grammatical approach [15] N ttohethoethperrodbilreemctioofnE. nTghliesrhe-thoa-As aSlLsotrbaenesnlaatnionItabluiatnntoetaimn BLEU2(p,r)=e(1−||pr||)(cid:89)pN. i=1 that has designed a tentative translation system for Ital- iansignlanguage,[7]butthereareveryfewparallelswith Thus BLEU-2 is essentially the ratio of shared uni- respect to grammatical structures that we can draw for grams and bigrams between the predicted translation this project. In short, there has been relatively little em- and the gold-standard translation to the number that 1 appear in the predicted translation. The exponen- 5.1 Baseline and Oracle tial factor at the beginning of the above expression is Because of the bidirectional nature of the problem, we called the brevity penalty. Because the rest of the have two baselines. For a baseline in English-to-ASL di- BLEU expression measures the precision of the predic- rection, we use a unigram cost function over English to tion, it does not penalize the prediction for leaving out detect the “most important” words in the sentence and words. The role of the brevity penalty is to penal- directly translate each word into ASL, preserving the or- ize predicted sentences that are shorter than their ref- derofthesewords. Topickthemostimportantwords,our erence counterparts. As a concrete example, consider baselineselectsthehighestcostwordsabovesomethresh- translating ”YOU LIKE COLOR BROWN?” to the En- old based on our unigram cost function. In this case, a glish ”You like brown color?” with the correct reference higher cost implies a lower frequency in the English lan- ”Do you like the color brown?”. The BLEU-2 score of this prediction is e1−46(4)(1)=0.202 guage, suggesting higher importance. For our baseline 4 3 intheotherdirection,wedirectlytranslateeachsigninto The BLEU-2 metric is used primarily to rate transla- Englishandinsertsmallhelperwords(suchas“a,”“the,” tionsatacorpuslevel,soitdoesnotperformaswellwhen “and,” etc.), treating this as a search problem that tries used to evaluate translations of single sentences. Despite to minimize the cost of the English sentence using a bi- that, the BLEU-2 score should be a sufficient evaluation gram cost function. To test these baselines, we utilized metric for the purposes of this project, since it is still the same BLEU score as we did for the results generated the most common machine translation evaluator and is by our IBM word alignment system. This approach al- relatively simple to interpret. lowed us to easily compare the baseline results and IBM WAMresults,whicharereportedinthe’Results’section. Forouroraclesinbothdirections,wehavesomeonefa- 4 Data miliar with both languages translate each sentence into the other language. These translations are adequate or- Our data consists of 579 ASL/English translation pairs acles because they make the most sense from a human scraped from a repository from lifeprint.com,[12] a web- standpoint. Furthermore, the oracles serve as appropri- sitededicatedtoASLeducation. Eachpairisasingleac- ate gold standard translations that can be used in the curatetranslationofanASLsentenceorquestionintoEn- BLEU score calculation. glish. ManyoftheASLsentencescontaingesturetokens. These pairs are not specifically designed to be transla- tionsfromEnglishtoASLandingeneral, translationbe- 5.2 Custom WAM tweenlanguagesisnotsymmetric. However,wemakethe TheIBMWAM1isacommonmachinetranslationmodel simplifying assumption that this translation is symmet- thatwecustomizetotakeadvantageofcommonconstruc- ric primarily because we do not have a thorough enough tions in ASL and allow us to modify how the translation understanding of ASL to translate from English to ASL. modelandlanguagemodelsareconstructed. Thecustom Totestouralgorithms,wesplitthedataintotwomain model is meant to give us flexibility in our translations. sets–atrainingset(80%oforiginaldata)andatestingset We now describe the general workflow of our model in (remaining20%). Wealsohaveathirdset, calledthede- the ASL to English direction. This workflow applies to velopment set, that we use to tune our hyperparameters. the other direction as well, except with E switched with This set is constructed by pulling out 10 examples from S: 1). For each potential translation, we calculate p(E) the training set and treating them as our development 2). Then, given this channel input, we calculate p(S|E) set. A more standard convention is to use a 70%-30% in the “noisy channel” 3). Finally, we choose the E that split for training and testing data, but we decided to use gives the highest posterior probability given these two alargertrainingsetsizebecauseweusethesesignsasour probabilities (likelihood and prior). The p(E) is repre- entire corpus for our ASL language model. We also use sented by the language model while the p(S|E) is rep- small subsets of the data to run tests to assess the per- resented by the translation model. We will derive this formance of specific aspects of our system. For example, formulation below. we construct a test set from the original test set to judge the performance of ”comma-trigram” feature of our ASL language model. We construct another test set to handle 5.2.1 Modeling gesture tokens. To assess these tests, we use the BLEU-2 Forconcreteness,wewillusearunningtranslationexam- metric described in the task definition. pleinthedescriptionsofourmodelsandalgorithms. Con- siderthefollowinggold-standardtranslationfromASLto 5 Technical Approach English: S :YOU LIKE LEARN SIGN? Below,wedetailthebaselines,oracles,andtheIBMword input E :Do you like learning sign language? alignment method for this problem. output 2 To derive the models that we will use in the translation, the comma and a common word after the comma. The weconsiderthefollowingoptimizationproblem(notethat secondcaveatisthatweincorporategesturetokens,which once again we are writing this derivation in the ASL to are contained in brackets, into our corpus. Our language English direction but the same derivation applies to the model does not, however, treat these gesture tokens any other direction): differently than regular signs. Given a sentence S in ASL, we want to find an English sentence such that: 5.2.3 Translation Model E =arg max P(E|S) Now, we wish to estimate the probability P(S|E) given E∈English a trained translation model. We represent this model as E =arg max P(S|E)P(E) E∈English a mapping from ”sign-English” pairs to probabilities (of those pairs) using a dictionary in python. Training this The probability P(E) represents the fluency of E in En- model requires us to introduce a set A = a ,...,a of 1 J glish, i.e. how much the translation makes sense to a na- alignment variables where a represents the index of the j tive speaker, while P(S|E) represents the faithfulness of word in the English sentence that translates to the jth thetranslationfromS toE,i.e. ifthetranslationreflects word s in S. We also allow these variables to take on j the actual meaning of what is being said. We determine a value of 0, which represents a “null” word in the En- P(E) using a language model for English and P(S|E) glish sentence. This null word allows for the possibility using a translation model from English to ASL. (Note thatthereisnowordintheEnglishsentencethatdirectly that in our overall translation from ASL to English, we translates to some sign s in S. In the case of our exam- j modeltheprobabilityofagiventranslationfromEnglish ple above, if A specifies that “learning” in E translates to ASL.) to “LEARN” in S, then a = 4, since “LEARN” is the 3 thirdwordinS and“learning”isinthefourthpositionof 5.2.2 Language Model E. Thus, Acanbethoughtofasamany-to-onemapping from words in E to signs in S. (In general, this mapping To compute P(E) for a given English sentence E, we use would need to be many-to-many since we sometimes re- an n-gram cost function. For instance, in our example quire phrase-to-phrase translation, but for our purposes sentence E above in the case where n=3, the many-to-one assumption is reasonable because gen- erally speaking, multiple English words map to only one p(E)=p (null,null,”do”)·p (null,”do”,”you”) 3 3 sign.) Then, to compute P(S|E), we marginalize out the ·p (”learning”,”sign”,”language?”) 3 alignment variables where p3 is a 3-gram cost function. We have imported (cid:88) (cid:88) P(S,A|E)= P(S|A,E)P(A|E). n-gram cost functions from http://www.ngrams.info/ [8] for n=2,3,4,5. A A There are a couple of caveats with the above approach Now, let I be the number of words in E. For fixed I, whentryingtocreatealanguagemodelforASL,whichis weassumethateachpossiblealignmentforeachlengthis necessaryfortranslationintheotherdirection(i.e. tocal- equallylikely,whichgivesusthatP(A|E)= (cid:15) . This culate P(S)). The first is that our ASL language model (1+I)J probabilitycanbethoughtofasanormalizationfactorfor is basically a unigram model that has additional com- the probability, P(S|E). Indeed, P(S|A,E) is given by, ponents. In addition to each sign having an associated probability, which is a simple unigram model, we add a J comma trigram construction. This construction incorpo- p(S|A,E)= (cid:89)t(s ,e ), rates into the language model the following grammatical j aj j=1 structure that is commonly found in ASL, where t(s,e) is the probability of translating e as s. ([NOUN]|[NOUNPHRASE])[,][PHRASE] Therefore, this gives us the following for P(S|E), In other words, an infrequent (uncommon) noun will be J (cid:88) (cid:15) (cid:89) followed by a comma, which is followed by a phrase that p(S|E)= t(s ,e ), generallydescribesorreferstothenoun. Thephraseafter (1+I)J j aj A j=1 thecommausuallystartswithamorefrequent(common) word. The way the comma trigram encapsulates this is After training on our dataset, constructing a language byincreasingtheunigramprobabilityofthesecondword model for each language using the corpa we have pro- and decreasing the unigram probability of the first word. cessed,andgeneratingaparameterizedtranslationmodel This effectively favors a big difference between the word using the EM algorithm, we can use our decoder to find before the comma and the word after the comma, which the optimal translation. Both the EM and decoding al- inturnfavorsanuncommonwordbeingtranslatedbefore gorithms are described in detail in the next section. 3 5.2.4 Algorithms empty hypothesis. Then we iterate through the priority queues,andforthetophypothesishinthecurrentqueue, The two primary algorithms used in our implementation we generate some number of new hypotheses h ,...,h areanEMalgorithmtotrainourtranslationmodelwhich 1 k byaddingasingleEnglishwordasthenextwordinh. We defines P(S|E) and our decoding algorithm. The decod- placeeachnewhypothesish(cid:48) inq , wherei isthenumber ingalgorithmisamodifiedversionofthedecodinginIBM i of signs in S that have been translated to English words WAM1, which uses a variant of beam search to find in h(cid:48). Each hypotheses h is prioritized according to argmaxP(S|E)P(E). E |E| (cid:88) log p(s |e )+W logp(E) FirstwedescribetheEMalgorithm. Ourlanguagemodel ej j j=1 maintains a mapping t from pairs (s,e), where s is a sign and e is an Englishword, toprobabilities t(s,e)=p(s|e). whereE =[e e ...e ]istheEnglishsentencecontained 1 2 |E| Given our training data containing a list of m transla- in h and W is the language model weight whose purpose tion pairs (S,E), we can rewrite our estimate of a single is explained below. The aspect related to BEAM search transition probability as is that we only consider the first k hypotheses in each q before moving on to q . Thus, we effectively prune m i i+1 (cid:88)(cid:88) p(s|e)= p(S ,A |E ) hypotheses that do not appear among the first k for a i i i given number of translated signs. i=1 Ai m J Modified decoding algorithm (cid:88)(cid:88) (cid:89) = p(A |E ) t(s ,e ) Set q to PQ with empty hyp. i i j aj 0 i=1 Ai j=1 Set each q1,...,ql to empty PQ for i=0 to l−1 do where A is the alignment vector for training example i. i for j =0 to MAX QUEUE SIZE do To estimate t(s,e) given the data, we need to find the hyp = q .pop() i maximum likelihood of this probability. However, with for k =1 to l do theintroductionoftheselatentvariables,itisnotpossible newhyp = max over s,e of newhyp(s,e) tofindthislikelihoodinaclosedform. Therefore, weuse insert new hyp into appropriate queue the EM algorithm to estimate this likelihood by comput- end for ingtheprobabilityofeachpossiblealignmentgivenadis- end for tributionoft(s,e)foreachsandeintheE-stepandthen end for adjustingthisdistributionbasedonthesealignmentprob- Return q .pop() l ability estimates in the M-step. The algorithm runs as where newhyp(s,e) is the new hypothesis formed by follows: translating s ∈ S to e and newhyp is the maximum EM for ASL-to-English translation model according to the priority measure. Once the algorithm while Not converged do reaches the final priority queue, all l signs have been Set t(s,e) uniform, including t(s,NULL). translated. This implies that the top hypothesis in the Set counts of e translating to s as C(s,e)=0∀s,e. priorityqueueisthehypothesisthattranslatesallsignsin for (S,E) in training set do S toEnglish(possiblytotheNULLword)andmaximizes for sign position i=1,...,l do p(S|E)p(E), i.e., the hypothesis with English sentence P(a =j|S,E)= t(si,ej) i (cid:80)j(cid:48)t(si,ej(cid:48)) C(s |e )+=P(a =j|S,E) E =argmaxp(e|S). end foirai i e∈ql end for 5.2.5 Algorithms Commentary for all s,e do t(s,e)= C(s,e) (cid:80)s(cid:48)C(s(cid:48),e) Oneofthemostimportantaspectsofthealgorithmisthat end for a hypothesis h can generate a new hypothesis h(cid:48) that is end while placed in the same queue as h if it is generated by trans- Our decoding algorithm is a modification of the conven- lating a sign already translated in h. This is important tionaldecodingalgorithmusedinIBMWAM1. Weframe because it allows us to translate a single sign to multiple the problem of finding the best English translation as a English words. In particular, this allows the translated type of search problem. Let l be the length of S. The English sentence to be longer than the input sign sen- algorithm is a variant on BEAM search that maintains tence, which is the case with most English translations. a list of priority queues q ,q ,...,q of hypotheses. Each Another important note concerns the language model 0 1 l hypothesisconsistsofalistofEnglishwordseeachwitha weight. If this weight is set to 1, the priority is simply correspondingsigns fromS. Initially,allpriorityqueues p(E|S), i.e., the value we are trying maximize. How- e are empty, with the exception of q , which contains the ever, we find that if we try to maximize this value, 0 4 6 Results As mentioned in the Data section, we split the dataset into three separate sets: a training set, a development set, and a test set. For both directions, we train our models on a training set, tune the hyperparameters (i.e. thelanguagemodelweight,thequeuesizek,andthetype ofn-grammodelweareusingforthelanguagemodel[unigram,bigram,trigram])onadevelopmentset,andfinally findtheaverageBLEU-2scoreonatestset. Furthermore, our metric for measuring the accuracy of translation on the test set or the development set is the average of individual BLEU-2 scores of the translations. Thus, any BLEU-2 score that is reported in tables or figures will be the mean BLEU-2 score. Figure1: ASL-to-ENG:VaryingLanguageModelWeight 6.1 Hyperparameter Tuning Results To find the optimal hyperparamaters, we ran several experiments. Theinitialsetofexperimentswereruntofind theoptimalhyperparametersfortranslationfromASLto English. First, we found the BLEU-2 score on the devel- opmentsetbykeepingthequeuesizeconstantandadjust- ing the language model weight using the bigram English model. Next, we performed the same experiment except using the trigram English model. Results of these experiments are reported in tables 2 and 3. We find that the optimal language model weight is 0.1, the optimal queue size for ASL to English translation is 20, and the optimal language model type is a trigram model. Using this combination gives us a BLEU-2 score of 0.276 (see Table 3), which is the highest BLEU-2 score that we obtained while tuning hyperparameters. Toprovidesomeintuitiononthechoiceofexperiments, we reasoned that we could converge on the optimal hyperparameter combination using a method similar to co- Figure 2: ASL-to-ENG: Varying Queue Size ordinate ascent, where we optimize a single parameter whilekeepingtheotherparametersconstant. Then,after finding the optimal value for the parameter, repeat the process with the other parameters while still maintain- the language model probabilities ”outweigh” the prob- ing the optimal values for the parameters that have been abilities of the translation model. In effect, the sign alreadyprocessed. Usingthe(20,0.1,’Trigram’)combina- sentence is likely to be translated as a set of com- tion,weranoursystemonthetestset;theBLEU-2score, mon English n-grams unrelated to the original sen- reported in Table 1, is 0.1202. tence. For instance, when we attempt to translate The subsequent experiments involved finding the op- YOUR SISTER SINGLE?as”Is your sister single?”,our timal hyperparameters for translation from English to top result is ”in the first”. However, when we introduce thislanguagemodelweightwithW =0.3,weproducethe Table 1: Overall BLEU-2 Scores on Test Sets translation ”Your sister single?”. Although this transla- tionmightnotbeascommoninEnglishas”in the first”, BLEU-2 Score itcertainlypreservesthemeaningoftheoriginalsentence ASL-to-English 0.1202 much better. English-to-ASL 0.1802 Comma Trigram Test 0.1201 Just as with BEAM search, our decoding algorithm Gestures Test 0.1232 is not guaranteed to converge to the optimal result (in Baseline ASL-to-English 0.1012 fact, this problem in NP-complete), but in many cases it Baseline English-to-ASL 0.1139 provides good results. 5 ASL. Using a similar methodology to the one used in After running our tuning experiments on the hyper- the other translation direction, we ran one set of experi- parameters, we noticed a few interesting trends. First, ments where the queue size was kept constant while the instances in which we use a trigram model for English language model weight was adjusted. Because we only tend to produce higher BLEU-2 results than with corre- used a unigram ASL language model while translating in sponding bigram English models, particularly when the thisdirection,runningonlythissetofexperimentsissuf- maximum queue size in the decoding algorithm is large. ficient to find the optimal hyperparameter pair. Results This makes sense because the trigram model inherently of the experiment are reported in Table 4. We find that captures more information than the bigram model. Fur- the optimal language model weight is 0.1 and the opti- thermore, we see from the two graphs above that for mal queue size is 20. Using the (20,0.1) combination and large maximum queue sizes, instances with small lan- running the system on the test set, we obtain a BLEU-2 guagemodelweightsperformbetter. Wecanexplainthis score, reported in Table 1, of 0.1802. as follows: For large maximum queue sizes, after the first few iterations of the decoding algorithm, we will see hypotheses with common words. If we do not sufficiently 6.2 Experiments for ASL Constructions dampen the influence of the language model, we will be- gin to see hypotheses with high priorities resulting from Thelasttwosetsofexperimentstesthowoursystemper- commonngrams,andthesewillsteerthesentencegenera- forms on translating specific ASL constructions. Firstly, tionawayfromtheoriginalmeaningofthesignsentence. we perform an experiment on translating only the com- Therefore, we conclude that models with high maximum mon comma construction described above in Modeling, queuesizesandlowlanguagemodelweightsgenerallyper- by filtering for these test examples in the test set and form best. using the optimal ASL to English hyperparameters when runningonthefilteredtestset. TheBLEU-2scoreresult for this test is .1201 (see Table 1). Although it is useful to look at how changing these Finally,thelastexperimenttestshowoursystemtrans- hyperparameters changes the model, it is perhaps even latessignsequenceswithgesturetokens. Wecreateatest more important to recognize the shortcoming of our cur- setforthisexperimentinasimilarfashiontowhatwedid rent model. The first issue is with our corpus, which by for comma constructions, by filtering the original test set most measures is far too small to perform accurate ma- to only include sign sequences with gesture tokens. The chinetranslation. Wealsolacklargedatabasescontaining BLEU-2 score for this test is .1232 (see Table 1). sign language as text which limited our language model forASL.Unfortunately, ourASLlanguagemodelwasfar too crude to produce accurate results. We oversimplified 7 Analysis the model in several ways, in particular with our comma trigramstructureandtreatmentofgesturesasequivalent toothersigns. BecausewearenotfluentinASL,letalone As a sanity check for our results, it is important to note fully understand its underlying structure, we had diffi- that our translation system outperforms the baseline im- cultyindesigningitslanguage. Infact,toourknowledge, plementations in both directions. We also observe a dif- no one else has designed a text-based language model for ference in performance between the two translational di- anysignlanguage. Finally,machinetranslationforsingle rections. The ASL-to-English direction produces an av- sentencesisinherentlyadifficulttask,sincethesentences erage BLEU-2 score of 0.12, while the English-to-ASL lack context. Despite these many challenges and difficul- gives a higher score of 0.18. The behavior is consistent ties,oursystemstilltranslatesmanysentenceseffectively withourexpectationsgoingintotheprojectsinceingoing between the two languages in both directions. fromEnglishtoASL,weessentiallyfilteroutunnecessary information, whereas the other direction requires gener- ating information in effect. Table 2: ASL-to-ENG: Bigrams Table 3: ASL-to-ENG: Trigrams QueueSize LanguageModelWeight BLEU-2Score QueueSize LanguageModelWeight BLEU-2Score 8(bigram) 0.1 0.2360 8(trigram) 0.1 0.0950 0.2 0.1690 0.2 0.1610 0.3 0.1600 0.3 0.1610 10(bigram) 0.1 0.1990 10(trigram) 0.1 0.1270 0.2 0.1690 0.2 0.1440 0.3 0.0700 0.3 0.1160 20(bigram) 0.1 0.1990 20(trigram) 0.1 0.2760 0.2 0.2190 0.2 0.1690 0.3 0.2080 0.3 0.0505 6 Table 4: ENG-to-ASL Table: Language Model Weights [8] N-grams data. http://www.ngrams.info/. Ac- cessed: 2015-12-11. QueueSize LanguageModelWeight BLEU-2Score 20 0.1 0.1790 [9] NaturalLanguageProcessing:MachineTranslation. 0.2 0.1140 http://web.stanford.edu/class/cs224n/ 0.3 0.1150 handouts/cs224n-lecture3-MT.pdf. Accessed: 15 0.1 0.1004 0.2 0.1000 2015-12-11. 0.3 0.0680 [10] KishorePapinenietal.“BLEU:amethodforauto- 13 0.1 0.0964 0.2 0.1480 matic evaluation of machine translation”. In: Pro- 0.3 0.0860 ceedings of the 40th annual meeting on association for computational linguistics. Association for Com- 8 Conclusion & Future Work putational Linguistics. 2002, pp. 311–318. [11] Becky Sue Parton. “Sign language recognition and While we would like our translations to be as accurate as translation: A multidisciplined approach from the possible, it is important to note that it is virtually im- field of artificial intelligence”. In: Journal of deaf possible, even for an oracle, to come up with a perfect studies and deaf education 11.1 (2006), pp. 94–101. translation. In any language, words and sentences carry [12] Sign Language Phrases. http://www.lifeprint. contextual meaning that might be impossible to express com/asl101/index/sign-language-phrases. exactly in another language. The problem of translat- htm. Accessed: 2015-12-11. ing between sign languages and natural languages is ex- [13] Signing Savvy. https://www.signingsavvy.com/. tremely difficult, and we are likely the first to address Accessed: 2015-12-11. it exactly this way. Although we faced numerous challenges, we have shown that this approach is a reasonably [14] Statistical Machine Translation: IBM Model 1 and effectiveimprovementtoourbaselinealgorithms. Witha 2. http://www.cs.columbia.edu/~mcollins/ more thorough understanding of ASL and its grammat- courses/nlp2011/notes/ibm12.pdf. Accessed: ical structure, along with a larger training corpus, our 2015-12-11. approach has the potential to be an effective system of [15] Liwei Zhao et al. “A machine translation system translation between ASL and English. from English to American Sign Language”. In: En- visioningmachinetranslationintheinformationfu- 9 References ture. Springer, 2000, pp. 54–67. [1] American Sign Language Dictionary. http://www. signasl.org. Accessed: 2015-12-11. [2] American Sign Language Grammar. http://www. lifeprint.com/asl101/pages-layout/grammar. htm. Accessed: 2015-12-11. [3] StevenBird,EwanKlein,andEdwardLoper.Natu- rallanguage processingwith Python.”O’ReillyMe- dia, Inc.”, 2009. [4] Introduction to Statistical MT. https://spark- public.s3.amazonaws.com/cs124/slides/mt2. pdf. Accessed: 2015-12-11. [5] Machine Translation: Decoding. http://www. statmt.org/book/slides/06-decoding.pdf. Accessed: 2015-12-11. [6] Daniel Ortiz Mart´ınez, Ismael Garc´ıa Varea, and Francisco Casacuberta Nolla. “Generalized stack decoding algorithms for statistical machine translation”. In: Proceedings of the Workshop on Statis- tical Machine Translation. Association for Compu- tational Linguistics. 2006, pp. 64–71. [7] Alessandro Mazzei et al. “Deep Natural Language Processing for Italian Sign Language Translation”. In:AI*IA2013:AdvancesinArtificialIntelligence. Springer, 2013, pp. 193–204. 7