ebook img

Decoding Anagrammed Texts Written in an Unknown Language PDF

12 Pages·2016·0.3 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Decoding Anagrammed Texts Written in an Unknown Language

Decoding Anagrammed Texts Written in an Unknown Language and Script BradleyHauer and GrzegorzKondrak DepartmentofComputingScience UniversityofAlberta Edmonton,Canada bmhauer,gkondrak @ualberta.ca { } Abstract phering the manuscript is the lack of knowledge of whatlanguageitrepresents. Algorithmic decipherment is a prime exam- Identificationoftheunderlyinglanguagehasbeen pleofatrulyunsupervisedproblem. Thefirst crucial for the decipherment of ancient scripts, in- step in the decipherment process is the iden- cluding Egyptian hieroglyphics (Coptic), Linear B tification of the encrypted language. We pro- (Greek),andMayanglyphs(Ch’olti’). Ontheother posethreemethodsfordeterminingthesource hand, the languages of many undeciphered scripts, language of a document enciphered with a monoalphabetic substitution cipher. The best such as Linear A, the Indus script, and the Phaistos method achieves 97% accuracy on 380 lan- Disc, remain unknown (Robinson, 2002). Even the guages. We then present an approach to de- order of characters within text may be in doubt; in coding anagrammed substitution ciphers, in Egyptianhieroglyphicinscriptions,forinstance,the which the letters within words have been ar- symbols were sometimes rearranged within a word bitrarilytransposed. Itobtainstheaveragede- inordertocreateamoreelegantinscription(Singh, cryptionwordaccuracyof93%onasetof50 2011). Another complicating factor is the omission ciphertextsin5languages. Finally, wereport theresultsontheVoynichmanuscript, anun- ofvowelsinsomewritingsystems. solvedfifteenthcenturycipher,whichsuggest Applicationsofciphertextlanguageidentification Hebrewasthelanguageofthedocument. extend beyond secret ciphers and ancient scripts. Nagy et al. (1987) frame optical character recogni- tion as a decipherment task. Knight et al. (2006) 1 Introduction note that for some languages, such as Hindi, there The Voynich manuscript is a medieval codex1 con- exist many different and incompatible encoding sistingof240pageswritteninauniquescript,which schemes for digital storage of text; the task of an- has been referred to as the world’s most important alyzing such an arbitrary encoding scheme can be unsolved cipher (Schmeh, 2013). The type of ci- viewedasadeciphermentofasubstitutioncipherin pher that was used to generate the text is unknown; an unknown language. Similarly, the unsupervised a number of theories have been proposed, includ- derivation of transliteration mappings between dif- ing substitution and transposition ciphers, an abjad ferent writing scripts lends itself to a cipher formu- (a writing system in which vowels are not written), lation(RaviandKnight,2009). steganography, semi-random schemes, and an elab- TheVoynichmanuscriptiswritteninanunknown orate hoax. However, the biggest obstacle to deci- script that encodes an unknown language, which is the most challenging type of a decipherment prob- 1The manuscript was radiocarbon dated to 1404-1438 lem (Robinson, 2002, p. 46). Inspired by the mys- AD in the Arizona Accelerator Mass Spectrometry Labo- tery of both the Voynich manuscript and the un- ratory (http://www.arizona.edu/crack-voynich-code, accessed Nov.20,2015). deciphered ancient scripts, we develop a series of 75 TransactionsoftheAssociationforComputationalLinguistics,vol.4,pp.75–86,2016.ActionEditor:ReginaBarzilay. Submissionbatch:12/2015;Published4/2016. c2016AssociationforComputationalLinguistics.DistributedunderaCC-BY4.0license. (cid:13) algorithms for the purpose of decrypting unknown alphabetic scripts representing unknown languages. Weassumethatsymbolsinscriptswhichcontainno more than a few dozen unique characters roughly correspond to phonemes of a language, and model them as monoalphabetic substitution ciphers. We Figure1: AsamplefromtheVoynichmanuscript. furtherallowthatanunknowntranspositionscheme could have been applied to the enciphered text, resulting in arbitrary scrambling of letters within Numerous languages have been proposed to un- words(anagramming). Finally,weconsiderthepos- derlietheVMS.Thepropertiesandthedatingofthe sibility that the underlying script is an abjad, in manuscriptimplyLatinandItalianaspotentialcan- whichonlyconsonantsareexplicitlyrepresented. didates. Onthebasisoftheanalysisofthecharacter Ourdecryptionsystemiscomposedofthreesteps. frequency distribution, Jaskiewicz (2011) identifies Thefirsttaskistoidentifythelanguageofacipher- five most probable languages, which include Mol- text,bycomparingittosamplesrepresentingknown davianandThai. ReddyandKnight(2011)discover languages. The second task is to map each symbol an excellent match between the VMS and Quranic of the ciphertext to the corresponding letter in the Arabicinthedistributionofwordlengths,aswellas identified language. The third task is to decode the asimilaritytoChinesePinyininthepredictabilityof resultinganagramsintoreadabletext,whichmayin- lettersgiventheprecedingletter. volvetherecoveryofunwrittenvowels. Thepaperisstructuredasfollows. Wediscussre- It has been suggested previously that some ana- lated work in Section 2. In Section 3, we propose gramming scheme may alter the sequence order of threemethodsforthesourcelanguageidentification characterswithinwordsintheVMS.Tiltman(1968) oftextsencipheredwithamonoalphabeticsubstitu- observes that each symbol behaves as if it had its tioncipher. InSection4,wepresentandevaluateour ownplaceinan“orderofprecedence”withinwords. approachtothedecryptionoftextscomposedofen- Rugg (2004) notes the apparent similarity of the ciphered anagrams. In Section 5, we apply our new VMStoatextinwhicheachwordhasbeenreplaced techniques to the Voynich manuscript. Section 6 by an alphabetically ordered anagram (alphagram). concludesthepaper. Reddy and Knight (2011) show that the letter se- quences are generally more predictable than in nat- 2 RelatedWork urallanguages. Inthissection,wereviewparticularlyrelevantprior SomeresearchershavearguedthattheVMSmay workontheVoynichmanuscript,andonalgorithmic be an elaborate hoax created to only appear as a deciphermentingeneral. meaningful text. Rugg (2004) suggests a tabular method, similar to the sixteenth century technique 2.1 VoynichManuscript of the Cardan grille, although recent dating of the Since the discovery of the Voynich manuscript manuscript to the fifteenth century provides evi- (henceforthreferredtoastheVMS),therehavebeen dence to the contrary. Schinner (2007) uses analy- a number of decipherments claims. Newbold and sis of random walk techniques and textual statistics Kent(1928)proposedaninterpretationbasedonmi- to support the hoax hypothesis. On the other hand, croscopicdetailsinthetext,whichwassubsequently Landini (2001) identifies in the VMS language-like refuted by Manly (1931). Other claimed decipher- statisticalproperties,suchasZipf’slaw,whichwere ments by Feely (1943) and Strong (1945) have also onlydiscoveredinthelastcentury. Similarly, Mon- been refuted (Tiltman, 1968). A detailed study of temurro and Zanette (2013) use information theo- the manuscript by d’Imperio (1978) details various retic techniques to find long-range relationships be- other proposed solutions and the arguments against tweenwordsandsectionsofthemanuscript,aswell them. asbetweenthetextandthefiguresintheVMS. 76 2.2 AlgorithmicDecipherment which are represented by short sample texts. The methodsarebasedon: A monoalphabetic substitution cipher is a well- known method of enciphering a plaintext by con- 1. relativecharacterfrequencies, verting it into a ciphertext of the same length using a 1-to-1 mapping of symbols. Knight et al. (2006) 2. patternsofrepeatedsymbolswithinwords, propose a method for deciphering substitution ci- pherswhichisbasedonViterbidecodingwithmap- 3. theoutcomeofatrialdecipherment. ping probabilities computed with the expectation- 3.1 CharacterFrequency maximization (EM) algorithm. The method cor- rectly deciphers 90% of symbols in a 400-letter ci- An intuitive way of guessing the source language phertext when a trigram character language model of a ciphertext is by character frequency analysis. is used. They apply their method to ciphertext The key observation is that the relative frequencies language identification using 80 different language ofsymbolsinthetextareunchangedafterencipher- samples,andreportsuccessfuloutcomesonthreeci- mentwitha1-to-1substitutioncipher. Theideaisto phersthatrepresentEnglish,Spanish,andaSpanish order the ciphertext symbols by frequency, normal- abjad,respectively. ize these frequencies to create a probability distri- Ravi and Knight (2008) present a more complex bution,andchoosetheclosestmatchingdistribution but slower method for solving substitution ciphers, fromthesetofcandidatelanguages. whichincorporatesconstraintsthatmodelthe1-to-1 More formally, let P be a discrete probability T property of the key. The objective function is again distribution where P (i) is the probability of a ran- T theprobabilityofthedeciphermentrelativetoann- domlyselectedsymbolinatextT beingtheithmost gramcharacterlanguagemodel. Asolutionisfound frequent symbol. We define the distance between byoptimallysolvinganintegerlinearprogram. two texts U and V to be the Bhattacharyya (1943) Knight et al. (2011) describe a successful deci- distance between the probability distributions P U pherment of an eighteenth century text known as andP : V theCopialeCipher. Languageidentificationwasthe first step of the process. The EM-based method of d(U,V) = ln PU(i) PV(i) − · Knight et al. (2006) identified German as the most i Xp likely candidate among over 40 candidate charac- The advantages of this distance metric include its ter language models. The more accurate method symmetry, and the ability to account for events that of Ravi and Knight (2008) was presumably either haveazeroprobability(inthiscase,duetodifferent too slow or too brittle for this purpose. The cipher alphabet sizes). The language of the closest sample waseventuallybrokenusingacombinationofman- text to the ciphertext is considered to be the most ualandalgorithmictechniques. likelysourcelanguage. Thismethodisnotonlyfast Hauer et al. (2014) present an approach to solv- butalsorobustagainstletterreorderingandthelack ing monoalphabetic substitution ciphers which is ofwordboundaries. more accurate than other algorithms proposed for this task, including Knight et al. (2006), Ravi and 3.2 DecompositionPatternFrequency Knight (2008), and Norvig (2009). We provide a Our second method expands on the character fre- detaileddescriptionofthemethodinSection4.1. quency method by incorporating the notion of de- compositionpatterns. Thismethodusesmultipleoc- 3 SourceLanguageIdentification currences of individual symbols within a word as a Inthissection,weproposeandevaluatethreemeth- cluetothelanguageoftheciphertext. Forexample, ods for determining the source language of a docu- thewordseemscontainstwoinstancesof‘s’and‘e’, mentencipheredwithamonoalphabeticsubstitution andoneinstanceof‘m’. Weareinterestedincaptur- cipher. Weframeitasaclassificationtask, withthe ing the relative frequency of such patterns in texts, classes corresponding to the candidate languages, independentofthesymbolsused. 77 Formally, we define a function f that maps a 1: kmax InitialKey ← word to an ordered n-tuple (t ,t ,...t ), where 2: formiterationsdo 1 2 n t t if i < j. Each t is the number of oc- 3: k kmax i j i ← cur≥rences of the ith most frequent character in the 4: bestswapsfork S ← word. For example, f(seems) = (2,2,1), while 5: foreach c1,c2 do { } ∈ S f(beams) = (1,1,1,1,1). Werefertotheresulting 6: k0 k(c1 c2) ← ↔ tupleasthedecompositionpatternoftheword. The 7: ifp(k0) > p(kmax)thenkmax k0 ← decomposition pattern is unaffected by monoalpha- 8: ifkmax = k thenreturnkmax beticlettersubstitutionoranagramming. Aswiththe 9: returnkmax character frequency method, we define the distance betweentwotextsastheBhattacharyyadistancebe- Figure2: Greedy-swapdeciphermentalgorithm. tweentheirdecompositionpatterndistributions,and classifythelanguageofaciphertextasthelanguage it is incorporated in the current key; otherwise, the ofthenearestsampletext. algorithmterminates. Thetotalnumberofiterations It is worth noting that this method requires word is bounded by m, which is set to 5 times the size separators to be preserved in the ciphertext. In fact, of the alphabet. After the initial run, the algorithm the effectiveness of the method comes partly from isrestarted20timeswitharandomlygeneratedini- capturing the distribution of word lengths in a text. tialkey,whichoftenresultsinabetterdecipherment. On the other hand, the decomposition patterns are All parameters were established on a development independent of the ordering of characters within set. words. We will take advantage of this property in Section4. 3.4 Evaluation 3.3 TrialDecipherment We now directly evaluate the three methods de- scribed above by applying them to a set of cipher- Thefinalmethodthatwepresentinvolvesdecipher- texts from different languages. We adapted the ing the document in question into each candidate dataset created by Emerson et al. (2014) from the language. The decipherment is performed with a text of the Universal Declaration of Human Rights fast greedy-swap algorithm, which is related to the (UDHR) in 380 languages.2 The average length of algorithms of Ravi and Knight (2008) and Norvig the texts is 1710 words and 11073 characters. We (2009). Itattemptstofindthekeythatmaximizesthe divided the text in each language into 66% train- probability of the decipherment according to a bi- ing, 17% development, and 17% test. The training gramcharacterlanguagemodelderivedfromasam- partwasusedtoderivecharacterbigrammodelsfor ple document in a given language. The decipher- eachlanguage. Thedevelopmentandtestpartswere mentwiththehighestprobabilityindicatesthemost separatelyencipheredwitharandomsubstitutionci- likelyplaintextlanguageofthedocument. pher. Thegreedy-swapalgorithmisshowninFigure2. Table1showstheresultsofthelanguageidentifi- The initial key is created by pairing the ciphertext cationmethodsonboththedevelopmentandthetest andplaintextsymbolsintheorderofdecreasingfre- set. Wereporttheaveragetop-1accuracyonthetask quency, with null symbols appended to the shorter ofidentifyingthesourcelanguageof380enciphered of the two alphabets. The algorithm repeatedly at- test samples. The differences between methods are tempts to improve the current key k by considering statisticallysignificantaccordingtoMcNemar’stest the “best” swaps of ciphertext symbol pairs within with p < 0.0001. The random baseline of 0.3% in- the key (if the key is viewed as a permutation of dicates the difficulty of the task. The “oracle” de- the alphabet, such a swap is a transposition). The cipherment assumes a perfect decipherment of the best swaps are defined as those that involve a sym- text, which effectively reduces the task to standard bol occurring among the 10 least common bigrams in the decipherment induced by the current key. If 2Eight languages from the original set were excluded be- anysuchswapyieldsamoreprobabledecipherment, causeofformattingissues. 78 Method Dev Test 4 AnagramDecryption RandomSelection 0.3 0.3 In this section, we address the challenging task of Jaskiewicz(2011) 54.2 47.6 deciphering a text in an unknown language written CharacterFrequency 72.4 67.9 using an unknown script, and in which the letters DecompositionPattern 90.5 85.5 within words have been randomly scrambled. The TrialDecipherment 94.2 97.1 task is designed to emulate the decipherment prob- OracleDecipherment 98.2 98.4 lem posed by the VMS, with the assumption that Table1: Languageidentificationaccuracy(in%correct) its unusual ordering of characters within words re- onciphersrepresenting380languages. flects some kind of a transposition cipher. We re- strict the source language to be one of the candi- date languages for which we have sample texts; we languageidentification. modelanunknownscriptwithasubstitutioncipher; All three of our methods perform well, with the andweimposenoconstraintsonthelettertransposi- accuracy gains reflecting their increasing complex- tionmethod. Theenciphermentprocessisillustrated ity. Between the two character frequency methods, inFigure3. Thegoalinthisinstanceistorecoverthe our approach based on Bhattacharyya distance is plaintextin(a)giventheciphertextin(c)withoutthe significantlymoreaccuratethanthemethodofJask- knowledge of the plaintext language. We also con- iewicz (2011), which uses a specially-designed dis- sider an additional encipherment step that removes tribution distance function. The decomposition pat- allvowelsfromtheplaintext. ternmethodmakesmanyfewererrors,withthecor- Our solution is composed of a sequence of three rectlanguagerankedsecondinroughlyhalfofthose modules that address the following tasks: language cases. Trial decipherment yields the best results, identification,scriptdecipherment,andanagramde- whichareclosetotheupperboundforthecharacter coding. For the first task we use the decomposition bigram probability approach to language identifica- pattern frequency method described in Section 3.2, tion. The average decipherment error rate into the which is applicable to anagrammed ciphers. After correctlanguageisonly2.5%. In4outof11identi- identifyingtheplaintextlanguage,weproceedtore- fication errors made on the test set, the error rate is versethesubstitutioncipherusingaheuristicsearch abovetheaverage;theother7errorsinvolveclosely algorithm guided by a combination of word and relatedlanguages,suchasSerbianandBosnian. character language models. Finally, we unscramble the anagrammed words into readable text by fram- The trial decipherment approach is much slower ing the decoding as a tagging task, which is effi- than the frequency distribution methods, requiring ciently solved with a Viterbi decoder. Our modular roughly one hour of CPU time in order to classify approachmakesiteasytoperformdifferentlevelsof each ciphertext. More complex decipherment algo- analysisonunsolvedciphers. rithmsareevenslower,whichprecludestheirappli- cation to this test set. Our re-implementations of 4.1 ScriptDecipherment thedynamicprogrammingalgorithmofKnightetal. (2006),andtheintegerprogrammingsolverofRavi Forthedeciphermentstep,weadaptthestate-of-the- and Knight (2008) average 53 and 7000 seconds of art solver of Hauer et al. (2014). In this section, we CPUtime,respectively,tosolveasingle256charac- describe the three main components of the solver: tercipher,comparedto2.6secondswithourgreedy- key scoring, key mutation, and tree search. This is swapmethod. Thedynamicprogrammingalgorithm followedbythesummaryofmodificationsthatmake improves decipherment accuracy over our method themethodworkonanagrams. byonly4%onabenchmarksetof50ciphersof256 The scoring component evaluates the fitness of characters. We conclude that our greedy-swap al- each key by computing the smoothed probability gorithm strikes the right balance between accuracy of the resulting decipherment with both character- and speed required for the task of cipher language level and word-level language models. The word- identification. level models promote decipherments that contain 79 (a) organized compositions through improvisational music into genres (b) fyovicstu dfnrfecpcfie pbyfzob cnryfgcevpcfivm nzecd cipf otiyte (c) otvfusyci cpifenfercfd bopbfzy fgyiemcpfcvrcnv nczed fpic etotyi (d) adegiknor ciimnooopsst ghhortu aaiiilmnooprstv cimsu inot eegnrs (e) adegiknor compositions through aaiiilmnooprstv music into greens Figure3: Anexampleoftheencryptionanddecryptionprocess: (a)plaintext;(b)afterapplyingasubstitutioncipher; (c) ciphertext after random anagramming; (d) after substitution decipherment (in the alphagram representation); (e) finaldeciphermentafteranagramdecoding(errorsareunderlined). in-vocabulary words and high-probability word n- because all these words map to the (2,1,1,1) pat- grams, while the character level models allow for tern. Internally, we represent all words as alpha- theincorporationofout-of-vocabularywords. grams,inwhichlettersarereshuffledintothealpha- The key mutation component crucially depends betical order (Figure 3d). In order to handle the in- on the notion of pattern equivalence between char- creasedambiguity,weusealetter-frequencyheuris- acter strings. Two strings are pattern-equivalent if tictoselectthemostlikelymappingofletterswithin they share the same pattern of repeated letters. For an n-gram. The trigram language models over both example, MZXCX is pattern-equivalent with there wordsandcharactersarederivedbyconvertingeach and bases. but not with otter. For each word uni- word in the training corpus into its alphagram. On gram,bigram,andtrigramintheciphertext,alistof abenchmarksetof50ciphersoflength256,theav- the most frequent pattern equivalent n-grams from erageerrorrateofthemodifiedsolveris2.6%,with the training corpus is compiled. The solver repeat- onlyasmallincreaseintimeandspaceusage. edly attempts to improve the current key through 4.2 AnagramDecoder a series of transpositions, so that a given cipher n- gram maps to a pattern-equivalent n-gram from the Theoutputofthescriptdeciphermentstepisgener- provided language sample. The number of substi- ally unreadable (see Figure 3d). The words might tutions for a given n-gram is limited to the k most be composed of the right letters but their order is promising candidates, where k is a parameter opti- unlikely to be correct. We proceed to decode the mizedonadevelopmentset. sequenceofanagramsbyframingitasasimplehid- The key mutation procedure generates a tree denMarkovmodel,inwhichthehiddenstatescorre- structure,whichissearchedforthebest-scoringde- spondtoplaintextwords,andtheobservedsequence ciphermentusingaversionofbeamsearch. Theroot iscomposedoftheiranagrams. Withoutlossofgen- of the tree contains the initial key, which is gener- erality, we convert anagrams into alphagrams, so atedaccordingtosimplefrequencyanalysis(i.e.,by that the emission probabilities are always equal to mappingthen-thmostcommonciphertextcharacter 1. Anyalphagramsthatcorrespondtounseenwords to the n-th most common character in the corpus). arereplacedwithasingle‘unknown’type. Wethen Newtreeleavesarespawnedbymodifyingthekeys useamodifiedViterbidecodertodeterminethemost of current leaves, while ensuring that each node in likely word sequence according to a word trigram the tree has a unique key. At the end of computa- language model, which is derived from the train- tion,thekeywiththehighestscoreisreturnedasthe ing corpus, and smoothed using deleted interpola- solution. tion(JelinekandMercer,1980). Inouranagramadaptation,werelaxthedefinition 4.3 VowelRecovery of pattern equivalence to include strings that have the same decomposition pattern, as defined in Sec- Many writing systems, including Arabic and He- tion 3.2. Under the new definition, the order of the brew,areabjadsthatdonotexplicitlyrepresentvow- letterswithinawordhasnoeffectonpatternequiv- els. ReddyandKnight(2011)provideevidencethat alence. Forexample,MZXCX isequivalentnotonly the VMS may encode an abjad. The removal of with there and bases, but also with three and otter, vowels represents a substantial loss of information, 80 andappearstodramaticallyincreasethedifficultyof Step2 Step3 Both Ceiling solvingacipher. English 99.5 98.2 97.7 98.7 In order to apply our system to abjads, we re- Bulgarian 97.0 94.7 91.9 95.3 move all vowels in the corpora prior to deriving the German 97.3 90.6 88.7 91.8 language models used by the script decipherment Greek 95.7 96.6 92.7 97.2 step. Weassumetheabilitytopartitiontheplaintext Spanish 99.1 98.0 97.1 99.0 symbolsintodisjointsetsofvowelsandconsonants Average 97.7 95.7 93.8 96.5 for each candidate language. The anagram decoder Table2: Wordaccuracyontheanagramdecryptiontask. is trained to recover complete in-vocabulary words fromsequencesofanagramscontainingonlyconso- nants. At test time, we remove the vowels from the TheresultsinTable2showthatoursystemisable input to the decipherment step of the pipeline. In to effectively break the anagrammed ciphers in all contrast with Knight et al. (2006), our approach is fivelanguages. ForStep2(scriptdecipherment),we able not only to attack abjad ciphers, but also to re- countascorrectallwordtokensthatcontaintheright storethevowels,producingfullyreadabletext. characters, disregarding their order. Step 3 (ana- 4.4 Evaluation gram decoding) is evaluated under the assumption thatithasreceivedaperfectdeciphermentfromStep In order to test our anagram decryption pipeline on 2. On average, the accuracy of each individual step out-of-domain ciphertexts, the corpora for deriving exceeds 95%. The values in the column denoted as language models need to be much larger than the Botharetheactualresultsofthepipelinecomposed UDHR samples used in the previous section. We of Steps 2 and 3. Our system correctly recovers selected five diverse European languages from Eu- 93.8% of word tokens, which corresponds to over roparl (Koehn, 2005): English, Bulgarian, German, 97%ofthein-vocabularywordswithinthetestfiles, Greek, and Spanish. The corresponding corpora The percentage of the in-vocabulary words, which containabout50millionwordseach,withtheexcep- are shown in the Ceiling column, constitute the ef- tion of Bulgarian which has only 9 million words. fectiveaccuracylimitsforeachlanguage. Weremovepunctuationandnumbers,andlowercase Theerrorsfallintothreecategories, asillustrated alltext. inFigure3e. Step2introducesdeciphermenterrors WetestontextsextractedfromWikipediaarticles (e.g., deciphering ‘s’ as ‘k’ instead of ‘z’ in “orga- onart,Earth,Europe,film,history,language,music, nized”),whichtypicallyprecludethewordfrombe- science, technology, and Wikipedia. The texts are ing recovered in the next step. A decoding error in firstencipheredusingasubstitutioncipher,andthen Step 3 may occur when an alphagram corresponds anagrammed (Figure 3a-c). Each of the five lan- to multiple words (e.g. “greens” instead of “gen- guages is represented by 10 ciphertexts, which are res”), although most such ambiguities are resolved decrypted independently. In order to keep the run- correctly. However,themajorityoferrorsarecaused ningtimereasonable,thelengthoftheciphertextsis by out-of-vocabulary (OOV) words in the plaintext setto500characters. (e.g.,“improvisational”). Sincethedecodercanonly The first step is language identification. Our de- producewordsfoundinthetrainingcorpus,anOOV composition pattern method, which is resistant to wordalmostalwaysresultsinanerror. TheGerman both anagramming and substitution, correctly iden- ciphersstandoutashavingthelargestpercentageof tifies the source language of 49 out of 50 cipher- OOVwords(8.2%),whichmaybeattributedtofre- texts. The lone exception is the German article on quentcompounding. technology, for which German is the second ranked languageafterGreek. Thiserrorcouldbeeasilyde- Table3showstheresultsoftheanalogousexper- tectedbynoticingthatmostoftheGreekwords“de- iments on abjads (Section 4.3). Surprisingly, the ciphered” by the subsequent steps are out of vocab- removal of vowels from the plaintext actually im- ulary. We proceed to evaluate the following steps proves the average decipherment step accuracy to assumingthatthesourcelanguageisknown. 99%. Thisisduenotonlytothereducednumberof 81 Step2 Step3 Both Ceiling Language Text Words Characters English 99.9 84.6 84.5 98.7 English Bible 804,875 4,097,508 Bulgarian 99.1 71.1 70.4 95.0 Italian Bible 758,854 4,246,663 German 98.5 73.7 72.9 91.8 Latin Bible 650,232 4,150,533 Greek 97.7 65.4 63.6 97.0 Hebrew Tanach 309,934 1,562,591 Spanish 99.8 73.7 73.3 99.0 Arabic Quran 78,245 411,082 Average 99.0 73.8 73.1 96.4 Table4: Languagecorpora. Table3: Wordaccuracyontheabjadanagramdecryption task. 5.2 SourceLanguage In this section, we present the results of our cipher- distinctsymbols,butalsotothefewerpossibleana- textlanguageidentificationmethodsfromSection3 grammingpermutationsintheshortenedwords. On ontheVMStext. theotherhand,thelossofvowelinformationmakes The closest language according to the letter fre- the anagram decoding step much harder. However, quency method is Mazatec, a native American lan- morethanthreequartersofin-vocabularytokensare guage from southern Mexico. Since the VMS was stillcorrectlyrecovered,includingtheoriginalvow- els.3 Ingeneral,thisissufficientforahumanreader created before the voyage of Columbus, a New World language is an unlikely candidate. The top tounderstandthemeaningofthedocument,andde- tenlanguagesalsoincludeMozarabic(3),Italian(8), ducetheremainingwords. andLadino(10), allofwhichareplausibleguesses. 5 VoynichExperiments However, the experiments in Section 3.4 demon- strate that the frequency analysis is much less reli- In this section, we present the results of our experi- ablethantheothertwomethods. mentsontheVMS.Weattempttoidentifythesource The top-ranking languages according to the de- language with the methods described in Section 3; composition pattern method are Hebrew, Malay (in we quantify the similarity of the Voynich words to Arabic script), Standard Arabic, and Amharic, in alphagrams; and we apply our anagram decryption this order. We note that three of these belong to algorithmfromSection4tothetext. theSemiticfamily. Thesimilarityofdecomposition 5.1 Data patterns between Hebrew and the VMS is striking. The Bhattacharyya distance between the respective Unless otherwise noted, the VMS text used in distributions is 0.020, compared to 0.048 for the our experiments corresponds to 43 pages of the second-ranking Malay. The histogram in Figure 4 manuscript in the “type B” handwriting (VMS-B), showsHebrewasasingleoutlierintheleftmostbin. investigated by Reddy and Knight (2011), which In fact, Hebrew is closer to a sample of the VMS we obtained directly from the authors. It con- ofasimilarlengththantoanyoftheremaining379 tains 17,597 words and 95,465 characters, tran- UDHRsamples. scribed into 35 characters of the Currier alphabet The ranking produced by the the trial decipher- (d’Imperio,1978). mentmethodissensitivetoparameterchanges;how- For the comparison experiments, we selected ever,thetwolanguagesthatconsistentlyappearnear five languages shown in Table 4, which have the top of the list are Hebrew and Esperanto. The been suggested in the past as the language of the highrankofHebrewcorroboratestheoutcomeofthe VMS (Kennedy and Churchill, 2006). Consider- decomposition pattern method. Being a relatively ing the age of the manuscript, we attempt to use recent creation, Esperanto itself can be excluded as corpora that correspond to older versions of the theciphertextlanguage,butitshighscoreisremark- languages, including King James Bible, Bibbia di ableinviewofthewell-knowntheorythattheVMS Gerusalemme,andVulgate. textrepresentsaconstructedlanguage.4 Wehypoth- 3The differences in the Ceiling numbers between Tables 2 and3areduetowordsthatarecomposedentirelyofvowels. 4The theory was first presented in the form of an ana- 82 In order to quantify how strongly the words in a language resemble alphagrams, we first need to identify the order of the alphabet that minimizes the total alphagram distance of a representative text sample. ThedecisionversionofthisproblemisNP- complete, which can be demonstrated by a reduc- tion from the path variant of the traveling salesman problem. Instead, we find an approximate solution Figure4: HistogramofdistancesbetweentheVMSand withthefollowinggreedysearchalgorithm. Starting samplesof380otherlanguages,asdeterminedbythede- from an initial order in which the letters first occur compositionpatternmethod.Thesingleoutlierontheleft in the text, we repeatedly consider all possible new isHebrew. positions for a letter within the current order, and choosetheonethatyieldsthelowesttotalalphagram esize that the extreme morphological regularity of distanceofthetext. Thisprocessisrepeateduntilno Esperanto (e.g., all plural nouns contain the bigram betterorderisfoundfor10iterations,with100ran- ‘oj’) yields an unusual bigram character language domrestarts. model which fits the repetitive nature of the VMS Whenappliedtoarandomsampleof10,000word words. tokens from the VMS, our algorithm yields the or- In summary, while there is no complete agree- der 4BZOVPEFSXQYWC28ARUTIJ3*GHK69MDLN5, ment between the three methods about the most which corresponds to the average alphagram dis- likely underlying source language, there appears to tanceof0.996(i.e.,slightlylessthanonepairoflet- be a strong statistical support for Hebrew from the tersperword). ThecorrespondingresultonEnglish two most accurate methods, one of which is robust is jzbqwxcpathofvurimslkengdy, with an against anagramming. In addition, the language is average alphagram distance of 2.454. Note that the a plausible candidate on historical grounds, being lettersatthebeginningofthesequencetendtohave widely-usedforwritingintheMiddleAges. Infact, low frequency, while the ones at the end occur in a number of cipher techniques, including anagram- popular morphological suffixes, such as ed and ming, can be traced to the Jewish Cabala (Kennedy − ly. For example, the beginning of the first arti- andChurchill,2006). − cle of the UDHR with the letters transposed to fol- lowthisorderbecomes: “Allahumnbisengareborn 5.3 Alphagrams freeandqauleintiingdyandthrisg.” In this section, we quantify the peculiarity of the To estimate how close the solution produced by VMSlexiconbymodelingthewordsasalphagrams. our greedy algorithm is to the actual optimal solu- We introduce the notion of the alphagram distance, tion, we also calculate a lower bound for the total andcomputeitfortheVMSandfornaturallanguage alphagram distance with any character order. The samples. lowerboundis min(b ,b ),whereb isthe x,y xy yx xy We define a word’s alphagram distance with re- numberoftimescharacterxoccursbeforecharacter specttoanorderingofthealphabetasthenumberof y withinwordsPinthetext. letterpairsthatareinthewrongorder. Forexample, Figure 5 shows the average alphagram distances with respect to the QWERTY keyboard order, the for the VMS and five comparison languages, each word rye has an alphagram distance of 2 because it representedbyarandomsampleof10,000wordto- containstwoletterpairsthatviolatetheorder: (r,e) kens which exclude single-letter words. The Ex- and(y,e). Awordisanalphagramifandonlyifits pected values correspond to a completely random alphagram distance is zero. The maximum alpha- intra-word letter order. The Lexicographic values gram distance for a word of length n is equal to the correspond to the standard alphabetic order in each numberofitsdistinctletterpairs. language. The actual minimum alphagram distance isbetweentheLowerBoundandtheComputedMin- gram(FriedmanandFriedman,1959). Seealsoamorerecent proposalbyBalandinandAveryanov(2014). imumobtainedbyourgreedyalgorithm. 83 tically correct or semantically consistent. This is expected because our system is designed for pure monoalphabetic substitution ciphers. If the VMS indeed represents one of the five languages, the amountofnoiseinherentintheorthographyandthe transcription would prevent the system from pro- ducing a correct decipherment. For example, in a hypothetical non-standard orthography of Hebrew, some prepositions or determiners could be written as separate one-letter words, or a single phoneme could have two different representations. In addi- tion, because of the age of the manuscript and the variety of its hand-writing styles, any transcription requires a great deal of guesswork regarding the Figure5: Averagewordalphagramdistances. separationofindividualwordsintodistinctsymbols (Figure 1). Finally, the decipherments necessarily The results in Figure 5 show that while the ex- reflectthecorporathatunderliethelanguagemodel, pectedalphagramdistancefortheVMSfallswithin whichmaycorrespondtoadifferentdomainandhis- the range exhibited by natural languages, its mini- toricalperiod. mum alphagram distance is exceptionally low. In Nevertheless,itisinterestingtotakeacloserlook absolute terms, the VMS minimum is less than half at specific examples of the system output. The first the corresponding number for Hebrew. In relative line of the VMS (VAS92 9FAE AR APAM ZOE terms, the ratio of the expected distance to the min- ZOR9 QOR92 9 FOR ZOE89)isdecipheredinto imum distance is below 2 for any of the five lan- Hebrewasוישנא ילע ו וחיבל וילא שיא Nהכה הל השעו guages,butabove4fortheVMS.Theseresultssug- תוצמה.5 According to a native speaker of the lan- gest that, if the VMS encodes a natural language guage, this is not quite a coherent sentence. How- text, the letters within the words may have been re- ever, after making a couple of spelling corrections, orderedduringtheencryptionprocess. Google Translate is able to convert it into passable English: “Shemaderecommendationstothepriest, 5.4 DeciphermentExperiments manofthehouseandmeandpeople.” 6 Inthissection,wediscusstheresultsofapplyingour Even though the input ciphertext is certainly too anagramdecryptionsystemdescribedinSection4to noisy to result in a fluent output, the system might theVMStext. still manage to correctly decrypt individual words We decipher each of the first 10 pages of the in a longer passage. In order to limit the influence VMS-Busingthefivelanguagemodelsderivedfrom ofcontextinthedecipherment,werestricttheword thecorporadescribedinSection5.1. Thepagescon- language model to unigrams, and apply our sys- tainbetween292and556words,3726intotal. Fig- tem to the first 72 words (241 characters)7 from the ure6showstheaveragepercentageofin-vocabulary “Herbal”sectionoftheVMS,whichcontainsdraw- words in the 10 decipherments. The percentage is ings of plants. An inspection of the output reveals significantly higher for Hebrew than for the other severalwordsthatwouldnotbeoutofplaceiname- languages, which suggests a better match with the dieval herbal, such as רצה ‘narrow’, רכיא ‘farmer’, VMS. Although the abjad versions of English, Ital- רוא‘light’,ריוא‘air’,שׁא‘fire’. ian, and Latin yield similar levels of in-vocabulary The results presented in this section could be in- words,theirdistancestotheVMSlanguageaccord- terpreted either as tantalizing clues for Hebrew as ing to the decomposition pattern method are 0.159, 5Hebrewiswrittenfromrighttoleft. 0.176, and 0.245 respectively, well above Hebrew’s 6https://translate.google.com/(accessedNov.20,2015). 0.020. 7Thelengthofthepassagewaschosentomatchthenumber None of the decipherments appear to be syntac- ofsymbolsinthePhaistosDiscinscription. 84

Description:
method achieves 97% accuracy on 380 lan- guages. cryption word accuracy of 93% on a set of 50 ciphertexts in 5 such as Linear A, the Indus script, and the Phaistos .. ning time reasonable, the length of the ciphertexts is.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.