Table Of ContentDecoding Anagrammed Texts Written in an Unknown Language and Script
BradleyHauer and GrzegorzKondrak
DepartmentofComputingScience
UniversityofAlberta
Edmonton,Canada
bmhauer,gkondrak @ualberta.ca
{ }
Abstract phering the manuscript is the lack of knowledge of
whatlanguageitrepresents.
Algorithmic decipherment is a prime exam- Identificationoftheunderlyinglanguagehasbeen
pleofatrulyunsupervisedproblem. Thefirst
crucial for the decipherment of ancient scripts, in-
step in the decipherment process is the iden-
cluding Egyptian hieroglyphics (Coptic), Linear B
tification of the encrypted language. We pro-
(Greek),andMayanglyphs(Ch’olti’). Ontheother
posethreemethodsfordeterminingthesource
hand, the languages of many undeciphered scripts,
language of a document enciphered with a
monoalphabetic substitution cipher. The best such as Linear A, the Indus script, and the Phaistos
method achieves 97% accuracy on 380 lan- Disc, remain unknown (Robinson, 2002). Even the
guages. We then present an approach to de- order of characters within text may be in doubt; in
coding anagrammed substitution ciphers, in Egyptianhieroglyphicinscriptions,forinstance,the
which the letters within words have been ar-
symbols were sometimes rearranged within a word
bitrarilytransposed. Itobtainstheaveragede-
inordertocreateamoreelegantinscription(Singh,
cryptionwordaccuracyof93%onasetof50
2011). Another complicating factor is the omission
ciphertextsin5languages. Finally, wereport
theresultsontheVoynichmanuscript, anun- ofvowelsinsomewritingsystems.
solvedfifteenthcenturycipher,whichsuggest Applicationsofciphertextlanguageidentification
Hebrewasthelanguageofthedocument. extend beyond secret ciphers and ancient scripts.
Nagy et al. (1987) frame optical character recogni-
tion as a decipherment task. Knight et al. (2006)
1 Introduction
note that for some languages, such as Hindi, there
The Voynich manuscript is a medieval codex1 con- exist many different and incompatible encoding
sistingof240pageswritteninauniquescript,which schemes for digital storage of text; the task of an-
has been referred to as the world’s most important alyzing such an arbitrary encoding scheme can be
unsolved cipher (Schmeh, 2013). The type of ci- viewedasadeciphermentofasubstitutioncipherin
pher that was used to generate the text is unknown; an unknown language. Similarly, the unsupervised
a number of theories have been proposed, includ- derivation of transliteration mappings between dif-
ing substitution and transposition ciphers, an abjad ferent writing scripts lends itself to a cipher formu-
(a writing system in which vowels are not written), lation(RaviandKnight,2009).
steganography, semi-random schemes, and an elab- TheVoynichmanuscriptiswritteninanunknown
orate hoax. However, the biggest obstacle to deci- script that encodes an unknown language, which is
the most challenging type of a decipherment prob-
1The manuscript was radiocarbon dated to 1404-1438
lem (Robinson, 2002, p. 46). Inspired by the mys-
AD in the Arizona Accelerator Mass Spectrometry Labo-
tery of both the Voynich manuscript and the un-
ratory (http://www.arizona.edu/crack-voynich-code, accessed
Nov.20,2015). deciphered ancient scripts, we develop a series of
75
TransactionsoftheAssociationforComputationalLinguistics,vol.4,pp.75–86,2016.ActionEditor:ReginaBarzilay.
Submissionbatch:12/2015;Published4/2016.
c2016AssociationforComputationalLinguistics.DistributedunderaCC-BY4.0license.
(cid:13)
algorithms for the purpose of decrypting unknown
alphabetic scripts representing unknown languages.
Weassumethatsymbolsinscriptswhichcontainno
more than a few dozen unique characters roughly
correspond to phonemes of a language, and model
them as monoalphabetic substitution ciphers. We Figure1: AsamplefromtheVoynichmanuscript.
furtherallowthatanunknowntranspositionscheme
could have been applied to the enciphered text,
resulting in arbitrary scrambling of letters within
Numerous languages have been proposed to un-
words(anagramming). Finally,weconsiderthepos-
derlietheVMS.Thepropertiesandthedatingofthe
sibility that the underlying script is an abjad, in
manuscriptimplyLatinandItalianaspotentialcan-
whichonlyconsonantsareexplicitlyrepresented.
didates. Onthebasisoftheanalysisofthecharacter
Ourdecryptionsystemiscomposedofthreesteps.
frequency distribution, Jaskiewicz (2011) identifies
Thefirsttaskistoidentifythelanguageofacipher-
five most probable languages, which include Mol-
text,bycomparingittosamplesrepresentingknown
davianandThai. ReddyandKnight(2011)discover
languages. The second task is to map each symbol
an excellent match between the VMS and Quranic
of the ciphertext to the corresponding letter in the
Arabicinthedistributionofwordlengths,aswellas
identified language. The third task is to decode the
asimilaritytoChinesePinyininthepredictabilityof
resultinganagramsintoreadabletext,whichmayin-
lettersgiventheprecedingletter.
volvetherecoveryofunwrittenvowels.
Thepaperisstructuredasfollows. Wediscussre-
It has been suggested previously that some ana-
lated work in Section 2. In Section 3, we propose
gramming scheme may alter the sequence order of
threemethodsforthesourcelanguageidentification
characterswithinwordsintheVMS.Tiltman(1968)
oftextsencipheredwithamonoalphabeticsubstitu-
observes that each symbol behaves as if it had its
tioncipher. InSection4,wepresentandevaluateour
ownplaceinan“orderofprecedence”withinwords.
approachtothedecryptionoftextscomposedofen-
Rugg (2004) notes the apparent similarity of the
ciphered anagrams. In Section 5, we apply our new
VMStoatextinwhicheachwordhasbeenreplaced
techniques to the Voynich manuscript. Section 6
by an alphabetically ordered anagram (alphagram).
concludesthepaper.
Reddy and Knight (2011) show that the letter se-
quences are generally more predictable than in nat-
2 RelatedWork
urallanguages.
Inthissection,wereviewparticularlyrelevantprior
SomeresearchershavearguedthattheVMSmay
workontheVoynichmanuscript,andonalgorithmic
be an elaborate hoax created to only appear as a
deciphermentingeneral.
meaningful text. Rugg (2004) suggests a tabular
method, similar to the sixteenth century technique
2.1 VoynichManuscript
of the Cardan grille, although recent dating of the
Since the discovery of the Voynich manuscript manuscript to the fifteenth century provides evi-
(henceforthreferredtoastheVMS),therehavebeen dence to the contrary. Schinner (2007) uses analy-
a number of decipherments claims. Newbold and sis of random walk techniques and textual statistics
Kent(1928)proposedaninterpretationbasedonmi- to support the hoax hypothesis. On the other hand,
croscopicdetailsinthetext,whichwassubsequently Landini (2001) identifies in the VMS language-like
refuted by Manly (1931). Other claimed decipher- statisticalproperties,suchasZipf’slaw,whichwere
ments by Feely (1943) and Strong (1945) have also onlydiscoveredinthelastcentury. Similarly, Mon-
been refuted (Tiltman, 1968). A detailed study of temurro and Zanette (2013) use information theo-
the manuscript by d’Imperio (1978) details various retic techniques to find long-range relationships be-
other proposed solutions and the arguments against tweenwordsandsectionsofthemanuscript,aswell
them. asbetweenthetextandthefiguresintheVMS.
76
2.2 AlgorithmicDecipherment which are represented by short sample texts. The
methodsarebasedon:
A monoalphabetic substitution cipher is a well-
known method of enciphering a plaintext by con-
1. relativecharacterfrequencies,
verting it into a ciphertext of the same length using
a 1-to-1 mapping of symbols. Knight et al. (2006) 2. patternsofrepeatedsymbolswithinwords,
propose a method for deciphering substitution ci-
pherswhichisbasedonViterbidecodingwithmap- 3. theoutcomeofatrialdecipherment.
ping probabilities computed with the expectation-
3.1 CharacterFrequency
maximization (EM) algorithm. The method cor-
rectly deciphers 90% of symbols in a 400-letter ci- An intuitive way of guessing the source language
phertext when a trigram character language model of a ciphertext is by character frequency analysis.
is used. They apply their method to ciphertext The key observation is that the relative frequencies
language identification using 80 different language ofsymbolsinthetextareunchangedafterencipher-
samples,andreportsuccessfuloutcomesonthreeci- mentwitha1-to-1substitutioncipher. Theideaisto
phersthatrepresentEnglish,Spanish,andaSpanish order the ciphertext symbols by frequency, normal-
abjad,respectively. ize these frequencies to create a probability distri-
Ravi and Knight (2008) present a more complex bution,andchoosetheclosestmatchingdistribution
but slower method for solving substitution ciphers, fromthesetofcandidatelanguages.
whichincorporatesconstraintsthatmodelthe1-to-1 More formally, let P be a discrete probability
T
property of the key. The objective function is again distribution where P (i) is the probability of a ran-
T
theprobabilityofthedeciphermentrelativetoann- domlyselectedsymbolinatextT beingtheithmost
gramcharacterlanguagemodel. Asolutionisfound frequent symbol. We define the distance between
byoptimallysolvinganintegerlinearprogram. two texts U and V to be the Bhattacharyya (1943)
Knight et al. (2011) describe a successful deci- distance between the probability distributions P
U
pherment of an eighteenth century text known as andP :
V
theCopialeCipher. Languageidentificationwasthe
first step of the process. The EM-based method of d(U,V) = ln PU(i) PV(i)
− ·
Knight et al. (2006) identified German as the most i
Xp
likely candidate among over 40 candidate charac-
The advantages of this distance metric include its
ter language models. The more accurate method
symmetry, and the ability to account for events that
of Ravi and Knight (2008) was presumably either
haveazeroprobability(inthiscase,duetodifferent
too slow or too brittle for this purpose. The cipher
alphabet sizes). The language of the closest sample
waseventuallybrokenusingacombinationofman-
text to the ciphertext is considered to be the most
ualandalgorithmictechniques.
likelysourcelanguage. Thismethodisnotonlyfast
Hauer et al. (2014) present an approach to solv-
butalsorobustagainstletterreorderingandthelack
ing monoalphabetic substitution ciphers which is
ofwordboundaries.
more accurate than other algorithms proposed for
this task, including Knight et al. (2006), Ravi and
3.2 DecompositionPatternFrequency
Knight (2008), and Norvig (2009). We provide a
Our second method expands on the character fre-
detaileddescriptionofthemethodinSection4.1.
quency method by incorporating the notion of de-
compositionpatterns. Thismethodusesmultipleoc-
3 SourceLanguageIdentification
currences of individual symbols within a word as a
Inthissection,weproposeandevaluatethreemeth- cluetothelanguageoftheciphertext. Forexample,
ods for determining the source language of a docu- thewordseemscontainstwoinstancesof‘s’and‘e’,
mentencipheredwithamonoalphabeticsubstitution andoneinstanceof‘m’. Weareinterestedincaptur-
cipher. Weframeitasaclassificationtask, withthe ing the relative frequency of such patterns in texts,
classes corresponding to the candidate languages, independentofthesymbolsused.
77
Formally, we define a function f that maps a 1: kmax InitialKey
←
word to an ordered n-tuple (t ,t ,...t ), where 2: formiterationsdo
1 2 n
t t if i < j. Each t is the number of oc- 3: k kmax
i j i ←
cur≥rences of the ith most frequent character in the 4: bestswapsfork
S ←
word. For example, f(seems) = (2,2,1), while 5: foreach c1,c2 do
{ } ∈ S
f(beams) = (1,1,1,1,1). Werefertotheresulting 6: k0 k(c1 c2)
← ↔
tupleasthedecompositionpatternoftheword. The 7: ifp(k0) > p(kmax)thenkmax k0
←
decomposition pattern is unaffected by monoalpha- 8: ifkmax = k thenreturnkmax
beticlettersubstitutionoranagramming. Aswiththe 9: returnkmax
character frequency method, we define the distance
betweentwotextsastheBhattacharyyadistancebe- Figure2: Greedy-swapdeciphermentalgorithm.
tweentheirdecompositionpatterndistributions,and
classifythelanguageofaciphertextasthelanguage
it is incorporated in the current key; otherwise, the
ofthenearestsampletext.
algorithmterminates. Thetotalnumberofiterations
It is worth noting that this method requires word
is bounded by m, which is set to 5 times the size
separators to be preserved in the ciphertext. In fact,
of the alphabet. After the initial run, the algorithm
the effectiveness of the method comes partly from
isrestarted20timeswitharandomlygeneratedini-
capturing the distribution of word lengths in a text.
tialkey,whichoftenresultsinabetterdecipherment.
On the other hand, the decomposition patterns are
All parameters were established on a development
independent of the ordering of characters within
set.
words. We will take advantage of this property in
Section4.
3.4 Evaluation
3.3 TrialDecipherment We now directly evaluate the three methods de-
scribed above by applying them to a set of cipher-
Thefinalmethodthatwepresentinvolvesdecipher-
texts from different languages. We adapted the
ing the document in question into each candidate
dataset created by Emerson et al. (2014) from the
language. The decipherment is performed with a
text of the Universal Declaration of Human Rights
fast greedy-swap algorithm, which is related to the
(UDHR) in 380 languages.2 The average length of
algorithms of Ravi and Knight (2008) and Norvig
the texts is 1710 words and 11073 characters. We
(2009). Itattemptstofindthekeythatmaximizesthe
divided the text in each language into 66% train-
probability of the decipherment according to a bi-
ing, 17% development, and 17% test. The training
gramcharacterlanguagemodelderivedfromasam-
partwasusedtoderivecharacterbigrammodelsfor
ple document in a given language. The decipher-
eachlanguage. Thedevelopmentandtestpartswere
mentwiththehighestprobabilityindicatesthemost
separatelyencipheredwitharandomsubstitutionci-
likelyplaintextlanguageofthedocument.
pher.
Thegreedy-swapalgorithmisshowninFigure2.
Table1showstheresultsofthelanguageidentifi-
The initial key is created by pairing the ciphertext
cationmethodsonboththedevelopmentandthetest
andplaintextsymbolsintheorderofdecreasingfre-
set. Wereporttheaveragetop-1accuracyonthetask
quency, with null symbols appended to the shorter
ofidentifyingthesourcelanguageof380enciphered
of the two alphabets. The algorithm repeatedly at-
test samples. The differences between methods are
tempts to improve the current key k by considering
statisticallysignificantaccordingtoMcNemar’stest
the “best” swaps of ciphertext symbol pairs within
with p < 0.0001. The random baseline of 0.3% in-
the key (if the key is viewed as a permutation of
dicates the difficulty of the task. The “oracle” de-
the alphabet, such a swap is a transposition). The
cipherment assumes a perfect decipherment of the
best swaps are defined as those that involve a sym-
text, which effectively reduces the task to standard
bol occurring among the 10 least common bigrams
in the decipherment induced by the current key. If 2Eight languages from the original set were excluded be-
anysuchswapyieldsamoreprobabledecipherment, causeofformattingissues.
78
Method Dev Test 4 AnagramDecryption
RandomSelection 0.3 0.3
In this section, we address the challenging task of
Jaskiewicz(2011) 54.2 47.6
deciphering a text in an unknown language written
CharacterFrequency 72.4 67.9
using an unknown script, and in which the letters
DecompositionPattern 90.5 85.5
within words have been randomly scrambled. The
TrialDecipherment 94.2 97.1
task is designed to emulate the decipherment prob-
OracleDecipherment 98.2 98.4
lem posed by the VMS, with the assumption that
Table1: Languageidentificationaccuracy(in%correct) its unusual ordering of characters within words re-
onciphersrepresenting380languages. flects some kind of a transposition cipher. We re-
strict the source language to be one of the candi-
date languages for which we have sample texts; we
languageidentification. modelanunknownscriptwithasubstitutioncipher;
All three of our methods perform well, with the andweimposenoconstraintsonthelettertransposi-
accuracy gains reflecting their increasing complex- tionmethod. Theenciphermentprocessisillustrated
ity. Between the two character frequency methods, inFigure3. Thegoalinthisinstanceistorecoverthe
our approach based on Bhattacharyya distance is plaintextin(a)giventheciphertextin(c)withoutthe
significantlymoreaccuratethanthemethodofJask- knowledge of the plaintext language. We also con-
iewicz (2011), which uses a specially-designed dis- sider an additional encipherment step that removes
tribution distance function. The decomposition pat- allvowelsfromtheplaintext.
ternmethodmakesmanyfewererrors,withthecor- Our solution is composed of a sequence of three
rectlanguagerankedsecondinroughlyhalfofthose modules that address the following tasks: language
cases. Trial decipherment yields the best results, identification,scriptdecipherment,andanagramde-
whichareclosetotheupperboundforthecharacter coding. For the first task we use the decomposition
bigram probability approach to language identifica- pattern frequency method described in Section 3.2,
tion. The average decipherment error rate into the which is applicable to anagrammed ciphers. After
correctlanguageisonly2.5%. In4outof11identi- identifyingtheplaintextlanguage,weproceedtore-
fication errors made on the test set, the error rate is versethesubstitutioncipherusingaheuristicsearch
abovetheaverage;theother7errorsinvolveclosely algorithm guided by a combination of word and
relatedlanguages,suchasSerbianandBosnian. character language models. Finally, we unscramble
the anagrammed words into readable text by fram-
The trial decipherment approach is much slower
ing the decoding as a tagging task, which is effi-
than the frequency distribution methods, requiring
ciently solved with a Viterbi decoder. Our modular
roughly one hour of CPU time in order to classify
approachmakesiteasytoperformdifferentlevelsof
each ciphertext. More complex decipherment algo-
analysisonunsolvedciphers.
rithmsareevenslower,whichprecludestheirappli-
cation to this test set. Our re-implementations of
4.1 ScriptDecipherment
thedynamicprogrammingalgorithmofKnightetal.
(2006),andtheintegerprogrammingsolverofRavi Forthedeciphermentstep,weadaptthestate-of-the-
and Knight (2008) average 53 and 7000 seconds of art solver of Hauer et al. (2014). In this section, we
CPUtime,respectively,tosolveasingle256charac- describe the three main components of the solver:
tercipher,comparedto2.6secondswithourgreedy- key scoring, key mutation, and tree search. This is
swapmethod. Thedynamicprogrammingalgorithm followedbythesummaryofmodificationsthatmake
improves decipherment accuracy over our method themethodworkonanagrams.
byonly4%onabenchmarksetof50ciphersof256 The scoring component evaluates the fitness of
characters. We conclude that our greedy-swap al- each key by computing the smoothed probability
gorithm strikes the right balance between accuracy of the resulting decipherment with both character-
and speed required for the task of cipher language level and word-level language models. The word-
identification. level models promote decipherments that contain
79
(a) organized compositions through improvisational music into genres
(b) fyovicstu dfnrfecpcfie pbyfzob cnryfgcevpcfivm nzecd cipf otiyte
(c) otvfusyci cpifenfercfd bopbfzy fgyiemcpfcvrcnv nczed fpic etotyi
(d) adegiknor ciimnooopsst ghhortu aaiiilmnooprstv cimsu inot eegnrs
(e) adegiknor compositions through aaiiilmnooprstv music into greens
Figure3: Anexampleoftheencryptionanddecryptionprocess: (a)plaintext;(b)afterapplyingasubstitutioncipher;
(c) ciphertext after random anagramming; (d) after substitution decipherment (in the alphagram representation); (e)
finaldeciphermentafteranagramdecoding(errorsareunderlined).
in-vocabulary words and high-probability word n- because all these words map to the (2,1,1,1) pat-
grams, while the character level models allow for tern. Internally, we represent all words as alpha-
theincorporationofout-of-vocabularywords. grams,inwhichlettersarereshuffledintothealpha-
The key mutation component crucially depends betical order (Figure 3d). In order to handle the in-
on the notion of pattern equivalence between char- creasedambiguity,weusealetter-frequencyheuris-
acter strings. Two strings are pattern-equivalent if tictoselectthemostlikelymappingofletterswithin
they share the same pattern of repeated letters. For an n-gram. The trigram language models over both
example, MZXCX is pattern-equivalent with there wordsandcharactersarederivedbyconvertingeach
and bases. but not with otter. For each word uni- word in the training corpus into its alphagram. On
gram,bigram,andtrigramintheciphertext,alistof abenchmarksetof50ciphersoflength256,theav-
the most frequent pattern equivalent n-grams from erageerrorrateofthemodifiedsolveris2.6%,with
the training corpus is compiled. The solver repeat- onlyasmallincreaseintimeandspaceusage.
edly attempts to improve the current key through
4.2 AnagramDecoder
a series of transpositions, so that a given cipher n-
gram maps to a pattern-equivalent n-gram from the Theoutputofthescriptdeciphermentstepisgener-
provided language sample. The number of substi- ally unreadable (see Figure 3d). The words might
tutions for a given n-gram is limited to the k most be composed of the right letters but their order is
promising candidates, where k is a parameter opti- unlikely to be correct. We proceed to decode the
mizedonadevelopmentset. sequenceofanagramsbyframingitasasimplehid-
The key mutation procedure generates a tree denMarkovmodel,inwhichthehiddenstatescorre-
structure,whichissearchedforthebest-scoringde- spondtoplaintextwords,andtheobservedsequence
ciphermentusingaversionofbeamsearch. Theroot iscomposedoftheiranagrams. Withoutlossofgen-
of the tree contains the initial key, which is gener- erality, we convert anagrams into alphagrams, so
atedaccordingtosimplefrequencyanalysis(i.e.,by that the emission probabilities are always equal to
mappingthen-thmostcommonciphertextcharacter 1. Anyalphagramsthatcorrespondtounseenwords
to the n-th most common character in the corpus). arereplacedwithasingle‘unknown’type. Wethen
Newtreeleavesarespawnedbymodifyingthekeys useamodifiedViterbidecodertodeterminethemost
of current leaves, while ensuring that each node in likely word sequence according to a word trigram
the tree has a unique key. At the end of computa- language model, which is derived from the train-
tion,thekeywiththehighestscoreisreturnedasthe ing corpus, and smoothed using deleted interpola-
solution. tion(JelinekandMercer,1980).
Inouranagramadaptation,werelaxthedefinition
4.3 VowelRecovery
of pattern equivalence to include strings that have
the same decomposition pattern, as defined in Sec- Many writing systems, including Arabic and He-
tion 3.2. Under the new definition, the order of the brew,areabjadsthatdonotexplicitlyrepresentvow-
letterswithinawordhasnoeffectonpatternequiv- els. ReddyandKnight(2011)provideevidencethat
alence. Forexample,MZXCX isequivalentnotonly the VMS may encode an abjad. The removal of
with there and bases, but also with three and otter, vowels represents a substantial loss of information,
80
andappearstodramaticallyincreasethedifficultyof Step2 Step3 Both Ceiling
solvingacipher. English 99.5 98.2 97.7 98.7
In order to apply our system to abjads, we re- Bulgarian 97.0 94.7 91.9 95.3
move all vowels in the corpora prior to deriving the German 97.3 90.6 88.7 91.8
language models used by the script decipherment Greek 95.7 96.6 92.7 97.2
step. Weassumetheabilitytopartitiontheplaintext Spanish 99.1 98.0 97.1 99.0
symbolsintodisjointsetsofvowelsandconsonants Average 97.7 95.7 93.8 96.5
for each candidate language. The anagram decoder
Table2: Wordaccuracyontheanagramdecryptiontask.
is trained to recover complete in-vocabulary words
fromsequencesofanagramscontainingonlyconso-
nants. At test time, we remove the vowels from the
TheresultsinTable2showthatoursystemisable
input to the decipherment step of the pipeline. In
to effectively break the anagrammed ciphers in all
contrast with Knight et al. (2006), our approach is
fivelanguages. ForStep2(scriptdecipherment),we
able not only to attack abjad ciphers, but also to re-
countascorrectallwordtokensthatcontaintheright
storethevowels,producingfullyreadabletext.
characters, disregarding their order. Step 3 (ana-
4.4 Evaluation gram decoding) is evaluated under the assumption
thatithasreceivedaperfectdeciphermentfromStep
In order to test our anagram decryption pipeline on
2. On average, the accuracy of each individual step
out-of-domain ciphertexts, the corpora for deriving
exceeds 95%. The values in the column denoted as
language models need to be much larger than the
Botharetheactualresultsofthepipelinecomposed
UDHR samples used in the previous section. We
of Steps 2 and 3. Our system correctly recovers
selected five diverse European languages from Eu-
93.8% of word tokens, which corresponds to over
roparl (Koehn, 2005): English, Bulgarian, German,
97%ofthein-vocabularywordswithinthetestfiles,
Greek, and Spanish. The corresponding corpora
The percentage of the in-vocabulary words, which
containabout50millionwordseach,withtheexcep-
are shown in the Ceiling column, constitute the ef-
tion of Bulgarian which has only 9 million words.
fectiveaccuracylimitsforeachlanguage.
Weremovepunctuationandnumbers,andlowercase
Theerrorsfallintothreecategories, asillustrated
alltext.
inFigure3e. Step2introducesdeciphermenterrors
WetestontextsextractedfromWikipediaarticles
(e.g., deciphering ‘s’ as ‘k’ instead of ‘z’ in “orga-
onart,Earth,Europe,film,history,language,music,
nized”),whichtypicallyprecludethewordfrombe-
science, technology, and Wikipedia. The texts are
ing recovered in the next step. A decoding error in
firstencipheredusingasubstitutioncipher,andthen
Step 3 may occur when an alphagram corresponds
anagrammed (Figure 3a-c). Each of the five lan-
to multiple words (e.g. “greens” instead of “gen-
guages is represented by 10 ciphertexts, which are
res”), although most such ambiguities are resolved
decrypted independently. In order to keep the run-
correctly. However,themajorityoferrorsarecaused
ningtimereasonable,thelengthoftheciphertextsis
by out-of-vocabulary (OOV) words in the plaintext
setto500characters.
(e.g.,“improvisational”). Sincethedecodercanonly
The first step is language identification. Our de-
producewordsfoundinthetrainingcorpus,anOOV
composition pattern method, which is resistant to
wordalmostalwaysresultsinanerror. TheGerman
both anagramming and substitution, correctly iden-
ciphersstandoutashavingthelargestpercentageof
tifies the source language of 49 out of 50 cipher-
OOVwords(8.2%),whichmaybeattributedtofre-
texts. The lone exception is the German article on
quentcompounding.
technology, for which German is the second ranked
languageafterGreek. Thiserrorcouldbeeasilyde- Table3showstheresultsoftheanalogousexper-
tectedbynoticingthatmostoftheGreekwords“de- iments on abjads (Section 4.3). Surprisingly, the
ciphered” by the subsequent steps are out of vocab- removal of vowels from the plaintext actually im-
ulary. We proceed to evaluate the following steps proves the average decipherment step accuracy to
assumingthatthesourcelanguageisknown. 99%. Thisisduenotonlytothereducednumberof
81
Step2 Step3 Both Ceiling Language Text Words Characters
English 99.9 84.6 84.5 98.7 English Bible 804,875 4,097,508
Bulgarian 99.1 71.1 70.4 95.0 Italian Bible 758,854 4,246,663
German 98.5 73.7 72.9 91.8 Latin Bible 650,232 4,150,533
Greek 97.7 65.4 63.6 97.0 Hebrew Tanach 309,934 1,562,591
Spanish 99.8 73.7 73.3 99.0 Arabic Quran 78,245 411,082
Average 99.0 73.8 73.1 96.4
Table4: Languagecorpora.
Table3: Wordaccuracyontheabjadanagramdecryption
task.
5.2 SourceLanguage
In this section, we present the results of our cipher-
distinctsymbols,butalsotothefewerpossibleana-
textlanguageidentificationmethodsfromSection3
grammingpermutationsintheshortenedwords. On
ontheVMStext.
theotherhand,thelossofvowelinformationmakes
The closest language according to the letter fre-
the anagram decoding step much harder. However,
quency method is Mazatec, a native American lan-
morethanthreequartersofin-vocabularytokensare
guage from southern Mexico. Since the VMS was
stillcorrectlyrecovered,includingtheoriginalvow-
els.3 Ingeneral,thisissufficientforahumanreader created before the voyage of Columbus, a New
World language is an unlikely candidate. The top
tounderstandthemeaningofthedocument,andde-
tenlanguagesalsoincludeMozarabic(3),Italian(8),
ducetheremainingwords.
andLadino(10), allofwhichareplausibleguesses.
5 VoynichExperiments However, the experiments in Section 3.4 demon-
strate that the frequency analysis is much less reli-
In this section, we present the results of our experi-
ablethantheothertwomethods.
mentsontheVMS.Weattempttoidentifythesource
The top-ranking languages according to the de-
language with the methods described in Section 3;
composition pattern method are Hebrew, Malay (in
we quantify the similarity of the Voynich words to
Arabic script), Standard Arabic, and Amharic, in
alphagrams; and we apply our anagram decryption
this order. We note that three of these belong to
algorithmfromSection4tothetext.
theSemiticfamily. Thesimilarityofdecomposition
5.1 Data patterns between Hebrew and the VMS is striking.
The Bhattacharyya distance between the respective
Unless otherwise noted, the VMS text used in
distributions is 0.020, compared to 0.048 for the
our experiments corresponds to 43 pages of the
second-ranking Malay. The histogram in Figure 4
manuscript in the “type B” handwriting (VMS-B),
showsHebrewasasingleoutlierintheleftmostbin.
investigated by Reddy and Knight (2011), which
In fact, Hebrew is closer to a sample of the VMS
we obtained directly from the authors. It con-
ofasimilarlengththantoanyoftheremaining379
tains 17,597 words and 95,465 characters, tran-
UDHRsamples.
scribed into 35 characters of the Currier alphabet
The ranking produced by the the trial decipher-
(d’Imperio,1978).
mentmethodissensitivetoparameterchanges;how-
For the comparison experiments, we selected
ever,thetwolanguagesthatconsistentlyappearnear
five languages shown in Table 4, which have
the top of the list are Hebrew and Esperanto. The
been suggested in the past as the language of the
highrankofHebrewcorroboratestheoutcomeofthe
VMS (Kennedy and Churchill, 2006). Consider-
decomposition pattern method. Being a relatively
ing the age of the manuscript, we attempt to use
recent creation, Esperanto itself can be excluded as
corpora that correspond to older versions of the
theciphertextlanguage,butitshighscoreisremark-
languages, including King James Bible, Bibbia di
ableinviewofthewell-knowntheorythattheVMS
Gerusalemme,andVulgate.
textrepresentsaconstructedlanguage.4 Wehypoth-
3The differences in the Ceiling numbers between Tables 2
and3areduetowordsthatarecomposedentirelyofvowels. 4The theory was first presented in the form of an ana-
82
In order to quantify how strongly the words in
a language resemble alphagrams, we first need to
identify the order of the alphabet that minimizes
the total alphagram distance of a representative text
sample. ThedecisionversionofthisproblemisNP-
complete, which can be demonstrated by a reduc-
tion from the path variant of the traveling salesman
problem. Instead, we find an approximate solution
Figure4: HistogramofdistancesbetweentheVMSand
withthefollowinggreedysearchalgorithm. Starting
samplesof380otherlanguages,asdeterminedbythede-
from an initial order in which the letters first occur
compositionpatternmethod.Thesingleoutlierontheleft
in the text, we repeatedly consider all possible new
isHebrew.
positions for a letter within the current order, and
choosetheonethatyieldsthelowesttotalalphagram
esize that the extreme morphological regularity of
distanceofthetext. Thisprocessisrepeateduntilno
Esperanto (e.g., all plural nouns contain the bigram
betterorderisfoundfor10iterations,with100ran-
‘oj’) yields an unusual bigram character language
domrestarts.
model which fits the repetitive nature of the VMS
Whenappliedtoarandomsampleof10,000word
words.
tokens from the VMS, our algorithm yields the or-
In summary, while there is no complete agree-
der 4BZOVPEFSXQYWC28ARUTIJ3*GHK69MDLN5,
ment between the three methods about the most
which corresponds to the average alphagram dis-
likely underlying source language, there appears to
tanceof0.996(i.e.,slightlylessthanonepairoflet-
be a strong statistical support for Hebrew from the
tersperword). ThecorrespondingresultonEnglish
two most accurate methods, one of which is robust
is jzbqwxcpathofvurimslkengdy, with an
against anagramming. In addition, the language is
average alphagram distance of 2.454. Note that the
a plausible candidate on historical grounds, being
lettersatthebeginningofthesequencetendtohave
widely-usedforwritingintheMiddleAges. Infact,
low frequency, while the ones at the end occur in
a number of cipher techniques, including anagram-
popular morphological suffixes, such as ed and
ming, can be traced to the Jewish Cabala (Kennedy −
ly. For example, the beginning of the first arti-
andChurchill,2006). −
cle of the UDHR with the letters transposed to fol-
lowthisorderbecomes: “Allahumnbisengareborn
5.3 Alphagrams
freeandqauleintiingdyandthrisg.”
In this section, we quantify the peculiarity of the
To estimate how close the solution produced by
VMSlexiconbymodelingthewordsasalphagrams.
our greedy algorithm is to the actual optimal solu-
We introduce the notion of the alphagram distance,
tion, we also calculate a lower bound for the total
andcomputeitfortheVMSandfornaturallanguage
alphagram distance with any character order. The
samples.
lowerboundis min(b ,b ),whereb isthe
x,y xy yx xy
We define a word’s alphagram distance with re- numberoftimescharacterxoccursbeforecharacter
specttoanorderingofthealphabetasthenumberof y withinwordsPinthetext.
letterpairsthatareinthewrongorder. Forexample,
Figure 5 shows the average alphagram distances
with respect to the QWERTY keyboard order, the
for the VMS and five comparison languages, each
word rye has an alphagram distance of 2 because it
representedbyarandomsampleof10,000wordto-
containstwoletterpairsthatviolatetheorder: (r,e)
kens which exclude single-letter words. The Ex-
and(y,e). Awordisanalphagramifandonlyifits
pected values correspond to a completely random
alphagram distance is zero. The maximum alpha-
intra-word letter order. The Lexicographic values
gram distance for a word of length n is equal to the
correspond to the standard alphabetic order in each
numberofitsdistinctletterpairs.
language. The actual minimum alphagram distance
isbetweentheLowerBoundandtheComputedMin-
gram(FriedmanandFriedman,1959). Seealsoamorerecent
proposalbyBalandinandAveryanov(2014). imumobtainedbyourgreedyalgorithm.
83
tically correct or semantically consistent. This is
expected because our system is designed for pure
monoalphabetic substitution ciphers. If the VMS
indeed represents one of the five languages, the
amountofnoiseinherentintheorthographyandthe
transcription would prevent the system from pro-
ducing a correct decipherment. For example, in a
hypothetical non-standard orthography of Hebrew,
some prepositions or determiners could be written
as separate one-letter words, or a single phoneme
could have two different representations. In addi-
tion, because of the age of the manuscript and the
variety of its hand-writing styles, any transcription
requires a great deal of guesswork regarding the
Figure5: Averagewordalphagramdistances.
separationofindividualwordsintodistinctsymbols
(Figure 1). Finally, the decipherments necessarily
The results in Figure 5 show that while the ex- reflectthecorporathatunderliethelanguagemodel,
pectedalphagramdistancefortheVMSfallswithin whichmaycorrespondtoadifferentdomainandhis-
the range exhibited by natural languages, its mini- toricalperiod.
mum alphagram distance is exceptionally low. In Nevertheless,itisinterestingtotakeacloserlook
absolute terms, the VMS minimum is less than half at specific examples of the system output. The first
the corresponding number for Hebrew. In relative line of the VMS (VAS92 9FAE AR APAM ZOE
terms, the ratio of the expected distance to the min- ZOR9 QOR92 9 FOR ZOE89)isdecipheredinto
imum distance is below 2 for any of the five lan- Hebrewasוישנא ילע ו וחיבל וילא שיא Nהכה הל השעו
guages,butabove4fortheVMS.Theseresultssug- תוצמה.5 According to a native speaker of the lan-
gest that, if the VMS encodes a natural language guage, this is not quite a coherent sentence. How-
text, the letters within the words may have been re- ever, after making a couple of spelling corrections,
orderedduringtheencryptionprocess. Google Translate is able to convert it into passable
English: “Shemaderecommendationstothepriest,
5.4 DeciphermentExperiments manofthehouseandmeandpeople.” 6
Inthissection,wediscusstheresultsofapplyingour Even though the input ciphertext is certainly too
anagramdecryptionsystemdescribedinSection4to noisy to result in a fluent output, the system might
theVMStext. still manage to correctly decrypt individual words
We decipher each of the first 10 pages of the in a longer passage. In order to limit the influence
VMS-Busingthefivelanguagemodelsderivedfrom ofcontextinthedecipherment,werestricttheword
thecorporadescribedinSection5.1. Thepagescon- language model to unigrams, and apply our sys-
tainbetween292and556words,3726intotal. Fig- tem to the first 72 words (241 characters)7 from the
ure6showstheaveragepercentageofin-vocabulary “Herbal”sectionoftheVMS,whichcontainsdraw-
words in the 10 decipherments. The percentage is ings of plants. An inspection of the output reveals
significantly higher for Hebrew than for the other severalwordsthatwouldnotbeoutofplaceiname-
languages, which suggests a better match with the dieval herbal, such as רצה ‘narrow’, רכיא ‘farmer’,
VMS. Although the abjad versions of English, Ital- רוא‘light’,ריוא‘air’,שׁא‘fire’.
ian, and Latin yield similar levels of in-vocabulary The results presented in this section could be in-
words,theirdistancestotheVMSlanguageaccord- terpreted either as tantalizing clues for Hebrew as
ing to the decomposition pattern method are 0.159,
5Hebrewiswrittenfromrighttoleft.
0.176, and 0.245 respectively, well above Hebrew’s
6https://translate.google.com/(accessedNov.20,2015).
0.020. 7Thelengthofthepassagewaschosentomatchthenumber
None of the decipherments appear to be syntac- ofsymbolsinthePhaistosDiscinscription.
84
Description:method achieves 97% accuracy on 380 lan- guages. cryption word accuracy of 93% on a set of 50 ciphertexts in 5 such as Linear A, the Indus script, and the Phaistos .. ning time reasonable, the length of the ciphertexts is.