02 voynich manuscript

of 14/14
PLEASE SCROLL DOWN FOR ARTICLE This article was downloaded by: On: 25 October 2010 Access details: Access Details: Free Access Publisher Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37- 41 Mortimer Street, London W1T 3JH, UK Cryptologia Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t725304178 The Voynich Manuscript: Evidence of the Hoax Hypothesis Andreas Schinner To cite this Article Schinner, Andreas(2007) 'The Voynich Manuscript: Evidence of the Hoax Hypothesis', Cryptologia, 31: 2, 95 — 107 To link to this Article: DOI: 10.1080/01611190601133539 URL: http://dx.doi.org/10.1080/01611190601133539 Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Post on 28-Apr-2015

46 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

This article was downloaded by: On: 25 October 2010 Access details: Access Details: Free Access Publisher Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 3741 Mortimer Street, London W1T 3JH, UK

Cryptologia

Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t725304178

The Voynich Manuscript: Evidence of the Hoax HypothesisAndreas Schinner

To cite this Article Schinner, Andreas(2007) 'The Voynich Manuscript: Evidence of the Hoax Hypothesis', Cryptologia, 31:

2, 95 107

To link to this Article: DOI: 10.1080/01611190601133539 URL: http://dx.doi.org/10.1080/01611190601133539

PLEASE SCROLL DOWN FOR ARTICLEFull terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Cryptologia, 31:95107, 2007 Copyright Taylor & Francis Group, LLC ISSN: 0161-1194 print DOI: 10.1080/01611190601133539

The Voynich Manuscript: Evidence of the Hoax HypothesisANDREAS SCHINNERAbstract In this article, I analyze the Voynich manuscript, using random walk mapping and token=syllable repetition statistics. The results significantly tighten the boundaries for possible interpretations; they suggest that the text has been generated by a stochastic process rather than by encoding or encryption of language. In particular, the so-called Chinese theory now appears less convincing. Keywords hoax hypothesis, statistical analysis, stochastic process, Voynich manuscript

Downloaded At: 14:22 25 October 2010

IntroductionThe Voynich manuscript (the VMS) is a handwritten codex of about 250 pages, ink on vellum, appearing on stylistic grounds to date from around 1500. It contains illustrations of mostly unidentifiable plants, astronomical or astrological diagrams, and naked nymphs, bathing in strange arrangements of pools or tubs connected by complex systems of pipes. The most striking feature, however, is the text, written in an elegant unique script that has defied commonly accepted translation so far. Information about the VMS, its possible history, as well as attempts of explanation can be found in various places [4, 7, 15, 14]. Only a brief summary will be given here. Interpretations of the VMS can roughly be divided into three classes: . Cipher text hypothesis. The VMS contains natural language text (from the origin of the manuscript this should most probably be Latin or German) that has been encrypted. . Plain text hypothesis. The VMS text is plain text in natural, not yet identified language that either did not possess an original alphabet in the beginning 16th century or the system of writing appeared too complex to a medieval scholar. The word length statistics makes East Asian languages, in particular Chinese, the most promising candidate for this (Chinese theory). Alternatively, the script could also have been invented together with an artificial language. . Hoax hypothesis. The VMS contains no meaningful text at all. In this context, the word hoax should be associated with a broad spectrum of possibilities, ranging from intentional forgery for monetary gain to the work by an idiot savant, interpreted by medieval scholars as revelation of arcane lore. These three classes are not completely distinct. For example, the VMS could contain a message hidden steganographically in a set of otherwise meaninglessAddress correspondence to Dr. Andreas Schinner, Institut fur Experimentalphysik, Abteilung fur Atom- and Oberflachenphysik, Johannes Kepler Universitat, Altenberger Strae 69, 4040 Linz, Austria. E-mail: [email protected]

95

96

A. Schinner

strings. This theory is especially difficult to prove or disprove; the best argument against it known so far is a psychological one: the basic principle of steganography is to hide the mere existence of a messageand the worst place to hide a genuine secret is an apparently mysterious book. It is one of the most striking features of the VMS that even modern computer aided analysis so far could not rule out a single one of these interpretations definitely. Instead, arguments pro and contra all three viewpoints can be given: since statistical properties characteristic for natural languages are also present in the VMS text, the encryption method usedif anyshould not be too complex; additionally, around 1500, cryptology was still in its early beginnings. Despite these facts, all attempts of decipherment by modern cryptanalysts have failed. On the other hand, the text shows several exotic linguistic features like the frequent word repetitions, or the preferred positions for certain letters within a line; this appears to be incompatible with the plain text hypothesis, even in the artificial language version. Consequently, there are attractions in the hoax hypothesis. However, the VMS text obviously is not composed of simple random strings, and it shows rich linguistic-like structure. It seemed unlikely that a medieval hoaxer (or even an early 20th century forger) could create such a convincing facsimile language within reasonable time. The work by Gordon Rugg [11] has proven that this need not necessarily be true: an algorithm feasible even with medieval technology (the table-and-grille method) makes it possible for a single person to generate a text as long and complex as the VMS within approximately three months. This, however, is just a possibility and far from a proof of the hoax hypothesis. Furthermore, the table-and-grille method as investigated so far does not explain all of the statistical text properties of the VMS. The three concurrent explanation classes are thus still of roughly equal relevance. In this article, statistical investigations of the VMS are presented that provide additional restrictions to possible solutions. Mapping the text to a random walk uncovers characteristic long-range correlations not present in normal human writings; they better fit to a stochastic process with memory effects than a sequence of tokens chosen according to linguistic rules. Furthermore, the distribution of gaps between two similar or selected tokens, respectively, also differs qualitatively from normal texts; its mathematical properties indicate the presence of very unusual random effects. Possible implications of these results for the interpretation of the VMS are discussed in the conclusions section. Throughout this article, the following usual conventions are used: the term token denotes any string of characters separated by spaces or line start or end; a word is a type of token regardless of its frequency in the text. For characters or tokens from the VMS script the European Voynich Alphabet (EVA) is used [15]; the letters (or sequences of letters) are written italic and are put in angle brackets: (for example, the notorious most frequent VMS token will be transcribed as hdaiini). Finally, the analysis presented in this article is based on the various text samples listed in Table 1.

Downloaded At: 14:22 25 October 2010

Random Walk ModelFollowing Kokol, Podgorelec, Zorman, Kokol, and Njivar [8], long-range power law correlations are present in a wide variety of information encoding systems, ranging from human writings (natural languages and computer programs) to DNA

The Voynich Manuscript: Evidence of the Hoax Hypothesis Table 1. Text sources used in this article Text Voynich manuscript1 Vulgate Bible Luther Bible Alice in Wonderland Chinese Bible1

97

Text part Language Number of tokens Number of words All 2 Unknown 5% 3 Latin 5% 3 German All English Genesis Mandarin4 36,000 25,000 35,000 26,000 34,000 7000 6000 4000 3000 2000

majority vote version of interlinear EVA transcription 1.6e6 [15]; 2or particular sections of it; see Table 2; 3percentages are counted from top of document; and, 4in pin-yin romanization with all tones removed.

sequences. To some extent, they characterize the information content and complexity of communication. A useful method to study correlations in character strings is based on mapping the symbol sequence to a stochastic process that, especially in linguistic literature, frequently is called Brownian walk. This terminology is somehow misleading, since Brownian motion can be described as scaling limit of a so-called random walk: in the theory of stochastic processes [2] it is characterized by independent steps that all have the same probability distribution, i.e., are uncorrelated. On the other hand, in statistical physics, for example, the expression random walk with memory is sometimes used to describe a situation when the stochastic process generating the steps is of Markovian or even non-Markovian type. In the following, random walk should be understood with respect to this generalized meaning. As a first step it is necessary to encode the characters of the texts under investigation to bit sequences. It has been shown that the actual definition of this code table has negligible influence on the interesting quantities, as long as all (or at least almost all) possible bit patterns are used [12]. Since the VMS contains no punctuation signs they are removed from the other texts too; upper case characters are converted to lower case. Thus the remaining character set consists of the letters az, the German umlauts a, o u, and the German sz ligature ; empty spaces are , ignored. These 30 characters can be represented by a 5-bit code. The bits of the resulting binary string then define the steps 1 of a random walk. Let Dyl; l0 yl l0 yl0 be the walk displacement between step numbers l0 and l l0. Then F l2 Dy2 hDyi2 1

Downloaded At: 14:22 25 October 2010

2

describes the variance of the mean displacement. The angle brackets denote averaging over all l0. For pure (uncorrelated) random walks of infinite length, where the steps are Bernoulli trials with probability p, one easily obtains: F l2 4p1 pl 3

In general, F(l ) will behave asymptotically as F l / l a , where an exponent a 6 0.5 indicates the presence of long-range correlations.

98

A. Schinner

Downloaded At: 14:22 25 October 2010

Figure 1. Root mean square fluctuation of the random walk displacement for the VMS and normal language texts. Inset: VMS curve (full line) with low and high l asymptotic behavior, respectively (dashed lines).

Particular care has to be taken evaluating Eq. (2) for a walk of finite length N to avoid finite size effects: as l ! N 1 the sample size available for calculating the averages (i.e. the number of possible l0 values) tends to 1; consequently, F l ! 0. In the calculations presented here l is limited to a maximal value of N=10. The resulting F(l) on applying this method to the VMS and other texts is shown in Figure 1. Previous investigations by Kokol et al. [8] of various human writings have demonstrated that for natural language texts (almost independent of the language used) the asymptotic exponent a of F(l) does not notably differ from 0.5, while for computer program source codes significant deviations are observed. As far as the normal language samples are considered the present results confirm this. Most interestingly, the VMS text shows completely different behavior: a crossover point exists where the random process a 0.5 turns into an asymptotic exponent a % 0:85, indicating the presence of memory effects in the underlying stochastic process. The principal structure of F(l) remains the same also for single sections of the VMS, as presented in Table 2: the asymptotic exponents for parts of the VMS are somewhat lower (between 0.7 and 0.8) than for the whole text; the difference is mainly due to the relatively high sensitivity of a to reduction of the walk length. Two facts are especially noteworthy: (i) the crossover point lco % 360 72 characters 5 bits of the whole text fits well to the average line length; (ii) this value approximately also holds for sections that are associated with Curriers language A [3], while for sections written in language B lco is significantly higher (by approximately a factor of 3). It appears that in the VMS significant correlations between tokens with spacing of more than an average text line exist, while within a line the text behaves randomly (like ordinary human writings). To inspect this more closely, the step (or bit) autocorrelation function

The Voynich Manuscript: Evidence of the Hoax Hypothesis Table 2. Random walk asymptotic displacement variance VMS section All Herbal Astrological Biological Pharmaceutical Recipes1

99

Folios 1r116v 1r66v 67r73v 75r84v 87r102v 103r116v

Walk length 954456 272896 74721 172096 99176 282536

a1 0.131 0.243 0.396 0.161 0.314 0.182

a1 0.846 0.768 0.659 0.762 0.706 0.738

2 lco

Script3 AB A ? B A B

356 196 339 1065 277 1285

0:5 a F l ! al a for l > 1; see Eq. (2) and text; 2Crossover l-value: lco alco ; and, 3Currier lan> guage [3] that is dominant in this section.

Cl hnl l0 nl0 i hnl0 i2 and its corresponding cumulative distribution function Cc l l 1X Ck l k1

4

Downloaded At: 14:22 25 October 2010

5

are useful quantities. n(k) denotes the value (0 or 1) of the bit at position k in the binary string generating the random walk. As demonstrated in Figure 2 positive correlations in the VMS build up within approximately l < 400 that are by an order of magnitude stronger than in ordinary text. These correlations decay after some thousand steps. Such positive correlations are typical for a stochastic process in which the probability of a particular random event is increased by previous occurrences of this event.

Figure 2. Cumulative step autocorrelation function Cc(l), cf., Eq. (5), (smoothed by 100 points adjacent averaging); full line: VMS, dashed line: Vulgate Bible. Inset: autocorrelation function C(l), cf., Eq. (4), for l between 1000 and 1030; full line: VMS, gray shaded area: Vulgate Bible.

100

A. Schinner

A classical model for such a system, often applied to cascade processes like particle induced electron emission [1], is the so-called Plya process. It is based on the o Plya urn scheme, where on drawing a ball of particular color from an urn a specific o number of balls of the same color are put into the urn, increasing the probability of drawing this color again [5] (spurious contagion). In the scaling limit of large step numbers l the resulting distribution is the so-called Plya distribution, also known as o negative binomial distribution 1=b Pn bln 1 bl1=bn 6 n In the present context Pn is the probability that in a walk of length l!1 the number of up-steps is equal to n. Mean and variance of Pn are given by h ni l r2 l1 bl 7 8

The parameter b describes the cascading strength of the process: for b 0 the random steps are uncorrelated and Eq. (6) turns into a Poisson distribution, while for b 1 the so-called Yule-Ferry process (also known as simple birth process) is recovered [2]. Since l / l, from Eq. (8) follows that an underlying Plya process results in the o p asymptotic behavior F l / b l 1 of the random walk model. In order to reproduce the observed a % 0:85 from Figure 1, l-dependence of b is necessary. Strictly speaking, the underlying process then is no longer a pure Plya process, since with nono constant b Eq. (6) no longer satisfies the Kolmogorov equations exactly. Due to the rather weak variation of b / l 0:3 , however, it still remains a useful approximation. The actual representation of the random walk in form of the VMS text can be used to estimate the true distribution Pn(l). Unfortunately, in particular for large l (which represents the interesting case) the sample size is too small to identify the distribution with compelling evidence (mainly because b is small). The data, however, do not contradict the hypothesis Eq. (6). The unusual shape of F(l) for the VMS has major impact on possible interpretations. In particular, the Chinese hypothesis appears not to be compatible with it. The impression that a non-Markovian stochastic process, where the step probability depends on the long-term history, may play a key role in the interpretation of the VMS will be still deepened in the following sections.

Downloaded At: 14:22 25 October 2010

Similar Tokens Repetition Distance DistributionIn a previous work by G. Landini [9] the repetition distance distribution of the most frequent tokens in the VMS (hdaiini), Alice in Wonderland (the), and the Vulgate Bible (et), respectively, have been investigated, i.e., the probability distribution of the number of other tokens between two occurrences of the particular one (iso-word gap). The result did not show characteristic difference between the VMS and the normal texts, apart from the well-known enigmatic VMS feature that common words, in particular hdaiini, quite frequently appear in sequences and consequently have non-vanishing probability for zero repetition distance. As will be demonstrated in this section it is more instructive to investigate the repetition distance of two similar rather than exactly matching tokens. From the

The Voynich Manuscript: Evidence of the Hoax Hypothesis

101

Downloaded At: 14:22 25 October 2010

many well-known string distance metrics the more straight-forward Levenshtein distance [6] will be used here. More sophisticated methods of calculating string distances tend to be optimized for human writings which appears problematic in the VMS context of unknown language and meaning (if any). The Levenshtein distance of two character strings is an integer ranging from 0 (exact match) to the maximum of the two string lengths (no similarity), denoting the number of elementary edit operations necessary to make both strings equal. Mapping this number to the interval [0,100] yields a percentage of dissimilarity for two tokens. In Figure 3, the similar token repetition distance distribution Pn for the VMS compared with normal texts is presented. Here n denotes the number of other tokens between two similar ones, i.e., n 0 corresponds to the situation of two alike tokens in immediate vicinity. Two words are considered similar if their dissimilarity as defined above is less or equal to 30%; it turns out that the precise value (10%) of this threshold changes Pn only quantitatively, not qualitatively. The most striking feature is the almost mathematically perfect smooth shape of the VMS curve for n ! 0, while the other text sample data display the expected irregular behavior and tend to zero (or at least small values). As noted previously, this simply expresses the effect that writers normally try to avoid word repetitions. It is especially noteworthy that even the Chinese text lies closer to the European languages than the VMS, although the higher tendency of common-word repetition sequences in Asian languages is a frequent argument in favor of the Chinese theory. The remaining text samples listed in Table 1 have been omitted in Figure 3 just to avoid confusion by too many markers; their behavior is comparable to that of the Vulgate Bible. Let us consider an infinite random text consisting of N words occurring with probabilities kk, k 1,. . ., N. The chance for a particular word k to reappear for

Figure 3. Similar tokens repetition distance distribution (maximal dissimilarity 30%) of the VMS, compared with Vulgate Bible and the pin-yin text. Inset: VMS result and fit using Eq. (12) (a 3.5618, b 0.1534, q 0.9885).

102

A. Schinner

the next time exactly after n other tokens follows a geometric distribution kk 1 kk n . The total token repetition distance distribution is then given by Pn N X k1

k2 1 kk n k

9

The geometric distribution has its maximum at n 0 and decreases monotonically; a behavior also true for the VMS data in Figure 3. The fact that normal texts as well as the VMS obey Zipfs first law [10] suggests the approximation kk / 1=k. As rough estimate for small n the discrete index k may be replaced by a continuous variable j, turning the sum Eq. (9) into an integral. Setting kj % c=j with an upper cutoff jm to ensure convergence of the kj-norm, and under the reasonable assumption c