some evidence concerning the genesis of shannon’s information theory

11
Some evidence concerning the genesis of Shannon’s information theory Samuel W. Thomsen Department of History and Philosophy of Science, University of Pittsburgh, 1017 Cathedral of Learning, Pittsburgh, PA 15260, USA article info Article history: Received 29 January 2007 Received in revised form 17 February 2008 Keywords: History of science History of technology Claude Shannon Information theory Ergodic theory Electrical engineering abstract A typescript by Claude Shannon, ‘Theorems on statistical sequences’, (donated to the Library of Congress in 2001 and apparently unscrutinized by historians to date) is examined to shed light on the development of information theory. In particular, it appears that Shannon was still working out the mathematical details of his theory in the spring of 1948, just before he published ‘A mathematical theory of communi- cation’. This is contrasted with evidence from a declassified cryptography report that Shannon’s theory was intuitively worked out in its essentials by the time he filed the report in 1945. Previous interviews with Shannon, and a recent interview with a colleague of his, Brockway McMillan, confirm this picture. Ó 2008 Elsevier Ltd. All rights reserved. When citing this paper, please use the full journal title Studies in History and Philosophy of Science 1. Introduction In this paper, I examine an unpublished typescript, ‘Theorems on statistical sequences’, 1 written by Claude Shannon in the spring of 1948 (Shannon, 1948a), just before the publication of his ground- breaking paper, ‘A mathematical theory of communication’ (Shan- non, 1948b). Shannon has claimed (Ellersick, 1984) that his theory was developed in its essentials between 1940 and 1945. But had he worked out the mathematics in detail, and simply delayed putting them into article form—or were his results primarily intuitive? I ar- gue that the truth is closer to the latter. The unpublished typescript I examine shows Shannon still working out the mathematical details of his theory in the spring of 1948. In particular, he was working to prove the theorems underlying his ‘fundamental coding theorems’, which in turn set theoretical limits on how fast information can be transmitted over a specified channel. Very little or no in-depth his- tory has been done so far concerning the genesis of the fundamental coding theorems 2 , although articles on the development of the con- cept of information are relatively common (for example Cherry, 1952; Aspray, 1985; Leff & Rex, 1990; Segal, 2003). Despite all the attention it has received from historians, Shan- non did not consider his information measure to be anything all that difficult to develop (see Ellersick interview, 1984, p. 123). In the simplest case, when one is sending one of n messages, each with equal probability, the Shannon information of that message is H = log n. 3 One way to grasp the significance of this quantity is to imagine trying to specify one of n messages using a signal com- posed of 1s and 0s. If you send H characters, you can specify up to 2 H possible messages. So to send n possible messages you need to send at least H = log 2 n characters. The formula for entropy is, more precisely, H ¼ X p i log p i 0039-3681/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.shpsa.2008.12.011 E-mail address: [email protected] 1 The only mention I have found of this typescript in the literature is in the bibliography of Shannon’s Collected papers (Shannon, 1993d, p. xxxvii). 2 As far as I can tell, there are no published full length biographies of Claude Shannon, much less any books to make use of the Shannon archives at the Library of Congress. Furthermore, I have found no articles making use of this resource in any of the following journals: IEEE Annals of the History of Computing, Isis, Osiris, Studies in History and Philosophy of Science, Archive for History of the Exact Sciences, British Journal for the History of Science, History and Computing, History and Technology, or History of Science. 3 In information theory logs are usually taken base-2. Studies in History and Philosophy of Science 40 (2009) 81–91 Contents lists available at ScienceDirect Studies in History and Philosophy of Science journal homepage: www.elsevier.com/locate/shpsa

Upload: samuel-w-thomsen

Post on 28-Oct-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Some evidence concerning the genesis of Shannon’s information theory

Studies in History and Philosophy of Science 40 (2009) 81–91

Contents lists available at ScienceDirect

Studies in History and Philosophy of Science

journal homepage: www.elsevier .com/ locate/shpsa

Some evidence concerning the genesis of Shannon’s information theory

Samuel W. ThomsenDepartment of History and Philosophy of Science, University of Pittsburgh, 1017 Cathedral of Learning, Pittsburgh, PA 15260, USA

a r t i c l e i n f o

Article history:Received 29 January 2007Received in revised form 17 February 2008

Keywords:History of scienceHistory of technologyClaude ShannonInformation theoryErgodic theoryElectrical engineering

0039-3681/$ - see front matter � 2008 Elsevier Ltd. Adoi:10.1016/j.shpsa.2008.12.011

E-mail address: [email protected] The only mention I have found of this typescript in2 As far as I can tell, there are no published full leng

Furthermore, I have found no articles making use of tPhilosophy of Science, Archive for History of the Exact Sci

3 In information theory logs are usually taken base-

a b s t r a c t

A typescript by Claude Shannon, ‘Theorems on statistical sequences’, (donated to the Library of Congressin 2001 and apparently unscrutinized by historians to date) is examined to shed light on the developmentof information theory. In particular, it appears that Shannon was still working out the mathematicaldetails of his theory in the spring of 1948, just before he published ‘A mathematical theory of communi-cation’. This is contrasted with evidence from a declassified cryptography report that Shannon’s theorywas intuitively worked out in its essentials by the time he filed the report in 1945. Previous interviewswith Shannon, and a recent interview with a colleague of his, Brockway McMillan, confirm this picture.

� 2008 Elsevier Ltd. All rights reserved.

When citing this paper, please use the full journal title Studies in History and Philosophy of Science

1. Introduction

In this paper, I examine an unpublished typescript, ‘Theoremson statistical sequences’,1 written by Claude Shannon in the springof 1948 (Shannon, 1948a), just before the publication of his ground-breaking paper, ‘A mathematical theory of communication’ (Shan-non, 1948b). Shannon has claimed (Ellersick, 1984) that his theorywas developed in its essentials between 1940 and 1945. But hadhe worked out the mathematics in detail, and simply delayed puttingthem into article form—or were his results primarily intuitive? I ar-gue that the truth is closer to the latter. The unpublished typescript Iexamine shows Shannon still working out the mathematical detailsof his theory in the spring of 1948. In particular, he was working toprove the theorems underlying his ‘fundamental coding theorems’,which in turn set theoretical limits on how fast information can betransmitted over a specified channel. Very little or no in-depth his-tory has been done so far concerning the genesis of the fundamental

ll rights reserved.

the literature is in the bibliographth biographies of Claude Shannon, mhis resource in any of the followinences, British Journal for the History2.

coding theorems2, although articles on the development of the con-cept of information are relatively common (for example Cherry,1952; Aspray, 1985; Leff & Rex, 1990; Segal, 2003).

Despite all the attention it has received from historians, Shan-non did not consider his information measure to be anything allthat difficult to develop (see Ellersick interview, 1984, p. 123). Inthe simplest case, when one is sending one of n messages, eachwith equal probability, the Shannon information of that messageis H = log n.3 One way to grasp the significance of this quantity isto imagine trying to specify one of n messages using a signal com-posed of 1s and 0s. If you send H characters, you can specify up to2H possible messages. So to send n possible messages you need tosend at least H = log2 n characters. The formula for entropy is, moreprecisely,

H ¼ �X

pi log pi

y of Shannon’s Collected papers (Shannon, 1993d, p. xxxvii).uch less any books to make use of the Shannon archives at the Library of Congress.

g journals: IEEE Annals of the History of Computing, Isis, Osiris, Studies in History andof Science, History and Computing, History and Technology, or History of Science.

Page 2: Some evidence concerning the genesis of Shannon’s information theory

82 S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91

where pi is the probability of the ith message. The real value of this,according to Shannon (1948b, p. 393), is in its application—mostimportantly, the fundamental coding theorems, which have in-spired generations of engineers to find ever-better coding schemesfor transmitting information (see for example Calhoun, 2003).

At the time Shannon’s theory was published, the canonical engi-neering technique for increasing the rate of information transmis-sion across a channel was to increase the power, thus broadeningthe bandwidth. However, for long distance signal transmission thiswas proving ineffective—because periodic amplification leads to abuildup of amplified noise. To deal with such problems, Shannonintroduced what is generally viewed as an entirely novel frame-work. Rather than working with the transmission of continuoussignals, Shannon shifts his emphasis to the transmission of discretesymbols (a view defended in detail in ‘The Philosophy of PCM’,which Shannon coauthored with Oliver and Pierce [Shannonet al., 1948]. This, in turn, allows him to place theoretical limitson how well one can overcome noise in a channel. In particular,Shannon is able to show that rather than increasing the power ofa signal one can make use of signal redundancy to overcome noise.(For example, one might replace each letter with three copies ofthat letter, so that if one is lost, it can still be correctly guessedwhat letter was sent.) Most surprisingly, Shannon also shows thatfor any noisy channel there exist even more sophisticated codes(though he doesn’t show how to find them) with which one canmake the probability of error arbitrarily close to zero at a finite rateof transmission—called the ‘channel capacity’, which is only a func-tion of the amount of noise in the channel. In short: with the rightcodes, more noise just means slower transmission, not higherchance of error. To show all this, Shannon’s measure of informationwas only a first step. It appears that a much more difficult step wasto make use of this measure to prove the fundamental coding the-orems. In the typescript I examine Shannon demonstrates a resultnow known as the asymptotic equipartition property (AEP), whichprovides the basic link between the entropy of a message and howefficiently it can be coded. This result, which appears in the 1948paper as ‘Theorem 3’, plays a key role in Shannon’s proof of thecoding theorems and has even had an impact on pure mathematics(McMillan, 1953; Breiman, 1957; Shannon, 1993d, p. 464;McMillan, 1997). Comparing the typescript with his 1948 paper, Iclaim that Shannon was still working out some of the mathematicssurrounding the AEP right up to 1948.

I set the stage with a short discussion of what is alreadyknown about the genesis of Shannon’s theory, and a descriptionand summary of the typescript and its importance. Next I give anoverview of the structure of his important 1948 paper with aneye to the role of the AEP. Then, after discussing the differences be-tween the typescript and the 1948 paper, I summarize my claimsabout what it can tell us concerning the development of informa-tion theory.

2. The current picture of the genesis of information theory

It is often claimed that the only real precursors to Shannon’sinformation theory were two papers published in the 1920s,Nyquist’s (1924) ‘Certain factors affecting telegraph speed’, and

4 According to Wiener (1948), p. 76, his definition of information for the continuous casedefinition, and can be used to replace Fisher’s definition in the technique of statistics’.

5 Not even Pierce or McMillan, who worked closely with Shannon at Bell Laboratories,interview in the appendix of this paper.

6 Shannon (1993a). Reprinted in Shannon (1993d), pp. 455–456.7 As noted in Ellersick under reference 10: ‘The following information was obtained from

noiseless channels using the p log(p) measure were obtained in 1940–1941 (at the Instituteof noise’] for formal publication occurred soon after World War II’” (Ellersick, 1984, p. 126)IEEE—expresses puzzlement about the date and cites the same passage in Ellersick.

Hartley’s (1928) ‘Transmission of information’. Nonetheless, infor-mation and communication were thriving areas of research up tothe 1940s (see Cherry, 1952, p. 645–652). Shannon spent this dec-ade (and half of the next) working at Bell Laboratories (Brooks,2003, ‘‘Biographical Note”) with important thinkers in the fieldsuch as J. R. Pierce and Brockaway McMillan, publishing researchon topics in communications engineering (such as pulse code mod-ulation) and doing confidential military research—most notably incryptography, which uses some of the same basic mathematicalconcepts as information theory. In statistics, R. A. Fisher had al-ready introduced a logarithmic measure of how much ‘informa-tion’ a given random variable X provides concerning a fixed butunknown parameter h in the distribution function for X (Fisher,1925). But this rather special and limited notion of informationwas meant for application in statistics and not for the problem ofcommunication over a channel. Later, Norbert Wiener introduceda logarithmic measure of information for continuous distributionfunctions (Wiener, 1948) which is equivalent to the one Shannonuses in the second half of his 1948 paper concerning continuousinformation transmission.4

Exactly when and how Shannon developed his theory is stillsomewhat uncertain.5 In a 1939 letter to Vannevar Bush6, a mentorof his and an adviser on his dissertation committee, Shannonexpresses interest in the problem of communication over a channelwith noise, though he only mentions the continuous case (where themessage to be transmitted is a continuous function of time). Accord-ing to Shannon, he started working in earnest on his theory of com-munication in the year 1940 (Ellersick, 1984, p. 123). There isapparent evidence that, in 1940, he was already far along on the pro-ject—in his paper, ‘Communication in the presence of noise’, (Shan-non, 1949) which contains a proof and brief discussion of thenoiseless coding theorem and says at the bottom of the front page,‘Original manuscript received by the Institute, July 23, 1940’. How-ever, Shannon has confirmed that the early date ‘1940’ is a typo.7

This important point has been missed by several historians includingHodges (1983, p. 250 n. BP 8), Aspray (1985, pp. 119 n. 6, 120 time-line), and Sloane & Wyner (Shannon, 1993, p. xxxvii).

So, on his own admission, Shannon was not that far along withinformation theory in 1940. However, he had already taken animportant step forward that year, as he states in an interview(Ellersick, 1984, p. 123): ‘I would say that it was in 1940 that I firststarted modeling information as a stochastic process or a probabi-listic process’. The earliest evidence of this is Shannon’s 1943 con-fidential report, ‘Analogue of the Vernam system for continuoustime series’ (Shannon, 1993b, pp. 144–147).

The type of cryptographic problem Shannon considers in hiswartime work is the development of so-called ‘true’ secrecy sys-tems (as opposed to ‘concealment’ systems, such as invisible inkor a ‘privacy’ system utilizing a special device to recover the mes-sage) where the idea is to distribute a ‘key’ or cipher among one’sallies which allows for the public transmission of an encoded mes-sage which, ideally, can only be deciphered by those with a copy ofthe key.

In his 1943 memorandum, Shannon uses a probabilistic accountof message selection to demonstrate, for the first time, that theVernam system is a ‘perfect’ secrecy system. Theoretically ‘perfect’secrecy systems have the disadvantage of being usable only once.

‘is not the one given by R. A. Fisher for statistical problems, although it is a statistical

appear to know when and how the theory was developed—see Pierce (1993) and the

C. E. Shannon on March 3, 1984: ‘‘. . . While results for coding statistical sources intofor Advanced Study in Princeton), first submission of [‘Communication in the presence. An ‘Editorial note’ (1984) in the journal that published the article—Proceedings of the

Page 3: Some evidence concerning the genesis of Shannon’s information theory

S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91 83

As an example of such a system, one might represent each letter ofthe alphabet with the numbers 1 through 26, and have a ‘key’ con-sisting in a random series of numbers (0 through 25). To see howthis works, take a message: THISISAMESSAGE; put it in numberformat: 20, 8, 9, 19, 9, 19, 1, 13, 5, 19, 19, 1, 7, 5; randomly generatea key, such as: 7, 5, 14, 6, 2, 14, 16, 22, 24, 10, 10, 13, 3, 15; and thenencode the message by adding its first letter to the first letter of thekey modulo 26 (20 + 7 = 1), the second to the second (8 + 5 = 13),and so on. The encoded message becomes: MWYKGQICCCNJT. Thisparticular method is known as the ‘one-time pad’, adapted by Jo-seph Mauborgne in 1917 from the Vernam system, a machine pat-ented in 1919 that encodes binary teletype signals in essentiallythe same way. Ideally, such encryption systems are impossible tocrack because each letter in the encoded sequence is perfectly ran-dom, (that is, a given character is equally likely to be encoded asany letter). For perfect secrecy, however, once such a key is usedit must be discarded. Otherwise, if the key is n letters long, the en-emy could theoretically find patterns in the statistics of every nth

letter of the message. Shannon is credited as being the first toprove that the Vernam system is unbreakable, which he does inthe aforementioned 1943 confidential Bell Laboratories memo,‘Analogue of the Vernam system for continuous time series’, usingprobability arguments, though no special information-theoreticalvocabulary. This document provides hard evidence that Shannonwas thinking about messages probabilistically by 1943.

Shannon’s first employment of the AEP occurs in a 1945 confi-dential memo, ‘A mathematical theory of cryptography’, in the con-text of ‘ideal’ secrecy systems. Such systems use a differentprinciple—the elimination of redundancy—and involve reusablekeys, but are far more difficult to develop. To illustrate this, firstconsider the ‘simple substitution’ cipher, whose key is just a listof correspondences: A ? Q, B ? W, C ? E, D ? R, and so on. In thisexample, the word BAD would be sent as WQR. The major problemwith such ciphers, however, is that for long enough messagesdecryption becomes relatively easy. For instance, it might be no-ticed after long enough that the most common letter was Q—whichwould be a tip off that it was standing in for a vowel such as A or E.Or, one might easily discover the code letters for T, H, and E bylooking for the most common three-letter word. But theoretically,if one were able to compress all the redundancies out of writtenEnglish, the resulting message, though understandable to someoneknowing the compression code, would be mathematically random.One might then merely apply a simple substitution cipher, and theresulting system could never be cracked, even if the enemy knewthe compression codes. In practice however, it is virtually impossi-ble to eliminate redundancies completely. To quantify the effectsof redundancy on the enemy’s ability to crack such systems, Shan-non employs the AEP (see Shannon, 1945, pp. 74–76).

Shannon’s work in cryptography (Shannon, 1993b, 1945, 1993e)utilizes many of the same mathematical tools as information the-ory, including the entropy of a message, a precise idea of ‘equivo-cation’ or imperfect decoding of a message, the characterizationof message generation as a Markov process, and, most notably,the asymptotic equipartition property, or AEP. Despite all thesecommon themes, the nature and direction of the influence be-tween cryptography and information theory has remained some-what mysterious. When Robert Price asks him about it in aninterview he says,

C. S.: . . . That cryptography report [Shannon, 1945] is a funnything because it contains a lot of information theory that Ihad worked out before, during the five years between 1940and 1945. Much of that work I did at home.. . .

8 For additional confirmation of this picture, refer to the Brockway McMillan interview

R. P.: Was it an answer looking for a problem? . . .

C. S.: In part. I might say that cryptography was there and itseemed to me that this cryptography problem was very closelyrelated to the communications problem. The other thing wasthat I was not yet ready to write up information theory. Forcryptography you could write up anything in any shape, whichI did. (Ellersick, 1984, p. 124)

A good example of something Shannon wrote up ‘in any shape’ ap-pears to be his rough statement of the AEP—not proven, but given asan assumption (Shannon, 1945, p. 74, see Sect. 6 of this paper forexcerpt). This is important because the presence and use of theAEP in a 1945 document by Shannon confirms his claim that theessentials of information theory were worked out pre-1945.8

Whatever the case, there is indeed a deep connection, via theAEP, between information theory and cryptography, closely relatedto the notion of redundancy. Shannon uses the AEP to demonstratethe theoretical limits on how advantageous redundancy can be inallowing (1) codes of sufficient length (unicity distance) to be bro-ken (cryptography, Shannon, 1993e), (2) messages to be com-pressed and transmitted more efficiently (information theory—noiseless coding theorem, Shannon, 1948b), and (3) messages towithstand the effects of noise (information theory—noisy codingtheorem, ibid.). The noiseless coding theorem, in particular, followsalmost directly from the AEP, which indicates that Shannon hadthe essentials of information theory intuitively worked out priorto his 1945 cryptography report.

Nevertheless, there is evidence—in the form of an unpublishedtypescript from the spring of 1948—that Shannon was still workingout the rigorous mathematical foundations of information theory,specifically the AEP and its subordinate theorems, all the way upto 1948.

3. The typescript, ‘Theorems on statistical sequences’

The typescript (Shannon, 1948a) is an eight-page loose-leaf car-bon copy, on thin 8.5 � 1100 paper. It can be found in Folder 1, Box 9of the collection ‘Papers of Claude Elwood Shannon’ in the Manu-script Division of the Library of Congress. The typescript is classi-fied under ‘Speeches and writings: Articles and scientific papers:1948–1955’. At the end of the paper (also in typescript) is the name‘C. E. Shannon’ and the date ‘April 26, 1948’. His paper, ‘A mathe-matical theory of communication’ (Shannon 1948b) came out intwo parts in the Bell System Technical Journal in July and Octoberof 1948. It is clear from the rough state of the typescript that it isan early draft of results in Appendix III of the 1948 paper. Thematerial in the typescript reappears in condensed and amendedform, with several mathematical mistakes and typos corrected.

The typescript has no introduction or conclusion, and is not di-vided into sections. It proves seven different theorems, with littlecommentary, and begins at the very start to show the first one:‘If it is possible to go from any state with P > 0 to any other alonga path of probability p > 0. . .’. The first theorem in the typescriptis Theorem 3 in the 1948 paper and what is now known as theAEP for finitary ergodic sources. The other six theorems in thetypescript are subalternate to this one—either restatements or con-sequences. The second theorem in the typescript (Shannon 1948a,p. 3, line 4) corresponds to Theorem 5 in the 1948, and the third(ibid., line 10) corresponds to Theorem 6. The fourth and sixth the-orems (ibid., p. 5, line 9; p. 7, line 11) attempt to state togetherwhat Theorem 4 does in the 1948. The fifth theorem is an auxiliaryto the proof of the sixth, while the seventh corresponds to a theo-rem proved tangentially in Appendix III of the 1948. Several of the

in the Appendix.

Page 4: Some evidence concerning the genesis of Shannon’s information theory

84 S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91

proofs in the typescript are completely different from their corre-sponding proofs in the 1948 paper. I will discuss in more detailthese theorems and the relationship between the manuscript andthe paper in Section 4.

It is on the basis of errors still present in this typescript, alongwith the fact that the AEP, its main focus, is central to informationtheory, that I make the claim that Shannon’s theory was notworked out in mathematical detail until just before it was pub-lished. In the following sections, I will discuss the role of the AEPin information theory (Section 4), compare the material in thetypescript with the material in the 1948 paper (Section 5), andsummarize the argument for my conclusions (Section 6). Excerptsfrom my phone interview with a colleague of Shannon’s, BrockwayMcMillan, who provided information theory with a still more rig-orous basis (McMillan, 1953), appears in the Appendix.

4. Information Theory and the AEP

A major innovation in ‘A mathematical theory of communica-tion’ is Shannon’s extension of his entropy measure to finitaryergodic sources (that is, ergodic Markov processes, Fig. 1). This al-lows his theory to deal with messages where each letter may de-pend statistically on previous letters, such as messages in written

Fig. 1. Graphs representing three different ergodic Markov processes. (The num-bers are probabilities, the points are states, and the lines are transition outputtingthe given letter.) The top graph corresponds to a five letter i.i.d. source. The middlegraph corresponds to a source where each letter has a statistical influence on thenext letter only. The bottom graph represents a source with sixteen possible wordscomposed of the letters A, B, C, D, and E (see Shannon, 1948b, p. 387). S denotes aspace between words. (Image taken from Shannon, 1948b, p. 390. � 1948 AmericanTelephone and Telegraph Co.)

English, where, for example, Hs are more likely after Ts than afterDs. The AEP applies to such sources and plays two roles in this re-gard. The first is in proving his fundamental theorems, which Shan-non wants to hold good for all finitary ergodic sources. The secondis by showing, via Theorems 5 and 6 in the paper, that the entropyof an ergodic source can be measured by a series of approxima-tions, measuring the statistical influence on each letter from eachprevious letter back to its nth predecessor.

Shannon (1948b, p. 392) defines the entropy of an ergodic Mar-kov process as an average over the entropy for each state. If theequilibrium probability of state i is Pi and the probability of letterj from state i is pi(j), Shannon defines entropy as:

H ¼X

i

PiHi

¼X

i;j

PipiðjÞ log piðjÞ:

(Note that Shannon neglects to mention that the Pi’s are equilibriumprobabilities, but Pi is the symbol he introduces for such on p. 392.)For i.i.d. sources, which are always at equilibrium, this reduces tothe usual H = �Rpilogpi.

After presenting this definition of entropy for an ergodic source,Shannon presents Theorem 3, which later becomes known as theasymptotic equipartition property (AEP). This theorem is provenusing the law of large numbers and states, roughly, that the prob-ability of an ergodic sequence from a source of entropy H is almostalways about 2�HN, where N is the length of the sequence, and islarge. This can easily be seen to be true for i.i.d. sources, sincethe law of large numbers guarantees that in all probability thesymbol i will occur piN times in a sequence of length N, whichmeans the probability of a typical sequence will be aboutp ¼ PppiN

i . Taking the logarithm of this expression and identifyingthe H term yields the desired result. Shannon presents the theoremmore precisely for ergodic Markov sources in general:

Theorem 3: Given any e > 0 and d > 0, we can find an N0 such thatthe sequences of any length N P N0 fall into two classes:1. A set whose total probability is less than e.2. The remainder, all of whose members have probabilities sat-isfying the inequality

jðlog p�1Þ=N � Hj < d

In other words we are almost certain to have (logp�1)/N veryclose to H when N is large. (Ibid., p. 397)

The importance of this theorem can be more easily seen in its equiv-alent form, Theorem 4, which I discuss below. Roughly speaking, itallows one to ignore the vast majority of messages for the purposesof coding—most falling into the set whose probability vanishes.

Shannon proves Theorem 3 in Appendix III, ‘Theorems on ergo-dic sources’:

If it is possible to go from any state with P > 0 to any other alonga path of probability p > 0, the system is ergodic and the stronglaw of large numbers can be applied. Thus the number of timesa given path pij in the network is traversed in a long sequence oflength N is about proportional to the probability of being at iand then choosing this path, PipijN. If N is large enough the prob-ability of percentage error ±d in this is less than e so that for allbut a set of small probability the actual numbers lie within thelimits

ðPipij � dÞN

Hence nearly all sequences have a probability p given by

p ¼Y

pðPipij�dÞNij

Page 5: Some evidence concerning the genesis of Shannon’s information theory

S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91 85

and (logp)/N is limited by

ðlog pÞ=N ¼XðPipij � dÞ log pij

or

jðlog pÞ=N �X

Pijpij log pijj < g:

This proves theorem 3. (Ibid., pp. 420–421)

There appear to be two minor omissions in this proof. They are alsomissing from the corresponding proof in the unpublished typescriptwhich I discuss in the next section.

First, he implicitly assumes equilibrium, since he uses equilib-rium Pi’s. However, this is an innocent assumption, since evenstarting out of equilibrium the formula (Pipij ± d)N for the numberof times the path ij is traversed will hold for large N. This omissionmay have been an oversight, or may have been deemed trivial.

Second, he does not take into account the fact that a given symbolmay have multiple corresponding paths, which means that for someMarkov processes a given sequence of symbols may also have multi-ple corresponding sequences of paths. However, since fixing the ini-tial state fixes the path, and there are only a finite number of possibleinitial states for a Markov process, this does not make a significantdifference. It would, at worst, modify the probability p by a multipli-cative constant, which would only be an additive constant for logp,and therefore negligible. Nevertheless, the omission of this fact iscertainly an error and not a conscious decision, for two reasons.(1) Since each pij gives the probability of a transition and not a sym-bol, the formula for the probability of a sequence of symbols is incor-rect, even if it comes out to the same limit in the end. We know thatShannon is talking about sequences of symbols because Shannonexplicitly refers to them as such in his discussion of Theorems 5and 6. (2) Shannon makes a similar mistake in proving the theoremscorresponding to Theorems 5 and 6 in the typescript.

In McMillan’s (1953) more rigorous proof of the AEP, these min-or problems aren’t noted because the restriction to finitary Markovsources is dropped, and a more general version of the AEP is pro-ven—good for all stationary (that is, already at equilibrium), ergo-dic sources.

When the AEP is actually used to prove the fundamental theo-rems, it is seen in a slightly different form—taken from Theorem4, which reads:

Consider again the sequences of length N and let them bearranged in order of decreasing probability. We define n(q) tobe the number we must take from this set starting with themost probable one in order to accumulate a total probability qfor those taken.

Theorem 4:

LimN!1ðlog nðqÞÞ=N ¼ H

when q does not equal 0 or 1.We may interpret logn(q) as the number of bits required to

specify the sequence when we consider only the most probablesequences with a total probability q . . . . The rate of growth ofthe logarithm of the number of reasonably probable sequencesis given by H, regardless of our interpretation of ‘reasonablyprobable’. Due to these results, which are proved in appendixIII, it is possible for most purposes to treat the long sequencesas though there were just 2HN of them, each with probability2�HN. (Shannon, 1948b, p. 397)

Theorem 4 can be used to tell us how many high-probability (i.e.q = 1 � e) messages we must count—2HN—which, in turn, lets usdetermine how many bits, for example, need to be sent for a typicalmessage—about log2 (2HN) = HN. (In fact, what I just stated is a

rough sketch of Shannon’s proof of the noiseless coding theorem.)In the appendix, he omits a detailed proof of Theorem 4:

Theorem 4 follows immediately from . . . calculating upper andlower bounds for n(q) based on the possible range of values ofp in Theorem 3. (Ibid., p. 421)

The essentials of a rigorous proof can be found in the unpublishedtypescript, as I will discuss later.

The first thing Shannon does with Theorem 3 is show that theentropy of a source can be discovered by observation—‘withoutreference to the states and transition probabilities between states’(ibid., p. 398). This is proven in the form of Theorems 5 and 6. ‘Theyshow that a series of approximations of H can be obtained by con-sidering only the statistical structure of the sequence extendingover 1, 2, . . . N symbols’ (ibid.). This is essential to Shannon’s theoryof communication because it means that the entropy of an ergodicMarkov source is a quantity that can be measured empirically fromits output alone, without reference to internal states. For example,we can—in principle—measure the entropy of English text using itsstatistics, without having to know anything about where it is com-ing from.

Next, Shannon applies the AEP to noiseless signal transmission.In the noiseless case, the challenge, in short, is to encode the mes-sage (for example a running English text) in such a way as to (1)transmit it over a channel with a particular alphabet (say, 1s and0s) and (2) do so as efficiently as possible (by, as a simple example,omitting Us after Qs). Shannon’s treatment culminates in thenoiseless coding theorem—which guarantees the possibility oftransmitting information arbitrarily close to the ideal rate (thechannel capacity) with the right encoding (which the theorem pro-vides) and for long enough messages. Shannon states it precisely inthe following way:

Theorem 9: Let a source have entropy H (bits per symbol) and achannel have capacity C (bits per second). Then it is possible toencode the output of the sources in such a way as to transmit atthe average rate C/H�e symbols per second over the channelwhere e is arbitrarily small. It is not possible to transmit at anaverage rate greater than C/H. (Ibid., p. 401)

The second part of the theorem is trivial to show. Shannon demon-strates the first part of the theorem in two ways—both dependingon the AEP. I will only sketch the first way, which is significantlymore elegant (ibid., pp. 401–402).

Using Theorem 4, we know that we can divide long messagesinto two groups, a high probability group with little more than2HN members and a low probability group with a lot more—almost2RN, or the total number of messages (where 2R is the number ofdifferent symbols). By the definition of capacity, the approximatenumber of messages one can send over the channel of duration Tis 2CT. Now (and this is the crucial step) we choose T large enoughso that we have almost a one-to-one mapping between the numberof messages of length N from the source and the number of mes-sages we can send over the channel of duration T. Some leftoversymbol is used to signal the switch to a less efficient code in rarecases when a low probability message needs to be transmitted.We can arrange things so that the resulting average rate of trans-mission, N/T approaches C/H from below, even allowing for theoccasional low probability message requiring extra-lengthyencoding.

Though the noisy coding theorem takes more work to provethan the noiseless one, it uses Theorem 4 in essentially the sameway, that is, as a tool for eliminating a large number of low-prob-ability messages from consideration—but for four different cases:(1) possible messages from the source (before coding), (2) possibletransmitted messages, (3) possible received messages, and (4) pos-sible transmitted messages given a received message (that is,

Page 6: Some evidence concerning the genesis of Shannon’s information theory

86 S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91

equivocation, due to noise, in what may have been sent). The infor-mation for (1), (2), (3), and (4) can be measured using entropy, sothey are all susceptible to the AEP. Since the proof of the noisy cod-ing theorem (ibid., pp. 410–413) is somewhat complex, I will notgive a detailed treatment here.

It is important to note that there is a significant difference in howthe noiseless and the noisy coding theorems use Theorem 4. Thenoiseless coding theorem only needs an upper bound on the numberof high probability sequences. However, the noisy coding theoremalso needs a lower bound on the number of messages in case (2),since these are used to code for messages from (1). This turns outto be important for interpreting the typescript, because Shannon’sfirst attempt to prove Theorem 4 fails to provide him with a lowerbound, which may be why there are two theorems in the typescriptwhich do not appear in his 1948 paper, as I discuss below.

Now that we’ve discussed the role of the AEP in Shannon’s 1948paper, we are ready to examine the contents of the typescript, andsee how they relate to the material in Appendix III.

5. The relationship between the typescript and the 1948 paper

As we discussed in the previous section, Theorem 3 plays twoimportant roles in Shannon’s 1948 paper. Via Theorem 4 it is usedto show the fundamental coding theorems, and via Theorems 5 and6 it shows that the entropy of a source can be determined from itsoutput. The focus of the unpublished typescript is, in fact, a set offive theorems which are almost identical or jointly equivalent toTheorems 3, 4, 5 and 6.

‘Theorems on statistical sequences’ begins with a proof of theAEP for ergodic Markov sources which is identical with the onefrom Appendix III of the 1948 paper (quoted in this section below),except that the clause in the typescript,

Hence the probability that nearly all sequences lie within limits±d is given by

p ¼Y

pðPipij�dÞNij :

(Shannon, 1948a, p. 1)

is corrected to read in the 1948 paper

Hence nearly all sequences have a probability p given by

p ¼Y

pðPipij�dÞNij :

(Shannon, 1948b, p. 421)

The theorem in the typescript is also stated less precisely and in theform of a limit. In all other respects the treatment of the theorem isthe same. However, its auxiliary theorems, which are the onlymeans by which Theorem 3 is actually put to use in the 1948 paper,are not in such a finished form in the typescript.

It is interesting that Shannon still appears to be working outTheorem 4 at the time he wrote the typescript. He does not proveTheorem 4 as it is in his paper, but instead a pair of theorems sim-ilar to Theorem 4 (which I’ll refer to as Theorem A and Theorem B,in order of their treatment in the typescript).

First, let’s look more closely at Theorem A, which is stated at theend of its proof (original reprinted in Fig. 2, errors noted below,italics added for clarity):

We have shown that apart from a set of small probability, theprobabilities of blocks of length L lie within the limits

2�ðH�dÞN < p < 2�ðHþdÞN

9 However, it is clear that Shannon was only thinking about the case when the low-probamaximum.

where d can be made small by taking N large enough. Let themaximum number of blocks of length N when we delete a setof measure e be Ge(N). Then:

Xremaining

set

p ¼ ð1� eÞ

GeðNÞpmax ¼ GeðNÞ2�ðHþdÞN

log GeðNÞ > ðH þ dÞN þ logð1� eÞ

Hence

LimN!1ðlog GeðNÞÞ=N ¼ uðeÞ � S

Similarly

1 >X

p > GeðNÞpmin ¼ GeðNÞ2�ðH�dÞN

from which we obtain

ðlog GeðNÞÞ=N > H � d

and

uðeÞ � H

Hence we haveTheorem: uðeÞ ¼ H for e–0;1

(Shannon, 1948a, pp. 4–5)

Essentially, this is doing in detail what the 1948 paper suggests:‘Theorem 4 follows immediately . . . on calculating upper and lowerbounds for n(q) based on the possible range of values of p in Theo-rem 3’. However, Theorem A and Theorem 4 are different becauseGe(N), the correlate of n(q), is not defined as ‘the number we musttake from [the set of possible sequences] starting with the mostprobable one in order to accumulate a total probability q for thosetaken’ (Shannon, 1948b, p. 397), but as ‘the maximum number ofblocks of N when we delete a set of measure e’ (Shannon, 1948a,p. 4). Theorem A does not specify what ‘set of measure e’ is beingdeleted; it may be a set of high-probability sequences—whichmeans that Theorem A, as it is stated, is strictly speaking false (sincewithout removing the low-probability set far more than 2HN se-quences will be left).9

Shannon’s attempted proof in the typescript can be made into aproof of Theorem 4 with the following alterations.

a) Replace e with q and Ge(N) with n(q). (Note that in the type-script he sometimes uses the symbol S for H.)b) The initial set of bounds on p are incorrect and should bereversed, and the appropriate substitutions be made in theproof for the limiting values of p. (This means switching‘H + d’ and ‘H�d’.)c) The two last inequalities in the proof should be reversed,because this allows us to correctly make the inference thatu(e) = H (since we know it is neither greater nor less than it).

Shannon knows that he needs to prove more than is stated in The-orem A, which only gives an upper bound on the number of highprobability sequences. To show the noiseless coding theorem, it isenough to have an upper bound on the number of messages forwhich an efficient code is needed. However, recall that for the noisycoding theorem, a lower bound is also needed on the number ofhigh probability input sequences. This would provide sufficientmotivation for Shannon to push ahead and attempt to prove Theo-rem B. In the typescript, which is mathematical in tone, Shannon

bility set is removed, and in some other way made the error of calling Ge(N) a

Page 7: Some evidence concerning the genesis of Shannon’s information theory

Fig. 2. Pages 4 and 5 from Shannon’s unpublished typescript. ‘Theorems on statistical sequences’, dated 26 April 1948 (Shannon, 1948a). On these pages we see Shannon’sattempt at a proof of what I call ‘Theorem A’ which corresponds to Theorem 4 in his important 1948 paper, ‘A mathematical theory of communiation’.

S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91 87

gives a different reason—to put tighter bounds on the probabilityrange (ibid., pp. 5–6). Theorem B states the following:

Theorem: Given d > 0 there exists a set of M blocks of length N(when H is sufficiently large) such that

M > 2ðS�dÞN

and each block has the same probability, and starts and ends inthe same state, which can be chosen arbitrarily. (Ibid., p. 7)

Shannon shows this by construction—putting together blocks in theright composition to get equiprobable sequences, the number ofwhich grows at a sufficient rate with N. However, since equiproba-bility is not a property needed in his 1948 paper, and Theorem A can

Page 8: Some evidence concerning the genesis of Shannon’s information theory

Fig. 2 (continued)

88 S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91

be modified to provide both upper and lower bounds, Theorem B iseventually dropped completely.

To sum up our discussion of Theorems A and B, we can see thefollowing progress occurring between the typescript and the 1948paper: (i) Theorem A is restated in a stronger and more correct fash-

ion as Theorem 4, (ii) its error-ridden proof is eliminated in favor ofa brief synopsis, and (iii) Theorem B is discarded as superfluous.

Now let’s examine the theorems in the typescript which corre-spond to Theorems 5 and 6, which I’ll refer to as Theorems C and D,respectively (both stated on page 3). Theorem C and Theorem 5 are

Page 9: Some evidence concerning the genesis of Shannon’s information theory

S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91 89

almost the same, except for one small but important difference. The-orem C reads:

Theorem:

LimN!1ð1=NÞX

pðBiÞ log pðBiÞ ¼ H

where p(Bi) is the probability of block Bi of length L, and the sumis over all possible blocks. (Ibid., p. 3)

Shannon adds the needed minus sign in the 1948 paper and speci-fies that these are blocks of symbols—and not states, as he callsthem in the typescript.

Theorem 5: Let p(Bi) be the probability of a sequence Bi of sym-bols from the source. Let

GN ¼ �ð1=NÞX

pðBiÞ log pðBiÞ

where the sum is over all sequences Bi containing N symbols.Then GN is a monotonic decreasing function of N and

LimN!1GN ¼ H:

(Shannon, 1948b, p. 398)

In Appendix III, Shannon simply says that Theorem 5 is true ‘Byusing theorem 3’ (Shannon, 1948b, p. 421). In the typescript, how-ever, Shannon provides an explicit proof (Shannon, 1948a, pp. 2–3).First, he establishes lower and upper bounds on the contributionmade by the high probability blocks—which turns out to be H inthe limit. Then, he places a vanishing upper bound on the contribu-tion made by the low probability blocks, which gives the desired re-sult. The added specification that these are blocks of symbols isimportant because in the 1948 paper Shannon states that the pur-pose of Theorem 5 is to provide a way to determine the entropyof a source ‘directly from the statistics of the message sequences,without reference to the states and transition probabilities betweenstates’ (Shannon, 1948b, p. 398). Theorem C is ambiguous onwhether it refers to symbols or state transitions.

In the 1948 paper, Shannon proves Theorem 6 (correspondingto Theorem D) by showing that the following limit is equal tothe one above, and is therefore also H:

LimN!1 �X

i;jpðBi; SjÞ log pBiðSjÞ

Where p(Bi,Sj) is ‘the probability of sequence Bi followed by symbolSj’ and pBi(Sj) is ‘the conditional probability of Sj after Bi’ (ibid.). Thisprovides a better and in many cases more practical approximationto H than Theorem 5, by measuring the probability of each letter gi-ven its N predecessors. If there are no statistical influences extend-ing more than M letters, we can use the Mth order approximation tocalculate the entropy of a source exactly.

In the typescript, there are several differences. The theorem hastwo parts, the second part almost equivalent to Theorem 6. Thefirst part is not a limit and assumes equilibrium (to avoid confusionI’ve put corrected expressions in brackets):

Theorem

H ¼ �X

pðBijSjÞ log pBiðSiÞ �X

pðBiSjÞ log pBiðSiÞh i

¼ LimLðBÞ!1 �X

qðBiSjÞ log qBðSjÞ LimLðBÞ!1 �X

qðBiSjÞ log qBiðSjÞh i

where p(Bi,Sj) is the probability of block Bi followed by Sj andpBi(Sj) is the conditional probability of Sj after the block Bi isknown to occur. q(Bi,Sj) in [is] the probability when Bj is com-puted on the basis of any initial state probabilities, not neces-sarily the proper ones and qBi(Sj) the correspondingconditional probabilities. (Shannon, 1948a, p. 3)

The reason the first equality does not have to be a limit is that itassumes equilibrium. Shannon notes that, on this assumption, theformula for entropy can be obtained by summing over i, givingthe desired result. However, Shannon does not appear to realize,once again, that there is an implicit assumption at work that eachsymbol corresponds to a unique state transition—which is not ingeneral true for Markov sources. At the same time, he does appearto realize that the first equality is trivial because it works for blocksof any length, which indicates that it fails to draw the desired con-nection between the measured statistics of a sequence and its en-tropy. Shannon mistakenly assumes that the problem is just hisassumption of equilibrium, and offers the second equality whichholds good even starting out of equilibrium. He proves it in the fol-lowing way:

If the q’s are used [that is, looking at the second equality], the qBi

(Si) are still pkj where k is the state in which Bi ends.X

Bi!kqðBi; SjÞ ¼ pkj

XBi!k

pðBiÞ

! pkjPk

since any initial distribution tends toward equilibrium. (Ibid.,p. 4)

Hence, in the limit of large block size, �Rq(BiSj)logqBi(Sj) also re-duces to the expression for entropy, �RPkpkjlogpkj. On closerinspection, however, this does not resolve the difficulty, since westill haven’t addressed the difference between symbols and statetransitions. Shannon dodges these issues in his paper by provingTheorem 6 in a completely different way—showing that the expres-sion (which Shannon names FN) has the same limit as the expres-sion in Theorem 5 (called GN). Luckily for Shannon, once theminor corrections are made to Theorem 3 that I discussed in theprevious section, Theorems 5 and 6 become valid for sequences ofsymbols and not just state transitions.

To summarize the progress Shannon makes between TheoremsC and D and Theorems 5 and 6:

1) Shannon adds the specification that Theorems 5 and 6 referto blocks of symbols.2) Shannon omits a detailed proof of Theorem 5/C in hispaper, in favor of a brief synopsis—perhaps to spare thereader some algebraic tedium. Alternatively, he may onlyhave worked through the proof in the typescript to checkhis work.3) Shannon omits the first part of Theorem D, which is invalidwhen applied to sequences of symbols.4) Shannon keeps the second part of Theorem D as Theorem 6,and proves it in a more elegant fashion using Theorem 5—avoid-ing difficulties involving the distinction between states andsymbols.

Before making some concluding remarks on what the typescriptcan tell us about the genesis of Shannon’s information theory, let’sbriefly consider the two additional theorems proven in the manu-script which I have not discussed.

The first one (ibid., p. 6) is merely an auxiliary theoremneeded to prove Theorem B, which Shannon discards. The secondone (ibid., p. 8) corresponds to an unnumbered theorem appear-ing in Appendix III of the 1948 paper. This theorem is ananalogue of Theorem 4 for the mixed ergodic case. (The differ-ence is that n(q) becomes a step function.) After Shannon provesthis theorem, he ends the typescript with a definition of entropyfor a mixed source: ‘H = RHici’. The idea of treating mixed sourcesis more or less abandoned in the 1948 paper, except for a briefstatement and proof of the theorem mentioned above in Appen-dix III.

Page 10: Some evidence concerning the genesis of Shannon’s information theory

90 S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91

6. Near the end of the road to information theory

Shannon’s information theory introduced a number of innova-tions to communications engineering, most importantly (a) mea-sures for information and channel capacity good for any discreteergodic source and discrete channel with noise, (b) two fundamen-tal theorems showing that channel capacity is in principle attain-able, all helping to illustrate that (c) redundancy is the key toefficient coding. The AEP is essential to these innovations. It is goodfor discrete finitary ergodic sources, making it applicable to mes-sages with statistical influence between symbols, such as writtenlanguage—and it makes the connection between redundancy andefficient coding, by allowing us to think of entropy in terms ofnumber of high probability sequences.

We can see the mathematics underneath these innovations stillbeing worked out in the typescript. It appears that Shannon knewhe had to think in terms of number of high probability messages,but the proof that this can be done is rough, and the theorem is sta-ted too weakly to be used to prove the noisy coding theorem. Thisis what perhaps compelled him to prove a parallel theorem in thetypescript that did not appear in the paper. In the typescript, Shan-non attempted to show that the entropy of a source can be mea-sured from the statistics of its sequences, but failed to draw therequisite distinction between symbols and state transitions. Hecorrects this problem in the paper, despite a seeming failure totrace its origin back to an ambiguity in Theorem 3. Finally, therewas a lot of work done in the typescript—detailed algebra, andan attempt to treat mixed sources as well as ergodic ones—whichShannon dropped in favor of elements more clear and to-the-point.(And apparently not in vain—engineers still praise his work for itsclarity and richness, for example Calhoun, 2003, p. 24.)

The typescript does not, however, suggest that Shannon was inany way puzzled about the significance of Theorem 3—and wewouldn’t expect him to be puzzled in something written so closein time to his 1948 paper. In fact, it demonstrates that he knowsexactly what its significance is. The typescript is a very concisetreatment of statistical sequences, with most of its space devotedto the most important theorems for information theory. Subse-quently omitted material consists mostly of detailed algebra. Still,the typescript supports the idea that Shannon’s intuitions concern-ing his theory’s mathematical core, which, according to Shannon,were already well developed by 1944 or 1945, had not been ex-pressed in mathematical detail until just before the publicationof his 1948 paper. This picture is further supported by his own re-ports about how he worked:

Shannon: I don’t think I was ever motivated by the notion ofwinning prizes . . . I was more motivated by curiosity . . . I justwondered how things were put together. Or what laws or rulesgovern a situations, or if there are theorems about what onecan’t or can do. Mainly because I wanted to know myself. AfterI had found the answers it was always painful to write them upor to publish them (which is how you get acclaim). There aremany things I have done and never written up at all. Too lazy,I guess. I have a file upstairs of unfinished papers. (Shannon,1993d, p. xxiv)

This suggests another way to think about the purpose of the type-script ‘Theorems on statistical sequences’, apart from its being adraft of Appendix III. In particular, it may have been a rough sketchfor a more developed mathematical paper that Shannon never gotaround to writing. (This may be why its title ‘Theorems on statisti-cal sequences’ is not the same as the corresponding appendix title‘Theorems on ergodic sources’, where ‘source’ refers to a specificallyinformation-theoretical object.) Either way, it is clearly a steppingstone on the way to the results published in his 1948 paper.

In an attempt to verify this picture, I interviewed Shannon’s col-league at Bell Laboratories, mathematician Brockway McMillan,who introduced Shannon’s ideas on ergodic sequences to themathematical community with his paper, ‘The basic theorems ofinformation theory’ (McMillan, 1953). He confirms the idea thatShannon had information theory more or less intuitively workedout before putting it on a sounder basis in 1948, though he doesnot specifically recall an eight-page typescript (see Appendix forrelevant excerpts).

A latest date for Shannon’s intuitive development of the AEP canbe inferred from the declassified cryptography report (Shannon,1945). It gives the following statement, without proof:

We suppose that the possible messages of length N can bedivided into two groups, one group of high and fairly uniformprobability, while the total probability in the second group issmall. This is usually possible in information theory if the mes-sages have any reasonable length. Let the total number of mes-sages be

H ¼ 2RoN

where R is the maximum rate and N the number of letters. Thehigh probability group will contain about

S ¼ 2RN

where R is the statistical rate. (Ibid., p. 74)

This is one way of stating the core theorem of information theory,the AEP.

Acknowledgements

Thanks to John Norton, Or Neeman, Paolo Palmieri, and RonKline for helpful comments on earlier drafts. And special thanksto Brockway McMillan for an interesting and enjoyable interview.

Appendix: An interview with Brockway McMillan, 6 February2008

The mathematician Brockway McMillan was a friend of ClaudeShannon’s and worked with him at Bell Laboratories while he wasdeveloping his theory of information. The following are excerpts froma phone interview, during which McMillan related facts and storiesabout Shannon’s work and his own.

Thomsen: It looks like he had an intuitive version of the [AEP] prin-ciple, that he used in his cryptography report . . . three or four yearsearlier.

McMillan: Yes, it did start with cryptography there’s no questionabout that. He made that pretty clear . . .

T: . . . so this principle, it seemed like he had it intuitively, some-how he had an intuition of this principle, and then later, beforehe published the theory, then he started to work rigorously toget the foundations. Was that your impression? . . .

M: Yeah, he, I can remember him you know sort of stating numer-ous times that these were essentially equally probable and [Iwasn’t sure] about it, and finally I wrote something that made itclear to myself . . . I wrote a paper that was published in the Annalsof Mathematical statistics [McMillan, 1953] explaining Shannon’stheory to statisticians . . . Well that’s sort of . . . all I know about it.

Page 11: Some evidence concerning the genesis of Shannon’s information theory

S.W. Thomsen / Studies in History and Philosophy of Science 40 (2009) 81–91 91

T: . . . ‘Theorems on statistical sequences’, it was just a little type-script, like eight pages long, where he kind of gives a rough proof.Do you remember him circulating—?

M: Well he probably did, I don’t remember it specifically. I remem-ber something, didn’t it have some graphs, some Markov processes,some diagrams of Markov processes?

T: Those diagrams showed up in the cryptography report and in hispaper, not in this particular typescript though . . . I noticed thatthey were in the original cryptography report from 1945, that hewas already thinking in these terms.

M: 1945. Yeah, well I came to Bell labs in 1946, and so his theorywas already pretty well along. Though, what I’m trying to remem-ber is whether he wrote an internal memorandum that is in [the]Bell Laboratories archive . . . what was finally published in the BellSystems Technical Journal [Shannon, 1948b]. I think he did, becauseI must have read that before the paper in the BSTJ came out. I canremember discussing it with him early in the game, how he keptmuttering that these things were equally probable and sort ofnot understanding and not believing what he said, until I sat down[to work it out myself], until I understood it well enough that Ipublished a paper sort of proving the equipartition property. I sentthat paper off to the Annals of Mathematical Statistics and the edi-tor sent it back saying, ‘Hey, you gotta explain this more thor-oughly than what you’ve done’. So, I really wrote an expositorypaper trying to explain what Shannon really [meant] and it madea difference. The statisticians considered it respectable mathemat-ics after that, and began to polish it up.10

References

Aspray, W. (1985). Information: A survey. Annals of the History of Computing, 7,117–140.

Breiman, L. (1957). The individual ergodic theorems of information theory. TheAnnals of Mathematical Statistics, 28, 809–811.

Brooks, J. (2003). Claude Elwood Shannon: A register of his papers in the Libraryof Congress. http://memory.loc.gov/cgi-bin/query/r?faid/faid:@field(DOCID+ms003071). (Accessed June 2006).

Calhoun, G. (2003). Third generation wireless systems, Vol. 1. Boston, MA: ArtechHouse.

Cherry, E. (1952). The communication of information (an historical review).American Scientist, 40, 640–664.

Editorial Note on ‘Communication in the presence of noise’ by Claude E. Shannon.(1984). Proceedings of the IEEE, 72, 1713.

10 According to McMillan (private correspondence), Ornstein (1971) ‘proved Shannon’s Comore on the scientific impact of Shannon’s work, see McMillan (1997).

Ellersick, F. (1984). A conversation with Claude Shannon. IEEE CommunicationsMagazine, 22, 123–126.

Fisher, R. (1925). Theory of statistical estimation. Proceedings of the CambridgePhilosophical Society, 22, 700–725.

Hartley, R. (1928). Transmission of Information. Bell System Technical Journal, 7, 535.Hodges, A. (1983). Alan Turing: The enigma. New York: Simon & Schuster.Leff, H., & Rex, A. (Eds.). (1990). Maxwell’s demon: Entropy, information, and

computing. Bristol: A. Hilger.McMillan, B. (1953). The basic theorems of information theory. The Annals of

Mathematical Statistics, 24, 196–219.McMillan, B. (1997). Scientific impact of the work of C. E. Shannon. In V. Mandrekar,

& P. R. Masani (Eds.), Proceedings of the Norbert Wiener Centenary Congress, 1994:Michigan State University, November 27–December 3, 1994 (pp. 513–520).Proceedings of Symposia in Applied Mathematics, 52. Providence, RI:American Mathematical Society.

Nyquist, H. (1924). Certain factors affecting telegraph speed. Bell System TechnicalJournal, 3, 324–346.

Ornstein, D. (1971). Two Bernoulli shifts with infinite entropy are isomorphic.Advances in Mathematics, 5, 339–348.

Pierce, J. (1993). Looking back: Claude Elwood Shannon. IEEE Potentials, 12, 38–40.Segal, J. (2003). Le zéro et le un: Histoire de la notion scientifique d’information au 20e

siècle. Paris: Syllepse.Shannon, C. (1945). A mathematical theory of cryptography. (Photocopy of

declassified confidential report.) 114 pp. text & 21 pp. diagrams, 1 September1945. Institute Archives, MIT.

Shannon, C. (1948a). Theorems on statistical sequences. (Carbon copy of typescript.)8 pp. 26 April 1948. In Folder 1, Box 9, Papers of Claude Elwood Shannon,Manuscript Division, Library of Congress.

Shannon, C. (1948b). A mathematical theory of communication. Bell SystemTechnical Journal, 27, 379–423. 623–656.

Shannon, C. (1949). Communication in the presence of noise. Proceedings Institute ofRadio Engineers, 37, 10–21. (Claims to have been received in 1940, but Shannondenies submitting it before the end of World War II. See n. 6 above)

Shannon, C. (1993a). Letter to Vannevar Bush. In idem, Collected papers (N. Sloane, &A. Wyner, Eds.) (pp. 455–456). New York: IEEE Press.

Shannon, C. (1993b). Analogue of the Vernam system for continuous times series. Inidem, Collected papers (N. Sloane, & A. Wyner, Eds.) (pp. 144–147). New York:IEEE Press. (Memorandum MM 43-110-44, Bell Laboratories. 4pp. 10 May 1943)

Shannon, C. (1993c). The best detection of pulses. In idem, Collected Papers (N.Sloane, & A. Wyner, Eds.) (pp. 148–150) New York: IEEE Press. (MemorandumMM 44-110-28, Bell Laboratories. 3pp. 22 June 1944)

Shannon, C. (1993d). Collected papers (N. Sloane, & A. Wyner, Eds.) New York: IEEEPress.

Shannon, C. (1993e). Communication theory of secrecy systems. In idem, CollectedPapers (N. Sloane, & A. Wyner, Eds.) (pp. 84–143). New York: IEEE Press. (Firstpublished in Bell System Technical Journal, 28 (1949), 656–715).

Shannon, C. (1993f). Review of N. Wiener, Cybernetics, or control and communicationin the animal and the machine. In idem, Collected papers (N. Sloane, & A. Wyner,Eds.) (pp. 872–873). New York: IEEE Press. (First published in Proceedings of theInstitute of Radio Engineers, 37 (1949), 1305).

Shannon, C., Oliver, B., & Pierce, J. (1948). The philosophy of PCM. Proceedings of theInstitute of Radio Engineers, 36, 1324–1331.

Wiener, N. (1948). Cybernetics, or control and communication in the animal and themachine. Cambridge, MA: MIT Press.

ding Theorem in full generality’ and ‘finally closed the book on Shannon’s theory’. For