Travesty in Byte

Download Travesty in Byte

Post on 27-Oct-2014




8 download

Embed Size (px)


<p>NOVEMBER 1984 VOL. 9, NO. 12$ 4.25 IN CANADA / 2.10 IN U.K. A MCGRAW-HILL PUBLICATION 0360-5280</p> <p>$3.50 IN UNITED STATES</p> <p>THE SMALL SYSTEMS JOLH'</p> <p>New Chips</p> <p>Data General/One</p> <p>Nonsense imitation</p> <p>can be disconcertingly recognizable</p> <p>recognizable mannerisms of the texts from which they are derived. For example, the following text was generated by the first sentences of this article: English letter-combination frequencies from text was generived. For example. Though nonsentencies from text was the text was generated to generisms of that mimics the first sentencies from text the texts have a have a sample, they article:The nature of such texts has been little explored, in part because it's been difficult to get samples. Claude Shannon generated "approximations to English" by hand in 1948, but the laborious calculation it involved prevented extensive study. This is clearly a task for a computer, but programs have been hampered by the need for impractical amounts of memory. We offer a Pascal program, Travesty, to fabricate Hugh Kenner and Joseph o'Roarke teach pseudo-text quickly from any input text. Students of style and linguistics will see possibilities . So may pro- English and computer science, respectively, at grammers since Travest contains a feature th at ca n The Johns y Hopkins greatly speed up general pattern-matching proce- university, Baltimore, dures. We add a special-case version that is (continued) MD 21218.</p> <p>E</p> <p>nglish letter-combination frequencies can be used to generate random text that mimics the frequencies found in a sample. Though nonsensical, these pseudo-texts havea haunting plausibility, preserving as they do many</p> <p>NOVEMBER 1984 BYTE 129</p> <p>130 B Y T E NOVEMBER 1984 ILLUSTRATED BY JOAN HALL. WITH APOLOGIES TO JAMES JOYCE AND HENRY JAMES</p> <p>Each of these writers had his own way with trigrams, tetragrams, pentagrams, matters to which he surely gave no thought.even speedier. To make clear what Travesty does, we'll first discuss language statistics and what they imply.LANGUAGE STATISTICS Finish typing a page of English prose, and the key you hit most often will have been the space bar. Either "e" or "t" will rank second. You did not make those decisions, the language did. In fact, the language makes three-quarters of your writing decisions for you. Not only do the letters observe preferred frequencies, they keep preferred company. A familiar example: write "Q", and (unless you are drafting a QANTAS ad or some comments on Iraq) the next character is almost sure to be "u". If probability coerces the successor to a single letter, what follows a letter pair is even more tightly bound. Write "th", and the probability is very high that what follows will be "e". If it is, then the character after "e" is most likely to be either a space or an "r". Pairs like "th" are called digrams; triplets like "the" are trigrams. They have frequencies, like letters. The most common English digram is "he"; you will find it three times in the sentence you are reading now, 15 times in this paragraph. And you will guess correctly that as we move up from single letters to diagrams and trigrams, the probabilities that govern the next character grow ever more rigorous. By the time we've reached, say, pentagrams, has the author any choice at all?</p> <p>Yes, he has; otherwise Henry James could have had no way to be Henry James, or James Joyce to be James Joyce. At a fairly low level, the statistics of English would have taken over from both of them, and neither would have been distinguishable from The New York Times.But that is not what happens. True, even with a James or a Joyce holding the pen, the statistics do not lie dormant. However, they no longer derive from the undifferentiated language , i.e., from a large sample of everything we can find. The significant statistics derive from the personal habits of James , or Joyce, or Jack London, or J. D. Salinger. Each of these writers, amazingly, had his own way with trigrams, tetragrams, pentagrams, mat(continued on page 449) ters to which he surely gave no thought.</p> <p>NOVEMBER 1984 BYTE 131</p> <p>TRAVESTY</p> <p>(continued from page 131)</p> <p>This line of reasoning brings us to the unexpected fact that essentially random nonsense can preserve many "personal" characteristics of a source text. Travesty (listing 1), a program suitable for small systems, will scan a sample text and generate, from the sample's n-gram statistics, a "nonsense" imitation through which the original text, and even its authorship, is disconcertingly recognizable. For example, we provided Travesty with 29 names of towns taken from a gazetteer of England and called for third-order (trigram) analysis. It promptly churned out a couple thousand characters. These letter groups included (1) many input words regurgitated; (2) some uninteresting letter strings that we agreed to call "garbage" (on the principle that a weed is a flower you don't want); and (3) some wondrously plausible names for English towns that don't exist but ought to. They included Bambudge, Nettlewett, Gidge, Hample, Bognorton, Chire, Clop, Tootinton, Bleweth, and Eastle. (If any of these is a real name, that's by accident; none was on our input list.) And fancy being Mayor of Clop! The connection of the output to the source can be stated exactly: for an order-n scan , every n-character sequence in the output occurs somewhere in the input, and at about the same frequency. That is all, yet it is enough to account for an eerie similarity. Every string of three letters in our pseudo-place-names, "ttl" or "dge", for instance, was lifted out of a string of characters and spaces that consisted simply of the 29 input words typed one after another with one space after each. Figure 1 shows one of the thousands of machine-generated derivations Travesty can extract from a 75-word sample of James Joyce's Ulysses. This passage is an order-4 scan; every four-character sequence in the output comes from somewhere in the input.FREQUENCY ARRAYS</p> <p>language and literature to investigate. To what degree can personal "style" be described as a manifestation of letter frequencies? Such a question, though not new, was merely tantalizing before the modern computer;</p> <p>even more so before procedures were discovered-quite recently-that didn't demand impossible amounts of machine memory.Brian P. Hayes, associate editor of (continued)</p> <p>REMARKS ON THE TRAVESTY LISTINGstandardized. We have three Pascal systems available: Turbo Pascal for CP/M and MS-DOS, Lucidata Pascal for CP/M and HDOS, and Berkeley Pascal running under UNIX-and we haven't been able to write a version of Travesty that will run on all three unmodified. judging that Turbo is the rising young comer, we list the Turbo Pascal version, with notes on such problem areas as we know about. This version might run on UCSD Pascal too, but we've not been able to try it. Since it avoids features unique to Turbo and UCSD, it ought to be transportable to any decent Pascal system at the cost of a little attention to input and output.</p> <p>P</p> <p>ascal input/output (1/O) conven-</p> <p>have been declared:Type STRING = PACKED ARRAY[1 .. 12] OF CHAR;</p> <p>tions are, to say the least, poorly</p> <p>Then change line 49 to InFile STRING. 62 Some Pascals will require you to declare a variable i and say, FOR i 1 TO 12 DO READ InFile[i};. 63 Berkeley Pascal doesn't use the ASSIGN command. You'd omit this line and change line 64 to reset (f, infile);.Also, you will probably want output to a disk file, and you'll have to set that up yourself. Add a second TEXT variable, g, to line 33 and a second STRING variable, OutFile, to line 49. Then insert after line 64 a request for the name of the Outfile, and ASSIGN it to g in whatever way your system provides. And if your system requires files to be explicitly closed, add a statement line, CLOSE (g), just before the final END. (Don't forget the semicolon at the end of the line above it.)NOTES ON HELLBAT</p> <p>Line numbers are, of course, for reference only: don't type them into your Pascal listing.23 This value is safe and may even be increased, but remember that you'll have two arrays this size. How big you can make ArraySize depends on your system's memory requirements. Turbo Pascal, when compiled to disk to get the compiler itself out of the way, permitted ArraySize = 14,000 on a 64Kbyte CP/M system. That's about 2300 words of input text. On an MS-DOS system with 196K bytes, maximum ArraySize increased to 21,000, or 3500 words of text. independent of whether compilation was to memory or to disk. 33 If your Pascal doesn't know about the TEXT type, change this line to f file of char. 40 If your Pascal system has a RANDOM function, you can drop lines 40 to 44 altogether. Then change line 239 to read toss : = random(total) + 1;. You should also delete lines 38. 52, and 53. 49 Many versions of Pascal don't recognize STRING types unless they</p> <p>There is a lot of fun to be had here. There is also much for the student of</p> <p>To change Travesty into Hellbat, procedures InitSkip and Match are replaced by the versions given in listing 2, and numerous lines are deleted as shown below. Note that WriteCharacter now receives its characters from Match and has only formatting duties to perform. If your Pascal has its own RANDOM function, make the deletions listed in the section on Travesty for line 40; and the major change-applied above to the WriteCharacter procedure-should instead be made to the line in the new Match procedure that invokes Random. Lines to delete for Hellbat include 28, 72 to 80, 269, 273 (all references to FregArray), and 232 to 245 (process for getting a character).</p> <p>NOVEMBER 1984 B Y T E 449</p> <p>Circle 445 on inquiry card.</p> <p>CAMBRIDGE GRAPHIC SYSTEMS</p> <p>TRAVESTY</p> <p>Listing 1: Travesty, a program for generating pseudo-text. The program will scan a sample text and generate a "nonsense" imitation. For an order-n scan, every n-character sequence in the output occurs somewhere in the input.3 PROGRAM travesty (input, output); (Kenner / O'Rourke, 5/9/84} This is based on Brian Hayes's article in Scientific American, November 1983. It scans a text and generates an n-order simulation of its letter combinations. For order n, the relation of output to input is exactly: "Any pattern n characters long in the output has occurred somewhere in the input, and at about the same frequency." Input should be ready on disk. Program asks how many characters of output you want. It next asks for the "Order"-i.e. how long a string of characters will be cloned to output when found, You are asked for the name of the input file, and offered a "Verse" option. If you select this, and if the input has a " I " character at the end of each line, words that end lines in the original will terminate output lines. Otherwise, output lines will average 50 characters in length. 22 CONST 23 ArraySize = 3000 ; { maximum number of text chars} 24 MaxPat = 9 ; { maximum Pattern length}</p> <p>V M 1480RGB, TTL Input High Resolution; 16 Color, 14" Display; IBM, Apple Compatible</p> <p>VM12501</p> <p>VM1 2101 High Resolution Monochrome Low Distortion Tilt and Swivel Base Fully IBM and Apple Compatible Green and Amber Displays</p> <p>26 VAR 27 BigArray : PACKED ARRAY [1..ArraySize] of CHAR; 28 FreqArray, StartSkip : ARRAY[' '..';] of INTEGER; 29 Pattern : PACKED ARRAY [1..MaxPat] of CHAR; 30 SkipArray : ARRAY [1..ArraySize] of INTEGER; 31 OutChars : INTEGER; {number of characters to be output) 32 PatLength : INTEGER; 33 f : TEXT; 34 CharCount : INTEGER; (characters so far output) 35 Verse, NearEnd : BOOLEAN; 36 NewChar : CHAR; 37 TotalChars : INTEGER; {total chars input, + wraparound} 38 Seed : INTEGER; 40 FUNCTION Random (VAR Randlnt : INTEGER) : REAL; 41 BEGIN 42 Random : = Randlnt / 1009; 43 Randlnt (31 * Randlnt + 11 ) MOD 1009 44 END; 46 PROCEDURE InParams; 47 (* Obtains user's instructions 48 VAR 49 InFile : STRING (12]; 50 Response : CHAR; 51 BEGIN 52 WRITELN ('Enter a Seed (1..1000) for the randomizer'); 53 READLN (Seed); 54 WRITELN ('Number of characters to be output?'); 55 READLN (OutChars); 56 REPEAT 57 WRITELN ('What order? '); 58 READLN (PatLength) 59 UNTIL (PatLength IN (2..MaxPat]); 60 PatLength := PatLength - 1; 61 WRITELN ('Name of input file?'); (continued)</p> <p>IMMEDIATE AVAILABILITY EXCELLENT PERFORMANCECOMPETITIVE PRICE PACKAGE Dealer/DlsMbutlen Inquiries Invited 40 - 50% margin built-in. Sales territories available.</p> <p>CMCambridge Graphic Systems 11020 East Rush Street So. El Monte, CA 91733 800-228-3320 / 818-448-6173 See us at Comdex / Booth M832Apple is a registered trademark of the Apple Corp IBM is a registered trademark of the International Business Machines Corp</p> <p>450 BYT E NOVEMBER 1984</p> <p>'a`s ly` pgribl o `d tiona , High resolution color graphics and interla e Monochrome graphics and interlace O One year warranty u Priced competitively at</p> <p>e</p> <p>M is a Trade'74 4 f I}# a</p> <p> ess Mac..</p> <p>INTELLIGENT</p> <p>14 932 G wenchris Ct. DATA SYSTEM INC. Paramount , CA 90723</p> <p>8009325.2455 (213) 633.5504 TLX: 509098</p> <p>DeSmetC8086/8088 Development $^ O^ - PackageFULL DEVELOPMENT PACKAGE Full K&amp;R C Compiler Assembler, Linker &amp; Librarian Full-Screen Editor Execution Profiler Complete STDID Library (&gt;120 Func)</p> <p>TRAVESTY</p> <p>62 READLN (InFile); 63 ASSIGN (f, InFile); 64 RESET (f); 65 WRITELN ('Prose or Verse? '); 66 READLN (Response); 67 IF (Response = 'V') OR (Response = 'v') THEN 68 Verse : = true 69 ELSE Verse : = false 70 END; {Procedure InParams} 72 PROCEDURE ClearFreq; 73 (* FreqArray is indexed by 93 probable ASCII characters, 74 (* from to ". Its elements are all set to zero. 75 VAR 76 ch : CHAR; 77 BEGIN 78 FOR ch TO ' I' DO 79 FregArray[ch] : = 0 80 END; (Procedure ClearFreq} 82 PROCEDURE NullArrays; 83 (* Fill BigArray and Pattern with nulls 84 VAR 85 j : INTEGER; 86 BEGIN 87 FOR j 1 TO ArraySize DO 88 BigArray [j] : = CHR(0); 89 FOR j : = 1 TO MaxPat DO 90 Pattern [j] : = CHR(0) 91 END ; { Procedure NullArrays} 93 PROCEDURE FillArray; 94 (* Moves textfile from disk into BigArray , cleaning it 95 (* up and reducing any run of blanks to one blank. 96 (* Then copies to end of array a string of its opening 97 (* characters as long as the Pattern , in effect wrapping 98 (* the end to the beginning. 99 VAR 100 Blank : BOOLEAN; 101 ch: CHAR; 102 j : INTEGER; 104 PROCEDURE Cleanup; 105 (* Clears Carriage Returns, Linefeeds, and Tabs out of 106 (* input stream. All are changed to blanks. 107 BEGIN108 IF (( ch = CHR(13)) {CR} 109 OR (ch = CHR(10)) {LF} 110 OR (ch = CHR(9))) {TAB} 111 THEN ch</p> <p>Automatic DOS I.X/2.X SUPPORT BOTH 8087 AND SOFTWARE FLOATING POINT OUTSTANDING PERFORMANCE First and Second in AUG '83 BYTE benchmarks</p> <p>SYMBOLIC DEBUGGER $50 Examine &amp; change variables by name using C expressions Flip between debug and display screen Display C source during execution Set multiple breakpoints by function or line number</p> <p>DOS LINK SUPPORT $35 Uses DOS OBJ Format LINKS with DOS ASM Uses Lattice naming conventi...</p>