linguistics and genetics: systematic parallels 1 basics ... · pdf filelinguistics and...

24
Art. 8 Linguistics and Genetics: Systematic Parallels Wolfgang Raible, University of Freiburg i.Br. Has appeared in: Haspelmath, Martin & König, Ekkehard & Oesterreicher, Wulf & Raible, Wolfgang (eds.). 2001. Language Typology and Language Universals - Sprachtypologie und sprachliche Universalien - La Typologie des langues et les universaux linguistiques. An International Handbook - Ein internationales Handbuch - Manuel international. Berlin & New York: de Gruyter (Handbücher zur Sprach- und Kommunikationswissenschaft vol 20.1). 103–23. This text is slightly modified with respect to the 2001 version. Contents 1 Basics 1 2 Language as a Metaphor in Mo- lecular Biology 3 3 Linguistic Vocabulary in Microbi- ology 3 4 Structural Similarities 4 5 Program or Encyclopedia? 14 6 The Awareness of Biologists 14 7 A Relation between DNA and Language Types? 17 8 Language Genes? 18 9 References 20 “Life depends on the interaction of tens of thousands of genes and their protein products, orchestrated by the regulatory lo- gic of each genome. If we are to com- prehend this logic, we must hope that it can be dissected into a series of interlinked modules or networks, each of which can be studied in relative isolation. But even then the complexity of a single module can be daunting. As our knowledge increases, diagrams of gene regulatory networks look increasingly like explosions in a spaghetti factory. We need fresh methods to explore the behaviour of such networks.” (Dearden & Akam 2000: 131.) 1 Basics In order to understand the following com- parison, some facts have to be recalled first. Any cell – irrespective of its being part of a complex organism or its function- ing as a one-cell organism – contains the so-called “genetic information” necessary for the reproduction and formation of the whole organism. This information is part of the genome embodied in the double helix of desoxyribonucleic acid (DNA). The double helix or “duplex” has two inter- twined strands of DNA. Each one is a long polymer of subunits called nucleotides with the four bases adenin, thymin, guanin, and cytosin, abbreviated as A, T, G, and C. The nucleotide T always pairs with A, C with G. Functionally, nucleotides come as triplets, that is to say that three of them form a codon. Most of them code for one of the 20-odd (21 in the case of man) amino acids, with four of the 64 possible combinations of three nucleotides having the additional function of a start or a stop codon. The genome – for instance the human one – contains some thirty thousand genes or functional subunits of the DNA-strands. Most of them are responsible for the pro- duction of specific chains of amino acids, the so-called polypeptides (proteins) which fold into very specific forms fulfilling cru- cial functions in living cells. When a gene coded in a DNA-sequence is transformed into a protein, in a first step an enzyme called RNA-polymerase (RNA = ribonucleic acid) makes a replica of the strand of nucleotides forming the gene at stake. This primary replica consisting of RNA is usually subject to further modi- fication: in eukaryotes certain stretches of RNA (termed “introns”) are removed, leav- 1

Upload: ngodiep

Post on 29-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

Art. 8Linguistics and Genetics: Systematic Parallels

Wolfgang Raible, University of Freiburg i.Br.

Has appeared in: Haspelmath, Martin & König, Ekkehard & Oesterreicher, Wulf & Raible,Wolfgang (eds.). 2001.Language Typology and Language Universals - Sprachtypologieund sprachliche Universalien - La Typologie des langues et les universaux linguistiques.An International Handbook - Ein internationales Handbuch - Manuel international. Berlin& New York: de Gruyter (Handbücher zur Sprach- und Kommunikationswissenschaft vol20.1). 103–23. This text is slightly modified with respect to the 2001 version.

Contents

1 Basics 1

2 Language as a Metaphor in Mo-lecular Biology 3

3 Linguistic Vocabulary in Microbi-ology 3

4 Structural Similarities 4

5 Program or Encyclopedia? 14

6 The Awareness of Biologists 14

7 A Relation between DNA andLanguage Types? 17

8 Language Genes? 18

9 References 20

“Life depends on the interaction of tensof thousands of genes and their proteinproducts, orchestrated by the regulatory lo-gic of each genome. If we are to com-prehend this logic, we must hope that itcan be dissected into a series of interlinkedmodules or networks, each of which canbe studied in relative isolation. But eventhen the complexity of a single module canbe daunting. As our knowledge increases,diagrams of gene regulatory networks lookincreasingly like explosions in a spaghettifactory. We need fresh methods to explorethe behaviour of such networks.”(Dearden & Akam 2000: 131.)

1 Basics

In order to understand the following com-parison, some facts have to be recalled first.

Any cell – irrespective of its being partof a complex organism or its function-ing as a one-cell organism – contains theso-called “genetic information” necessaryfor the reproduction and formation of thewhole organism. This information is partof the genome embodied in the doublehelix of desoxyribonucleic acid (DNA). Thedouble helix or “duplex” has two inter-twined strands of DNA. Each one is a longpolymer of subunits callednucleotideswiththe four bases adenin, thymin, guanin, andcytosin, abbreviated as A, T, G, and C. Thenucleotide T always pairs with A, C with G.

Functionally, nucleotides come astriplets, that is to say that three of them forma codon. Most of them code for one of the20-odd (21 in the case of man) amino acids,with four of the 64 possible combinationsof three nucleotides having the additionalfunction of a start or a stop codon.

The genome – for instance the humanone – contains some thirty thousand genesor functional subunits of the DNA-strands.Most of them are responsible for the pro-duction of specific chains of amino acids,the so-called polypeptides (proteins) whichfold into very specific forms fulfilling cru-cial functions in living cells.

When a gene coded in a DNA-sequenceis transformed into a protein, in a first stepan enzyme called RNA-polymerase (RNA= ribonucleic acid) makes a replica of thestrand of nucleotides forming the gene atstake. This primary replica consisting ofRNA is usually subject to further modi-fication: in eukaryotes certain stretches ofRNA (termed “introns”) are removed, leav-

1

Page 2: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

ing the remaining “exons” (a Greek termwhich means ‘going out’) to be fused and fi-nally transported out of the nucleus. Besidesthat, the ends of the RNA strand are modi-fied, and sometimes specific nucleotides aresubstituted for others.

Once the RNA molecule has been workedover, it is used by the cell to produce poly-peptides. To that end, the RNA molecule– then referred to as “messenger RNA”(mRNA) – is transported to one of theribosomes in the cytoplasm, i.e. to anotherenzyme or macromolecule. Using theinformation contained in the sequence ofcodons represented by the strand of mRNA,a ribosome produces the correspondingchain of amino acids which folds up into avery specifically shaped protein.

This basic knowledge should be com-plemented by a contextual knowledge con-densed into the following five points (Raible1993, largely based on Monod 1970):

1. Our cells have two kinds of basic sub-stances: in the domain of the doublehelix, the substances are theNUCLE-OTIDES embodying genetic informa-tion. For the rest of the cell, the basicsubstances areAMINO ACIDS whichare chained into large macro-moleculesdoing the “work” that has to be done inthe cell.

2. As regards the interior organization ofcells, these two classes of macromolec-ules belong to two separated domains,with the nucleus harboring the DNA.However, the nucleus is only a featureof EUKARYOTIC cells, the componentsall multicellular organisms consist of.The cells calledPROKARYOTES, for in-stance bacteria such as the omnipresentEscherichia coli, do not have such acompartmented nucleus. In this case,the genome lies in the cytoplasm.

3. Chemically, the macromolecules inboth domains are formed by two dif-ferent classes of bonds. (1) When theirrespective units are put together inorder to build larger structures, thisis achieved byCOVALENT BONDS. Acovalent bond is a relatively stable,strictly speaking chemical unit where

two or more atoms share common elec-trons. Covalent bonds create chemicalconfigurations. (2) The overwhelmingnumber of bonds which give the mo-lecules their specific shape or conform-ation are ofNON-COVALENT nature:weak interactions, hydrophobic inter-actions, hydrogen bonds. They resultin conformationalinteractions (as op-posed to configurations). The advant-age of conformations is that neithertheir formation nor their dissolutionneeds much energy. At the same time,the respective processes are very fast.

4. Most of the cytoplasm consists of wa-ter. Amino acids are partly hydrophilic,partly hydrophobic. Now when, ac-cording to the instructions contained inthe chain of mRNA, chains of aminoacids are formed by the ribosomes,the whole compound folds up into athree-dimensional macromolecule ac-cording to degree of hydrophilia orhydrophobia of its components. Asoften even so-called scaffolding pro-teins (chaperons) are used in order toguarantee a specific three-dimensional(steric) structure.

The special kind of folding, mostly sta-bilized by hydrogen bonds and some-times also by specific covalent bondsmaking up for the overall shape ofthe macro-molecule, is the basis of aninfinity of specific three-dimensionalstructures in the domain of complexproteins.

5. The resulting steric form is crucial forthe functioning of processes in cells:a very large part of such processes isbased on the “recognition” – this isa frequently used metaphor – of mo-lecules by each other: Molecule sur-faces are fitting like a positive intoa negative, protruding forms fit intocaved, convex into concave forms, andso on. Here a further common meta-phor is the metaphor of the key and thelock.

Above all, the low-energy non-covalentbonds (see above § 1.3) and the “recogni-tion” of molecules by fitting forms are re-sponsible for the fact that cells in our bodies

2

Page 3: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

are able to perform processes of synthesisand catalysis with an incredible speed con-suming the least possible amount of energy.All these processes are energetically optim-ized by evolution, leaving thus little roomfor further economy.

2 Language as a metaphorin molecular biology

Since its beginnings in the middle of the19th century, molecular biology has been as-sociated with the metaphor of language, es-pecially language in its form of an alpha-betic script (Raible 1993: 8–10).

This use of written language as a meta-phor has a long history. It starts with theearly atomists Leucippus and Democritus (vB.C.). Their basic idea was that the wholecomplex, manifold, beautiful, fragrant andcolored world surrounding us is nothingbut appearance, whereas in reality all wasthought to consist of atoms and the voidbetween them. According to what is toldus in Aristotle’sMetaphysicsabout the doc-trine of these two pre-socratic philosophers,their visible model for the invisible structureof matter was alphabetic script. The varietyof the visible world would be due to the factthat the atoms are differently shaped – justas an A differs in shape from an N; that theirorder may be different – as the sequence ANis different from NA; finally their relativeposition in space may differ: a rotation of90 degrees makes an N out of a Z (Aristotle,MetaphysicsA4, 985b15ff.).

The central idea behind this conception isthe reduction of immense varieties onto a re-stricted set of elements (here the 20-odd let-ters of the Greek alphabet making up for apossibly infinite variety of written texts).

In 1869, Friedrich Miescher discoveredthe existence of nucleic acid in the center ofliving cells. In 1893, just before his death, heput forward the idea that the relation hold-ing between the letters of our alphabet andthe enormous number of words their com-bination results in could explain the rela-tionship between the information containedin the nuclei of our cells and the vari-ety of life forms (“daß aller Reichtum undalle Mannigfaltigkeit erblicher Übertragun-

gen ebenso gut darin ihren Ausdruck findenkönnen, als die Worte und Begriffe allerSprachen in den 24 bis 30 Buchstaben desAlphabets”).

It was not until 1943 that the same ideawas put forward again, this time by theAustrian physicist Erwin Schrödinger, in aseries of conferences given in Dublin un-der the heading “What is life?” He sugges-ted a genetic alphabet similar to the Morsecode. Mutations would be due to mistakesin the process of reading and copying thecode. However, it took another decade be-fore the chemical nature of Schödinger’salphabet was understood. At first, OswaldAvery proved in 1944 that the nucleic acids– and not the proteins – contained the ge-netic information necessary to unfold thefunctions of a pneumonic bacterium. Whenreading Avery’s contribution, Erwin Char-gaff even added the idea of a “grammar ofbiology” (Blumenberg 1981, ch. 22).

Both Miescher and Schrödinger wereconfirmed in 1953 by the discovery of Fran-cis Crick and James Watson who showedthat the long strands of DNA (Schrödinger’spunched Morse tapes) have the structure ofa double helix, and by the series of moment-ous discoveries that followed this break-through. The metaphor of language in theform of alphabetic script has been omni-present in molecular biology since 1953.

3 Some of the linguisticvocabulary used in thetexts of microbiology

In the U.S., the National Center for Bio-technology Information (NCBI) is running alarge data base called Medline. Until 1997,this base had a subset in molecular genet-ics comprising some 700,000 full-text doc-uments starting from about 1966. In 1998,this subset was merged with the general bio-logical data base, the resulting whole con-taining now some nine million full-text doc-uments. Since this data base permits search-ing for any expression, it is easy to demon-strate the presence of the language and thescript metaphor in the whole range of textsin molecular biology.

Right from the beginning in 1953, the

3

Page 4: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

four nucleotide bases abbreviated by A, T,G, and C were called the “letters of the ge-netic alphabet”. RNA-polymerase isread-ing (found in ca. 44,500 documents as of2000; the numbers always cover a periodof ten years) DNA-sequences with theirreading-frame/s(27,700 docs.). This pro-cess is calledtranscription (81,000 docs.in 1997; 148,100 in 2000 and a total of212,300 for the familytranscr* in 2000),and this happens thanks totranscriptionfactors(92,300 docs. in 2000), a topic need-ing further comment in a later section (seebelow § 4.3). Associated with transcriptionis an immediate process ofproofreadingorproof reading(700 docs. as of 2000).

The result is called acopy(20,000 docs.in 2000) subject to furtherediting (2,100)or copy editing(52). The resulting stringof mRNA will be translated(20,000 docs.in 1997, 75,400 for translat* in 2000) intoa polypeptide. This is made possible be-cause the triplets of nucleotidesencodeorare coding for amino acids (130,000 docs.for code*/coding in 1997, 253,000 in 2000).The whole process is calledgene expression(245,400 docs. as of 2000).

The use of the metaphor does not endhere, though. The genome of lots of speciesis beingdecipheredactually (830 docs. fordecipher* as of 2000). The result is storedin largedata basesmodelling the sequencesof nucleotides as sequences of the letters A,T, G, and C. The same is true for proteindata bases symbolizing one amino acidby one letter (the sequencemgqtgkk. . .for instance stands formethionin-glycin-glutamin-threonin-glycin-lysin-lysin. . . ).This is tantamount to saying that sequencesof nucleotides or amino acids correspondingto triplets of nucleotides “materialize” –in a somewhat hybrid way – in data basesas sequences of letters. As a consequence,there are 7,100 documents containingdatabase(s)in 2000.

When the molecular section of Medlinewas still separated from the rest of bio-logy, there were already 27,000 documentscontaining (gene-)library, libraries; thereare 47,000 in 2000 (with 9,400 instancesof gene library/ies, 6,950 of genomic lib-rary/ies). Recurrent sequences of nucleotide“letters” as well as recurrent sequences ofamino acids in proteins are calledmotif, mo-

tifs (12,000 documents in 1997, 29,900 in2000). Recently a new metaphor is beingused more and more: the genome is like anencyclop(a)edia(54 docs. in 1997, 225 asof 2000).Dictionary is still relatively rare ingenetic contexts, though the number of in-stances is increasing rapidly (913 docs. by2000). Parts of this encyclopedia may beformattedin a different way, likebold or it-alic text(see §4.8).

Large data bases of necessity use clas-sification criteria. In 1999, Medline usedabout 19,000 hierarchically ordered descrip-tion terms or “main headings” – even therethe linguistic metaphor is most evident:reading frame(classifying 9,073 docs. as of1999),transcription(68,440),transcriptionfactor (79,704),gene expression(138,773),library/libraries:gene library; genomic lib-rary (4,715).

Written language and processes related towriting make up for most of the metaphors.Apart from the concept oftranslationwhichis ambigous in this respect, vocal languageis used relatively seldom in this context, themost conspicuous case being the familysi-lenc* with 4,300 occurrences as of 2000:esp.gene silencing(500); a relatively newtopic is posttranscriptional gene silencing(33), a phenomenon whose basic mechan-isms remain to be fully understood. (Herethe usual metaphor is ‘knocking out’ a gene(9,700 cases ofknockout, 1,300 ofknock outas of 2000).

A last metaphor that should be noted isgenomic imprinting(3,200 docs. by 2000):it serves to characterize regions in the gen-ome that are marked by methylation: im-printed regions are observed to be moremethylated and lesstranscriptionallyactive.

4 Structural similarities asthe deeper reason behindthe language metaphor

Under the above heading ‘basics’ (see above§ 1), the present author succeeded in avoid-ing language as a metaphor (apart fromthe terms “to code for” and “information”,though), showing thus that it should – atleast in principle – be possible to describesome of the fundamental processes in mo-

4

Page 5: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

lecular biologywithoutborrowing from lan-guage and alphabetic script.

4.1 Linearity as a fundamentalproblem

Nevertheless, language is not only a meta-phor biologists “live by” (without beingvery aware of the fact that they are using it,though: see below § 6). There is a deeperrelationship between the way the “grammarof biology” and the grammar of natural lan-guages are working.

In both cases, information is coded intoa linear sequence of basic elements, nuc-leotides in one, phonemes or letters in theother case. Since this coded sequence of ele-ments is of necessity linear, both systemshave to cope with the same fundamentalproblem: how can a one-dimensional me-dium transmit the information required forthe construction of three- (or even more-) di-mensional entities?

The complexity of the information to betransmitted becomes evident when we ima-gine that all of us originated from onesingle egg-cell whose genome was replic-ated billions of times by successive di-visions of cells, so-calledmitoses.Never-theless, the result was not an amorphousheap of identical cells, i.e. something sim-ilar to yeast. Instead, what took shape inthe embryonic evolution process was a well-structured three-dimensional body. Sincethis is a process, the outcome of the read-ing of linear code-sequences was even four-dimensional.

This is tantamount to saying that the codecontained in any cell of our emerging bodyhad to be read very selectively such that, bydifferentiation of cells occurring in the rightplace and at the right time, a highly complic-ated organism with highly specialized com-pounds of cells took shape. The entire in-formation necessary to carry out this pro-cess came from our genome (and the spe-cific composition of the egg plasma of thefertilized oozyte, see below § 4.3).

The same kind of process takes place inhuman speech. What we want to utter arecomplex ideas and representations – like thepresent article. Those ideas are broken downinto sections encompassing a series of sub-

sections made out of paragraphs, the para-graphs themselves consisting of linked sen-tences and propositions. Sentences are againbroken down into entities called clauses,phrases, words, and the still linear sequenceof words consists of letters. This will bethe input for readers whose task is to re-construct, out of a linear sequence of phon-emes or letters, the complex representationthe speaker or writer started from.

Although the first part of this process –the breaking down of the complex repres-entation into successive hierarchical layersof pieces resulting eventually in a linearizedsequence of basic elements – does not haveits counterpart in genetics, the subsequentreconstruction by the reader (or hearer) doesindeed. In order to achieve this, genetic pro-cesses rely on principles strikingly similar tothose holding in language.

Whereas the result of this process of re-construction tends to be rather volatile andelusive in language (readers or hearers maycome up with a reconstruction that is some-what far from the ideas of the speaker orwriter), in the genetic counterpart the resultof the cellular reading processes is mostlymaterial, visible and palpable: it is a livingbody. Genetic reading processes tend to behighly reliable thanks, among other things,to a considerable amount of redundancy ingenetic information.

4.2 The basic principles in bothsystems

In both systems, the principles allowing thereconstruction of multi-dimensional wholesfrom linear sequences of basic elements areidentical:

• double articulation,

• different classes of ‘signs’,

• hierarchy,

• combinatorial rules on the differentlevels of hierarchy, and

• linking the principles of hierarchy andcombinatorial rules: wholes are alwaysmore than the sum of their parts.

5

Page 6: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

Double articulation means that in languagewe make words out of sounds (in alpha-betic script: out of letters) which, by them-selves, are – in principle – meaningless.Only groups of sounds are meaningful, justas only triplets of nucleotides “signify” (inthe wording of molecular biology: code for)an amino acid. On the level of the doublearticulation which comprises hierarchy andfunctional classes, in genetics amino acidsare combined into multiform and multi-fold proteins showing the specific steric andother properties needed for the functioningof specific cells, with cells themselves com-bining into functional units called for in-stance organs, such as the liver or an eye.They are again parts in a larger whole – withthe respective whole always exceeding thesum of its smaller components.

In the domain of Life this principle,clearly put forward in the seminal thesis ofÉmile Boutroux (11875,21991), bridges atthe same time the seemingly antinomicalCartesian gap between Mind and Matter byintroducing the idea of a continuum.

4.3 The discovery of regulatingproteins in genetics

Originally, microbiologists thought thatthere was a clear distinction between theworld of nucleotides and the world of pro-teins. The idea of this distinction was evenenhanced by the fact that in eukaryotic cells– as opposed to prokaryotic ones (see above§ 1.2) – the nucleus with its genome is separ-ated from the rest of cytoplasm (see above §1.1). Since the genome contains genes – i.e.“meaningful” subunits of the long strands ofDNA – it was thought that all these geneswere responsible for the proteins function-ing in the cytoplasm, and giving the cell itsspecific characteristics. Very soon it becameclear, though, that proteins, too, play a mostimportant part when a gene is read, tran-scribed into a strand of mRNA and trans-lated into a protein.

Firstly, all those enzymes making up the“machinery” of transcribing, proof-reading,editing, and finally translating, are – ormainly consist of – proteins, too. It shouldnot come as a surprise that an egg-cell doesnot have to start from scratch (orab ovo . . .),

but is already equipped with the necessaryproteins (like RNA-polymerase, ribosomes,other specific proteins). But it should alsobe clear that such proteins are themselvesproducts of specific genes lined up in thegenome.

Secondly, the machinery reading andtranscribing genes has to “know” whichgenes are to be transcribed – the humangenome is supposed to contain more than30,000 of them. It is beyond any doubt thatthe specific properties of specific cells andcompounds of cells are the result of the ex-pression of very specific genes – the rightgenes in the right place and at the righttime. But from the observation of relativelysimple organisms likeEscherichia coli, aprokaryote, it became clear that in orderto be transcribed, genes have to be markedbeforehand by proteins that occupy nearbystretches of DNA, recognizing a specificsequence of nucleotides on the DNA, so-called binding sites. These proteins, them-selves products of other genes, are calledac-tivators or repressors, according to the ef-fect they have on the transcription of thegene at stake.

The discovery of such regulatory sys-tems was so novel that it was immediatelyhonored by a Nobel prize (given to JacquesMonod and François Jacob). Eventually itled to the discovery of lots of so-calledtran-scription factors.

Activators and repressors do not onlyact locally. As we know from eukaryoticgene studies, there are classes of activat-ors and repressors that can also act at thedistance of some 1,000 base pairs “up-stream” or “downstream” of the respectivegenes. The corresponding binding sites onthe DNA are calledenhancersand silen-cers because the proteins binding to themcan multiply the effect of classical activat-ors resp. repressors. With one of their endsenhancer or silencer proteins occupy theirbinding-site on the DNA, with the otherone they attach themselves to so-calledco-activatorsor adapter molecules which them-selves are linked, by so-calledbasal factors,to the RNA-polymerase. (It will becomeclear later why enhancer or repressor pro-teins can bridge the distance of several 1,000base pairs on the DNA: see below § 4.8).

In the complex of coactivators there is

6

Page 7: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

still another protein calledTATA binding-protein. It “recognizes” another functionalbinding-site upstream of the coding se-quence of the gene and, by doing this, bringsthe RNA-polymerase in a position that al-lows for it to be near to or exactly on thestart codon signalling the coding region ofthe gene which is going to be read or tran-scribed. A protein occupying on the DNAthe binding site of a silencer prevents the ex-pression of one or more genes.

The most simple forms of transcriptionalcontrol were found in prokaryotes (seeabove § 1.2) where often binding sites foractivators and repressors overlap, thus cre-ating an “on/off” switch regulating gene ex-pression by answering to regulatory stimuli.In addition, the genes of prokaryotic cellstend to be “switched on” by default. Thepresence of an operator then switches offthe gene; the exchange of a subunit of RNA-polymerase, the so-calledsigma-factor, en-ables RNA-polymerase to switch to a differ-ent subset of genes, for instance under theinfluence of a heat shock.

Apart from marking a certain gene bybinding for instance to a specific bindingsite on the (flexible) DNA, the task of ac-tivators is binding at the same time to a spe-cific protein engaged in the process of read-ing DNA-sequences, giving it – by allostericmodification – the shape it needs in orderto fulfill its specific function, for instancethe function of an RNA-polymerase. As of-ten such transcription factors have a thirdbinding site forco-factors(which e.g. maybe hormonescoming from adjacent cells).This makes regulation processes in euka-ryotes much more complicated and hencemuch more selective.

As was recalled at the beginning, in theseprocesses nearly everything is brought aboutby the fact that molecules have or get theappropriate shape in order to fit into certainparts of other molecules (see above § 1.5).

4.4 As have signs in language,DNA and proteins have twodifferent functions

For the genetic code, i.e. the DNA-sequences, this means that they have at leasttwo different functions:

• one of them, represented by the codingsequence of genes, is coding for pro-teins.

• The other – totally different one – rep-resented by binding-sites, is to pass-ively make possible a functional mark-ing by other proteins whose task is toactivate (or to block) the process ofreading specific genes.

At the same time this implies that theremust be two totally different kinds of genesand, as a consequence, of proteins translatedfrom these genes:

• proteins that give the cell its specificshape and its typical metabolic func-tions,

• and proteins whose task is to regulatethe reading of other genes.

What corresponds to the combinatorialrules, i.e. to the “grammar of biology”,are different classes of such regulatory pro-teins on different levels of hierarchy corres-ponding to different classes of binding-sites.Such binding-sites are characterized by spe-cific sequences of nucleotides, the above-mentioned “motifs” discovered on the DNA(see above § 3; below §§ 4.6; 6.1).

4.5 The two categories of signs inlanguage

Linguists know that the same kind of regu-latory processes are necessary for the func-tioning of language. If we take the – relat-ively simple – example of units we call pro-positions, we know that they consist of dif-ferent classes of signs, for instance ‘verbs’and ‘nouns’. A verb like ‘to give’ has threepositions open for signs belonging to theclass of nouns: somebody who is giving,something that is given, and one instancereceiving what is being given. In English,one of those places may be coded as a sub-ject place, one as an object place, one as theplace of an indirect object.

The respective signs fit into these posi-tions because of a specific “regulatory” in-formation, for instance case markers (or po-sition markers). There might be other signssuch as ‘adjectives’, and there might be such

7

Page 8: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

“regulatory mechanisms” as case agreementwhich could link the good adjective to theappropriate noun, welding the two of theminto a higher unit. In this way one can trans-late all the regulatory information which ispresent in a chain of linguistic signs intospecific positive forms fitting into specificnegative forms of other signs.

If we take as an example the first line ofthe tale Horace is telling us about the townmouse and the country mouse, we have a lin-ear sequence

“[olim] rusticus urbanum muremmus paupere fertur accepissecavo”(It is said that [once upon a time]the land mouse received the townmouse in its poor hole.)

which, translated into a two-dimensionalscheme with appropriate forms, could berepresented like this:

figure 8.1

As soon as we cut off the regulatory inform-ation (in language, position in a sequenceof signs may play an important part in thiscontext), we won’t be able to integrate theparts into a whole any more:

accipe- cavu- ferr- mure- mure-paupere- rusticu- urbanu-.

In language, there are lots of signals beneathand above the sentence level fulfilling such“regulatory functions”: a conjunction put-ting into a relationship two propositions, an‘if’ corresponding to a following ‘then’, an‘either’ with its subsequent ‘or’; a ‘firstly’makes us wait for the ‘secondly’, etc. As inthe case of the tale told by Horace, a textmight start with ‘once upon a time’ (olimin Latin), signalling to the hearer or readerthat, with respect to truth functions, whatfollows has to be understood in a way whichis very different from the kind of under-standing appropriate for a text starting with“The monthly meeting of the board of IBMtook place on Thursday . . . at . . . ”. This is,among other things, why most written textshave at their beginning the name of a textualgenre: it automatically favors certain read-ings while ruling out or blocking other ones.

4.6 Hierarchy: morphogenesisand highest ranking genes

A considerable difference between lin-guistic texts and genetic ones lies in the factthat we are free to invent new textual genreswith their respective rules, whereas “nature”always starts from the same genre we couldterm, for instance, “morphogenesis”.

All the forms of life go back to organismsconsisting of one cell. All of them developedout of such a single cell by successive differ-entiation brought about by mutations and byadaptation to the world inside and outsidethe respective newly emerged species.

This means that, the more we go back inphylogenesis, i.e. to the origins of life, themore we will find similar or even identicalsolutions. Since in their embryogenesis allliving beings repeat the ways and the de-tours their species has taken during the pro-cess of evolution, similar regulatory taskswill be fulfilled by similar means.

Now one of the first of these tasks is theorientation of an emerging body accordingto an anterior-posterior axis, and this means:the compartmentalization or segmentationof the originating body. A worm, a frog,a mouse, a human being start from onesingle egg-cell which successively dividesinto two, four, eight, sixteen, and so on cells,forming, in a first phase, something like atennis ball. (Instead of tennis ball, biolo-gists have a technical term,blastula,for thisstate.) Then a phase calledgastrulationfol-lows in which the ball takes the shape thetennis ball would have if we pressed ourthumb into it, giving thus the former blas-tula anectoderm,a mesoderm,and anento-derm,as well as an anterior-posterior axis.This identical topological task has to be ful-filled in all four cases mentioned before.

Biologists have learned for a long timethat in vertebrates the ectoderm will de-velop into the skin and the nerve system,that muscles, bones, blood originate in themesoderm, and that one of the results of theentoderm will be the formation of the path-way food is taking between the intake to thefinal excretion of what the organism cannotuse.

Around 1983, a momentous discoverywas made by microbiologists: the discov-ery of a series of highest-ranking genes. In

8

Page 9: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

��fertur

accepisse

� �� �� �

rusticus�� �mus @@

����@@

��@@��@@

urbanum murem paupere cavo

Figure 8.1This kind of representation translates instructions given by grammar into complementary forms.

the nucleotide sequences of those genes, ashared motif was found, termed thehomeo-box (geneticists tend to draw boxes arounda specific linear sequence of letters which,in their data bases, correspond to nucle-otides; in French the expression is clearer:homéoséquence). The corresponding classof genes is calledhomeobox genesor – ab-breviated –hox genes.

Now these hox genes belong to a classof genes whose task is a kind of regula-tion which brings about a gross compart-mentalization of the originating body (called“parasegments” in the case of the fruit fly).Whereas the suppression or, in the languageof geneticists, the “knocking-out” of hier-archically inferior genes normally has few,if any consequences for the phenotype, i.e.the developed body, blocking or suppress-ing hox genes has major, often lethal con-sequences.

The bizarre outcomes of compartmental-ization defects are illustrated by the namesgiven to some of the hox genes of the fruitfly Drosophila melanogaster, one of the fa-vourite research objects in genetics:nanos(Greek for ‘dwarf’), hunchback, bithorax(a phenotype which, as a consequence, hasalso two pairs of wings instead of one),trithorax, krüppel(cripple), hedgehog, an-tennapedia(a fly that has legs instead ofantennae),sine oculis(Latin for ‘withouteyes’),fushi tarazu(Japanese for ‘one (seg-ment) is lacking’, a lethal mutation)Poly-comb(a fly that shows sex combs not onlyon the first, but also on the other pairs oflegs). (The names reflect the internationalcharacter of research in genetics and/or theeducation of the researchers.)

What defines the class of hox genes

on the level of DNA is the above men-tioned shared motif of 180 nucleotides. Thechain of amino acids which correspondsto this homeobox in the translated proteinfolds into four main so-called alpha-helices.The third of these helices, the “recognitionhelix”, fits into a specific segment of theDNA, and several amino acids on one faceof this helix contact the respective nucle-otide bases on the DNA, thus forming oneof the above-mentioned transcription factors(see above § 4.3).

The contacting amino acids vary fromhox gene to hox gene and hence determ-ine which binding-site on the DNA will bechosen by the individualhomeodomain pro-tein. The task of these proteins is to markcertain sections of the DNA and either totrigger or to totally prevent the reading ofgenes specific for a certain segment of thebody (see below § 4.8).

4.7 Linear iconicity of selectorgenes and cellular memory

It came as a big surprise when geneticists –working on a subset of hox genes involvedin the patterning of the anterior-posterioraxis of the fruit fly – found that the order-ing of these so-calledselector geneson theDNA-sequence corresponds to the orderingof the parasegments – whose emergence anddevelopment is triggered – on the anterior-posterior axis of the body. What was evenmore exciting is the fact that the genes cod-ing for these homeodomain proteins are – inthe same order – in the genome not only ofDrosophila melanogaster, but also of worms(Caenorhabditis elegans), frogs (Xenopuslaevis), mice, and men. (This was matter for

9

Page 10: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

some more Nobel prizes.) No less surpris-ing was the fact that the DNA-sequences –and, above all, the protein-sequences result-ing from the DNA – were largely identicalin the same range of creatures.

This means that the basic signalling sys-tem which turned out to be successful inearly stages of evolution, resulting in theanterior-posterior orientation and the seg-mentation of the emerging body, was con-served and is still functioning in the com-partmentalization even in species belong-ing to a much later stage of development.Hence, the selector genes of, say, the fruit-fly have their homologues in humans.

The genes belonging to this class arecalledselector genesbecause their functionis to select and to trigger other, hierarchic-ally lower genes, making possible differenti-ation in the formation of regions in the emer-ging body. It is therefore crucial that thesegenes are activated where they are wantedand switched off where they are not wanted.In this context, at least two equally import-ant tasks have to be fulfilled.

1. Since the regulatory proteins translatedfrom selector genes define regions inan early embryonic state of the origin-ating body, and since these regions aremore and more differentiated by suc-cessive division of cells, the organismhas to ensure that the key regulators re-tain their transcriptional state through-out cell proliferation. There is goodevidence that general cellular mech-anisms exist that “freeze” importantregulators in their transcriptional state,yielding the cell a kind of “memory”(Paro & Hogness 1991) of its identitywithin the whole. (The consequencesof a loss of identity can be seen in can-cer cells.)

Although in principle any cell of our bodycontains all the information needed to re-construct the whole body, this cellularmemory with its enhancing or its blockingactivity explains why it is so difficult toclone a body from any cell whatever, andthis is why geneticists (and physicians) areso interested in the still undifferentiated, so-called stem-cells of embryos (cf. e.g. Thom-son 1998).

2. In addition, the successful gene regu-latory system established by selectorgenes has been adopted to other pat-terning processes, e.g. in the compart-mentalization of the limbs. This meansthat there may be a certain kind of re-cursivity in regulatory processes.

Generally speaking, it is evident thata successful genetic program may beused again and again in other partsof the body – witness polarity geneslike Sonic hedgehog(first discovered inDrosophila; see above § 4.6) effectiveboth in the polarization of our limbsand our brain, the program providingus with five fingers and five toes, orthe program giving our fingers and ourtoes rounded tips (cf. e.g. Shubin & al.1997).

Hence, what seems to emerge is a regulat-ory mechanism activating, according to dif-ferent segments of the emerging body, evermore specific genetic subprograms whileblocking at the same time other ones; and onthe other hand a mechanism that makes pos-sible the use of one and the same program indifferent parts of the body. In case the sameprogram is used in another part of the emer-ging body, the necessary differences may bebrought about, among other things, by theprocess of mRNA-editing (see above §§ 1& 3) which corresponds to the context sens-itive processing of linguistic signs.

4.8 Where language and scriptcould still learn from genetics

Up to now, the fact has been stressed thatthe strands of DNA are linear. This is trueas long as we speak for instance of readingand copying a strand of DNA into a strandof mRNA by RNA-polymerase. In this casewhat happens is a transcription nucleotideby nucleotide (or “letter by letter”).

Nevertheless, it is undeniable that, apartfrom its functional one-dimensionality, eventhe strands of DNA, i.e. the double-helix,have steric properties. Whereas the genes,that is the parts of DNA coding for pro-teins, contain a linear message, other parts,for instance the binding-sites upstream ofthe coding sequence of the gene, are “recog-

10

Page 11: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

nized” by regulating proteins on behalf oftheir steric properties.

Apart from the double-sided character ofDNA, with some stretches encoding pro-teins, others functioning as a binding-site forregulatory proteins which corresponds to asimilar bi-partition in linguistic signs (seeabove §§ 1.5 & 4.4) – on closer examinationit looks like a necessity that the genetic codeshould exploit its three-dimensional qualit-ies.

The human genome consists of nearlythree billions of genetic letters, i.e. base-pairs in the double-helix of DNA. If this du-plex were a string only, any one of our cellsshould contain a thread of about 200 cmand a diameter of 2 nanometers. It shouldbe clear that this thread does not exist in alinear form, though. Instead, there are pro-teins serving as architectural elements: Thedouble helix winds roughly twice around acomplex ofhistonesforming anucleosome.(A nucleosome even contains equal massesof DNA and histone proteins.) The result-ing series of nucleosomes is a second-orderthread much thicker, much more compactand, naturally, considerably shorter than thefirst-order one. This second-order threadtakes again the shape of a helix, making athird-order thread out of it which possiblyforms another helix resulting in a fourth-order thread. In this coiled state the DNA iscalledchromatin,and strands of chromatinare calledchromosomes. Thus, the chromo-somes visible at mitosis correspond to ahighly folded and twisted, densely packedform of DNA.

The different degrees of compacted DNAfirst explain why enhancer or silencer pro-teins, although binding on the DNA at a dis-tance of several 1,000 base pairs, can short-cut the long way to the gene to be expressedor repressed: in its chromatin state, there isa high probability for distant regions to getinto each other’s vicinity (see above § 4.3).

At the same time, the compact state ofthe DNA is important for the general under-standing of regulation processes describedabove (see §§ 4.6 & 4.7) because the DNAof genes – which has to be transcribed intostrands of mRNA – is not very well access-ible in the compact and coiled state of chro-matin. In order to be accessible, parts of thechromatin structure have to be unwound.

The genetic system profits from these dif-ferent states of DNA. It proves advantage-ous to tag parts of the DNA already in itscompact chromatin state as “readable” or“not readable”. This makes the access to thestrands of DNA which are relevant for spe-cific cells or cell groups very economic. Atthe same time, it facilitates the phenomenoncalled “cellular memory” by making inher-itable the transcriptional status of the cell inthe process of division (see above § 4.7).

This is exactly what can be observed forinstance inDrosophila melanogaster(e.g.Cavalli & Paro 1998). Here at least two dif-ferent sets of regulatory genes belonging tothetrithorax and to thePolycombgroup par-ticipate in this high order tagging of chro-matin. The Polycomb group of genes – as anexample – is responsible forHERITABLE SI-LENCING throughout development (cf. e.g.Paro & al. 1998; Pirrotta 1998; 1997; Shao& al. 1999). This means the maintenance ofthe repressed (“not readable”) state of se-lector and other homeotic genes.

Indeed, it was found that the Polycombgroup proteins are compacting repressedgenes in the form of a second or third-orderthread of DNA. A different subset of regu-latory proteins, the trithorax group, seem tobe able toKEEP ACTIVE regions of DNA inan open, “readable” form, thereby helping tomaintain a “memory” for the activated keyregulators in a region of the body.

In short, this adds another hierarchic levelto the regulatory system of development:while we have already seen that proteinscan bind to regions of DNA in order to ac-tivate or to repress the transcription of agene, this class of regulators can “freeze”the transcriptional state of genes at a struc-tural basis. They presuppose a molecularmachinery operating not on the level of lin-earized DNA strands, but on the higher levelof chromatin, i.e. a ‘chromatin’ or ‘nuc-leosome remodeling machine’. Biologistshave termed it ‘chromatin accessibility com-plex’, abbreviated as CHRAC (e.g. Varga-Weisz & al. 1997; Varga-Weisz & Becker1998).

For geneticists this is at the same timetantamount to saying that there is a consid-erable difference between the processes ob-servedin vitro, and thein vivoprocesses tak-ing place in living cells. This is why those

11

Page 12: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

working on chromatin remodeling try to ar-tificially create anin vivo environment intheir in vitro experiments. At the same time,signalling on the chromatin level would ex-plain the phenomenon of linear iconicityobserved on the level of high ranking se-lector genes (see above § 4.7): There wouldbe no advantage if all kinds of genes werescattered in a random way along the DNA.Instead, it should prove advantageous ifgroups of subordinate genes were relativelyclose to the regulating higher ranking genes.

In this context, it is not without interestto have a look at the evolution of writing.Whereas spoken language of necessity isone-dimensional, written language comes intwo dimensions. In the history of alphabeticscript, we can observe an increasing tend-ency to exploit what is made possible bythe existence of a second dimension. West-ern alphabetic script starts asscriptio con-tinua, that is to say as a single thread of let-ters without spaces. This translates the con-tinuous stream of speech into a continuousthread of letters.

But there is evolution in alphabetic script.In the first centuries A.D. we observe thefirst attempts at punctuation. Starting fromthe 8th century, spaces between words aregeneralized, making wordsvisible. Bothachievements enormously facilitate the pro-cess of reading. A true revolution takesplace in the epoch of scholasticism. By1200, scholastic manuscripts exhibit all theachievements we are inclined to attribute tothe invention of printing in the 15th cen-tury. Writers use different colors and dif-ferent fonts. They make alineas, they gener-alize punctuation. Chapters get titles whichare enhanced and which reappear in a tableof contents, at the beginning or at the end ofthe text. Alphabetically ordered registers areinvented. On the top of the pages we haverunning column titles; on the margins wefind summaries of the steps of argumenta-tion in tiny script.

The result of this new kind of layout is aquick and easy access to information. Read-ers are able to get an idea of the structureof even very long texts instead of decipher-ing long and monotonous strands of letters(a process obligatorily linked with readingout loud). A reader opening a book show-

ing this kind of layout knows at every mo-ment where he or she is in his or her read-ing process which, since 1200, has becomemore and more silent (Raible 1991; 1994).

In comparing this evolution in alphabeticscript with the genetic information system,we see that to a certain extent the layoutof texts follows principles already realizedin the densely packed and coiled “genetictexts”. As may be expected (see above §3),geneticists themselves draw once again onthe comparison with written text: “The ideaof a combinatorial ‘code’ of histone modi-fications has been proposed to complementthe information stored in DNA sequences, inmuch the same way that highlighting writ-ten words in bold or italics complementsthe information that they carry” (Paro 2000:579). Others speak of “the language of his-tone modifications” or of a “histone code”(Strahl & Allis 2000).

At the same time it is evident that, inthis respect, the possibilities of written textsare limited since written texts are two-dimensional whereas chromatin has threedimensions. Nevertheless, it could be thatthere is still some surprise left in the fur-ther evolution and the layout of writtentexts. (Up to now, hypertext only imple-ments techniques known since 1200, mak-ing them much more efficient, though.)

4.9 What is similar and what isdifferent in both systems?

In spite of all structural similarities men-tioned (for the basic principles: see above§ 4.2), there are some major differencesbetween the linguistic and the genetic sys-tem, too.

1. Sender and receiver in genetics?Thepoint that has been made up to nowis a striking and systematic similaritydue to the fact that both the linguisticand the genetic system have to copewith the same basic problem: to con-struct polydimensionality out of a lin-earized code. One should not overstressthis similarity, though, by asking for in-stance for the sender or the receiver ingenetics. There is nothing that corres-ponds to the two autonomous psycho-physical systems acting as speaker and

12

Page 13: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

hearer in human communication.

2. A different role of economy. Above itwas said that cellular processes are en-ergetically optimized by evolution (seeabove § 1,in fine). Since human lan-guage comes from a sender and goesto a hearer, human communication al-ways implies human beings, i.e. twopsycho-physical systems. Hence, eco-nomy is a major factor in human com-munication and one of the inevitableactivators of language change. In ge-netics, the potential for economicallyinduced change in an already econom-ically optimized system – lacking theliabilities of psycho-physical systems– is minimal. This means that changein genetic systems is brought about bydifferent factors.

At the same time this means thatauthors are mistaken when explicitlydenying the status of a language toDNA by arguing that the law of Zipf –based on economy: frequent signs tendto be shortened – does not apply to ge-netics (e.g. Tsonis & al. 1997).

3. Synonymy and homophony vs. poly-semy. Both the genetic system andthe systems of natural languages havedouble articulation. On the secondlevel, the genetic system makes useof the 20-odd amino acids where thenatural languages use their word-signs.There is a decisive difference betweenthe two systems on this level, though:In the genetic system, there exist43 =64 possible triplets of the four “letters”A, T, G, and C to which correspondonly 20-odd amino acids. This meansthat there is “synonymy” in the geneticsystem, whereas true synonymy is ex-tremely rare in natural languages. Whatis normal instead is polysemy and evenhomonymy: The number of necessarysigns is so high that, as a means ofeconomy, signs tend to have more thanone meaning (in the case of homonymythe meanings are even totally differ-ent). Witness such examples as Englishlies or fly. Natural languages heavilyrely upon the syntactic and/or semanticcontext for disambiguation.

4. Volatile vs. palpable sense. As wasalready mentioned above (see above§ 4.1), the result of gene transcrip-tion and expression is something ma-terial and palpable as in the case ofour bodies, whereas the result of thereading process for instance in the caseof a novel rests something highly im-material and volatile, subject to in-terpretation by different readers. Thisdepends, however, on the text genre:reading a patent specification shouldlead to something rather concrete. Onthe other hand genetically inheritedproperties (like behaviour) may be“volatile” as well.

5. Selective vs. linear reading(see above§ 4.1). In the domain of genetics, thesame “text” exists up to billions oftimes in one and the same body, andit is read very selectively in any cell.Whilst the regulating elements in textsof natural language facilitate the re-construction of a sense out of linearlyscattered elements, needing a mostlylinear reading of the text, regulatoryproteins are designed to make pos-sible a specific and selective readingof selected genes in one cell – suchthat, nevertheless, the aggregation ofindividual cells forms an ordered andfunctioning whole which is far morethan the sum of its parts (see above§ 4.2). Nevertheless, printed texts in-creasingly enable (depending on thegenre, though) selective reading tech-niques, too (see above § 4.8).

6. Creativity vs. replication. The primaryaim of genetics is high fidelity replic-ation. Texts in natural languages mayexpress whatever comes into our minds– facts as well as fiction, description ofwhat is as well as description of whatwill never come into existence, and soon. Nevertheless, evolution shows usthat there is another kind of potentialcreativity in the genetic code: it allowsfor mutations due largely to flaws inthe reading process. The fate of suchmutations may be determined in termsof trial, success, and error.

13

Page 14: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

5 The genome as a pro-gram or as an clopedia

If the genome is seen as a text and the bodyof a eukaryote as what corresponds to its“sense”, we might ask ourselves what thenature or genre of DNA is as a text.

In describing genetic processes, the word‘program’ was already used (e.g. above §4.7.2). This is one possibility. Computerprograms are texts, too (Raible 1999: 19ff.);they consist of command lines, mostly so-called conditioned instructions (“if X andnot Y or R, then (do) F”). The activation ofRNA-polymerase on a certain gene followsexactly this pattern: “IF activatora AND ac-tivatorb AND NOT repressorc AND . . . AND

. . . THEN READ genef ”. “If anything, thelanguage of the genes is much more likea programming language whose constraintswe do not know (or whose programs we donot know)” (Berwick 1996: 295).

The regulatory signals in the above ex-ample of Horace (see above § 4.5) might beinterpreted in the same way, i.e. as instruc-tions in a program – this is congruent withthe interpretation of grammatical informa-tions as instructions given to the reader orhearer (e.g. Weinrich 1993: 17 etc.).

Whilst the metaphor of a program mayapply to both the “grammar of biology” andthe grammar of human language, anothertype of text, a non-algorithmic one this time,might be used andis used as well: the meta-phor of an encyclopedia (see above § 3,infine). Indeed, the encyclopedia shares withthe program the feature that linear read-ing is not necessary: thanks to the fact thathierarchy is a characteristic of programminglanguages, too, the procedures of a com-puter program may be written in any orderwhatever. It is necessary however to link (forinstance by the instructionGOSUB) thoseprocedures to instructions on a higher level.

This is tantamount to saying that, if wechoose the metaphor of an encyclopedia, thegenome is a very specific encyclopedia: onethat comprises articles (genes) giving in-structions as to what other articles (genes)are to be read, at what time this should hap-pen, and under which specific conditions.Like procedures in a program that are notlinked any more to instructions on a higher

level, in such an encyclopedia there mightbe articles no one will read any longer sincethey are not linked any more to readinginstructions coming from a hierarchicallyhigher level. This is why deciphering all thegenes of a genome can only be a first stepin a still very lengthy investigation: in anyevent one has to determine whether thereis at least one superordinate gene triggeringthe reading process of this specific gene. Ifthis is not the case, the gene is like an unusedprocedure in a program: apart form makingthe program text longer, it does no harm atall; at the best, it reflects an anterior state inthe process of building such a program.

6 Are biologists aware ofthe structural similarit-ies?

The vast majority of scientists doing re-search in microbiology are familiar with themetaphors outlined above in sections 2 &3, and they use them whenever they speakof the basic processes sketched in section 1.Seldom are they aware, though, of the basicproblem posed by the linearity of DNA-codeas opposed to the three or even four dimen-sions of the resulting phenotype (see above§ 4.1), one of the rare exceptions being e.g.Walter Gehring (1985: 137).

Nevertheless, some biologists tackle theproblems of DNA-coding with linguisticmeans. They divide in two groups, one ofthem being more speculative, the other oneaiming at practical and directly viable res-ults.

6.1 Detecting regularities inDNA by linguistic methods

If we take the example of deciphering thehuman genome, one of the most prominentaims of microbiology is detecting coding se-quences (genes) and binding sites in the in-terminable strands of DNA: at the most 5%(others speak of 10%) of it are thought torepresent genes, the rest has different func-tions. In this case, the idea to write e.g. analgorithm for the detection of protein codingsequences in the DNA suggests itself. “The

14

Page 15: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

usefulness of a grammar for the representa-tion of biological knowledge is now amplyacknowledged” (Bentolila 1996: 336) – thenumber of laboratories working on this topicseems to be rather small, though.

Let us take the following five rewritingrules wheres stands for a non-terminal sym-bol, A,C,G, T for the nucleotides, andεfor an empty string (Searls 1997: 334f.; ina most explicit way Searls 2002. – The sym-bol s can be replaced by the symbols to theright of the arrow):

s → GsC s → CsG s → AsTs→ TsA s→ ε

The repeated application of these rules leadsto strings like:

s ⇒ AsT ⇒ AAsTT ⇒AACsGTT ⇒ AACTsAGTT⇒ AACTCsGAGTT , and byapplication of the empty ele-ment rule, eventually the stringAACTCGAGTT .

Let us assume that after an intermediatestring of, say, 30 nucleotides the motifAACTCGAGTT recurs, and let us furtherassume that this entire string with the re-peated motif is transcribed into a string ofRNA. In all probability it will then take theshape of a loop or a leaf: the two repeatedsequences – being dyad symmetric thanksto the underlying rewriting rules – matchperfectly in the reverse order (the finalTof the first one binding with the firstA ofthe second one, and so on) and hence willform the stalk, the intermediate 30 nucle-otides making the contours of the leaf.

Such dyad symmetric repeated motifs arereal and extremely important: they make upfor the typical cloverleaf-structure of a kindof RNA – termed transfer RNA or tRNA– specialized in “recognizing” amino acidsand transporting them to the ribosomeswhere they are chained to polypeptides ac-cording to the information given by a stringof mRNA (see above § 1). The same kindor other types of symmetrical motifs oftencharacterize binding sites for proteins on theDNA.

Since one can make overt by such meth-ods “hidden” regularities, lots of approachesaim at finding such properties – e.g. motifs

in the DNA, “parsing” of genes (yieldingfor instance exons and introns; e.g. Dong& Searls 1994; Jiménez Montaño 1994;Ratner & Amikishiev 1996; Asai & al.1998); the recognition of regulatory regions(e.g. Rosenblueth & al. 1996), even the en-tire regulation process of a gene (Bentolila1996; Collado-Vides 1996 & 1996a; PérezRueda & Collado-Vides 2000; van Helden& al. 2000).

So-called hidden Markov models(HMM S), algorithms shaped according tothe models of context-free grammar, turnedout to be powerful tools for the detectionof so-called homologs: viz. chains ofamino-acids sharing common function andevolutionary ancestry without being entirelyidentical since the function of divergentproteins may be conserved through evolu-tion even though sequence elements are freeto change in some areas. The concept ofHMM S, based on the similarity of proteinfamilies (or of the underlying DNA) isused to statistically describe the so-calledconsensus sequence of a protein family andto detect new members belonging to thesame family.

Nevertheless, HMMS operate, by defin-ition, on the probability of linear trans-itions between elements (nucleotides, aminoacids, letters, sounds, words). This makesthem very useful on a purely local basis,e.g. in the recognition of words from soundpatterns in speech recognition programs.A HMM based on sound patterns can-not detect relations holding between words,though, to say nothing of relations betweennon adjacent words. In the same way, it isbeyond the scope of a HMM detecting con-sensus sequences of proteins to tell us any-thing about why there will be, in the foldedstate of the respective polypeptide chain, ahydrogen bond between, say, amino acids14

and 53, or why the zinc-atom in aminoacid55 will create a covalent bond with ele-ments of e.g. amino acids32, 111, 145 and199

(for these kinds of bonds, see above § 1.3).This is why context-sensitive grammars

are more adequate in genetics proper. Ad-mitting, among other things, recursivity,they are being applied not only in DNA re-search; they prove helpful in the analysis ofprotein sequences, too. All of them oper-ate on DNA or polypeptides represented as

15

Page 16: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

strings of letters (see above § 3). The betterknown results (frequent DNA or protein mo-tifs, binding sites, genes with their location,their reading frames and sequence structure,DNA or protein consensus sequences, etc.)are integrated into the “dictionary” of suchalgorithms, the more do they become power-ful. A good example are Simone Bento-lila (1996) or, generally speaking, analysesmaking use of large data bases, e.g. a database of protein consensus sequences (e.g.the Protein families databasePfam, contain-ing 2,290 families as of June 2000) or databases of regulatory sequences (e.g. PérezRueda & Collado-Vides 2000 forE. coli;van Helden & al. 2000 for yeast).

All these approaches indirectly show thatthe genetic systemhas the properties of atrue language with its “grammar” or gram-matical regularities – the type of rewritingrules leading to dyad symmetry in the aboveexample was even used by Noam Chom-sky in 1957 in order to show that suchstrings cannot be generated by linear pro-cessing and that, instead, hierarchy (genesthat function on specific levels), categories(in genetics: ‘activator’/‘promoter’, ‘enhan-cer’, ‘silencer’/‘repressor’, ‘operator’, ‘in-ductor’, ‘binding site’, etc.) and transform-ation rules are necessary.

Nevertheless, one basic problem persists:these algorithms are essentially bottom-upmodels holding for cells (e.g.EscherichiaColi), at the most for cell assemblies (yeast)or cell compounds. The complementary top-down component necessary to understandthe processes of embryogenesis – with itshierarchical levels and corresponding cat-egories – still remains in the dark, or, at thebest, in twilight. Now that the human gen-ome has been successfully ‘deciphered’, thetrue complexity of the ‘grammar of biology’will become visible.

6.2 General aspects of a “gram-mar of biology”

Macroscopic views on the matter tend toremain somewhat more general and morespeculative.

When Lucien Tesnière developed hisconcept of a dependency grammar, one ofthe central metaphors he used – valency –

came from chemistry. Signs belonging to theclass of verbs have up to three free valenciesfor signs belonging to the class of nouns.According to the kind of valency they mayplay different parts – subject, direct object,and so on. The scheme in figure 8.1 (seeabove § 4.5) re-translated this into appropri-ate forms.

As linguists discovered the usefulnessof chemically inspired concepts for theirproper domain, chemists discovered lan-guage as a metaphor – not only in the do-main of genetics (see above § 2), but inchemistry in general. Witness for instancethe approaches made by Pierre Laszlo(1993; cf. 1995; 1986) or by Claude Kor-don (1993) in biochemistry. In this kindof consideration the division of compoundsinto conformationsandconfigurations(seeabove § 1.3) may play an important part.Conformations are, for instance, interpretedas one of the two articulations in chemicallanguage, configurations as the other one.

Such approaches suggest e.g. a “mo-lecular grammar” consisting of the rulesthat govern the assembling of molecularunits into messengers and their higher-orderstructures such as hormones, DNA-bindingproteins or transcription complexes (Ratner1993; Ji 1997: 21); the authors are able todiscover in the genetic system nearly all thefeatures attributed to human language byCharles Hockett 1960; cf. e.g. Ji 1997: 21,23–8).

While being speculative, the authors ofsuch contributions at least tackle a problemnormally not mentioned by the first group ofauthors. If at the most 5 to 10% of the DNAcontained in a genome is represented bygenes, what is the function of the rest? An-swers must take into consideration the dif-ference betweenin vivo and in vitro analy-sis of the genome, i.e. the above mentionedfact that the normal state of the genome ischromatin, not a first-order thread of DNA.Lots of the regulatory proteins produced byselector genes of the homeobox class bindto the DNA in its chromatin state, “freez-ing it” and hence either keeping the genescontained in long sections of DNA open forfurther reading in this part of the body, orpreventing them from being expressed (seeabove § 4.8).

Answers could be given taking into ac-

16

Page 17: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

count the topology of chromatin and thephysical properties the DNA needs both forbending in a certain way (e.g. around his-tones, see above § 4.8) and for bringingclasses of genes and classes of binding sitesinto a good position in the second, third andperhaps even fourth-order thread the DNArepresents in its chromatin state. This con-forms to the fact that – in terms of base pairson the first-order thread of DNA – enhancersor silencers often are at a very great distancefrom the gene or group of genes whose ex-pression they activate or repress (see above §4.3). Research done in the domain of ‘chro-matin remodeling machines’ (see above §4.8) or the so-called ‘matrix- or scaffold-associating regions’ of chromatin (dubbedMARs and SARs) gives hints in this dir-ection: the architecture of chromatin in itsdifferent states is turning out to be crucialfor transcription processes (e.g. Maric & al.1998, Girard & al. 1998.).

In this context, partisans of a theoryof cell language make far-reaching hypo-theses, assuming for instance that 90 to 95%of DNA could incorporate “spatiotemporalgenes”, with their function being “the con-trol of the folding patterns (or conform-ational states) of DNA and the topologyof chromosomes” (Ji 1997: 32). The non-coding parts of DNA in eukaryotic genomesare thought to encode “a language whichprograms organismal growth and develop-ment” (Bodnar & al. 1997).

7 A mediated link betweenkinds of DNA and typesof human languages

Strictly speaking, sections 7 and 8 do not be-long to the topic of this article in its narrowsense. Nevertheless, they have been addedbecause the issue they deal with is not onlyrelated to both the present volume and thepresent article, but of some more general in-terest for linguistics into the bargain.

Starting from about 1965, Luigi LucaCavalli-Sforza published a lot of biometricstudies which were increasingly based onDNA-analysis. They rely upon the fact thatthere is always variation in the genome of

the same species. A basic example: sincewe have synonymy in the genetic code (seeabove § 4.9.3), the same amino-acid may beencoded by two or more different triplets ofnucleotides. On a higher level, genes mayhaveallelesat chromosomal loci which of-ten lead to more or less visible differencesin the phenotype (e.g. diseases). Anothermethod assesses the variation amongmi-crosatelliteswhich – in eukaryotes – oc-cur either as repeated codons in genes oras highly repetitive non-coding sequences of10 to 50 groups consisting e.g. of the nucle-otides AC or ACCC which are scattered overthe genome (Moxon & Wills 1999).

Most of the eukaryotic cells have their“power plants”, themitochondria. They areorganelles in the cytoplasm and must oncehave been independent cells since theyhave their own genome, the mitochondrialDNA (mtDNA). There exist “dialects” inmtDNA, and since we always inherit itfrom our mothers, studying mtDNA in largesamples which, then, are statistically pro-cessed, leads to further hypotheses as to thediachrony of genetic variety.

Now since differences in the phenotypeoften – albeit never of necessity – corres-pond to differences in language, one cantry to map genetic variety (mtDNA, alleles,microsatellites) onto linguistic variety. Innearly every case this leads to macroscopicresults, suggesting for instance that Amer-inds must have come in three immigrationwaves from Asia across the Bering Strait –then a land passage – into America. Thereis for instance a T-allele (a T instead ofa C) at a certain locus on the (male) Y-chromosome that occurs only in the WesternHemisphere, i.e. the Americas. Some 90%of South America’s indigenous people and50% of those in North America share thatgenetic marker due to a common ancestor(Underhill & al. 1996).

On a less macroscopic level, the mtDNAmethod has no good selectivity because of asimple social fact. Roman Jakobson statedthat there are three kinds of communica-tion: through language, through goods andservices – and, an insight based on ClaudeLévi-Strauss, through women. Given boththe normal, peaceful exchange of womenand the frequent abduction (witness the rapeof the Sabine women as one of the innumer-

17

Page 18: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

able cases), mtDNA analysis cannot but leadto coarse-grained results. Specializing on al-leles, e.g. on the (male) Y chromosome, andon microsatellites, instead, shows more pre-cise results suggesting for instance that menwere far more sedentary throughout history.

Linking – in an indirect way – the inter-pretation of variation in the human genomewith linguistic diversity suggests that thecommon ancestors of mankind should havelived in Africa about 200,000 years ago, thatspreading of the human species must havestarted from there about 150,000 years ago,and that West Asia was first settled around100,000 years ago; that Oceania was occu-pied first from Africa, more or less at thesame time as East Asia, and that from EastAsia both Europe and America were settled.

Prehistoric human colonization in the Pa-cific seems to have happened in two phases,the second one in an express-train like man-ner starting only 6,000 years ago (Gray &Jordan 2000, based on the comparison of5,185 lexical items in 77 languages pro-jected onto the results of mtDNA analysis;Cann 2000).

As to Europe (which genetically speak-ing is relatively homogeneous), the observ-able gradients in the analysis of DNA vari-ance suggest the spread of agriculture fromthe Middle East in the period 10,000–6,000,a migration to the north (Uralic languages),and a migration from the region below theUrals and above the Caucasus to most ofEurope (Indo-European languages). At thesame time it becomes clear, that, in themeantime, e.g. Lapps and Finns geneticallybecame rather europeanized (Jin & al. 1999;Cavalli-Sforza 1997: 7719f., Cavalli-Sforza& al. 1994; 1993).

It should be clear, however, that there areelements of uncertainty. One of them is thedependence of the molecular clock on as-sumptions such as the calibration date andmutation rate. Another is the reliance on theparticular genetic subsystem under study in-stead of an overall picture mediated by aplurality of features. A third factor are recentfossil data that could also be interpreted asspeaking in favour of a multiregional evol-ution and subsequent interbreeding of hu-mans (Thorne & al. 1999).

At any rate, this method of genetic ana-lysis proves helpful for glottochronology,

above all in cases much under discussion(African, Amerindian languages) where e.g.analyses given by Joseph H. Greenbergon the basis of a certain kind of frequentvocabulary are confirmed by the analysisof genetic closeness resp. distance (Ruh-len 1994). Nevertheless, this method – es-pecially by extending the assessment of ge-netic distance or closeness to larger regionsof human DNA – could lead to different andpossibly more blurred results according tothe number and the kind of alleles examined.

8 Are there direct linksbetween the genome andhuman language capa-city?

It stands to reason that our language ca-pacity is genetically determined. Disagree-ment only concerns the extent of this de-termination. Some think there is a biopro-gram leading from poor care-giver inputto perfectly structured and constructed hu-man languages as in the case of CREOLES

evolving from PIDGIN input (→ art. 7); oth-ers think that the amount and quality ofcare-giver input is far more important, theevolution of our language capacity being anexperience-expectant process (Greenough& al. 1987 established the useful dis-tinction between ‘experience-expectant’ and‘experience-dependent’ brain plasticity).

In 1991, Myrna Gopnik and MarthaB. Crago described a condition currentlytermed ‘specific language impairment’(SLI): As could be shown by a series ofexperiments, certain members of an ENG-LISH speaking family apparently were notable to use nouns and verbs in vocal speechwith the appropriate grammatical endings.Typical phenomena were the omission ofthe -s of the third person singular of verbs,of the plural-s of nouns, and of the-ed ofpast forms. The understanding of syntaxon the sentence level did not seem to beimpaired, though (Gopnik & Crago 1991).

Although the condition was known forsome decades, this contribution became no-torious thanks to Steven Pinker: “The Kfamily, three generations of SLI sufferers,

18

Page 19: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

whose members say things likeCarol is cryin the churchand cannot deduce the plural ofwug, is currently one of the most dramaticdemonstrations that defects in grammaticalabilities might be inherited” (1994: 323).

Since the condition of SLI would pointto the existence and innateness of a putat-ive grammatical subsystem in the brain, theissue became extremely controversial, set-ting off about 500 contributions on the topicbetween 1991 and 1999.

What remained more or less uncontro-versial is a link to (autosomal, since bothsexes are concerned) heredity. However, inthe meantime four points became evident:(1) SLI is not a rare condition. It is said toconcern between 3 and 6% of otherwise un-impaired children. (2) It is linked to the pro-duction and the reception ofvocal speech(most salient is slow and bad articulation),not to the conceptual side of language pro-duction or understanding. (3) The conditionis accompanied by a general delay in lan-guage acquisition manifesting itself, amongother things, in poor vocabulary. (4) Theoverwhelming number of the studies werecarried out (and continue to be carried out)with ENGLISH speaking children and con-trols.

However, the rare studies made with SLIaffected children speaking different lan-guages, e.g. FRENCH, HEBREW, GERMAN,ITALIAN , SPANISH, MODERN GREEK,FINNISH, JAPANESE, do not as a rule ex-hibit an identical pattern of grammaticalsymptoms. Italian children with SLI closelyresemble the controls in their production ofnoun plural inflections (bambino/bambini;donna/donne), third person copula forms(sono – sei – è, ero – eri – era), firstperson singular and plural verb inflections(amo – amiamo), and third person singularverb inflection (amo – ami – ama). Whatwas produced with much lower percentagewere the (unstressed) articles and clitics (il/i,la/le, lo/gli, the l’ in l’ abbiamo sentito) andthird person plural verb inflections (mando– mandi – manda – mandiamo – mandate– mandano) with their three syllables andthe unstressed-no ending. (Le Normand &al. 1993; Bortolini & al. 1997; Leonard &Bortolini 1998; Leonard 1998: part II, ch.4 gives a detailed overview over the phe-nomena observed in a series of languages.)

Given the different nature of symptoms indifferent languages, a definition of SLI bypurely linguistic criteria seems to be im-possible.

SLI affected children speaking FRENCH

and ITALIAN not only showed symptomsdifferent from ENGLISH. Starting in theearly as 1970’s, studies carried out by PaulaTallal and her colleagues gave hints as to ad-ditional factors in the possible etiology: thenature, the duration and the context of phon-emes and grammatical morphemes – andauditory stimuli in general – proved to bea most important factor. Furthermore, by re-petition tasks it became evident that childrenwith SLI had a diminished capacity of thephonological working memory (e.g. Mont-gomery 1995; ITALIAN children performe.g. much better with disyllabic third personplural forms such asfanno or stanno thanwith the trisyllabic typevedono). Psycho-physical tests employing simple tones andnoises and imaging techniques with hightemporal and spatial resolution (especiallymagneto-encephalography) show that per-sons with SLI have severe auditory percep-tual deficits for brief – but not long – tones inparticular sound contexts. Similar phenom-ena can be observed with poor readers, thuslinking to a certain extent SLI with dyslexia(cf. Wright & al. 1997; Montgomery & Le-onard 1998; Nagarajan & al. 1999).

The basis is a phenomenon well-knownin audiology, so-called masking: the termrefers to a natural limitation in the humanability to detect any particular sound that ispresented simultaneously – or within a smallfraction of a second – with other maskingsounds. In normal individuals this mask-ing of particular speech sounds by preced-ing or following sounds is not sufficientto impair speech processing – in SLI chil-dren it is. They require hundreds of milli-seconds between acoustic events to discrim-inate between them, while children of thesame age and intelligence level only needtens of milliseconds. Since auditory feed-back is defective, speech production and ar-ticulation of such children is worse thanwith normals.

This etiology explains differencesbetween the phenomena observed in variouslanguages. In a language like HEBREW

where inflection is to a large extent brought

19

Page 20: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

about by changing vowels – i.e. not con-sonants – between the three consonants ofthe verbal root, morphology is much lesslikely to be impaired by such a limitationin the capacity of decoding auditory input.In ITALIAN , the unstressed articleil isalways followed by a consonant (il r igore,il mercato), before vowels it appears in areduced form (l’ amico).

A specific training program for audit-ory discrimination (incidentally a computergame) is said to advance such childrenwithin four weeks in a way that normallywould have taken two years, suggesting thusthat SLI is not related to higher cognitiveor even grammatical functions as was sup-posed by the ‘grammar gene’ hypothesis(Tallal & al. 1996).

In the meantime, biologists tried to loc-alize the genetic basis the condition un-doubtedly has. In the family investigated byGopnik & Crago (1991), there seemed to bean anomaly on chromosome 7 (Fisher & al.1998). Bioinformatic analyses permitted tofurther circumscribe the relevant sector (Lai& al. 2000). Eventually, in 2001, a culpritwas discovered: not the putative languagegene, but a forkhead-shaped transcriptionfactor located on chromosome 7 (Lai & al.2001).

This etiology would explain the obser-vation that the effects of the anomaly un-der discussion are rather unspecific: theyconcern auditory discrimination, velocity ofspeech processing, and a speech productionthat is always accompanied by articulationdifficulties (affected members of the fam-ily are said to suffer from a severe orofa-cial dyspraxia making their speech more orless incomprehensible to normal listeners).One could even add, in many cases, dys-lexia. This speaks in favor of the impair-ment of a more fundamental network (thetranscription factor is also expressed, e.g., inmice) implied in cognitive as well as mus-cular activities. It should perhaps not comeas a surprise if the cerebellum – a part ofour brain that orchestrates complex activit-ies – would be implied (Silveri & Misciagna2000).

Thanks to Steven Pinker (1994: 64f.),George Pullum’s unmasking of “The greatEskimo vocabulary hoax” became widelyknown among linguists and a large read-

ing public. Unfortunately, it might well bethat for a rather long period the putative‘language gene’ or ‘grammar gene’ respons-ible for SLI could play a role similar to theESKIMO vocabulary hoax in linguistic text-books. Anyway, the numerous SLI-studiescontinue to be published as if nothing hadhappened.

What remains is the general claim of agenetically determined disposition to learnlanguages. In an indirect way, the exist-ence of a corresponding genetic apparatus isshown by children suffering from Williamssyndrome. Here the cause of the syndromeis known: one copy of the aforementionedchromosome 7 lacks a tiny section whichmay contain up to 15 genes, some of themalready identified. The brain of such chil-dren is smaller, their mean IQ score is 60.Nevertheless, their language capacity seemsto be nearly unimpaired on the level of sen-tences and sentence chaining (the overall co-herence of the texts they produce being quitepeculiar, though). This gives them a remark-able linguistic ability. Some of them displayalso a strong musical talent (Lehnhoff & al.1997).

Human language is a complex system.It is built on principles whose effective-ness has been confirmed by genesis andmorphogenesis (see above § 4.2), addingto them an apparatus for speech productionand reception. Given the complexity of thegenetic system whose “linguistic” aspectshave been outlined above, it was unlikelythat one single gene – or even two of them –should be responsible for language capacityor even for one of its major components. Asis shown by SLI, the impairment or disrup-tion of any of a number of these componentscan impair language development, which, asit stands, is not as autonomous as is some-times presumed. At the same time, the ex-ample makes clear that it may be dangerousto jump to conclusions on the basis of datacoming from one single language.

9 References

Asai, K. & Itou, K. & Ueno, Y. & Yada,T. 1998. “Recognition of human genes bystochastic parsing”Pacific Symposium onBiocomputing: 228–39.

20

Page 21: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

Bentolila, Simone 1996. “A grammar de-scribing ‘biological binding operators’ tomodel gene regulation”.Biochimie 78.5:335–50.

Berwick, Robert C. 1996. “The language ofthe genes”. In: Collado-Vides, Julio & al.(eds.), 281–96.

Blumenberg, Hans. 1981.Die Lesbarkeit derWelt. Frankfurt: Suhrkamp.

Bodnar, J.W. & Killian, J. & Nagle, M. &Ramchandani, S. 1997. “Deciphering thelanguage of the genome”.Journal of The-oretical Biology189.2: 183–93.

Bortolini, Umberta & Caselli, M.C. & Le-onard, Laurence B. 1997. “Grammaticaldeficits in Italian-speaking children withspecific language impairment”.Journal ofSpeech, Language, and Hearing Research40.4: 809–20.

Boutroux, Émile.21991 11875. De la con-tingence des Lois de la Nature. (Collec-tion Dito.) Paris: Presses Universitaires deFrance.

Cann, Rebecca L. 2000. “Talking trees telltales”.Nature405.6790: 1008–09.

Cavalli, G. & Paro, Renato. 1998. “The Dro-sophila Fab-7 chromosomal element con-veys epigenetic inheritance during mitosisand meiosis”.Cell 93.4: 505–18.

Cavalli-Sforza, Luigi Luca & Piazza, Alberto1993. “Human genomic diversity in Europe:a summary of recent research and prospectsfor the future”.European Journal of HumanGenetics1.1: 3–18.

Cavalli-Sforza, Luigi Luca & Menozzi,Paolo & Piazza, Alberto. 1994.The historyand geography of human genes. Princeton,NJ: Princeton University Press.

Cavalli-Sforza, Luigi Luca. 1997. “Genes,peoples, and languages”.Proceedings of theNational Academy of Sciences of the U.S.A.22;94.15: 7719–24.

Collado-Vides, Julio 1996. “Towards a uni-fied grammatical model of sigma 70 andsigma 54 bacterial promoters”.Biochimie78.5: 351–63.

Collado-Vides, Julio 1996a. “Integrative rep-resentations of the regulation of gene ex-pression”. In: Collado-Vides, Julio & al.,179–203.

Collado-Vides, Julio & Magasanik, Boris &Smith, Temple F. (eds.). 1996.Integrat-ive approaches to molecular biology. Cam-bridge/MA: MIT Press.

Dearden, Peter & Akam, Michael. 2000.“Segmentationin silico”. Nature406.6792:131–32. [The authors are highlighting theimportance that should be attributed to thecontribution of von Dassow & al. 2000.]

Dong, Shan & Searls, David B. 1994. “Genestructure prediction by linguistic methods”.Genomics23.3: 540–51.

Fisher, Simon E. & Vargha-Khadem, Faraneh& Watkins, Kate E. & Monaco, Anthony P.& Pembrey, Marcus E. 1998. “Localisationof a gene implicated in a severe speech andlanguage disorder”.Nature Genetics18.2:168–70.

Gehring, Walter. 1985. “The molecular basisof development”.Scientific American253.4:136–46.

Girard, Franck & Bello, Bruno & Laemmli,Ulrich K. & Gehring, Walter J. 1998. “Invivo analysis of scaffold-associated regionsin Drosophila: a synthetic high-affinity SARbinding protein suppresses position effectvariegation”. EMBO Journal 17.7: 2079–85.

Gopnik, Myrna & Crago, Martha B. 1991.“Familial aggregation of a developmentallanguage disorder”.Cognition39.1: 1–50.

Gray, Russell. D. & Jordan, Fiona M.2000. “Language trees support the express-train sequence of Austronesian expansion”.Nature405.6790: 1052–55.

Greenough, William T. & Black, J.E. & Wal-lace, C.S. 1987. “Experience and brain de-velopment”.Child Development58.3: 539–59.

Hockett, Charles F. 1960. “The origin ofspeech”.Scientific American203.3: 88–96.

Ji, Sungchul 1997. “Isomorphism betweencell and human languages: molecular biolo-gical, bioinformatic and linguistic implica-tions”. Biosystems44.1: 17–39.

Jiménez Montaño, Miguel Angel 1994. “Onthe syntactic structure and redundancy dis-tribution of the genetic code”.Biosystems31: 11–23.

21

Page 22: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

Jin, L. & Underhill, Peter A. & Doctor, V.& Davis, R.W. & Shen, P. & Cavalli-Sforza,Luigi Luca & Oefner, P.J. 1999. “Distribu-tion of haplotypes from a chromosome 21region distinguishes multiple prehistoric hu-man migrations”.Proceedings of the Na-tional Academy of Sciences of the U.S.A.96.7: 3796–800.

Kordon, Claude. 1993.The language of thecell (translated fromLangage des cellules).(McGraw-Hill horizons of science series.)New York: McGraw-Hill.

Lai, Cecilia S.L. & Fisher, Simon E. & Hurst,Jane A. & Levy, Elaine R. & Hodgson, Shir-ley & Fox, Margaret & Jeremiah, Stephen& Povey, Susan & Jamison, D. Curtis &Green, Eric D. & Vargha-Khadem, Faraneh& Monaco, Anthony P. 2000. “The SPCH1Region on Human 7q31: Genomic Charac-terization of the Critical Interval and Loc-alization of Translocations Associated withSpeech and Language Disorder”.AmericanJournal of Human Genetics67: 357–68.

Lai, Cecilia S.L. & Fisher, Simon E. &Hurst, Jane A. & Vargha-Khadem, Faraneh& Monaco, Antony P. 2001. “A forkhead-domain gene is mutated in a severe speechand language disorder”.Nature 413: 519-23.

Laszlo, Pierre. 1986.Molecular correlatesof biological concepts. (Comprehensive bio-chemistry, 34A, Section 6, A History of bio-chemistry.) Amsterdam: Elsevier.

Laszlo, Pierre. 1993.La parole des choses oule language de la biologie. (Collection sa-voir: Sciences.) Paris: Hermann.

Laszlo, Pierre. 1995.Organic reactions. Sim-plicity and logic. Chichester: Wiley.

Le Normand, M.T. & Leonard, Laurence B.& McGregor, K.K. 1993. “A cross-linguisticstudy of article use by children with specificlanguage impairment”.European Journal ofDisorders of Communication28.2: 153–63.

Lehnhoff, Howard M. & Wang, Paul P. &Greenberg, Frank & Bellugi, Ursula. 1997.“Williams syndrome and the brain”.Sci-entific American277.6: 68–73.

Leonard, Laurence B.22000. Childrenwith specific language impairment. (Lan-guage, speech and communication.) Cam-bridge/MA & London: MIT Press.

Leonard, Laurence B. & Bortolini, Umberta.1998. “Grammatical morphology and therole of weak syllables in the speech ofItalian-speaking children with specific lan-guage impairment”.Journal of Speech, Lan-guage, and Hearing Research41.6: 1363–74.

Maniatis, Tom & Reed, Robin. 2002. “An ex-tensive network of coupling among gene ex-pression machines”.Nature416: 499-506.

Maric, Chrystelle & Hyrien, Olivier. 1998.“Remodeling of chromatin loops does notaccount for specification of replication ori-gins during Xenopus development”.Chro-mosoma107.3: 155–65.

Monod, Jacques. 1970.Le hasard et la néces-sité. Essai sur la philosophie naturelle de labiologie moderne. Paris: Éditions du Seuil.

Montgomery, James W. 1995. “Sentencecomprehension in children with specific lan-guage impairment: the role of phonologicalworking memory”.Journal of Speech andHearing Research38.1: 187–99.

Montgomery, James W. & Leonard,Laurence B. 1998. “Real-time inflec-tional processing by children with specificlanguage impairment: effects of phoneticsubstance”.Journal of Speech and HearingResearch41.6: 1432–43.

Moxon, Richard E. & Wills, Christopher.1999. “DNA Microsatellites: Agents ofEvolution?”Scientific American280.1: 72–77.

Nagarajan, Srikantan & Mahncke, Henry& Salz, Talya & Tallal, Paula & Roberts,Timothy & Merzenich, Michael M. 1999.“Cortical auditory signal processing inpoor readers”.Proceedings of the NationalAcademy of Sciences of the U.S.A.96: 6483–88.

Palacios, O.A. & Stephens, Christopher R.& Waelbroeck, Henri. 1998. “Emergence ofalgorithmic language in genetic systems”.Biosystems47.3: 129–47.

Paro, Renato & Hogness, D.S. 1991. “ThePolycomb protein shares a homologousdomain with a heterochromatin-associatedprotein of Drosophila”.Proceedings of theNational Academy of Sciences of the U.S.A.88: 263–67.

22

Page 23: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

Paro, Renato & Strutt, H. & Cavalli, G. 1998.“Heritable chromatin states induced by thePolycomb and trithorax group genes”.No-vartis Foundation Symposium214: 51–61;discussion 61-6: 104–13.

Paro, Renato. 2000. “Formatting genetictext”. Nature406.6796: 579–80.

Pérez Rueda, Ernesto & Collado-Vides, Ju-lio. 2000. “The repertoire of DNA-bindingtranscriptional regulators in Escherichia coliK-12”. Nucleic Acids Research28.8 :1838–47.

Pinker, Steven. 1994.The language instinct.How the mind creates language. New York:William Morrow & Company.

Pirrotta, Vincenzo. 1997. “Chromatin-silencing mechanisms in Drosophilamaintain patterns of gene expression”.Trends in Genetics.13.8: 314–18.

Pirrotta, Vincenzo 1998. “Polycombing thegenome: PcG, trxG, and chromatin silen-cing”. Cell. 93.3: 333–36.

Popov, O. & Segal, D.M. & Trifonov, E.N.1996. “Linguistic complexity of protein se-quences as compared to texts of human lan-guages”.Biosystems38: 65–74.

Raible, Wolfgang. 1991.Die Semiotik derTextgestalt. Erscheinungsformen und Fol-gen eines kulturellen Evolutionsprozesses.(Abhandlungen der Heidelberger Akademieder Wissenschaften, phil.-hist. Klasse,1991.1.) Heidelberg: Winter.

Raible, Wolfgang. 1993.Sprachliche Texte– Genetische Texte. Sprachwissenschaftund molekulare Genetik. (Sitzungsberichteder Heidelberger Akademie der Wis-senschaften, phil.-hist. Klasse, 1993.1.)Heidelberg: Winter.

Raible, Wolfgang. 1994. “Orality and Liter-acy”. In: Günther, Hartmut & Ludwig, Otto(eds.). Schrift und Schriftlichkeit. Writingand Its Use.An Interdisciplinary Handbookof International Research. Berlin & NewYork: De Gruyter. 1–17.

Raible, Wolfgang. 1999.Kognitive Aspektedes Schreibens. (Schriften der Philoso-phisch-historischen Klasse der HeidelbergerAkademie der Wissenschaften, 14.) Heidel-berg: C. Winter.

Ratner, V.A. 1993. “Comparative hierarchicstructure of the genetic language” (in Rus-sian).Genetika29: 720–39.

Ratner, V.A. 1993. “The genetic language:grammar, semantics, evolution” (in Rus-sian).Genetika29: 709–19.

Ratner, V.A. & Amikishiev, V.G. 1996. “Ana-lysis of motifs of functional MDG2 sites inassuring its possible molecular functions”(in Russian).Genetika32.7: 902–13.

Rosenblueth, David A. & Thieffry, Denis &Huerta Moreno, Araceli & Salgado Osorio,Heladia & Collado-Vides, Julio 1996. “Syn-tactic recognition of regulatory regions inEscherichia coli”.Computer Applications inthe Biosciences12.5: 415–22.

Ruhlen, Merritt. 1994.On the origin oflanguages. Studies in linguistic taxonomy.Stanford, Calif.: Stanford University Press.

Searls, David B. 1997. “Linguistic ap-proaches to biological sequences”.Com-puter Applications in the Biosciences13:333–44.

Searls, David B. 2002. “The language ofgenes”.Nature420: 211-17.

Shao, Zhaohui & Raible, Florian & Mol-laaghababa, Ramin & Guyon, Jeffrey R. &Wu, Chao-ting & Bender, Welcome & King-ston, Robert E. 1999. “Stabilization of chro-matin structure by PRC1, a Polycomb com-plex”. Cell 98.1: 37–46.

Shubin, Neil & Tabin, Clifford J. & Carroll,Sean. 1997. “Fossils, genes and the evol-ution of animal limbs”.Nature 388.6643:639–48.

Silveri, Maria Caterina & Misciagna, Sandro.2000. “Language, memory, and the cerebel-lum”. Journal of Neurolinguistics13: 129-43.

Strahl, Brian D. & Allis, C. David. 2000.“The language of covalent histone modific-ations”.Nature403.6765: 41–45.

Tallal, Paula. 1976. “Rapid auditory pro-cessing in normal and disordered languagedevelopment”.Journal of Speech and Hear-ing Research37: 561–71.

Tallal, Paula & Miller, Steve L. & Bedi, Gail& Byma, Gary & Wang, Xiaoqin & Naga-rajan, Srikantan S. & Schreiner, Christoph

23

Page 24: Linguistics and Genetics: Systematic Parallels 1 Basics ... · PDF fileLinguistics and Genetics: Systematic Parallels ... Most of them code for one of the ... ted a genetic alphabet

& Jenkins, William M. & Merzenich, Mi-chael M. 1996. “Language comprehensionin language-learning impaired children im-proved with acoustically modified speech”.Science271: 81–84.

Thomson, James A. & Itskovitz-Eldor,Joseph & Shapiro, Sander S. & Waknitz,Michelle A. & Swiergiel, Jennifer J. & Mar-shall, Vivienne S. & Jones, Jeffrey M. 1998.“Embryonic stem cell lines derived from hu-man blastocysts”.Science282.5391: 1145–47.

Thorne, Alan & Grün, Rainer & Mortimer,Graham & Spooner, Nigel A. & Simpson,John J. & McCulloch, Malcolm & Taylor,Lois & Curnoe, Darren 1999. “Australia’soldest human remains: age of the LakeMungo 3 skeleton”.Journal of Human Evol-ution36.6: 591–612.

Tsonis, Anastasios A. & Elsner, J.B. &Tsonis, Panagiotis A. 1997. “Is DNA a lan-guage?”Journal of Theoretical Biology184:25–29.

Underhill, Peter A. & Jin, L. & Zemans, R.& Oefner, P.J. & Cavalli-Sforza, Luigi Luca.1996. “A pre-Columbian Y chromosome-specific transition and its implications forhuman evolutionary history”.Proceedingsof the National Academy of Sciences of theU.S.A.93.1: 196–200.

van Helden, Jacques & Rios, A.F. & Collado-Vides Julio. 2000. “Discovering regulatoryelements in non-coding sequences by ana-lysis of spaced dyads”.Nucleic Acids Re-search28.8: 1808–18.

von Dassow, George & Meir, Eli & Munro,Edwin M. & Odell, Garrett M. 2000. “Thesegment polarity network is a robust devel-opmental module”.Nature406.6792: 188–92.

Varga-Weisz Patrick D. & Wilm, Mat-thias & Bonte, Edgar & Dumas, Ka-tia & Mann, Matthias & Becker, PeterB. 1997. “Chromatin-remodelling factorCHRAC contains the ATPases ISWI andtopoisomerase II”.Nature 388.6642: 598–602. (With an erratum published inNature389.6654: 1003.)

Varga-Weisz, Patrick D. & Becker, Peter B.1998. “Chromatin-remodeling factors: ma-chines that regulate?”Current Opinion inCell Biology10.3: 346–53.

Weinrich, Harald. 1993.Textgrammatik derdeutschen Sprache. Mannheim etc.: Duden-verlag.

Wright, Beverly A. & Lombardino, Linda J.& King, Wayne M. & Puranik, Cynthia S. &Leonard, Christiana M. & Merzenich, Mi-chael M. 1997. “Deficits in auditory tem-poral and spectral resolution in language-impaired children”.Nature387.6629: 176–78.

Wolfgang Raible,University of Freiburg i.Br., Germany

24