some varieties of information

7
030&4573/84 $3.00 + .oo Pergamon Press Ltd. SOME VARIETIES OF INFORMATION JAAKKO HINTIKKA Department of Philosophy, Worida State University, Tallahassee, FL 32306, U.S.A. Abstract-Several different kinds of measures of information are distinguished from each other. The differences between them show that our pretheoretical concept of information is multiply ambiguous. Attempts to think of the scientific enterprise in terms of information maximization lead to several legitimately different rules of scientific decision-making, depending on which kind of information one is interested in. The contrast between high information and high probability postulated by some philo- sophers is spurious for the same reason. It is even possible to define interesting measures of information such that nontrivial deductive reasoning can increase information so defined. The concept of information is to a logically-minded philosopher like myself both a godsend and a puzzling problem. It is a godsend because, in the immortal words of J. L. Austin, it is not a deep concept. A philosopher is likely to approach old sacrosanct concepts like KNOWLEDGE, TRUTH and MEANING with a special reverence, usually to the detriment of the quality of the discussion. In contrast, to paraphrase Austin again, in terms of information it is much easier not only to say what one means but to mean what one says. The situation would be healthier in philosophy, it seems to me, if philosophers wrote, for every ten papers flaunting the terms “knowledge” and “meaning” in their titles, just a single paper dealing seriously with the concept of information and its properties. For instance, it is much easier to turn information into a quantitative concept (amount of information), as I shall do consistently in the present paper, than to think of quantitative measures of knowledge. But what do we mean by information? It is here that the puzzle sets in[l]. It is perhaps not surprising that we find it hard to give a satisfactory definition of information which would not presuppose other tricky notions, such as probability. However, this is no reason to worry. Concepts of this generality seldom allow to be defined in a presuppositionless manner. Instead, we can follow the time-honored strategy which generations of theoreticians have followed in trying to capture mathematically or logically some important general concept. We can try to find its structural properties, hopefully expressible in the form of logical and mathematical axioms. All we need are suitable clues; the precise axioms can often be teased out by their means quite easily. An excellent example is offered by the usual axioms of probability theory. If they are not intuitive enough in their own right, they can all be given a further justification in terms of the requirement that no Dutch Book be possible against one who uses probabilities as his or her betting ratios. Can something similar be done in the case of information? This question boils down to asking: What clues do we have to the logical and mathematical behavior of the concept of information (in the sense of the amount of information)? Here are some eminently plausible candidates: Inf(A) + Inf(B) = Inf(A&B) (1) if A and B are probabilistically independent. Inf(A) + Inf(A 1 B) = Inf(A&B). (2) Here A, B,. . . are arbitrary propositions or, as a probability theorist would say in his misleading jargon, events. 175

Upload: jaakko-hintikka

Post on 31-Aug-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Some varieties of information

030&4573/84 $3.00 + .oo Pergamon Press Ltd.

SOME VARIETIES OF INFORMATION

JAAKKO HINTIKKA Department of Philosophy, Worida State University, Tallahassee, FL 32306, U.S.A.

Abstract-Several different kinds of measures of information are distinguished from each other. The differences between them show that our pretheoretical concept of information is multiply ambiguous. Attempts to think of the scientific enterprise in terms of information maximization lead to several legitimately different rules of scientific decision-making, depending on which kind of information one is interested in. The contrast between high information and high probability postulated by some philo- sophers is spurious for the same reason. It is even possible to define interesting measures of information such that nontrivial deductive reasoning can increase information so defined.

The concept of information is to a logically-minded philosopher like myself both a godsend and a puzzling problem. It is a godsend because, in the immortal words of J. L. Austin, it is not a deep concept. A philosopher is likely to approach old sacrosanct concepts like KNOWLEDGE, TRUTH and MEANING with a special reverence, usually to the detriment of the quality of the discussion. In contrast, to paraphrase Austin again, in terms of information it is much easier not only to say what one means but to mean what one says. The situation would be healthier in philosophy, it seems to me, if philosophers wrote, for every ten papers flaunting the terms “knowledge” and “meaning” in their titles, just a single paper dealing seriously with the concept of information and its properties. For instance, it is much easier to turn information into a quantitative concept (amount of information), as I shall do consistently in the present paper, than to think of quantitative

measures of knowledge. But what do we mean by information? It is here that the puzzle sets in[l]. It is perhaps

not surprising that we find it hard to give a satisfactory definition of information which would not presuppose other tricky notions, such as probability. However, this is no reason to worry. Concepts of this generality seldom allow to be defined in a presuppositionless manner.

Instead, we can follow the time-honored strategy which generations of theoreticians have followed in trying to capture mathematically or logically some important general concept. We can try to find its structural properties, hopefully expressible in the form of logical and mathematical axioms. All we need are suitable clues; the precise axioms can often be teased out by their means quite easily. An excellent example is offered by the usual axioms of probability theory. If they are not intuitive enough in their own right, they can all be given a further justification in terms of the requirement that no Dutch Book be possible against one who uses probabilities as his or her betting ratios.

Can something similar be done in the case of information? This question boils down to asking: What clues do we have to the logical and mathematical behavior of the concept of information (in the sense of the amount of information)? Here are some eminently plausible candidates:

Inf(A) + Inf(B) = Inf(A&B) (1)

if A and B are probabilistically independent.

Inf(A) + Inf(A 1 B) = Inf(A&B). (2)

Here A, B,. . . are arbitrary propositions or, as a probability theorist would say in his misleading jargon, events.

175

Page 2: Some varieties of information

176 J. HINTIKKA

Both of these equations appear to be (and are) perfectly obvious. Surely information is additive for propositions which are independent of each other. And surely the information which is needed to reach the joint information of A and B from A’s information is that of a proposition which says that B is true if A is. (A 3 B) is after all the weakest proposition which, when conjoined with A, yields (A&B) as a logical

consequence. Indeed, both (1) and (2) lead, together with a few supplementary assumption of a partly

normalizing character, to a natural definition of information in terms of probability. (For a mathematician this of course is but one more example of the versatility of functional equations.) So what’s new here? The subtlety of the conceptual situation is illustrated by the fact that the definitions (1) and (2) give rise to are different. From (1) we obtain, in conjunction with certain less crucial assumptions, the definition

Inf(A) = - log P(A) (3)

where P(A) is the probability of A (that probability measure we were relying on in speaking of probabilistic independence in (1))[2]. In contrast, (2) leads to the definition

Inf(A) = 1 - P(A) (4)

in the sense that if I introduce a notion P(X) by the equation

P(A) = 1 - Inf(A) (5)

it obeys all the laws of probability calculus (with the possible exception of countable additivity), assuming (2) plus a certain amount of normalization[3].

I will not present the arguments here, but examine the consequences of these observations. What we have found means that there are two different conceptions of information we are trafficking in in our ordinary thinking about information. In order to avoid fallacies and confusions, it is therefore most important to keep them apart and to understand their difference. For the latter purpose, the good old model-theoretical viewpoint may be helpful. What is measured in (4) is roughly speaking the weighted number of the possibilities (“sample-space points”) which A excludes. (The weighting is of course effected by the probability distribution P.) This is something like the substantial information or information content of A. In what follows, I shall use for it the symbol Cont. In contrast, (3) expresses a kind of surprise value of A. I shall use for it in the following the original symbol Inf.

This contrast is further explained by noting some of the consequences of the definitions (3)-(4), in particular what relativized versions they admit of.

First, what is commonly called information in the so-called information theory is the expected (average) value of (3) or

- ~W,) log f’(4) (6)

for some set of pairwise incompatible and collectively exhaustive propositions A,. This sense of information thus receives a natural place in my taxonomy.

Second, we have to distinguish from each other different kinds of relative information. On the one hand, there is the change in the informational status of a proposition brought about by another one. I will call this incremental information. It can be defined in the obvious way:

Ir&,,(A /B) = Inf(A&B) - Inf(B) (7)

Cont,,,(A /B) = Cont(A&B) - Cant(B). (8)

On the other hand. there is what might be called conditional information, that is, the

Page 3: Some varieties of information

Some varieties of information 177

informational status of a proposition A on the assumption that we know another one, say B:

Inf,,,,(A /B) = -log P(A /B) (9)

Cont,,,JA /B) = 1 - P(A/B). (10)

The difference between information content Cont and surprise value Inf can be illustrated by the following observations:

InCAA /B) = Infcond(A /B). (11)

Hence we can drop the subscript from relative surprise value. However, no similar equation holds for Cont,,,(A /B) and Cont,,,,(A /B).

The contrast between incremental and conditional information does not exhaust the senses of relative information. For one thing, there is the information conveyed by one proposition A concerning the subject matter of another one, say B. What this amounts to can be seen as follows, using the basic idea of information as reduction of uncertainty: The reduction of uncertainty when one comes to know A is Inf(A), while Inf(A/B) is the information A adds to that of B, that is, the uncertainty which remains concerning A that remains even after we have learned that B is true. Hence

Inf(A) - Inf(A /B) (12)

measures the reduction in one’s uncertainty concerning A which takes place when one comes to know, not A, but B. In other words, it measures the information B conveys concerning A. Similar remarks apply of course to

Cant(A) - Cont,,,(A/B). (13)

We shall call these senses of information transmitted information and abbreviate them by Inf,,,,(B/A) and Con&&B/A), respectively.

The explanation of this terminology is seen by comparison with the traditional Shannon type “information theory” which of course ought to be called a theory of information transmission. If we think of A as the proposition that a certain message was sent and B the proposition that it was received, then (12) is intuitively speaking the information conveyed by the reception of the message concerning whether it was sent or not. This is of course precisely what is usually called transmitted information.

It is interesting to note that we have

Cont,,,,,(B/A) = 1 - Cont(A/B) (14)

which clearly expresses the information overlap (in the sense of information content) of A and B. Clearly, (14) is always nonnegative.

In contrast, we have

P(A lB) Inf,,,,(B/A) = log ~

P(A) (15)

which can be positive or negative. This makes of course good sense in terms of my intuitive explanation of the difference between information content and surprise value. For in suitable circumstances a message concerning A may make it more surprising, not less so, that A should be the case.

Another variant of information is the ‘expected information which results from the adoption of a hypothesis H on evidence E. It is clearly

P(H/E) Cant(H) - P( N H/E) Cont( - H) (16)

Page 4: Some varieties of information

178

which simplifies as

J. HINTIKKA

Explanation: Cant(H) is the “utility” we gain when we rightly accept the hypothesis H. Cont( - H) is the information we miss if we falsely adopt H. In (16), these are simply weighted with the corresponding probabilities on evidence. Furthermore, clearly we are here interested in information content and not in a mere surprise value.)

Not entirely surprisingly, we obtain the same result if we consider the expected value of Cont,,,,,(H/E) rather than that of Cant(H).

What is the moral of this story? What do all these distinctions show? They show that the concept of information, as we naively employ it, is multiply ambiguous. It is not at all clear which of the multiple senses of information we are presupposing on different occasions on which we are thiinking and talking about information. What has been established above illustrates the differences between different variants of information I have defined. These differences demonstrate that my distinction between those different senses of information really matters, in spite of the fact that they are of course systematically connected with each other.

This moral is important for anyone who is concerned with the concept of information in any size, shape, or form. For a philosopher like myself, the systematic ambiguity of the concept of information is especially important because of the role of this concept ‘in recent discussions of the scientific method and scientific explanation. Several philosophers, most insistently perhaps Sir Karl Popper, have suggested that high information is the true aim of scientific theorizing[4]. I agree that this is an extremely useful vantage point in the philosophy of science. I would go as far as to claim that the idea of thinking of a scientist’s enterprise in terms of information maximization is potentially the most interesting new approach in the philosophy of science in the last couple of decades[5]. It enables to relate philosophical discussions about the scientific method to statistical decision theory by considering information as a utility (“epistemic utility”) to be maximized by scientific decisions. However, this idea is virtually vacuous until it is specified what kind of information a scientist is supposed to maximize. In other words, distinctions of the sort I have outlined here are a prerequisite to any serious study of the scientific method as an attempt to maximize information.

One can say more than this, however. In several cases we encountered above, the different senses of information are clearly all justified, at least given suitable circumstances. There is for instance no reason to expect that some “transcendental deduction” can justify the use of Cont at the expense of Inf (or vice uersa) or the use of Con&,, rather than Cant,_. Which of these is to be preferred depends on the circumstances of a scientist, including his or her ends and cannot be decided once and for all.

To illustrate this, let us notice for instance that, in suitable circumstances, a scientist might simply want to explain some particular data E when he or she chooses a hypothesis H. The idea is that this E is all that the scientist is at the moment interested in. We may

call his or her enterprise local explanation. In such a situation, the scientist is clearly trying to maximize the information H carries

with respect to E in the sense of surprise value, i.e. maximize inf(H/E). This reduces to

P(EIH) log ___

P(E) (18)

More generally, the relative merits of two competing hypotheses H, and H2 will in such circumstances depend on the ratio P(E/H,)/P(E/H,).

What this means is that we obtain the well-known Fisherian principles of maximum likelihood and likelihood ratio comparison as the preferred rule of scientific inference in one type of case. This illustrates at one and the same time the strength and the weaknesses of Fisherian methods. They clearly are the preferred ones in certain types of situations, but it does not follow that they should be followed in others.

Page 5: Some varieties of information

Some varieties of information 179

For instance, if a scientist is looking for an overall theory whose informativeness is his or her main aim and is using the data E only to judge whether a theory is likely to be true or not, other methods are to be preferred. One candidate for this role is the expected (absolute) information of H, i.e. (16), possibly suitably normalized. I have shown elsewhere that this naturally leads to several of the measures of explanatory power philosophers and statisticians have in fact suggested. What we have seen shows that there is no irreconcilable conflict between them and likelihood methods. Each of them is the right procedure in right particular circumstances.

Another observation along similar lines seems to me to be timely in discussions of scientific methods today. In many of such discussions, especially by philosophers, it is assumed that there is in some sense an irreducible conflict or tension between a scientists’s quest of high information and his or her quest of a safe theory, i.e. a theory H whose probability on evidence P(H/E) is high. This alleged conflict is the cornerstone of Sir Karl Popper’s famous (or notorious) philosophy of science. It is assumed by so many others, however, that I am tempted to speak of a dogma of contemporary philosophy of science. One consequence of our observations here, however simple and even simple-minded they may be, is that this alleged contrast is entirely spurious. The dogma of course we can see from (3~(4) that absolute information and absolute (prior) probability are inversely related. However, I mentioned is just that no one, with the dubious exception of Sir Karl Popper, has seriously maintained that all we are looking for in science is (absolute) information, nothing but information, and the whole information, in other words, that the dominating desideratum of a scientist is high absolute information. Popper has had the courage of his belief and maintained that, since (absolute) information and (prior) probability are inversely related. the principal rule of a scientist should be: Be bold! Search for the most unlikely (improbable) theories possible! However, for many of us outside the closed society of Popperians this looks more like a reductio ad absurdum of his views than a serious advice to practicing scientists.

In any case, as soon as we realize the multiplicity of legitimate variants (senses) of information, Popper’s contrast looses all plausibility. For instance, by any reasonable criterion expected (absolute) information is a much likelier candidate for the quantity to be maximized than absolute information simpliciter. But if so, we can see from (17) that there is no conflict between high (posterior) probability and high (expected) information.

Indeed, the general idea of thinking of information as a utility to be maximized by a scientist’s decisions helps us to understand in what circumstances the Popperian advice to a scientist has some limited validi.ty. We know from the work on Dubins and Savage on inequalities for stochastic processes that, roughly speaking, when the odds are long against a gambler, boldness is the best strategy[6]. Hence Popper’s emphasis on boldness as a guide of a scientist’s life is in reality predicated on a romantic image of a brilliant scientist making his discoveries against all odds. In the sober daylight of the actual history of science, I am afraid, such an image is bound to appear as unrepresentative of the principles of scientific investigation as the charge of the light brigade is of the principles of twentieth-century warfare: it’s magnificient, but it is not science.

Thus, in spite of their apparent simplicity, my distinctions between different senses of information prove to be extremely interesting in contemporary philosophy of science. The multiplicity of the different variants of information does not deprive this concept its fruitfulness but on the contrary enhances it. It is my belief that the same is true of the uses of the concept of information in other fields. However, it is your task to show that, not mine, for my first-hand field is philosophy and not computer science or statistics.

There is, however, one more field of applications of informational concepts which is

only now opening up and which is not yet generally known. It is therefore worth outlining here. It means in effect defining one more sense of information. Now many of the methodological uses of the concepts of information and probability take place in the realm of mathematics and logic. We speak of one theorem of logic as being more informative than another and of its being more or less likely that a mathematical conjecture could be proved. Hence, as the great theoretical statistician L. J. Savage urged as early as in the sixties, it would be of considerable interest to have a notion of information such that

Page 6: Some varieties of information

180 J. HINTIKKA

logical proofs could (sometimes) increase it[7]. A measure of information of this kind could

be called deductive information. The need of such a sense of information or, rather, of its mirror-image notion of probability, if felt especially keenly by those theorists who like L. J. Savage are subjectivists, for it is clearly true that we for instance associate different subjective probabilities with different logical truths, even though they are all equivalent and

hence have the same probability. However, the desirability of being able to define such measures of probability and information is not dependent on one’s subjectivist learnings in probability theory.

How can one characterize such measures of deductive information? Obviously we can

proceed by way of their probabilistic cousins. How do we have to modify the laws of probability calculus so as to allow for definitions of deductive information? The first part of an answer is clear. We must give up or, more accurately speaking, restrict, one of the main assumptions of probability calculus, viz. the interchangeability of logically equivalent

propositions with each other. For it is this assumption that entails that a consequence of a logical argument can never have more information than its premises, that all truths of logic, however subtle, are all informationally empty tautologies, and so on[S].

It turns out that everything else can be left alone in the usual calculi of probability.

But precisely how is the invariance with respect to logical equivalence to be restricted? There are two extremely interesting answers which turn out to be equivalent. One of them is proof-theoretical. It says that we should restrict invariance with respect to logical equivalence to those equivalences which can in a certain sense be proved’ without introducing any new “auxiliary” entities into the argument[9]. This sense is a direct generalization of the time-honored distinction between those arguments of elementary geometry for which we don’t need auxiliary constructions and those for which we do need them. It turns out that this distinction can be generalized so as to become completely independent of the use of figures in elementary geometry. It is a purely logical distinction.

It turns out that the same restriction on admissible equivalences is also reached model-theoretically by generalizing logicians’ customary concept of model. Clearly, the meaning of logical quantifiers can be thought of in terms of draws from a certain urn, called the model. An existential quantifier says that at least possible draws yields such-and such an individual; a universal quantifier says that every ball (i.e. individual) one can draw is such-and-such; nested quantifiers (they are a logician’s bread and butter) describe successive draws; and so on. Now we obtain a generalization of the usual concept of model simply by allowing the set of balls available to be drawn to change between successive draws. The result is what Rantala has called an urn model[lO]. Classical models are a special case of urn models: they are the invariant urn models.

Now if I am given such an urn as a black box, I can experiment with ramified sequences of draws from it in order to see if the balls (individuals) come from the same set of balls or different ones. Often, I can decide whether I am dealing with a noninvariant (proper) urn model or not. But if the length of the sequences of my successive draws is restricted, it turns out that there is a class of noninvariant urn models I cannot tell from constant (classical) ones. They may be called almost invariant urn models. If they are admitted as models over and above the classical ones, we obtain a model theory in which those and only those equivalences are logically valid which according to the proof-theoretical suggestion guarantee the sameness of their informational status. In other words, almost invariant urn models provide a model theory to go together with my proof theory in which only some logical equivalences admit substitutivity.

There are many details here that have to be filled in. Moreover, the resulting new sense of information has not yet been utilized in a large scale in applications, apart from certain philosophical uses. In this direction, too, my final message is to recommend to you what looks like an excellent conceptual tool future for applications of informational concepts.

REFERENCES

[I] I am returning in this paper to some of the themes of J. HINTIKKA, The varities of information and scientific explanation. Logic, Methodology, und Philosophy of Science III (Edited by B. VAN

ROOTSELAAR and J. F. STAAL), pp. 151Ll71. North-Holland, Amsterdam (1968). Of the related

Page 7: Some varieties of information

Some varieties of information 181

literature, a glimpse is given by D. B. OSTEYJZE and I. J. GOoD, Information, Weight of Evidence: The Singularity between Probability Measures and Signal Detection, Vol. 376, Lecture Notes in Mathematics, Springer-Verlag, Berlin (1974).

[2] R. T. COX, The Algebra of Probabfe Inference, pp. 37-38. Johns Hopkins Press, Baltimore (1961).

[3] J. HINTIKKA, On Defining Information. Ajutus 1971, 33, 271-273. [4] K. R. POPPER, The Logic of Scientific Discovery. Hutchinson, London (1959). [5] For different versions of this general idea, see, e.g. several papers in J. HIN~KKA and P. SUPPES,

Aspects of Inductive Logic. North-Holland, Amsterdam (1966) and ISAAC LEVI, Gambling With Truth. Knopf, New York (1967).

[6] L. E. DUBINS and L. J. SAVAGE, How To Gamble If You Must: Inequalities for Stochastic Processes. McGraw-Hill, New York (1965).

[7] See L. J. SAVAGE, Difficulties in the theory of personal probability. Phil. Sci. 1967, 34, 305-310. [S] J. HINTIKKA, Surface information and depth information. Information and Inference, (Edited

by Jaakko Hintikka and Patrick Suppes), pp. 263-297. Reidel, Dordrecht (1970). [9] J. HINTIKKA, Logic, Language-Games, and Information. Clarendon Press, Oxford (1973); J.

HINTIKKA, Knowledge, belief and logical consequence. Logic and Philosophy for Linguists (Edited by J. M. E. MORAVCSIK), pp. 165-176. Mouton, The Hague (1974).

[IO] V. RANTALA, Urn models. J. Phil. Logic, 1975, 4, 455-474; J. HINTIKKA, Impossible possible worlds vindicated. J. Phil. Logic 1975, 4, 475-484. Both are reprinted in Game-Theoretical Semantics (Edited by E. SAARINEN). Reidel, Dordrecht (1979).