an axiomatic derivation of the coding-theoretic possibilistic entropy

19
Available online at www.sciencedirect.com Fuzzy Sets and Systems 143 (2004) 335 – 353 www.elsevier.com/locate/fss An axiomatic derivation of the coding-theoretic possibilistic entropy Andrea Sgarro ; 1 Department of Mathematical Sciences (DSM), University of Trieste, 34100 Trieste, Italy Received 4 November 2002; accepted 6 March 2003 Abstract We re-take the possibilistic (strictly non-probabilistic) model for information sources and information coding put forward in (Fuzzy Sets and Systems 132–1 (2002) 11–32); the coding-theoretic possibilistic entropy is dened there as the asymptotic rate of compression codes, which are optimal with respect to a possibilistic (not probabilistic) criterion. By proving a uniqueness theorem, in this paper we provide also an axiomatic derivation for such a possibilistic entropy, and so are able to support its use as an adequate measure of non-specicity, or rather of “possibilistic ignorance”, as we shall prefer to say. We compare our possibilistic entropy with two well-known measures of non-specicity: Hartley measure as found in set theory and U-uncertainty as found in possibility theory. The comparison allows us to show that the latter possesses also a coding-theoretic meaning. c 2003 Elsevier B.V. All rights reserved. Keywords: Measures of information; Possibility theory; Possibilistic entropy; Non-specicity; Shannon entropy; Hartley measure; U-uncertainty; Fuzzy sets 1. Introduction The most successful information measure is certainly Shannon entropy H(P), which is inter- preted as the uncertainty contents of a random experiment ruled by the probability distribution P. The “universe of discourse”, i.e. the set of the possible outcomes of the experiment, called also the sample space, or the alphabet, is assumed to be nite: A = {a 1 ;a 2 ;:::;a k }; k ¿2. Actually, one can approach Shannon entropy in two dierent ways. The rst approach, which is preferred by coding theorists, is the coding-theoretic, or pragmatic, or operational approach: under this approach, Shannon entropy is the solution to a coding-theoretic problem, and represents the ideal Tel.: +39-040-6762623; fax: +39-040-6762636. E-mail address: [email protected] (A. Sgarro). 1 Partially supported by MIUR. 0165-0114/$ - see front matter c 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0165-0114(03)00123-4

Upload: andrea-sgarro

Post on 02-Jul-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: An axiomatic derivation of the coding-theoretic possibilistic entropy

Available online at www.sciencedirect.com

Fuzzy Sets and Systems 143 (2004) 335–353www.elsevier.com/locate/fss

An axiomatic derivation of the coding-theoretic possibilisticentropy

Andrea Sgarro∗;1

Department of Mathematical Sciences (DSM), University of Trieste, 34100 Trieste, Italy

Received 4 November 2002; accepted 6 March 2003

Abstract

We re-take the possibilistic (strictly non-probabilistic) model for information sources and information codingput forward in (Fuzzy Sets and Systems 132–1 (2002) 11–32); the coding-theoretic possibilistic entropy isde.ned there as the asymptotic rate of compression codes, which are optimal with respect to a possibilistic (notprobabilistic) criterion. By proving a uniqueness theorem, in this paper we provide also an axiomatic derivationfor such a possibilistic entropy, and so are able to support its use as an adequate measure of non-speci.city,or rather of “possibilistic ignorance”, as we shall prefer to say. We compare our possibilistic entropy withtwo well-known measures of non-speci.city: Hartley measure as found in set theory and U-uncertainty asfound in possibility theory. The comparison allows us to show that the latter possesses also a coding-theoreticmeaning.c© 2003 Elsevier B.V. All rights reserved.

Keywords: Measures of information; Possibility theory; Possibilistic entropy; Non-speci.city; Shannon entropy; Hartleymeasure; U-uncertainty; Fuzzy sets

1. Introduction

The most successful information measure is certainly Shannon entropy H(P), which is inter-preted as the uncertainty contents of a random experiment ruled by the probability distribution P. The“universe of discourse”, i.e. the set of the possible outcomes of the experiment, called alsothe sample space, or the alphabet, is assumed to be .nite: A= {a1; a2; : : : ; ak}; k¿2. Actually,one can approach Shannon entropy in two di<erent ways. The .rst approach, which is preferredby coding theorists, is the coding-theoretic, or pragmatic, or operational approach: under thisapproach, Shannon entropy is the solution to a coding-theoretic problem, and represents the ideal

∗ Tel.: +39-040-6762623; fax: +39-040-6762636.E-mail address: [email protected] (A. Sgarro).1 Partially supported by MIUR.

0165-0114/$ - see front matter c© 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0165-0114(03)00123-4

Page 2: An axiomatic derivation of the coding-theoretic possibilistic entropy

336 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

rate of a compression code, meant to “strip” an information source of all its redundancy; cf. Section4. The second, the axiomatic approach, is more relevant to the point of view of information mea-sures in the strict sense. In this case one lists properties, or axioms, which an adequate informationmeasure should possess; then one tries to understand which are the functional solutions that verifythose properties. A list of axioms for measures of probabilistic (statistical, “selective”) uncertaintyhas been proposed by HinCcin in the 1950s and has led to a uniqueness theorem, which showsthat Shannon entropy is the unique solution 2 to those axioms; another well-known list of axiomsuniquely leading to Shannon entropy is due to Fadeev.Nowadays, probability theory is no longer the only tool which is used to deal with the management

of partial knowledge; an alternative and successful tool is, among others, possibility theory; cf.Section 2. Several information measures have been proposed in possibility theory, for example theso-called U-uncertainty, which is seen as a measure of non-speci2city. 3 An indisputable merit ofU-uncertainty is that its use can be supported by a uniqueness theorem, as is the case for Shannonentropy; cf. [4,7,8]. However, the approach to information measures taken in possibility theory hasinvariably been the axiomatic approach, as if one took for granted that the operational or coding-theoretic approach is necessarily con.ned to probability theory, while it is of no use in possibilitytheory, or also in evidence theory, in the theory of imprecise probabilities, etc. (as for these theoriesdealing with partial knowledge cf. e.g. [4,7,8]). In particular, up to now U-uncertainty did not appearto possess any coding-theoretic meaning.After some preliminary work (cf. [11]), in [12] this author has put forward a systematic possi-

bilistic framework for coding-theoretic problems, both in source coding (information compression)and in channel coding (information protection from transmission noise); this possibilistic approachto coding has been extended and deepened in subsequent work (cf. [13,9]; in the latter an appli-cation to the design of error-proof keyboards is investigated). As commented in [12], possibilisticcoding is a “soft” alternative to probabilistic coding, and is apt to properly deal with ad hoc codingalgorithms, such as have been long used by coding practitioners, even if they are somehow over-looked by coding theorists. We are con.dent that our operational approach gives a more balancedview of possibilistic information theory, which up to now appeared to be somehow “defective” incomparison with probabilistic information theory.In particular, in [12,13] a possibilistic functional H�(�), called the possibilistic entropy, has been

introduced and investigated, which is the analogue of the probabilistic entropy, as obtained opera-tionally when the coding-theoretic problem of data compression is tackled. Here � is a possibilitydistribution over alphabet A, and takes the place of the probability distribution P (cf. Sections 2and 4).Notice, however, that the coding-theoretic possibilistic entropy H�(�) has two arguments: the .rst

is the possibility distribution �, but there is also a second argument �, whose meaning we have nowto try to make clear, even if we leave details to Section 4 below. When one constructs a sourcecode to compress data, one has to adopt a reliability criterion: usually, one .xes an allowed error

2 To this writer, Shannon entropy, apart from information measures and compression rates, has an even more “elemen-tary” meaning, which explains its success in so many di<erent contexts: Shannon entropy is a counting tool which servesto estimate the size of “large sets of long sequences”. To this point of view we shall come back in Appendix A.

3 Actually, rather than saying measures of non-speci.city, as is usual in the literature, we shall prefer to say measuresof possibilistic ignorance, and keep the term “non-speci.city” for sets; cf. below Section 2.

Page 3: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 337

probability � (06�¡1), and imposes that the probability of a decoding error should be at most �.In probability theory, the “ideal rate” of an optimal source code turns out to be Shannon entropyindependently of �, provided however that � �=0; if instead the allowed error probability is zero,the value of the ideal rate is greater, and is equal to the so-called Hartley measure; cf. Section 4.In other words, the probabilistic operational entropy H�(P) (note the dependency on �) is equal toShannon entropy for �¿0, and to Hartley measure for �=0:

H�(P) = Shannon entropy for � ¿ 0; H0(P) = Hartley measure:

In this sense the operational (coding-theoretic) entropy H�(P) of a probabilistic source is a non-increasing stepwise function of �, even if the step of the function is “almost invisible”. If theinformation source has a possibilistic nature, the reliability criterion is suitably modi.ed: one requiresthat the possibility of a decoding error should be at most �. The “ideal rate” one obtains in thiscase is precisely the possibilistic entropy H�(�); its value has been computed in [12]; cf. also [13].As will be shown in Section 4, the stepwise behaviour of the coding-theoretic possibilistic entropyH�(�) is quite evident, and so the dependency on � cannot be disposed of as readily as in theprobabilistic case. From the point of view of information measures, the meaning of � will be thatof a sensitivity threshold: one forgets about events whose possibility is � or less, as if they were“practically impossible”.Sections 5 and 6 are the core of this contribution. In Section 5 we list properties veri.ed by

the possibilistic entropy H�(�). Similarities with analogue properties veri.ed by Shannon entropywill turn out to be quite striking; in particular, a “dual” role is played by the uniform probabilitydistribution versus the vacuous possibility distribution, and by probabilistic independence versuspossibilistic non-interactivity. In Section 6, after choosing some of these properties as axioms, weprove a uniqueness theorem and so provide an axiomatic justi.cation which supports the use of thepossibilistic entropy as an adequate measure of possibilistic ignorance; the set of axioms is shownto be minimal, in the sense that by omitting any single axiom one introduces spurious solutions.In Section 7 we show how the possibilistic entropy relates to U-uncertainty: the latter is obtainedfrom H�(�) by a simple averaging operation with respect to �. This way, we are also able to showthat U-uncertainty has itself a coding-theoretic or operational meaning. Sections 2 and 3 containpreliminaries on possibility distributions and Hartley measures, respectively; in Section 3 we haveextended Hartley measures from crisp (usual) sets to fuzzy sets: this enables us to enhance the tightrelationship 4 between the possibilistic entropy and the Hartley measure of a set. Two appendicesconclude the paper; the .rst is basically a reply to GuiaIsu [6], and is meant to stress the non-probabilistic nature of our possibilistic approach. In the second appendix, we somewhat deepen thediscussion on the coding-theoretic nature of the possibilistic entropy, for readers who wish to havea more comprehensive view of the operational approach.We are con.dent that the present work will contribute to take a fresh view of information mea-

sures outside probability theory. In particular, we .ll in a gap left open in [12], by showing thatthe approach taken there is relevant also from the axiomatic point of view: the possibilistic entropy

4 The “form” of the possibilistic entropy soon reminds one of Hartley measures, which are usually associated to sets,rather than possibility distributions: this does not come as a surprise, since the kinship between possibility distributionsand sets (fuzzy sets, actually) is quite tight, as is well-known. The set-theoretic interpretation of possibilities, and theensuing set-theoretic view of possibilistic entropies, will be discussed in Section 3.

Page 4: An axiomatic derivation of the coding-theoretic possibilistic entropy

338 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

of [12] pertains not only to coding theory, but also to the theory of information measures in thestrict sense.

2. Possibility distributions: preliminaries

A possibility distribution � over the .nite alphabet A= {a1; : : : ; ak}; k¿2, is de.ned by givinga possibility vector �=(�1; �2; : : : ; �k) whose components �i are the possibilities �(ai) of the kalphabet letters ai; 16i6k:

�i = �(ai); 06 �i 6 1; max16i6k

�i = 1:

The possibility 5 of each subset A⊆A is the maximum of the possibilities of its elements:

�(A) = maxai∈A

�i:

In particular �(∅)= 0; �(A)= 1.

Example. Take A= {a; b; c; d}, �=(1; 1; 1=2; 1=2). Then �{a; b}=�{b; c}=�{b}=1¿�{c; d}=1=2.

We shortly recall that probability distributions are de.ned through a probability vector P=(p1; p2; : : : ; pk):

pi = P(ai); 06 pi 6 1;∑16i6k

pi = 1

and have an additive nature:

P(A) =∑ai∈A

pi:

We also shortly recall that Shannon entropy is a probabilistic functional de.ned as:

H(P) = −∑16i6k

pi log2 pi:

We use the script symbol H to distinguish Shannon entropy from the operational entropies; as forits basic properties cf. e.g. [1,2] or [7]. Here we just notice that H(P) spans the interval from0, obtained when P is deterministic (minimum uncertainty: one letter has probability 1, and so itsurely occurs), to log k, obtained when P is uniform (maximum uncertainty: all the k letters areequiprobable). As usual, by a continuity convention we set 0 log 0=0. Logarithms will be all tothe base 2; choosing 2 as the base of logarithms amounts to measuring uncertainties in bits; theuncertainty 1 is associated to the toss of a fair coin, P=(1=2; 1=2).

5 The fact that the symbol � is used both for vectors and for distributions will cause no confusion; similar conventionswill be tacitly adopted also in the case of probability vectors P. Similarly, no confusion will arise from the fact that weare using the same symbol H for both operational (coding-theoretic) entropies, probabilistic and possibilistic, since theargument, P or �, is enough to distinguish between them.

Page 5: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 339

Unlike probability vectors, the space of possibility vectors is endowed with a natural partialordering, by setting for two possibility vectors � and �=(�1; �2; : : : ; �k):

�6 � when �i 6 �i; 16 i 6 k:

The extreme elements in this ordering are at the start the k deterministic possibility vectors, for whichall the components are 0, save a single 1, and at the end the vacuous possibility vector, for whichall the components are equal to 1. The deterministic possibility vectors describe a situation of com-plete knowledge, while the vacuous possibility vector describes a situation of complete ignorance.The capability to describe complete ignorance, as opposed to complete uncertainty, is one of themain arguments used to support the claim that probability theory is unable to deal with allthe multifarious facets of partial knowledge; cf. e.g. [5] or [7]. Below we systematically associatethe term “ignorance” to possibilities and the term “uncertainty” to probabilities.In probability theory, one speaks also of random variables, and not only of their probability

distributions. Similarly, rather than saying that we assign a possibilistic distribution �, sometimeswe shall .nd it convenient to say that we are assigning a possibilistic variable X , which is ruled by�, and which takes its outcomes in the sample space, or alphabet A. In this case we shall writePoss{X ∈A} instead of �(A), and H�(X ) instead of H�(�). All this is basically a matter of notation,which turns out to be quite convenient to deal with bi-dimensional possibility distributions, as wedo now.Below � denotes a joint possibility matrix on the Cartesian product A×B, A= {a1; a2; : : : ; ak},

B= {b1; b2; : : : ; bh}; each of its entries �i; j is a joint possibility, namely the joint possibility of thecouple of letters (ai; bj), 16i6k, 16j6h (k; h¿2). In the following we shall think of A as headingthe columns of the matrix, while B heads its rows.

�i;j = �(ai; bj) = Poss{X = ai; Y = bj}; 06 �i;j 6 1; maxi;j

�i;j = 1

By XY we have denoted the bi-dimensional possibilistic variable, which corresponds to �; the .rstco-ordinate X belongs to alphabet A, while the second co-ordinate Y belongs to alphabet B. Inpossibility theory (cf. e.g [7]) the most common and “natural” way to marginalise a bi-dimensionalpossibility distribution is to resort to maximisation and set:

Poss{X = ai} = max16j6h

�i;j; 16 i 6 k (1)

for the marginal possibilistic variable X over alphabet A, and analogously for the marginal possi-bilistic variable Y over alphabet B.

Example. Let Poss{X= ai; = bj} ≡ �i; j =1 if j=1, else Poss{X= ai; Y= bj}=1=2. ThenPoss{X=ai}=Poss{Y= b1}=1; Poss{Y= bj}=1=2 for j¿2.

We recall that the joint possibility distribution � is non-interactive, or, correspondingly, that thetwo possibilistic variables X and Y are non-interactive, when each joint possibility �i; j is obtainedas the minimum of the two corresponding marginal possibilities, as de.ned in (1):

�i;j = Poss{X = ai; Y = bj} = min[Poss{X = ai}; Poss{Y = bj}]:

Page 6: An axiomatic derivation of the coding-theoretic possibilistic entropy

340 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

This is e.g. the case in the preceding example, and so there the two possibilistic variable X andY are non-interactive. We stress that non-interactivity is often seen as the “natural” counterpart toprobabilistic independence; cf. e.g. [7].

3. The Hartley measure of a fuzzy set

The deterministic distributions and the vacuous distribution are all special cases of unifocal pos-sibility distributions, i.e. of possibility distributions such that the only components allowed in �are either 0 or 1. Specifying a unifocal possibility distribution � is the same as specifying thecorresponding focal set A= {ai: �i=1}, which is an ordinary “crisp” set: one has just to “read”the vector � as the characteristic vector of A: �i=1 if and only if ai ∈A. Actually, any possibilityvector � can be viewed as the characteristic vector of a set F , but in this case the set is fuzzy:

�i = degree of membership of ai to F:

Reference texts for fuzzy sets are e.g. [5] or [7].As is well-known, if the .nite set A is crisp, its Hartley measure N(A) is de.ned as

N(A) = log |A| for |A| ¿ 0

and is unde.ned when A is empty (bars as in |A| denote size). Hartley measure has been longinterpreted as a measure of the non-speci2city of the set A, which explains our use of the letter Nto denote it; cf. e.g. [7]. We want to extend Hartley measure also to fuzzy sets F assigned throughtheir characteristic vector:

F = (�1; �2; : : : ; �k); 06 �i 6 1:

Here �i is the degree of membership of letter ai to fuzzy set F ; crisp sets are re-found as specialcases of fuzzy sets when the numbers �i are either zero or one. Observe that, unlike in the caseof possibility vectors �=(�1; �2; : : : ; �k), the constraint maxi �i=1 is not required. One de.nes theheight h(F) of a fuzzy set F as the maximum of its degrees of membership; if the height h(F) is 1,the fuzzy set F is called normal. The claim that the obvious formal similarity between possibilitydistributions 6 and normal fuzzy sets is not just formal has been the object of an extensive debatein the literature; cf. e.g. [14,5].One common way to “defuzzify” a fuzzy set is to .x a threshold � and to consider only the

elements ai whose degrees of membership to F is greater than �. This way one obtains a crisp setcalled the �-cut of F :

F� = {ai: �i ¿ �}; 06 � ¡ 1:

More precisely, we have just de.ned the strong �-cut of F ; if the strong inequalities �i¿� arereplaced by weak inequalities, one obtains the weak �-cut {ai: �i¿�}; 0¡�61.If the threshold � at which the cutting has taken place is strictly less than the height h(F) of the

fuzzy set F , the �-cut is not void, and one can consider its (crisp) Hartley measure N(F�). We .nd

6 And more generally between “incomplete” possibility distributions and arbitrary fuzzy sets; cf. e.g. [10].

Page 7: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 341

it convenient to de.ne the fuzzy Hartley measure N�(F) of the fuzzy set F as a function of �,rather than as a single number:

De!nition 1. The fuzzy Hartley measure N�(F) of a fuzzy set F is de.ned as:

N�(F) ≡ N(F�) = log |{ai: �i ¿ �}|; 06 � ¡ h(F); h(F) �= 0:We are assuming h(F) �=0, else F is empty and its Hartley measure, whether crisp or fuzzy, isunde.ned. As we shall soon see, the relationship between the possibilistic entropy H�(�) and thisfuzzy Hartley measure N�(F) perfectly mirrors the relationship between possibility distributions �and normal fuzzy sets F . It will be enough to “interchange” the cognate notions of �-cut and �-support, as given below:

De!nition 2. The �-support S�(�) of a possibility distribution � is the crisp set made up of theletters whose possibility is larger than �: S�(�)= {ai: �i¿�}; 06�¡1.

The support of � as usually de.ned in the literature is our 0-support 7 as de.ned above. Af-ter recalling that a rectangle is a (crisp) Cartesian product, i.e. a bi-dimensional set of the formA×B⊆A×B, A and B crisp subsets of A and B, respectively, we digress to state a lemma whichwe shall need in Section 5 (for the easy proof cf. e.g. [13]):

Lemma 1. The joint possibility matrix � is non-interactive if and only if the �-supports {aij: �ij¿�}are rectangles for all � in [0; 1[.

4. Possibilistic entropies: the coding-theoretic point of view

We start by recalling that the elements of the Cartesian power An are the kn sequences of lengthn built over the alphabet A: each such sequence can be interpreted as the information which isoutput in n time instants by an information source, be it a probabilistic or a possibilistic source.Unfortunately, the recommendations of possibility theory as how to compute the possibility of such asequence x starting from the possibilities of its n letters x1; x2; : : : ; xn are not so conclusive. The mostobvious choice is presumably the one we have made in [12], and which is inspired by the possibility-theoretic notion of non-interactivity, i.e. by the “natural” analogue of probabilistic independence: thepossibility of a sequence is the minimum possibility of its letters; in other words, for a sequenceto be declared possible above “threshold” �, all of its letters must have a possibility which is � ormore. Correspondingly, in [12] we have introduced the notion of a stationary and non-interactiveinformation source, or SNI source, to be compared with stationary and memoryless sources, or SMLsources, which are standard in the probabilistic approach. Note that the behaviour of a possibilisticSNI source 8 is completely speci.ed by giving the k possibilities �i of the source letters ai, precisely

7 We stress once more that supports and cuts are crisp sets. In the rest of the paper when we mention sets we refer bydefault to crisp objects.

8 A further type of possibilistic source, called a stationary and mean-arithmetic interactive source, is investigated in[13] and its practical relevance is commented. In [13] the operational entropy of such interactive sources is computed andsome of its formal properties are investigated.

Page 8: An axiomatic derivation of the coding-theoretic possibilistic entropy

342 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

as the behaviour of a probabilistic SML source is completely speci.ed by giving the k probabilitiespi=Prob{ai}.

De!nition 3. A stationary and memoryless probabilistic information source is de.ned by settingfor each sequence x∈An:

Pn(x) =∏

16r6n

P(xr):

De!nition 4. A stationary and non-interactive possibilistic information source is de.ned by settingfor each sequence x∈An:

�[n](x) = min16r6n

�(xr):

Example. Take again A= {a; b; c; d}; �=(1; 1; 1=2; 1=2). Then �[3](aba)=�[5](ababa)= 1¿�[5](abada)= 1=2. The “power vector” �[n] is a possibility distribution over the Cartesian powerAn. One has e.g. �[5]({ababa; abada})= max[1; 1=2]= 1; recall that possibilities are maxitive: thepossibility of a set is the maximum of the possibilities of its objects.

The basic optimisation problem of source coding consists in devising code constructions whichachieve substantial data compression, but which are reliable, in the sense that relevant informationdoes not get lost. As will be made precise in Appendix B, two basic and conMicting parameters areused to evaluate the performance of a source code, meant to compress data:

(i) the code rate, which should be as low as possible, in order to achieve substantial data com-pression;(ii) the error probability, or the error possibility, respectively, which should be also as low as

possible, in order to achieve high reliability.

In Shannon theory, and also in [12], what one does is the following: one .xes the length n ofthe messages to be encoded, and one .xes the allowed decoding error probability �, or the alloweddecoding error possibility �:

Poss{decoding error} 6 � (2)

with 06�¡1. Then, one tries to optimise, i.e. to minimise, the code rate subject to such constraints.The asymptotic “ideal” value of optimal code rates which is obtained as n goes to in.nity is calledthe entropy of the corresponding information source (cf. Appendix B). Expressions for the twooperational entropies H�(P) and H�(�) are given by the following Theorems 1 and 2:

Theorem 1. For a stationary and memoryless probabilistic source one has:

H�(P) = log |{ai: pi ¿ 0}|; � = 0;

H�(P) =H(P); 0¡ � ¡ 1:

Page 9: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 343

Theorem 2. For a stationary and non-interactive possibilistic source one has:

H�(�) = log |{ai: �i ¿ �}|; 06 � ¡ 1:

Theorem 1 is a basic result of information theory 9 , often called Shannon’s 2rst theorem. It statesthat Shannon entropy is the asymptotic value of optimal rates for SML sources, at least for 0¡�¡1;since this asymptotic value does not depend on the tolerated error probability �¿0, but only thespeed of convergence is a<ected, in the literature the mention of � is often omitted. However, for�=0 the value of H�(P) is di<erent, being equal to the Hartley measure. Theorem 2 has been provedin [12]; the possibilistic entropy of SNI sources has the form of a crisp Hartley measure for whatevervalue of �. Actually, it coincides with the fuzzy Hartley measure N(F) of the fuzzy set F withcharacteristic vector equal to �, as de.ned in Section 3.

Example. For �=(1; 1; 1=2; 1=2), one has H�(�)= log 4=2 for 06�¡1=2, H�(�)= log 2=1 for1=26�¡1. As for code constructions, cf. Appendix B.

The function H�(�) as in Theorem 2 will be also called the weak-constraint possibilistic entropy.The reason for using the attribute “weak” is now given. Let us think of coding, take �∈]0; 1], andreplace the weak reliability constraint Poss{decoding error}6� as in (2) by the strong reliabilityconstraint

Poss{decoding error} ¡ �:

As easily proved (cf. [12]), this gives rise to a slightly di<erent 10 coding-theoretic possibilisticentropy, namely:

H+� (�) = log |{ai: �i ¿ �}|; 0¡ �6 1

which we shall call the strong-constraint possibilistic entropy. The two functions H+� (�) and H�(�)

are unde.ned for �=0 and 1, respectively; from now on, however, we .nd it convenient to prolongthem by continuity over the entire closed interval [0; 1] and set

H0(�) = H+0 (�) = log |{ai: �i �= 0}|; H1(�) = H+

1 (�) = log |{ai: �i = 1}|even if H+

0 (�) and H1(�) are devoid of any coding-theoretic meaning. In practice, H+� (�) is the

same as H�(�), only the “steps” of the entropy seen as a function of � are closed at the rightend, rather than being closed at the left end. In technical terms the two entropies are equal almosteverywhere; more precisely, they are di<erent only over the .nite set of points J(�) made up ofthe distinct components which appear in � and are di<erent from 0 and 1. The set J(�) is emptyif and only if � is unifocal.

9 In the case of probabilistic SML sources, one can exhibit a sequence of reliable codes whose error probability goesto zero with the message length n. So, in a way, Shannon entropy is equal to the source entropy also for an admissibleerror which is “positive but in.nitesimal”. As commented in [12] this point of view does not make sense in the case ofpossibilistic sources.10 Choosing a weak or a strong constraint gives rise to no di<erence whatsoever either in the case of Shannon entropy

or in the case of the possibilistic interactive (mean-arithmetic) entropy studied in [13].

Page 10: An axiomatic derivation of the coding-theoretic possibilistic entropy

344 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

One might even devise criteria which are “intermediate” between Poss{decoding error}6� andPoss{decoding error}¡�, for example stochastic criteria, and so .nd a value of the entropy interme-diate between H�(�) and H+

� (�) for �∈J(�). All of this suggests to introduce the more Mexiblenotion below, which is independent of the fastidious speci.cation of a weak or strong inequalityin the reliability criterion, and yet captures all the essential elements of the notion of possibilisticentropy.

De!nition 5. The abstract possibilistic entropy H ∗� (�) is de.ned as

H ∗� (�) = H�(�) = H

+� (�); � ∈ [0; 1]−J(�):

In practice, we take a short way out, and leave the abstract entropy unde.ned 11 over J(�).

5. Properties of the possibilistic entropy

The debate on the meaning and the use of possibilities (and more speci.cally on the meaningof non-interactivity) is an ample and long-standing one; with respect to probabilities, an empiricalinterpretation of possibilities is de.nitely less clear. The reader is referred to standard texts onpossibility theory, e.g. [3,5] or [7], which give an extensive bibliography. In this paper, possibilities�i will be seen basically as linguistic labels; in this sense, what matters is their ordering, rather thanthe actual numeric values.

De!nition 6. A re-scaling �(x) is a strictly increasing function, such that �(0)= 0; �(1)= 1.

A re-scaling �(x) maps [0; 1] into itself; we can re-scale a possibility distribution � to a newpossibility distribution �(�) by re-scaling all its components �i to �(�i). Invariance under re-scalingis veri.ed by our possibilistic entropy, and is quite a natural property if one adopts a linguisticinterpretation of possibilities. Of course also the threshold � has to be re-scaled to �(�); cf. property10 in the list to follow.We list properties of the possibilistic entropy, which may support its use as a measure of pos-

sibilistic ignorance; out of these properties we shall select our axioms in the next section. Evenif our notation in the list below refers to the weak-constraint possibilistic entropy, we stress thatonly property 12 does not extend to abstract entropies and strong-constraint entropies. As for thelatter entropy, one has just to replace continuity to the right with continuity to the left in 12; asfor abstract entropies, property 12 does not make sense, while the speci.cation “for � =∈J(�)” isredundant in 11, and so can be omitted.Comments about adequacy are given at the end of the list; many of the properties below had

already been listed in [13]. In this section and in the following one we .nd it convenient to addthe size of the alphabet in the notation for entropy, and so we write Hk; �(�), or Hkh;�(XY ) for abi-dimensional entropy. We stress that the equality involving �-entropies are uniform in the sense

11 An alternative point of view might be to de.ne the abstract entropy as any function which is intermediate between theweak-constraint and the strong-constraint entropy. In terms of fuzzy sets one might introduce abstract Hartley entropies,and not bother to specify whether the cuts are weak or strong; cf. Section 3.

Page 11: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 345

that they hold for all values of � (cf. [13] for pointwise equalities and pointwise equality criteria).In the same spirit, when we write a strict inequality involving �-entropies, what we mean is thatthe weak inequality holds for all values of �, but there is at least one value of � for which theinequality is strict. In property 7∗, by saying that Y is a deterministic function of X we mean thatfor each letter a of positive possibility, Poss{X= a}¿0, there is exactly one letter b such that thejoint possibility Poss{X= a; Y= b} is positive. Observe that, by de.nition (1) of marginal possibility,there must be letters b such that Poss{X= a; Y= b}=Poss{X= a}, and so Poss{X= a}¿0 impliesthat there is at least one letter b whose joint possibility with a is positive.

Properties

0. H2; �(1; 1)=1.1. Hk; �(�)¿0.1∗ Hk; �(�)= 0 if and only if � is deterministic.2. Hk; �(�)6 log k.2∗ Hk; �(�)= log k if and only if � is vacuous.3. Hk; �(�) is permutation-invariant, i.e. does not change if one permutes the components of �.4. Hk; �(�) is insensitive to components 6�, in particular to zeroes.5. Hk; �(�) is strictly increasing in �.6. Hk; �(�) is weakly decreasing in �.7. Hk; �(X )6Hkh;�(XY ).7∗ Hk; �(X )=Hkh;�(XY ) if and only if Y is a deterministic function of X .8. Hkh;�(XY )6Hk; �(X ) + Hh;�(Y ).8∗ Hkh;�(XY )=Hk; �(X ) + Hh;�(Y ) if and only if X and Y are non-interactive.9. Hk; �(�)=Hk;�(�)(�(�)) for any re-scaling �.10. If � =∈J(�) is .xed, Hk; �(�) is continuous as a function of �.11. If � is .xed, Hk; �(�) is continuous for � =∈J(�).12. If � is .xed, Hk; �(�) is continuous to the right for �∈J(�).

Proof. It will be enough to prove points 7, 7∗, 8 and 8∗, the rest being pretty obvious. To prove7 and 7∗ .rst .x �; by de.nition (1) of marginal possibilities and by Theorem 2, inequality 7 canalso be written as

|a: maxbPoss{X = a; Y = b} ¿ �| 6 |(a; b): Poss{X = a; Y = b} ¿ �|

and so it clearly holds true; it holds with equality when for each letter a such that Poss{X= a}¿�there is exactly one letter b such that Poss{X= a; Y= b}¿�. One soon obtains 7 and 7∗ by taking a“uniform” point of view, i.e. by requiring that the inequality and the equality hold for whatever valueof �. Similarly, to see why inequality 8 holds true for .xed �, use again de.nition (1), Theorem 2and an elementary property of logarithms to write it in the form:

|(a; b): Poss{X = a; Y = b} ¿ �| 6 |a : maxbPoss{X = a; Y = b} ¿ �|

×|b: maxaPoss{X = a; Y = b} ¿ �|:

Page 12: An axiomatic derivation of the coding-theoretic possibilistic entropy

346 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

It is enough to think of the rectangle formed by the couple (a; b) for which both marginal possibilitiesare ¿� to see why the inequality is true: outside that rectangle all the joint possibilities are 6�,because of the way how marginal possibilities are de.ned. As for the equality criterion, just useLemma 1 at the end of Section 2.

Comments

0. This property amounts to choosing a unit. The ignorance of one bit corresponds to a coin, aboutwhose behaviour one is totally ignorant: one does not commit oneself in any way whatever as towhether the coin is fair or biased.1. This inequality and the equality criterion 1∗ perfectly match analogue properties of Shannon en-tropy. The ignorance is zero only when the “partial” state of knowledge � is in factcomplete.2. This inequality and the equality criterion 2∗ perfectly match analogue properties of Shannonentropy, after replacing the uniform probability vector (maximum uncertainty) by the vacuous pos-sibility vector (maximum ignorance).3. This property, which is veri.ed also by Shannon entropy, reMects the fact that ignorance cannotdepend on the “names” ai of the outcomes.4. More explicitly, this property can be written as follows: if � is obtained from � by addingr components all 6�, �=(�1; : : : ; �k ; �k+1; �k+2; : : : ; �k+r), then Hk; �(�)=Hk+r; �(�). In the case ofShannon entropy one has only insensitivity to zeroes; insensitivity to “small possibilities” is typicalof its possibilistic counterpart. Recall the meaning of the “sensitivity threshold” �: events whosepossibility does not exceed � (i.e. events which are made up only of letters whose possibility doesnot exceed �) are viewed “as if they were impossible”.5. The implication is: �¡�⇒Hk; �(�)¡Hk;�(�); we recall that �¡� means: ∀i: �i6�i, ∃i: �i¡�i.In technical terms, entropy seen as a mapping from possibility vectors to the real numbers of theinterval [0; log k] is an order homomorphism; if �¡� the state of knowledge � is marred by moreignorance (by less speci.city) than the state of knowledge �.6. The implication is: �¡�′ ⇒Hk; �(�)¿Hk; �′(�). Pragmatically, the level of ignorance depends alsoon the sensitivity threshold � required. The less we demand, the less our ignorance matters.7. This inequality and the equality criterion 7∗ perfectly match analogue properties of Shannonentropy. Unless Y is already “implicit” in X , the ignorance about the couple strictly exceeds theignorance about the single component.8. This inequality and the equality criterion 8∗ perfectly match analogue properties of Shannonentropy, after replacing independence by non-interactivity.9. This property (invariancy with respect to re-scaling) reMects our linguistic interpretation of pos-sibilities.10. This property perfectly matches an analogue and obvious property veri.ed by Shannon entropy.In loose terms: if the possibility vector � is “similar” to the possibility vector �, in the sense thatthe Euclidean distance between � and � is “small”, then the two entropies Hk; �(�) and Hk; �(�) arethemselves “similar”, at least when � =∈J(�). We recall that the Euclidean distance is “small” if andonly if all the k di<erences �i − �i are so; therefore, if � and � are near enough, � =∈J(�) implies� =∈J(�). Note that also in the case of the operational probabilistic entropy Hk; �(P) continuity in Pfails for �=0, where Shannon entropy is replaced by Hartley measure.

Page 13: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 347

11, 12. No special comment is needed to justify continuity where it holds, and this happens “almosteverywhere”, in the sense of measure theory. The bad news is that neither the weak-constraint northe strong-constraint possibilistic entropies are continuous functions of � over the whole interval[0; 1], unless � is unifocal; recall that J(�) denotes the set of components which appear in vector�, other than 0 and 1. A small change in the sensitivity threshold can bring about a great change inignorance; cf. also our comment to property 4. Instead, the abstract entropy is continuous in �, butthis is only due to the “trick” of leaving it unde.ned outside J(�); as already noticed, property 12does not make sense for abstract entropies.

6. Axiomatic derivation of the possibilistic entropy: the uniqueness theorem

We proceed to choose an axiom system, and to prove a uniqueness theorem; solutions are lookedfor over the entire interval [0; 1]. A warning: here and below we are using the same symbol for thefunctional unknown which appears in the axiom system, and for one of the solutions found, i.e. forthe weak-constraint possibilistic entropy; the resulting ambiguity causes no problems, and allows usto lighten our notation.We begin by a uniqueness theorem limited to unifocal distributions �: each �i is either 0 or 1.

Recall that such a distribution is identi.ed by a crisp non-void subset A⊆A, A �= ∅. To stress thisfact we shall write �A; one has �i=1 if and only if ai ∈A, else �i=0. Below, we basically imposeadditivity under non-interactivity, and so can avail ourselves of the following well-known lemma(cf. e.g. [7], which gives the easy proof); n and m are integers ¿1:

Lemma 2. The only solution to the functional system in the unknown f(n)

f(mn) = f(m) + f(n); f(n) weakly increasing; f(2) = 1

is f(n)= log2 n.

The uniqueness theorem for unifocal possibilities, as the reader conversant with the theory ofinformation measures will readily recognise, is simply a re-visitation of the axiomatic derivationof Hartley measure, as given e.g. in [7]. The properties involved are numbers 0, 3, 4, 5, 8∗; were-phrase them to the special case of unifocal distributions. Needless to say, in the case of unifocaldistributions, the distinction among weak-constraint, strong-constraint and abstract entropies is void.Observe that the unique solution found does not depend on �.

Axioms for unifocal distributions.

(a) H2; �(1; 1)=1 (cf. property 0).(b) Hk; �(�A) is permutation-invariant (cf. property 3).(c) Hk; �(�A) is insensitive to zeroes (cf. property 4).(d) Hk; �(�A) is weakly increasing with respect to set inclusion ⊆ (cf. property 5, after observ-

ing that �A6�B if and only if A⊆B; actually, we are requiring less than what is stated inproperty 5, since we are not imposing that the increase must be strict).

Page 14: An axiomatic derivation of the coding-theoretic possibilistic entropy

348 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

(e) if A=AA × AB is a rectangle, then Hkh;�(�A)=Hk; �(�AA) + Hh;�(�AB) (special case ofproperty 8∗).

Theorem 3 (Uniqueness theorem for unifocal distributions). If �=�A is constrained to be a uni-focal possibility distribution, the only solution to axioms (a), (b), (c), (d), (e) is

Hk;�(�A) = log2 |A|; 06 �6 1:

Proof. Fix �. Because of axioms b) and c) the solution Hk; �(�A) depends only on the number ofones in �, i.e. on the size of A, and so we can write Hk; �(�A)=f(|A|). The rest soon follows fromLemma 2.

We now go to general distributions �, and add properties 9, 10 and 12 as axioms. We re-statethese properties, or rather the special cases we need; as for axiom (f), its meaning is made clear byTheorem 3. Axioms (i1), (i2) and (i3) are just “variants” of property 12, solely meant to discriminateamong weak-constraint, strong-constraint and abstract entropy; they are inactive outside J(�).

Axioms for general distributions.

(f) If �=�A is a unifocal possibility distribution, then Hk; �(�)= log2 |A| (cf. Theorem 3).(g) if �(�)= �, then Hk; �(�)=Hk; �(�(�)) (special case of property 9: one requires invariance under

re-scaling only for those re-scalings which leave the “sensitivity threshold” unvaried).(h) Outside J(�), Hk; �(�) is continuous as a function of � (cf. property 10).(i1) Hk; �(�) is unde.ned for �∈J(�).(i2) Hk; �(�) is continuous to the right for �∈J(�) (cf. property 12).(i3) Hk; �(�) is continuous to the left for �∈J(�).Outside J(�) axioms (i1), (i2) and (i3) will not be used, and the solution to the remaining axiomswill turn out to be a continuous function of �; continuity over the whole interval [0; 1] cannot berequired if one insists on keeping axioms (f) to (h).

Theorem 4 (uniqueness theorem). The only solutions to axioms (f), (g), (h), (i1) is the abstractpossibilistic entropy as de2ned in Section 4. The only solutions to axioms (f), (g), (h), (i2) isthe weak-constraint possibilistic entropy. The only solutions to axioms (f), (g), (h), (i3) is thestrong-constraint possibilistic entropy.

Proof. Assume that � is not unifocal; .x � outside J(�). After setting A= {ai: �i¿�}, orA= {ai: �i=1} when �=1, we need to show that Hk; �(�)= log |A|. Choose any ' such that 0¡'¡min[�; 1− �]. Take a re-scaling �(x) with �(�)= � such that the components �i¡� are transformed to�(�i)¡', those ¿� to �(�i)¿1−'. Clearly, �= �(�) =∈J(�(�)). Now, if ' is “small”, the re-scaledpossibility vector �(�) and the unifocal possibility vector �A have a “small” Euclidean distance,and so, by axiom (h) which requires continuity in �, the di<erence between the two entropiesHk; �(�(�)) and Hk; �(�A)= log |A| is itself “small”. After recalling that ' is arbitrary, use axiom(g), which states that Hk; �(�)=Hk; �(�(�)), to prove the theorem for each � =∈J(�). Outside J(�)use right continuity, or left continuity, respectively.

Page 15: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 349

One cannot omit any one of the axioms (f)–(h) without introducing unwanted solutions for� =∈J(�): in this sense our axiom system is minimal. Let us prove this claim. If one omits ax-iom (f), then f(�;�) equal to any constant independent of � and � is a solution. If one omitsaxiom (g), then f(�;�)= log(�1 + �2 + · · · + �k) is a solution. If one omits axiom (h), thenf(�;�)= log |{ai: �i �=0}| is a solution.

7. Average entropies

In this section we assume that one has .xed a speci.c numeric scale for possibilities, to whichone intends to stick without availing himself of any “linguistic” re-scaling �(x). If this is the case,one may wish to “get rid” of the particular reliability criterion � chosen by resorting to the averagevalues:∫ 1

0H�(P) d�;

∫ 1

0H�(�) d�

with respect to �, not only in the case of the probabilistic coding-theoretic entropy H�(P) but also inthe case of the possibilistic coding-theoretic entropy H�(�). In the probabilistic case, the “Hartleystep” has length zero and so does not inMuence the value of the integral; one simply re-obtainsShannon entropy:

∫ 1

0H�(P) d�=H(P):

Let us proceed to the possibilistic case. Two functions which are equal “almost everywhere” yield thesame integral, and so, from the point of view of averaging, it does not matter whether our reliabilityconstraints are weak or strong, or whether we are referring to abstract entropies. Below we need thedistinct �-supports of �. They are as many as are the distinct non-zero components which appearin �, s= s(�), say: �∗

1 = 1¿�∗2¿ · · ·¿�∗

s �=0 (16s6k). After setting �∗s+1 = 0, independent of

whether 0 is or is not a component of �, the distinct �-supports are the s sets Sj = {ai: �i¿�∗j+1},

16j6s. By averaging the possibilistic �-entropy with respect to �, one obtains:∫ 1

0H�(�) d� =

∑16j6s

(�∗j − �∗

j+1) log2 |Sj|:

This is a well-known measures of non-speci2city called U-uncertainty; cf. e.g. [7]. A nicer expressionis obtained, if the k components of � are assumed to be ordered: �1 = 1¿�2¿ · · ·¿�k¿0. Aftersetting �k+1 = 0, one obtains a form of the U-uncertainty which is often found in the literature:

∫ 1

0H�(�) d� =

∑16i6k

(�i − �i+1) log2 i:

Since U-uncertainty is obtained by an averaging operation with respect to �, it veri.es all theproperties veri.ed by the entropy H�(�) which do not involve �. In practice, all the properties from1 to 10 are veri.ed by U-uncertainty, save properties 6 and 9; in property 4 one must omit anymention of �¿0. An axiomatic derivation of U-uncertainty is to be found in [7], and so we shall

Page 16: An axiomatic derivation of the coding-theoretic possibilistic entropy

350 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

not insist on it here. In [7] the axioms chosen are expansibility, subadditivity and additivity undernon-interactivity, continuity in �, monotonicity, minimum and maximum, and also branching, i.e.,apart from branching, our properties 4 with �=0; 8 and 8∗, 10, 5, 1 and 1∗, 2 and 2∗. As forbranching, which is in a way the less “palatable” axiom, cf. [7]. Of course, averaging has no formof invariance with respect to re-scaling, and so U-uncertainty, unlike the possibilistic entropy, takesit for granted that a numeric scale has been agreed upon 12 for linguistic labels, and no re-scalingwill be performed.Our approach enables us to give a coding-theoretic meaning also to U-uncertainty: U-uncertainty

turns out to be the average asymptotic rate of optimal possibilistic codes, when the error threshold �is chosen (totally) at random out of the interval [0; 1[. In a way, U-uncertainty is the typical value ofthe coding-theoretic possibilistic entropy. The fact that U-uncertainty can be seen as an average ofHartley measures had been already pointed out, e.g. in [7], but no genuine operational interpretationof U-uncertainty had been given so far. The operational theory which made this interpretation possibleis the coding-theoretic approach to possibilistic information theory as expounded in [12].

Appendix A. Shannon entropy in the combinatorics of large numbers

This appendix is a comment to the comment [6] due to S. GuiaIsu. He observed that both possi-bilistic entropies discussed in [13] can be obtained as objective values of non-linear programs whereone maximises Shannon entropy subject to linear equalities and inequalities. For example the entropyaxiomatised in this paper can be written as:

maxp1 ;p2 ;:::;pk

H(p1; p2; : : : ; pk) subject to max pi = 1; pi ¿ 0; pi = 0 if �i 6 �i:

The occurrence of Shannon entropy is attributed in [6] to the fact that Shannon entropy is the onlyadequate measure of (probabilistic) uncertainty. The implication might be that the two possibilisticentropies of [13] are not as strictly non-probabilistic as one would like them to be. To foil thisobjection, we shall now argue that Shannon entropy has also a non-probabilistic meaning, which iseven more “elementary” than information measures, and so in a way even more basic.Let us consider a set of sequences An ⊆AN de.ned through a constraint x∈Vn which is assumed to

be symmetric. By this we mean that if a sequence x veri.es the constraint and so belongs to An, thenany of its permutations will also satisfy the constraint, and so will also belong to An. In other words,the constraint does not check the sequence x as such, but checks only its composition (ni; n2; : : : ; nk),ni being the number of occurrences of letter ai in sequence x. Equivalently, the constraint checksonly the k relative frequencies fi= ni=n; the vector Px=(fi; f2; : : : ; fk) is a probability vector whichis derived from sequence x, or rather from its composition. At this point, we may re-write thesymmetric constraint x∈Vn in the convenient form Px ∈Vn. In the following, we assume that thelength n of the sequences is “very large”, and so, rather than the size of An, which is “too big”(it is easy to prove that there are exponentially many sequences with the same composition), itwill be wise to be contented with its rate Rn= n−1 log |An|. The asymptotic result below is part of

12 Possibility numbers may be interpreted in more “objective” ways than as linguistic labels, for example as upperprobabilities. As for the possible meanings of possibilities we refer once more to [3,5].

Page 17: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 351

a powerful method expounded in [2], and widely used in information theory to prove asymptoticcoding theorems:

Rn = maxPx∈Vn

H(Px) + �n; �n → 0:

So, up to a “negligible” correction �n, the rate of the set An is equal 13 to a suitable Shannonentropy. Roughly, if one wants to talk about sizes of sets of long sequences, one ends up talkingabout Shannon entropies. In spite of the occurrence of probability vectors, this is not uncertaintyand randomness, this is just counting.In particular, there is nothing surprising if probability vectors and Shannon entropies show up in

a context like ours, where we count large symmetric sets of sequences (messages) output by aninformation source, and which is and remains strictly possibilistic.

Appendix B. Compression codes for non-interactive possibilistic sources

In this appendix we deepen the coding-theoretic approach to entropy, so as to convince the readerthat the axiomatic approach taken here has also a solid coding-theoretic basis.We start by giving a streamlined description of what a source code is; we refer also to [1] or [2].

A code is made up of two elements, an encoder and a decoder: after choosing a length n, the encodermaps the n-sequences of An, i.e. the messages of n letters output by the information source, to binarystrings of a .xed length l, which are called codewords; the decoder observes the received codewordand tries to guess the message. The basic element of a code is a subset Cn ⊆An of messages, calledthe codebook. Out of the kn messages output by the information source, only those belonging tothe codebook Cn are given separate binary codewords and are properly recognised by the decoder;should the source output a message which outside Cn, an alarm signal is triggered, and decodingis refused. So, an error (a detected error, actually) occurs if and only if the sequence output by thesource does not belong to the codebook.A good code should trade o< two conMicting demands: the binary codewords should be short, so

as to ensure compression of data, while the error probability, or the error possibility, as the case is,should be small so as to ensure reliability. In practice, one chooses a tolerated error probability orerror possibility �; 06�¡1, and then constructs a set Cn as small as possible with the constraint thatthe error probability or the error possibility is at most �. The number n−1�log |Cn|� is called the coderate; �log |Cn|� is equal to the length l of the binary codewords which encode source messages, andso the rate is measured in number of bits per source letter (the notation �x� stands for the upperinteger part of x). The fundamental optimisation problem of source coding boils down to .nding asuitable codebook Cn as small as possible, so as to minimise the code rate, subject to a constraintlike the following:

�[n]( NCn)6 �

13 The reader who is more mathematically oriented will be happy to learn that the in.nitesimal �n can be uniformlybounded, independently of Vn. Actually, the result above is more general than it would appear from the form in whichwe have stated it, because a set of sequences often possesses some “structure” which is enough to approximate it bysymmetric constraints, better and better as n increases. The reader is again referred to [2].

Page 18: An axiomatic derivation of the coding-theoretic possibilistic entropy

352 A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353

or to an analogue one in the probabilistic case. Here the overbar denotes negation, or set comple-mentation.Actually, the possibilistic criterion has a very clear meaning, due to the fact that possibility is

maxitive. The optimal error region Cn, which has to contain as many codewords as possible, willcontain all sequences whose possibility is at most �, and no sequence whose possibility strictlyexceeds �; in practice, what is required is that all the sequences whose possibility is strictly greaterthan � be given their own codeword (all “valuable” information must be properly recovered whendecoding, while “junk” information 14 is entirely lost).Given an information source S, be it a possibilistic or a probabilistic source, its �-entropy is

de.ned as the limit of the rates of the optimal codebooks Cn=Cn(�) obtained as the length n of theencoded messages goes to in.nity:

H�(S) = limn→+∞

log |Cn(�)|n

; 06 � ¡ 1:

We stress that this de.nition is operational, i.e. it is given 15 in terms of a coding-theoretic problem.In information theory, asymptotic rates, i.e. entropies, are seen as “ideal” rates, obtainable indepen-dently of real-world time and complexity constraints (the reader who wants to deepen this importantpoint is again referred to standard texts like [1] or [2]). Should an �-entropy be equal to log k, nosubstantial data compression would be feasible at the required level of reliability, since log k is therate of the codebook formed by all the sequences of An; in this case we shall say that the sourceis incompressible. When a source S is possibilistic, stationary and non-interactive, rather than usingthe generic symbol H�(S) one can write H�(�), as we did in the body of the paper.In the case of a stationary and non-interactive source, for .xed message-length n the optimal

codebook is just made up by all the sequences formed by using letters whose possibility is strictlygreater than �. So, the optimal 16 codebook Cn is simply the Cartesian power {ai; �i¿�}n and thissoon gives the value of the possibilistic entropy as in [12].

Example. Assume that the set {ai; �i¿�} contains two letters, a and b, say. Each sequence in thecodebook is encoded by itself, just writing 0 instead of a, say, and 1 instead of b; the value of therate is exactly 1 bit. As soon as the encoder encounters a low possibility letter the alarm is trigerred;if the position of the low-possibility letter in message was the sth, the chunk of the remaining n− sletters is discarded, and the encoder proceeds directly to encoding the subsequent message. In a way,

14 This is a harsh cut, indeed. In [13] a more Mexible point of view is adopted, and also messages with “few” low-possibility letters are encoded and properly decoded. The interactive entropy one obtains has a continuous behaviour,unlike the non-interactive entropy dealt with in this paper; its properties have been investigated in [13]. By and large,the non-interactive entropy as here appears to be a less sophisticated tool than the entropy of [13] as far as coding isconcerned, but in spite of this, or perhaps exactly because of this, it performs better (veri.es nicer properties) whenviewed as a possibilistic information measure: after all, possibility theory, or at least its linguistic-label interpretation,belongs to soft mathematics, and soft mathematics works well precisely in those situations when “unsophisticated” toolsturn out to be the most appropriate.15 If the limit does not exist, the corresponding source possesses no entropy. The integer ceiling has been dropped, since

it has no inMuence on the limit.16 The optimal code construction is so simple as to be disappointing. The reader should not be deluded into believing

that this is always the case: as observed in [12], some possibilistic code constructions are equivalent to famous unsolvedproblems of combinatorics.

Page 19: An axiomatic derivation of the coding-theoretic possibilistic entropy

A. Sgarro / Fuzzy Sets and Systems 143 (2004) 335–353 353

in this lucky case the encoder just has to transcribe integers written to the base 2 to themselves(messages and codewords are “blocks” of the same length; correspondingly, “small” binary integersare written with a run of zeroes as pre.x). If the set {ai; �i¿�} contains h letters, say, the encoderwill have to transcribe integers written to the base h to integers written to the base 2; one can availoneself of the existing algorithms. The blocklength of the binary codewords is of �n log h� bits.

References

[1] Th.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991.[2] I. CsiszQar, J. KSorner, Information Theory, Academic Press, New York, 1981.[3] G. De Cooman, Possibility theory, Internat. J. Gen. Systems 25.4 (1997) 291–371.[4] D. Dubois, H. Prade, Properties of measures of information in evidence and possibility theories, Fuzzy Sets and

Systems 24 (1987) 161–182.[5] D. Dubois, H.T. Nguyen, H. Prade, Possibility theory, probability and fuzzy sets: misunderstandings, bridges and

gaps, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000,pp. 343–438.

[6] S. GuiaIsu, Comments on the Paper on Possibilistic Entropies by A. Sgarro and L.P. Dinu, Internat. J. Uncertainty,Fuzziness Knowledge-Based Systems 10.6 (2002) 655–657.

[7] G.J. Klir, T.A. Folger, Fuzzy Sets, Uncertainty and Information, Prentice-Hall, London, 1988.[8] G.J. Klir, Measures of uncertainty and information, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets,

Kluwer Academic Publishers, Boston, 2000, pp. 439–457.[9] F.L. Luccio, A. Sgarro, Fuzzy graphs and error-proofs keyboards, Proc. IPMU-2002, Annecy, France, July 2002,

pp. 1503–1508.[10] A. Sgarro, An open-frame theory of incomplete interval probabilities, Internat. J. Uncertainty, Fuzziness

Knowledge-Based Systems 6.6 (1998) 551–562.[11] A. Sgarro, The capacity of a possibilistic channel, in: S. Benferhat, Ph. Besnard (Eds.), Symbolic and Quantitative

Approaches to Reasoning with Uncertainty, Lecture Notes in Arti.cial Intelligence, vol. 2143, Lecture Notes inComputer Science, Springer, Berlin, 2001, pp. 398–409.

[12] A. Sgarro, Possibilistic information theory: a coding-theoretic approach, Fuzzy Sets and Systems 132–1 (2002)11–32.

[13] A. Sgarro, L.P. Dinu, Possibilistic entropies and the compression of possibilistic data, Internat. J. Uncertainty,Fuzziness and Knowledge-Based Systems 10.6 (2002) 635–653.

[14] L. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets and Systems 1 (1978) 3–28.