possibilistic information theory: a coding theoretic approach

Fuzzy Sets and Systems 132 (2002) 11–32www.elsevier.com/locate/fss

Possibilistic information theory: a coding theoretic approach�

Andrea Sgarro ∗

Department of Mathematical Sciences (DSM), University of Trieste, 34100 Trieste, Italy

Received 20 April 2001; accepted 19 November 2001

Abstract

We de*ne information measures which pertain to possibility theory and which have a coding-theoretic meaning. We putforward a model for information sources and transmission channels which is possibilistic rather than probabilistic. In the caseof source coding without distortion we de*ne a notion of possibilistic entropy, which is connected to the so-called Hartley’smeasure; we tackle also the case of source coding with distortion. In the case of channel coding we de*ne a notion ofpossibilistic capacity, which is connected to a combinatorial notion called graph capacity. In the probabilistic case Hartley’smeasure and graph capacity are relevant quantities only when the allowed decoding error probability is strictly equal to zero,while in the possibilistic case they are relevant quantities for whatever value of the allowed decoding error possibility; as theallowed error possibility becomes larger the possibilistic entropy decreases (one can reliably compress data to smaller sizes),while the possibilistic capacity increases (one can reliably transmit data at a higher rate). We put forward an interpretation ofpossibilistic coding, which is based on distortion measures. We discuss an application, where possibilities are used to copewith uncertainty as induced by a “vague” linguistic description of the transmission channel.c© 2001 Elsevier Science B.V. All rights reserved.

Keywords: Measures of information; Possibility theory; Possibilistic sources; Possibilistic entropy; Possibilistic channels; Possibilisticcapacity; Zero-error information theory; Graph capacity; Distortion measures

1. Introduction

When one speaks of possibilistic information the-ory, usually one thinks of possibilistic informationmeasures, like U-uncertainty, say, and of their usein uncertainty management; the approach which onetakes is axiomatic, in the spirit of the validationof Shannon’s entropy which is obtained by usingHin<cin’s axioms; cf. e.g. [8,12–14]. In this paperwe take a diBerent approach: we de*ne information

� Partially supported by MURST and GNIM-CNR. Part of thispaper, based mainly on Section 5; has been submitted for presen-tation at Ecsqaru-2001, to be held in September 2001 in Toulouse,France.

∗ Corresponding author. Tel.: +40-6762623; fax: +40-6762636.E-mail address: [email protected] (A. Sgarro).

measures which pertain to possibility theory andwhich have a coding-theoretic meaning. This kind ofoperational approach to information measures was*rst taken by Shannon when he laid down the foun-dations of information theory in his seminal paper of1948 [18], and has proved to be quite successful; it haslead to such important probabilistic functionals as aresource entropy or channel capacity. Below we shalladopt a model for information sources and transmis-sion channels which is possibilistic rather than prob-abilistic (is based on logic rather than statistics); thiswill lead us to de*ne a notion of possibilistic entropyand a notion of possibilistic capacity in much the sameway as one arrives at the corresponding probabilisticnotions. An interpretation of possibilistic coding isdiscussed, which is based on distortion measures, anotion which is currently used in probabilistic coding.

0165-0114/01/$ - see front matter c© 2001 Elsevier Science B.V. All rights reserved.PII: S 0165 -0114(01)00245 -7

12 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11–32

We are con*dent that our operational approach maybe a contribution to enlighten, if not to disentangle,the vexed question of de*ning adequate informationmeasures in possibility theory.We recall that both the entropy of a probabilistic

source and the capacity of a probabilistic channel areasymptotic parameters; more precisely, they are limitvalues for the rates of optimal codes, compressioncodes in the case of sources, and error-correctioncodes in the case of channels; the codes one consid-ers are constrained to satisfy a reliability criterionof the type: the decoding-error probability of thecode should be at most equal to a tolerated value �,06�¡1. A streamlined description of source codesand channel codes will be given below in Sections4 and 5; even from our Keeting hints it is howeverapparent that, at least a priori, both the entropy ofa source and the capacity of a channel depend onthe value � which has been chosen to specify thereliability criterion. If in the probabilistic models themention of � is usually omitted, the reason is thatthe asymptotic values for the optimal rates are thesame whatever the value of �, provided however that� is strictly positive. 1 Zero-error reliability criterialead instead to quite diBerent quantities, zero-errorentropy and zero-error capacity. Now, the problemof compressing information sources at zero erroris so trivial that the term zero-error entropy is sel-dom used, if ever. 2 Instead, the zero-error problem

1 The entropy and the capacity relative to a positive errorprobability � allow one to construct sequences of codes whoseprobability of a decoding error is actually in/nitesimal; it will beargued below that this point of view does not make much sensefor possibilistic coding; cf. Remark 4.3.

2 No error-free data compression is feasible for probabilisticsources if one insists, as we do below, on using block-codes, i.e.,codes whose codewords have all the same length; this is why onehas to resort to variable-length codes, e.g., to HuBman codes. Asfor variable-length coding, the possibilistic theory appears to lacka counterpart for the notion of average length; one should have tochoose one of the various aggregation operators which have beenproposed in the literature (for the very broad notion of aggregationoperators, and of “averaging” aggregations in particular, cf., e.g.,[12] or [16]). Even if one insists on using block-codes, the problemof data compression at zero error is far from trivial when adistortion measure is introduced; cf. Appendix B. In this paperwe deal only with the basics of Shannon’s theory, but extensionsare feasible to more involved notions, compound channels, say,or multi-user communication (as for these information-theoreticnotions cf., e.g., [3] or [4]).

of data protection in noisy channels is devilishlydiLcult, and has lead to a new and fascinating branchof coding theory, and more generally of informa-tion theory and combinatorics, called zero-errorinformation theory, which has been pretty recentlyoverviewed and extensively referenced in [15]. Inparticular, the zero-error capacity of a probabilis-tic channel is expressed in terms of a remarkablecombinatorial notion called Shannon’s graph capac-ity (graph-theoretic preliminaries are described inAppendix A).So, to be fastidious, even in the case of probabilistic

entropy and probabilistic capacity one deals with twostep-functions of �, which can assume only two dis-tinct values, one for �=0 and the other for �¿0. Weshall adopt a model of the source and a model of thechannel which are possibilistic rather than probabilis-tic, and shall choose a reliability criterion of the type:the decoding-error possibility should be at most equalto �, 06�¡1. As shown below, the possibilistic ana-logues of entropy and capacity exhibit quite a perspic-uous step-wise behaviour as functions of �, and so themention of � cannot be disposed of. As for the “form”of the functionals one obtains, it is of the same typeas in the case of the zero-error probabilistic measures,even if the tolerated error possibility is strictly posi-tive. In particular, the capacities of possibilistic chan-nels are always expressed in terms of graph capacities;in the possibilistic case, however, as one loosens thereliability criterion by allowing a larger error possi-bility, the relevant graph changes and the capacity ofthe possibilistic channel increases.We describe the contents of the paper. In Sec-

tion 2, after some preliminaries on possibility theory,possibilistic sources and possibilistic channels areintroduced. Section 3 contains two simple lemmas,Lemmas 3.1 and 3.2, which are handy tools apt to“translate” probabilistic zero-error results into theframework of possibility theory. Section 4 is devotedto possibilistic entropy and source coding; we havedecided to deal in Section 4 only with the problem ofsource coding without distortion, and to relegate themore taxing case of source coding with distortion toan appendix (Appendix B); this way we are able tomake many of our points in an extremely simple way.In Section 5, after giving a streamlined descriptionof channel coding, possibilistic capacity is de*nedand a coding theorem is provided. Section 6 explores

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11–32 13

the consequences of changing the reliability criterionused in Section 5; one requires that the average errorpossibility should be small, rather than the maximalerror possibility. 3 Up to Section 6, our point of viewis rather abstract: the goal is simply to understandwhat happens when one replaces probabilities by pos-sibilities in the standard models for data transmission.A discussion of the practical meaning of our proposalis instead deferred to Section 7: we put forward an in-terpretation of the possibilistic model which is basedon distortion measures. We discuss an application tothe design of error-correcting telephone keyboards; inthe spirit of “soft” mathematics possibilities are seenas numeric counterparts for linguistic labels, and areused to cope with uncertainty as induced by “vague”linguistic information.Section 7 points also to future work, which does

not simply aim at a possibilistic translation andgeneralization of the probabilistic approach. Openproblems are mentioned, which might prove to bestimulating also from a strictly mathematical view-point. In this paper we take the asymptotic point ofview which is typical of Shannon theory, but onemight prefer to take the constructive point of view ofalgebraic coding, and try to provide *nite-length codeconstructions, as those hinted at in Section 7. Wedeem that the need for a solid theoretical foundationof “soft” coding, as possibilistic coding basically is,is proved by the fact that several ad hoc coding algo-rithms are already successfully used in practice, e.g.,those for compressing images, which are not based onprobabilistic descriptions of the source or of the chan-nel (an exhaustive list of source coding algorithms isto be found in [21]). Probabilistic descriptions, whichare derived from statistical estimates, are often toocostly to obtain, or even unfeasible, and at the sametime they are uselessly detailed.The paper aims at a minimum level of self-

containment, and so we have shortly re-describedcertain notions of information theory which are quite

3 The new possibilistic frame includes the traditional zero-errorprobabilistic frame, as argued in Section 3: it is enough to takepossibilities which are equal to zero when the probability is zero,and equal to one when the probability is positive, whatever itsvalue. However, the consideration of possibility values which areintermediate between zero and one does enlarge the frame; cf.Theorem 6.1 in Section 6, and the short comment made there justbefore giving its proof.

standard; for more details we refer the reader, e.g.,to [3] or [4]. As for possibility theory, and in par-ticular for a clari*cation of the elusive notion ofnon-interactivity, which is often seen as the naturalpossibilistic analogue of probabilistic independence(cf. Section 2), we mention [5,6,9,11,12,16,23].

2. Possibilistic sources and possibilistic channels

We recall that a possibility distribution �over a *nite set A= {a1; : : : ; ak}, called the al-phabet, is de*ned by giving a possibility vector�=(1; 2; : : : ; k) whose components i are thepossibilities �(ai) of the k singletons ai (16i6k,k¿2):

�(ai) = i; 06 i 6 1; max16i6k

i = 1:

The possibility 4 of each subset A⊆A is the maxi-mum of the possibilities of its elements:

�(A) = maxai∈A

i: (2.1)

In particular �(∅)= 0; �(A)= 1. In logical termstaking a maximum means that event A is �-possiblewhen at least one of its elements is so, in the sense ofa logical disjunction.Instead, probability distributions are de*ned

through a probability vector P=(p1; p2; : : : ; pk),P(ai)=pi; 06pi61,

∑16i6k pi =1, and have an

additive nature:

P(A) =∑

ai∈A

pi:

With respect to probabilities, an empirical interpreta-tion of possibilities is less clear. The debate on themeaning and the use of possibilities is an ample andlong-standing one; the reader is referred to standardtexts on possibility theory, e.g., those quoted at the

4 The fact that the symbol � is used both for vectors and fordistributions will cause no confusion; below the same symbol willbe used also to denote a stationary and non-interactive source,since the behaviour of the latter is entirely speci*ed by the vector�. Similar conventions will be tacitly adopted also in the case ofprobabilistic sources, and of probabilistic and possibilistic chan-nels.


end of Section 1; cf. also Section 7, where the appli-cability of our model to real-world data transmissionis discussed.The probability distribution P over A can be ex-

tended in a stationary and memoryless way to a prob-ability distribution Pn over the Cartesian power An

by setting for each sequence x= x1x2 : : : xn ∈An:

Pn(x) =∏

16i6n

P(xi):

We recall that the elements ofAn are the kn sequencesof length n built over the alphabet A. Each such se-quence can be interpreted as the information which isoutput in n time instants by a stationary and memo-ryless source, or SML source. The memoryless natureof the source is expressed by the fact that the n prob-abilities P(xi) are multiplied. Similarly, we shall ex-tend the possibility distribution � in a stationary andnon-interactive way to a possibility distribution �[n]

over the Cartesian power An:

De�nition 2.1. A stationary and non-interactiveinformation source over the alphabet A is de*ned bysetting for each sequence x∈An:

�[n](x) = min16i6n

�(xi):

In logical terms, this means that the occurrence ofsequence x= x1x2 : : : xn is declared �-possible whenthis is so for all of the letters xi, in the sense of a logicalconjunction. An interpretation of non-interactivity inour models of sources and channels is discussed inSection 7.

Let A= {a1; : : : ; ak} and B= {b1; : : : ; bh} be twoalphabets, called in this context the input alphabetand the output alphabet, respectively. Probabilisticchannels are usually described by giving a stochasticmatrix W whose rows are headed to the input alpha-bet A and whose columns are headed to the out-put alphabet B. We recall that the k rows of such astochastic matrix are probability vectors over the out-put alphabet B; each entry W (b|a) is interpreted asthe transition probability from the input letter a∈Ato the output letter b∈B. A stationary and memo-ryless channel Wn, or SML channel, extends W ton-tuples, and is de*ned by setting for each x∈An and

each y=y1y2 : : : yn ∈Bn:

Wn(y|x) =Wn(y1y2 : : : yn|x1x2 : : : xn)

=n∏

i=1

W (yi|xi): (2.2)

Note that Wn is itself a stochastic matrix whose rowsare headed to the sequences inAn, and whose columnsare headed to the sequences in Bn. The memorylessnature of the channel is expressed by the fact that then transition probabilities W (yi|xi) are multiplied.We now de*ne the possibilistic analogue of stochas-

tic (probabilistic) matrices. The k rows of a possibilis-tic matrix � with h columns are possibility vectorsover the output alphabet B. Each entry �(b|a) willbe interpreted as the transition possibility 5 from theinput letter a∈A to the output letter b∈B; cf. theexample given below. In De*nition 2.2 � is such apossibilistic matrix.

De�nition 2.2. A stationary and non-interactivechannel, or SNI channel, �[n], extends � to n-tuplesand is de*ned as follows:

�[n](y|x) =�[n](y1y2 : : : yn|x1x2 : : : xn)

= min16i6n

�(yi|xi): (2.3)

Products as in (2:2) are replaced in (2:3) by aminimum operation; this expresses the non-interactivenature of the extension. Note that �[n] is itself apossibilistic matrix whose rows are headed to thesequences in An, and whose columns are headed tothe sequences in Bn. Taking the minimum of the ntransition possibilities �(yi|xi) can be interpreted as a

5 Of course transition probabilities and transition possibilitiesare conditional probabilities and conditional possibilities, respec-tively, as made clear by our notation which uses a conditioningbar. We have avoided mentioning explicitly the notion of condi-tional possibilities because they are the object of a debate which isfar from being closed (cf. e.g., Part II of [5]); actually, the worstproblems are met when one starts by assigning a joint distributionand wants to compute the marginal and conditional ones. In ourcase it is instead conditional possibilities that are the starting point:as argued in [2], “prior” conditional possibilities are not problem-atic, or rather they are no more problematic than possibilities inthemselves.


logical conjunction: only when all the transitionsare �-possible, it is �-possible to obtain output y frominput x; cf. also Section 7. If B is a subset of Bn, onehas in accordance with (2:1):

�[n](B|x) = maxy∈B

�[n](y|x):

Example 2.1. For A=B= {a; b} we show a possi-bilistic matrix � and its “square” �[2] which speci*esthe transition possibilities from input couples to outputcouples. The possibility that a is received when b issent is �; this is also the possibility that aa is receivedwhen ab is sent, say; 06�61. Take B= {aa; bb};then �[2](B|ab)= max[�; 0]= �. In Section 6 this ex-ample will be used assuming � �=0, � �=1.

�[2] | aa ab ba bb

� | a b −− + −− −− −− −−−−+ −− −− aa | 1 0 0 0

a | 1 0 ab | � 1 0 0

b | � 1 ba | � 0 1 0

bb | � � � 1

3. A few lemmas

Sometimes the actual value of a probability doesnot matter, what matters is only whether that prob-ability is zero or non-zero, i.e., whether the corre-sponding event E is “impossible” or “possible”. Thecanonical transformationmaps probabilities to binary(zero-one) possibilities by setting Poss{E}=0 if andonly if Prob{E}=0, else Poss{E}=1; this transfor-mation can be applied to the components of a prob-ability vector P or to the components of a stochasticmatrix W to obtain a possibility vector � or a pos-sibilistic matrix �, respectively. Below we shall in-troduce an equivalence relation called �-equivalencewhich in a way extends the notion of a canonical trans-formation; here and in the sequel � is a real numbersuch as 06�¡1. It will appear that a vector � or amatrix � obtained canonically from P or from W are�-equivalent to P or to W , respectively, for whatevervalue of �.

De�nition 3.1. A probability vector P and a pos-sibility vector � over alphabet A are said to be

�-equivalent when the following double implicationholds ∀a∈A:

P(a) = 0 ⇔ �(a)6 �:

The following lemma shows that �-equivalence,rather than a relation between letters, is a relationpertaining to the extended distributions Pn and �[n],seen as set-functions over An:

Lemma 3.1. Fix n¿1. The probability vector P andthe possibility vector � are �-equivalent if and onlyif the following double implication holds ∀A⊆An:

Pn(A) = 0 ⇔ �[n](A)6 �:

Proof. To prove that the double implication implies�-equivalence, just take A= {aa : : : a} for each lettera∈A. Now we prove that if P and� are �-equivalentthen the double implication in Lemma 3.1 holds true.First assume A is a singleton, and contains only se-quence x. The following chain of double implicationsholds:

Pn(x) = 0 ⇔ ∃i : P(xi) = 0 ⇔

∃i : �(xi)6 � ⇔ mini

�(xi)6 �

⇔ �[n](x)6 �:

This means that, if the two vectors P and �are �-equivalent, so are also Pn and �[n], seen as vec-tors with kn components. Then the following chainholds too, whatever the size of A:

Pn(A) = 0 ⇔ ∀x ∈ A : Pn(x) = 0 ⇔

∀x ∈ A : �[n](x)6 � ⇔

maxx∈A

�[n](x)6 � ⇔ �[n](A)6 �:

However simple, Lemma 3.1 and its straightfor-ward generalization to channels, Lemma 3.2 below,are the basic tools used to convert probabilisticzero-error coding theorems into possibilistic ones.


De�nition 3.2. A stochastic matrix W and a possi-bilistic matrix � are said to be �-equivalent when thefollowing double implication holds ∀a∈A; ∀b∈B:

W (b|a) = 0 ⇔ �(b|a)6 �:

Lemma 3.1 soon generalizes as follows:

Lemma 3.2. Fix n¿1. The stochastic matrix W andthe possibility matrix � are �-equivalent if and onlyif the following double implication holds ∀x∈An,∀B⊆Bn:

Wn(B|x) = 0 ⇔ �[n](B|x)6 �:

In Sections 5 and 6 on channel coding we shallneed the following notion of confoundability betweenletters: two input letters a and a′ are confoundable forthe probabilistic matrix W if and only if there exists atleast an output letter b such that the transition proba-bilities W (b|a) and W (b|a′) are both strictly positive.Given matrix W , one can construct a confoundabilitygraph G(W ) whose vertices are the letters of A byjoining two letters by an edge if and only if they areconfoundable (graph-theoretic notions are describedin Appendix A).We now de*ne a similar notion for possibilis-

tic matrices; to this end we introduce a proximityindex � between possibility vectors �=(1; 2; : : :)and �′ =(′

1; ′2; : : :), which in our case will be pos-

sibility vectors over the output set B:

�(�;�′) = max16i6h

[i ∧ ′i]:

Above the wedge symbol ∧ stands for a minimumand is used only to improve readability. The in-dex � is symmetric: �(�;�′)= �(�′; �). One has06�(�;�′)61, with �(�;�′)= 0 if and only� and �′ have disjoint supports, and �(�;�′)= 1if and only if there is at least a letter a for which�(a)=�′(a)= 1; in particular, this happens when�=�′ (we recall that the support of a possibilityvector is made up by those letters whose possibilityis strictly positive).The proximity index � will be extended to input

letters a and a′, by taking the corresponding rows in

the possibilistic matrix �:

��(a; a′) = �(�(|a); �(|a′))

=maxb∈B

[�(b|a) ∧�(b|a′)]:

Example 3.1. We re-take Example 2.1 above. Onehas: ��(a; a)= ��(b; b)= 1, ��(a; b)= �. With re-spect to �[2], the proximity of two letter couples xand x′ is either 1 or �, according whether x= x′ orx �= x′ (recall that �[2] can be viewed as a possibilisticmatrix over the “alphabet” of letter couples). Cf. alsoExamples 5.1, 5.2 and the example worked out inSection 7.

De�nition 3.3. Once a possibilistic matrix � and anumber � are given (06�¡1), two input letters a anda′ are said to be �-confoundable if and only if theirproximity exceeds �:

��(a; a′) ¿ �:

Given� and �, one constructs the �-confoundabilitygraph G�(�), whose vertices are the letters of A, byjoining two letters by an edge if and only if they are�-confoundable for �.

Lemma 3.3. If the stochastic matrix W and the pos-sibilistic matrix� are �-equivalent the two confound-ability graphs G(W ) and G�(�) coincide.

Proof. We have to prove that, under the assumptionof �-equivalence, any two input letters a and a′ areconfoundable for the stochastic matrix W if and onlyif they are �-confoundable for the possibilistic matrix�. The following chain of double implications holds:

a and a′ are confoundable for W ⇔

∃b: W (b|a) ¿ 0; W (b|a′) ¿ 0 ⇔

∃b: �(b|a) ¿ �; �(b|a′) ¿ � ⇔

maxb

[�(b|a) ∧�(b|a′)] ¿ � ⇔

��(a; a′) ¿ � ⇔ a and a′ are

�-confoundable for �:


Remark 3.1. The index ��(a; a′) establishes a fuzzyrelation between input letters, which may be repre-sented by means of a fuzzy graph G(�) with vertexset equal to the input alphabet A: each edge (a; a′)belongs to the edge set of G(�) with a degree ofmembership equal to ��(a; a′). Then the crisp graphsG�(�) are obtained as (strong) �-cuts of the fuzzygraph G(�). By the way, we observe that ��(a; a′)is a proximity relation in the technical sense of fuzzyset theory [17]. For basic notions in fuzzy set theorycf., e.g., [7,12] or [16].

4. The entropy of a possibilistic source

We start by the following general observation,which applies both to source and channel coding.The elements which de*ne a code f, i.e., the en-coder f+ and the decoder f− (cf. below), do notrequire a probabilistic or a possibilistic descriptionof the source, or of the channel, respectively. Onemust simply choose the alphabets (or at least thealphabet sizes): a primary alphabet A, which is thesource alphabet in the case of sources and the inputalphabet in the case of channels, and the secondaryalphabet B, which is the reproduction alphabet in thecase of sources and the output alphabet in the case ofchannels. 6 One must also specify a length n, whichis the length of the messages which are encoded inthe case of sources, and the length of the codewordswhich are sent through the channel in the case ofchannels, respectively. Once these elements, A, Band n, have been chosen, one can construct a code f,i.e., a couple encoder=decoder. Then one can studythe performance of f by varying the “behaviour” ofthe source (of the channel, respectively): for exam-ple one can *rst assume that this behaviour has aprobabilistic nature, while later one changes to a lesscommittal possibilistic description.The *rst coding problem which we tackle is data

compression without distortion. The results of thissection, or at least their asymptotic counterparts, in-clusive of the notion of �-entropy, might have beenobtained as a very special case of data compression

6 Actually, in source coding without distortion the primary al-phabet and the reproduction alphabet coincide and so the latter isnot be explicitly mentioned in Section 4.

with distortion, as explained in Appendix B. The rea-son for con*ning the general case with distortion toan appendix is just ease of readability: actually, datacompression without distortion as covered in this sec-tion oBers no real problem from a mathematical pointof view, but at the same type is very typical of thenovelties which our possibilistic approach to codingpresents with respect to the standard probabilistic ap-proach.We give a streamlined description of what a source

code f is; for more details we refer to [3] or [4].A code f is made up of two elements, an encoder f+

and a decoder f−; the encoder maps the n-sequencesof An, i.e., the messages output by the informationsource, to binary strings of a *xed length l called code-words; the decoder maps back codewords to messagesin a way which should be “reliable”, as speci*ed be-low. In practice (and without loss of generality), thebasic element of a code is a subset C⊆An of mes-sages, called the codebook. The idea is that, out of thekn messages output by the information source, onlythose belonging to the codebook C are given sepa-rate binary codewords and are properly recognizedby the decoder; should the source output a messagewhich does not belong to C, then the encoder willuse any of the binary codewords meant for the mes-sages inC, and so a decoding errorwill be committed.Thinking of a source which is modelled probabilis-tically, as is standard in information theory, a goodcode should trade oB two conKicting demands: the bi-nary codewords should be short, so as to ensure com-pression of data, while the error probability shouldbe small, so as to ensure reliability. In practice, onechooses a tolerated error probability �, 06�¡1, andthen constructs a set C as small as possible with theconstraint that its probability be at least as great as1− �. The number n−1 log |C| is called the code rate;log |C| is interpreted as the (non-necessarily integer)length of the binary sequences which encode sourcesequences 7 and so the rate is measured as number of

7 In Shannon theory one often incurs into the slight but conve-nient inaccuracy of allowing non-integer “lengths”. By the way,the logarithms here and below are all to the base 2, and so the unitwe choose for information measures is the bit. Bars as in |C| de-note size, i.e., number of elements. Notice that, not to overchargeour notation, the mention of the length n is not made explicit inthe symbols which denote coding functions f and codebooks C.


bits per source letter. Consequently, the fundamentaloptimization problem of probabilistic source codingboils down to *nding a suitable codebook C:

Minimize the code rate1nlog |C| with the

constraint Prob{¬C}6 � (4.1)

(the symbol ¬ denotes negation, or set complementa-tion). As is usual, we shall consider only Bernoullian(i.e., stationary and memoryless, or SML) sources,which are completely described by the probability vec-tor P over the alphabet letters of A; then in (4:1) thegeneric indication of probability can be replaced bythe more speci*c symbol Pn.Given the SML source P, its �-entropy H�(P) is de-

*ned as the limit of the rates Rn(P; �) of optimal codeswhich solve the optimization problem (4:1); obtainedas the length n of the encoded messages goes to in*n-ity:

H�(P) = limn→+∞ Rn(P; �):

In the probabilistic theory there is a dramatic diBerencebetween the case � �=0 and the case �=0. Usually onetackles the case � �=0, the only one which allows actualdata compression, as it can be proved. It is well-knownthat the rates of optimal codes tend to the Shannonentropy H(P) as n goes to in*nity:

H�(P) = H(P) = −∑

16i6k

pi logpi; 0 ¡ � ¡ 1

(we use the script symbol H to distinguish Shannonentropy from the operational entropy H). So Shan-non entropy is the asymptotic value of optimal rates.Note that this asymptotic value does not depend onthe tolerated error probability �¿0; only the speed ofconvergence is aBected; this is why the mention of �is in most cases altogether omitted. Instead, we *ndit convenient to explicitly mention �, and say that the(probabilistic) �-entropy H�(P) of the SML sourceruled by the probability vector P is equal to the Shan-non entropy H(P) for whatever �¿0.Let us go to the case �=0. In this case the structure

of optimal codebooks is extremely simple: each se-quence of positive probability must be given its owncodeword, and so the optimal codebook is

C = {a: P(a) ¿ 0}n (4.2)

whose rate log |{a: P(a)¿0}| is the same whateverthe given length n. Consequently, this is also the valueof the zero-error entropy H0(P) of the SML source:

H0(P) = log |{a: P(a) ¿ 0}|:

For � strictly positive, one has, as is well known,H�(P)=H(P)6H0(P), the inequality being strictunless P is uniform over its support. Note that the�-entropy is a step-function of �: however, the func-tion’s step is obtained only if one keeps the value�=0, which is rather uninteresting because it corre-sponds to a situation where no data-compression isfeasible, but only data transcription into binary; thishappens, say, when one uses ASCII. The zero-errorentropy H0(P) is sometimes called Hartley’s measure(cf. [12]); in the present context it might be rathercalled Hartley’s entropy (�=0), to be set againstShannon’s entropy (�¿0).

Example 4.1. Take P=(1=2; 1=4; 1=4; 0) over analphabet A of four letters. For �=0 one hasH0(P)= log 3≈ 1:585, while H�(P)=H(P)= 1:5whenever 0¡�¡1.

We now go to a stationary and non-interactivesource, or SNI source, over alphabet A, which isentirely described by the possibilistic vector � overalphabet letters. The source coding optimization prob-lem (4.1) will be replaced by (4.3), where one boundsthe decoding error possibility rather than the decodingerror probability:

Minimize the code rate1nlog |C|

with the constraint �[n](¬C)6 �: (4.3)

Now we shall de*ne the possibilistic entropy; as in theprobabilistic case, the de*nition is operational, i.e., isgiven in terms of a coding problem.

De�nition 4.1. Given the stationary and non-interactive source�, its possibilistic �-entropy H�(�)is de*ned as the limit of the rates Rn(�; �) of optimalcodes which solve the optimization problem (4.3),obtained as the length n goes to in*nity:

H�(�) = limn→+∞ Rn(�; �):


Because of Lemma 3.1, the constraint in (4.3) canbe re-written as Pn{¬C}=0 for whatever P which is�-equivalent with �. This means that solving the min-imization problem (4.3) is the same as solving theminimization problem (4.1) at zero error for whateverP such as to be �-equivalent with �. So, the followinglemma holds:

Lemma 4.1. If P and � are �-equivalent, the verysame code f=(f+; f−) which is optimal for crite-rion (4:1) at zero error, with Pn(¬C)= 0, is optimalalso for criterion (4:3) at �-error, with �[n](¬C)6�,and conversely.

A comparison with (4.2) shows that an optimalcodebook C for (4.3) is formed by all the sequencesof length n which are built over the sub-alphabet ofthose letters whose possibility exceeds �:

C = {a: �(a) ¿ �}n:

Consequently:

Theorem 4.1. The possibilistic �-entropy H�(�) isgiven by:

H�(�) = log |{a: �(a) ¿ �}|; 06 � ¡ 1:

The fact that the possibilistic entropy is obtainedas the limit of a constant sequence of optimal ratesis certainly disappointing; however, asymptotic opti-mal rates are not always so trivially found, as will ap-pear when we discuss channel coding (Section 5) orsource coding with distortion (Appendix B); we shallcomment there that reaching an optimal asymptoticvalue “too soon” (for n=1) corresponds to a situationwhere one is obliged to use trivial code constructions.In a way, we have simply proved that in possibilisticsource coding without distortion trivial code construc-tions are unavoidable.Below we stress explicitly the obvious fact that

the possibilistic entropy H�(�) is a stepwise non-increasing function of �, 06�¡1. The steps of thefunction H�(�) begin in correspondence to the dis-tinct possibility components i¡1 which appear invector �, inclusive of the value 0 = 0 even if 0 isnot a component of �; below the term “consecutive”refers to an ordering of the numbers i.

Proposition 4.1. If 06�¡�′¡1, then H�(�)¿H�′(�). If i¡i+1 are two consecutive entries in �,then H�(�) is constant for i6�¡i+1.

Example 4.2. Take�=(1; 1; 1=2; 1=2; 1=4; 0) over analphabetA of six letters. ThenH�(�)= log 5≈ 2:322when 06�¡1=4, H�(�)= log 4=2 when 1=46�¡1=2, H�(�)= log 2=1 when 1=26�¡1.

Remark 4.1. In the probabilistic case the constraintProb{¬C}6� can be re-written in terms of the prob-ability of correct decoding as Prob{C}¿1 − �, be-cause Prob{C} + Prob{¬C}=1. Instead, the sumPoss{C} + Poss{¬C} can be strictly larger than 1,and so Poss{C}¿1− � is a diBerent constraint. Thisconstraint, however, would be quite loose and quiteuninteresting, since the possibility Poss{C} of correctdecoding and the error possibility Poss{¬C} can beboth equal to 1 at the same time.

Remark 4.2. Unlike in the probabilistic case, inthe possibilistic case replacing the “weak” reliabil-ity constraint �[n]{¬C}6� by a strict inequality,�[n]{¬C}¡�, does make a diBerence even asymptot-ically. In this case the De*nition 3.1 of �-equivalenceshould be modi*ed by requiring P(a)= 0 if andonly if �(a)¡�, 0¡�61. The “strict” possibilisticentropy one would obtain is however the same step-function as H�(�) above, only that the “steps” of thenew function would be closed on the right rather thanbeing closed on the left.

Remark 4.3. In the probabilistic case one can pro-duce a sequence of source codes whose rate tendsto Shannon entropy and whose error probability goesto zero; in other terms Shannon entropy allows oneto code not only with a decoding error probabilitybounded by any �¿0, but even with an “in*nitesi-mal” (however, positive) error probability. In our pos-sibilistic case, however, requiring that the error pos-sibility goes to zero is the same as requiring that it iszero for n high enough, as easily perceived by con-sidering that the possibilistic entropy is constant in aright neighbourhood of �=0.

Remarks 4.1, 4.2 and 4.3, suitably reformulated,would apply also in the case of source coding withdistortion as in Appendix B and channel coding as inSection 5, but we shall no further insist on them.


5. The capacity of a possibilistic channel

Let A= {a1; : : : ; ak} and B= {b1; : : : ; bh} be twoalphabets, called in this context the input alphabetand the output alphabet, respectively. We give astreamlined description of what a channel code is; formore details we refer to [3,4], and also to [15], whichis speci*cally devoted to zero-error information the-ory. The basic elements of a code f are the encoderf+ and the decoder f−. The encoder f+ is an in-jective (invertible) mapping which takes uncodedmessages onto a set of codewords C⊆An; the setM of uncoded messages is left unspeci*ed, since its“structure” is irrelevant. Codewords are sent as inputsequences through a noisy medium, or noisy channel.They are received at the other end of the channel asoutput sequences which belong to Bn. The decoderf− takes back output sequences to the codewords ofC, and so to the corresponding uncoded messages.This gives rise to a partition of Bn into decoding sets,one for each codeword c∈C. Namely, the decodingset Dc for codeword c is Dc = {y: f−(y)= c}⊆Bn.

The most important feature of a code f=(f+; f−)is its codebook C⊆An of size |C|. The decoder f−,and so the decoding setsDc, are often chosen by use ofsome statistical principle, e.g., maximum likelihood,but we shall not need any special assumption (pos-sibilistic decoding strategies are described in [1,10]).The encoder f+ will never be used in the sequel, andso its speci*cation is irrelevant. The rate Rn of a codef with codebook C is de*ned as

Rn =1nlog |C|:

The number log |C| can be seen as the (not nec-essarily integer) binary length of the uncodedmessages, the ones which carry information; thenthe rate Rn is interpreted as a transmission speed,which is measured in information bits (bit frac-tions, rather) per transmitted bit. The idea is todesign codes which are fast and reliable at thesame time. Once a reliability criterion has beenchosen, one tries to *nd the optimal code foreach pre-assigned codeword length n, i.e., a codewith highest rate among those which meet thecriterion.Let us consider a stationary and memoryless chan-

nel Wn, or SML channel, as de*ned in (2.2). To

declare a code f reliable, one requires that the proba-bility that the output sequence does not belong to thecorrect decoding set is acceptably low, i.e., below apre-assigned threshold �, 06�¡1. If one wants to playsafe, one has to insist that the decoding error shouldbe low for each codeword c∈C which might havebeen transmitted (a looser criterion will be examinedin Section 6). The reliability criterion which a code fmust meet is so:

maxc∈C

Wn(¬Dc|c)6 �: (5.1)

We recall that the symbol ¬ denotes negation, orset-complementation; of course the inequality sign in(5.1) can be replaced by an equality sign whenever�=0. Once the length n and the threshold � are cho-sen, one can try to determine the rate Rn =Rn(W; �)of an optimal code which solves the optimizationproblem:

Maximize the code rate Rn so as to satisfythe constraint (5:1):

The job can be quite tough, however, and soone has often to be contented with the asymp-totic value of the optimal rates Rn, which is ob-tained when the codeword length n goes to in-*nity. This asymptotic value is called the �-capacity of channel W . For 0¡�¡1 the capacityC� is always the same, only the speed of conver-gence of the optimal rates to C� is aBected bythe choice of �. When one says “capacity” onerefers by default to this positive �-capacity; cf. [3]or [4].Instead, when �=0 there is a dramatic change.

In this case one uses the confoundability graphG(W ) associated with channel W ; we recall thatin G(W ) two vertices, i.e., two input letters a anda′, are adjacent if and only if they are confound-able; cf. Section 3. If Wn is seen as a stochasticmatrix with kn rows headed to An and hn columnsheaded to Bn, one can consider also the confound-ability graph G(Wn) for the kn input sequences oflength n; two input sequences are confoundable, andso adjacent in the graph, when there is an outputsequence which can be reached from any of thetwo with positive probability. If C is a maximalindependent set in G(Wn) the limit of n−1 log |C|when n goes to in*nity is by de*nition the graph


capacity C(G(W )) of the confoundability graphG(W ). 8

As easily checked, the codebook C⊆An of an op-timal code is precisely a maximal independent set ofG(Wn). Consequently, the zero-error capacity C0(W )of channel W is equal to the capacity of the corre-sponding confoundability graph G(W ):

C0(W ) = C(G(W )):

The paper [19] which Shannon published in 1956 andwhich contains these results inaugurated zero-errorinformation theory. Observe however that the lastequality gives no real solution to the problem of as-sessing the zero-error capacity of the channel, butsimply re-phrases it in a neat combinatorial language;actually, a single-letter expression of the zero-errorcapacity is so far unknown, at least in general (“single-letter” means that one is able to calculate the limit soas to get rid of the codeword length n). This unpleas-ant observation applies also to Theorem 5.1 below.We now pass to a stationary and non-interactive

channel �[n], or SNI channel, as de*ned in (2.3). Thereliability criterion (5.1) is correspondingly replacedby:

maxc∈C

�[n](¬Dc|c)6 �: (5.2)

The optimization problem is now:

Maximize the code rate Rn so as to satisfy

the constraint (5:2):

The number � is now the error possibilitywhich we areready to accept. Again the inequality sign in (5.2) is tobe replaced by the equality sign when �=0. A loosercriterion based on average error possibility rather thanmaximal error possibility will be examined in Sec-tion 6.

De�nition 5.1. The �-capacity of channel � is thelimit of optimal code rates Rn(�; �), obtained as thecodeword length n goes to in*nity.

8 We recall that an independent set in a graph, called also astable set, is a set of vertices no two of which are adjacent,and so in our case the vertices of an independent set are neverconfoundable; all these graph-theoretic notions, inclusive of graphcapacity, are explained more diBusely in Appendix A.

The following lemma is soon obtained fromLemmas 3.2 and 3.3, and in its turn soon impliesTheorem 5.1; it states that possibilistic coding andzero-error probabilistic coding are diBerent formula-tions of the same mathematical problem.

Lemma 5.1. Let the SML channel W and the SNIchannel� be �-equivalent. Then a codef=(f+; f−)satis/es the reliability criterion (5:1) at zero errorfor the probabilistic channel W if and only if it sat-is/es the reliability criterion (5:2) at �-error for thepossibilistic channel �.

Theorem 5.1. The codebook C⊆An of an optimalcode for criterion (5:2) is a maximal independentset of G�(�[n]). Consequently, the �-capacity of thepossibilistic channel � is equal to the capacity of thecorresponding �-confoundability graph G�(�):

C�(�) = C(G�(�)):

Observe that the speci*cation of the decoding setsDc of an optimal code (and so of the decoding strat-egy) is obvious: one decodes y to the unique code-word c for which �[n](y|c)¿�; there cannot be twocodewords with this property, because they would be�-confoundable, and this would violate independence.If �[n](y|c)6� for all c∈C, then y can be assignedto any decoding set, this choice being irrelevant fromthe point of view of criterion (5.2).Below we stress explicitly the obvious fact that the

graph capacity C�(�)=C(G�(�)) is a stepwise non-decreasing function of �, 06�¡1; the term “consecu-tive” refers to an ordering of the distinct componentsi which appear in� (i can be zero even if zero doesnot appear as an entry in �):

Proposition 5.1. If 06�¡�′¡1, then C�(�)6C�′(�). If i¡i+1 are two consecutive entries in �,then C�(�) is constant for i6�¡i+1.

Example 5.1. Binary possibilistic channels. The in-put alphabet is binary,A= {0; 1}, the output alphabetis either the same (“doubly” binary channel), or is aug-mented by an erasure symbol 2 (binary erasure chan-nel); the corresponding possibilistic matrices �1 and


�2 are given below:

�1 | 0 1 �2 | 0 2 1

−− + −− −− −− + −− −− −−0 | 1 � 0 | 1 � 0

1 | 1 1 | 0 1

with 0¡ 6�¡1. As soon checked, one has for theproximities between input letters: �1(0; 1)= � in thecase of the doubly binary channel and �2(0; 1)= 6�in the case of the erasure channel. The rele-vant confoundability graphs are G0(�1)=G0(�2),where the input letters 0 and 1 are adjacent, andG�(�1)=G (�2), where they are not. One hasC0(�1)=C0(�2)= 0, C�(�1)=C (�2)= 1, and soC�(�1)=C�(�2)= 0 for 06�¡ , C�(�1)= 0¡C�(�2)= 1 for 6�¡�, else C�(�1)=C�(�2)= 1.Some of these intervals may vanish when the tran-sition possibilities and � are allowed to be equaland to take on also the values 0 and 1. Data trans-mission is feasible when the corresponding capacityis positive. In this case, however the capacity is“too high” to be interesting, since a capacity equalto 1 in the binary case means that the reliabilitycriterion is so loose that no data protection is re-quired: for *xed codeword length n, the optimalcodeword is simply C=An. Actually, wheneverthe input alphabet is binary, one is necessarily con-fronted with two limit situations which are bothuninteresting: either the confoundability graph iscomplete and the capacity is zero (i.e., the relia-bility criterion is so demanding that reliable trans-mission of data is hopeless), or the graph is edge-free and the capacity is maximal (the reliabilitycriterion is so undemanding that data protection isnot needed). In Section 7 we shall hint at interac-tive models for possibilistic channels which mightprove to be interesting also in the binary case; cf.Remark 7.1.

Example 5.2. A “rotating” channel. Take k =5;the quinary input and output alphabet is the same;the possibilistic matrix � “rotates” the row-vector

(1; �; ; 0; 0) in which 0¡ ¡�¡1:

| a1 a2 a3 a4 a5−− + −− −− −− −− −−a1 | 1 � 0 0

a2 | 0 1 � 0

a3 | 0 0 1 �

a4 | 0 0 1 �

a5 | � 0 0 1

After setting by circularity a6 = a1, a7 = a2, onehas: �(ai; ai)= 1¿�(ai; ai+1)= �¿�(ai; ai+2)= ,16i65. Capacities can be computed as explained inAppendix A: C0(�)= 0 (the corresponding graph iscomplete), C (�)= log

√5 (the pentagon graph pops

up), C�(�)= log 5 (the corresponding graph is edge-free). So C�(�)= 0 for 06�¡ , C�(�)= log

√5 for

6�¡�, else C�(�)= log 5.

6. Average-error capacity versus maximal-errorcapacity

Before discussing an interpretation and an applica-tion of the possibilistic approach, we indulge in onemore “technical” section. In the standard theory ofprobabilistic coding the reliability criterion (5.1) is of-ten replaced by the looser criterion:

1|C|

∑

c∈C

Wn(¬Dc|c)6 �

which requires that the average probability of error,rather than the maximal probability, be smaller than �so as to be declared acceptable. Roughly speaking, oneno longer requires that all codewords perform well,but is contented whenever “most” codewords do so,and so resorts to an arithmetic mean rather than toa maximum operator. The new criterion being looserfor �¿0, higher rates can be achieved; however oneproves that the gain evaporates asymptotically (cf.,e.g., [4]). So, the average-error �-capacity and themaximal-error �-capacity (the only one we have con-sidered so far) are in fact identical. We shall pursuea similar approach also in the case of possibilistic


channels, and adopt the reliability criterion:

1|C|

∑

c∈C

�[n](¬Dc|c)6 � (6.1)

rather than (5.2). The corresponding optimizationproblem is:

Maximize the code rate Rn so as to satisfy

the constraint (6:1):

For �¿0 one can achieve better rates than in the caseof maximal error, as the following example shows.

Example 6.1. We re-take the 2× 2 matrix � of Ex-ample 2.1, which is basically the matrix �1 of Exam-ple 5.1 when =0. We choose n¿1 and adopt thereliability criterion (5.2) which involves the maximaldecoding error. For 06�¡� the graph G�(�) is com-plete and so is also G�(�[n]). Maximal independentsets are made up by just one sequence: the optimal rateis as low as 0; in practice this means that no informa-tion is transmittable at that level of reliability. Let uspass instead to the reliability criterion (6.1) which in-volves the average decoding error. Let us take a code-book whose codewords are all the 2n sequences inAn;each output sequence is decoded to itself. The rate ofthis code is as high as 1. It easy to check that the de-coding error possibility for each transmitted sequencec is always equal to �, except when c= aa : : : a is sent,in which case the error possibility is zero. This meansthat with an error possibility � such that

2n − 12n

�6 � ¡ �

the optimal rate is 0 for criterion (5.2) while it is 1for criterion (6.1). Observe that the interval where thetwo optimal rates diBer evaporates as n increases.

In analogy to the maximal-error �-capacity C�(�),the average-error �-capacity is de*ned as follows:

De�nition 6.1. The average-error �-capacity QC�(�)of channel � is the limit of code rates QRn(�; �) op-timal with respect to criterion (6.1), obtained as thecodeword length n goes to in*nity.

(Our result below will make it clear that such alimit does exist.) We shall prove below that also

in the possibilistic case the maximal-error capacityand the average-error capacity coincide for all �. Westress that, unlike Theorem 5.1, Theorem 6.1 is notsolved by simply re-cycling a result already availablein the probabilistic framework (even if the “expur-gation” technique used below is a standard tool ofShannon theory). This shows that the possibilisticframework is strictly larger than the zero-error prob-abilistic framework, as soon as one allows possibilityvalues which are intermediate between zero and one.From now on we shall assume � �=0, else (5.2) and(6.1) become one and the same criterion, and there isnothing new to say. Clearly, (6.1) being a looser crite-rion, the average-error possibility of any pre-assignedcode cannot be larger than the maximal-error possibil-ity, and so the average-error �-capacity of the channelcannot be smaller than the maximal-error �-capacity:QC�(�)¿C�(�). The theorem below will be proven byshowing that also the inverse inequality holds true.

Theorem 6.1. The average-error �-capacity QC�(�)and the maximal-error �-capacity C�(�) of the SNIpossibilistic channel � coincide for whatever admis-sible error possibility �, 06�¡1:

QC�(�) = C�(�):

Proof. Let us consider an optimal code which satis*esthe reliability criterion (6.1) for *xed codeword lengthn and *xed tolerated error possibility �¿0; since thecode is optimal, its codebook C has maximal size |C|.Let

1 = 0 ¡ 2 ¡ · · · ¡ r = 1 (6.2)

be the distinct components which appear as entriesin the possibilistic matrix � which speci*es the tran-sition possibilities, and so speci*es the possibilisticbehaviour of the channel we are using; r¿1. Fix code-word c: we observe that the error possibility for c, i.e.,the possibility that c is incorrectly decoded, is neces-sarily one of the values which appear in (6.2), as itis derived from those values by using maximum andminimum operators (we add i =0 even if 0 is not tobe found in �). This allows us to partition the code-book C into r classes Ci, 16i6r, by putting into thesame class Ci those codewords c whose error possi-bility is equal precisely to i (some of the classes Ci


may be void). The reliability criterion (6.1) satis*edby our code can be re-written as:

∑

16i6r

|Ci||C| i 6 �: (6.3)

We can now think of a non-negative random variableX which takes on the values i, each with probability|Ci|=|C|; to this random variable X we shall apply thewell-known Markov inequality (cf., e.g., [4]), whichis written as:

Prob{X ¿ # QX }6 1#;

where QX is the expectation of X , i.e., the *rst sideof (6.3), while # is any positive number. Because of(6.3), which can be written also as QX6�, one has afortiori:

Prob{X ¿ #�}6 1#

or, equivalently:

∑

i: i¿#�

|Ci|6 |C|#

:

Now we choose # and set it equal to:

# =�+ j

2�;

where j is the smallest value in (6.2) such as tobe strictly greater than �. With this choice one has#¿1; we stress that # is a constant once � and � arechosen; in particular # does not depend on n. The lastsummation can be now taken over those values of ifor which:

i ¿�+ j

2

i.e., since there is no i left between � and j, theinequality can be re-written as:

∑

i: i¿�

|Ci|6 |C|#

:

The r classes Ci are disjoint and give a partition of C;so, if one considers those classesCi for which the errorpossibility i is at most �, one can equivalently write:

∑

i: i6�

|Ci|¿ #− 1#

|C|: (6.4)

Now, the union of the classes on the left side of (6.4)can be used as the codebook of a new code with max-imal error possibility6�. It will be enough to modifythe decoder by enlarging in whatever way the decod-ing setsDc with error possibility�[n](¬Dc|c)= i6�,so as to cover Bn; by doing so the error possibilitycannot become larger. Of course the new code neednot be optimal in the class of all codes which satisfycriterion (5.2) for *xed n and �; so for its rate R∗

n onehas

R∗n =

1nlog

∑

i: i6�

|Ci|6 Rn; (6.5)

where Rn is the optimal rate with respect to criterion(5.2) relative to maximal error. On the other hand, interms of the rate QRn = n−1 log |C| optimal with respectto criterion (6.1) relative to average error, (6.4) canbe re-written as:

2nR∗n ¿

#− 1#

2n QRn : (6.6)

In (6.6) the term (#−1)=# is a positive constant whichbelongs to the open interval ]0; 1[, and so its logarithmis negative. Comparing (6.5) and (6.6), and recallingthat Rn6 QRn:

QRn +1nlog

#− 1#

6 R∗n 6 Rn 6 QRn:

One obtains the theorem by going to the limit.

7. An interpretation of the possibilistic modelbased on distortion measures

We have examined a possibilistic model of datatransmission and coding which is inspired by the stan-dard probabilistic model: what we did is simply replac-ing probabilities by possibilities and independence bynon-interactivity, a notion which is often seen as the“right” analogue of probabilistic independence in pos-sibility theory. In this section we shall try to give aninterpretation of our possibilistic approach. The exam-ple of an application to the design of telephone key-boards will be given.We concentrate on noisy channels and codes for

correcting transmission errors; we shall considersources at the end of the section. The idea is that insome cases statistical likelihood may be eBectively


replaced by what one might call “structural resem-blance”. Suppose that a “grapheme” is sent througha noisy channel which we are unable to describe inall statistical details. A distorted grapheme will be re-ceived at the other end of the channel; the repertoireof input graphemes and of output graphemes are sup-posed to be both *nite. We assume that it is plausible 9

that the grapheme which has been received has asmall distortion, or even no distortion at all, from thegrapheme which has been sent over the channel; largedistortions are instead unplausible. Without real lossof generality we shall “norm” the distortions to the in-terval [0; 1], so that the occurrence of distortion one isseen as quite unplausible. Correspondingly, the one-complement of the distortion can be seen as an indexof “structural resemblance” between the input symboland the output symbol; with high plausibility this in-dex will have a high value. We shall assign a numericvalue to the plausibility by setting it equal preciselyto the value of the resemblance index; in other words,we assume the “equality”:

plausibility = structural resemblance: (7.1)

Long sequences of graphemes will be sent through thechannel. The distortion between the input sequence xand the output sequence y will depend on the distor-tions between the single graphemes xi and yi whichmake up the sequences; to specify how this happens,we shall take inspiration from rate-distortion theory,which is shortly reviewed in Appendix B; cf. also [3]or [4]. We recall here how distortion measures arede*ned. One is given two alphabets, the primary al-phabetA and the secondary alphabetB, which in ourcase will be the alphabet of possible input graphemesand the alphabet of possible output graphemes, re-spectively. A distortion measure d is given whichspeci*es the distortions d(a; b) between each primaryletter a∈A and each secondary letter b∈B; for eachprimary letter a there is at least one secondary letterb such that d(a; b)= 0, which perfectly reproduces a.Distortions d(a; b) are always non-negative, but in ourcase they are also constrained not to exceed 1. Thedistortion between letters a and b can be extended to

9 The term plausibility is a technical term of evidence theory;actually, possibilities can be seen as very special plausibilities; cf.,e.g., [12]. So, the adoption of a term which is akin to “possibility”is more committal than it may seem at *rst sight.

a distortion between sequences x∈An and y∈Bn inseveral way; one resorts, e.g., to peak distortion:

d∗n(x; y) = max

16i6nd(xi; yi) (7.2)

or, more commonly and less demandingly, to averagedistortion:

dn(x; y) =1n

∑

16i6n

d(xi; yi): (7.3)

Let us be very demanding and adopt peak distortion:structurally two sequences resemble each other onlywhen they do so in each position. Following the philos-ophy of the equality (7:1) above, where the term “plau-sibility” has been replaced by the more speci*c term“transition possibility” and where the resemblance isinterpreted as the one-complement of the distortion,we set:

�(b|a) = 1− d(a; b);

�[n](y|x) = 1− d∗n(x; y)

= min16i6n

�(yi|xi): (7.4)

This corresponds precisely to a stationary and non-interactive channel �.To make our point we now examine a small-

scale example. We assume that sequences of circledgraphemes out of the alphabet A={⊕;⊗;�;�;©}are sent through a channel. Because of noise, someof the bars inside the circle can be erased duringtransmission; instead, in our model the channel can-not add any bars, and so the repertoire of the outputgraphemes is a superset of A:B=A∪{◦ ;◦\ }. Wedo not have any statistical information about the be-haviour of the channel; we shall be contented withthe following “linguistic judgements”:It is quite plausible that a grapheme is received as ithas been sentIt is pretty plausible that a single bar has been erasedIt is pretty unplausible that two bars have been erasedEverything else is quite unplausibleWe shall “numerize” our judgements by assigning

the numeric values 1; 2=3; 1=3; 0 to the correspondingpossibilities. This is the same as setting the distor-tions d(a; b) proportional to the number of bars whichhave been deleted during transmission. Our choice is


enough to specify a possibilistic channel �, whosematrix is given below; zeroes have not been writ-ten to help readability. Underneath � we have writ-ten the matrix � which speci*es the proximities be-tween the input graphemes; since � is symmetric, i.e.,�(a; a′)= �(a′; a), we have written only the upper tri-angle; cf. the de*nition of � in Section 3.

� | ⊕ ⊗ � ◦ � ◦\ ◦− + − − − − − − −⊕ | 1 2=3 2=3 1=3

⊗ | 1 2=3 2=3 1=3

� | 1 2=3

� | 1 2=3

◦ | 1

� | ⊕ ⊗ � � ◦− − − − − − −⊕ | 1 1=3 2=3 1=3 1=3

⊗ | 1 1=3 2=3 1=3

� | 1 2=3 2=3

� | 1 2=3

◦ | 1

Both the components of � and those of � are intheir own way “resemblance indices”. However,those in � specify the structural resemblance be-tween an input grapheme a and an output graphemeb; this resemblance is 1 when b equals a, is 2=3when b can be obtained from a by deletion of asingle bar, is 1=3 when b can be obtained from aby deleting two bars, and is 0 when b cannot beobtained from a in any of these ways. Instead thecomponents of � specify how easy it is to confoundinput graphemes at the other end of the channel:�(a; a′)= 1 means a= a′, �(a; a′)= 2=3 means thata and a′ are diBerent, but there is at least an outputgrapheme which can be obtained by deletion of asingle bar in a, or in a′, or in both, �(a; a′)= 1=3means that one has two delete at least two barsfrom one of the input graphemes, or from both,to reach a common output grapheme. Assumingthat the channel � is stationary and non-interactivemeans that we are adopting a very strict criterionto evaluate the “structural resemblance” between

input sequence and output sequence; this crite-rion corresponds to peak distortion, as explainedabove.Let us consider coding. We use the proximity ma-

trix � to construct the �-confoundability graph G�(�),which was de*ned in Section 3. If 06�¡1=3 the�-confoundability graph is complete and the �-capacity of the channel is 0: this means that thereliability criterion (5.2) is so strict that no datatransmission is feasible. For �¿2=3 the graph isedge-free and so the �-capacity is log 5: this meansthat the reliability criterion (5.2) is so loose that thechannel � behaves essentially as noise-free. Let usproceed to the more interesting case 1=36�¡2=3;actually, one can take �=1=3 (cf. Proposition 5.1).A maximal independent set I in G1=3(�) is madeup by the three “vertices” ⊕;⊗ and ◦, as soonchecked. Using the notions explained in the ap-pendix, and in particular the inequalities (A.1), onesoon shows that the 3n sequences of In give amaximal independent set in G1=3(�[n]). Fix code-word length n; as stated by Theorem 5:2; an opti-mal codebook is C=In and so the optimal coderate is log 3, which is also the value of the capac-ity C1=3(�). When one uses such a code, a decod-ing error occurs only when at least one of the ngraphemes sent over the channel loses at least twobars, an event which has been judged to be prettyunplausible.We give the example of an application. Think of

the keys in a digital keyboard, as the one of theauthor’s telephone, say, in which digits from 1 to9 are arranged on a 3× 3 grid, left to right, toprow to bottom row, while digit 0 is positioned be-low digit 8. It may happen that, when a telephonenumber is digited, the wrong key is pressed (be-cause of “channel noise”). We assume the followingmodel of the “noisy channel”, in which possibili-ties are seen as numeric labels for vague linguisticjudgements:

(i) it is quite plausible that the correct key is pressed(possibility 1);

(ii) it is pretty plausible that one inadvertentlypresses a “neighbour” of the correct key, i.e., akey which is positioned on the same row or onthe same column and is contiguous to the correctkey (possibility 2=3);


(iii) it is pretty unplausible that the key one pressesis contiguous to the correct key, but is positionedon the same diagonal (possibility 1=3);

(iv) everything else is quite unplausible (possibility0).

When the wrong key is pressed, we shall say thata cross-over of type (ii), of type (iii), or of type(iv) has taken place, according whether its possi-bility is 2=3, 1=3, or 0. Using these values 10 onecan construct a possibilistic matrix � with the in-put and the output alphabet both equal to the set ofthe 10 keys. One has, for example: �(a|1)=2=3 fora∈{2; 4}, �(a|1)=1=3 for a=5, �(a|1)=0 fora∈{3; 6; 7; 8; 9; 0}. As for the proximity ��(a; b),it is equal to 2=3 whenever either keys a and b areneighbours as in (ii), or there is a third key c whichis a common neighbour of both. One has, for ex-ample: ��(1; a)= 2=3 for a∈{2; 3; 4; 5; 7}; instead,��(1; a)= 1=3 for a∈{6; 8; 9} and ��(1; a)= 0 fora=0. A codebook is a bunch of admissible tele-phone numbers of length n; since a phone numberis wrong whenever there is a collision with anotherphone number in a single digit, it is natural to as-sume that the “noisy channel” � is non-interactive.This example had been suggested to us by J. KTorner;however, at least in principle, in the standard prob-abilistic setting one would have to specify threestochastic matrices such as to be 0; 1=3 and 2=3—equivalent with �. In these matrices only the op-position zero=non-zero would count; their entrieswould have no empirical meaning, and no signi*cantrelation with the stochastic matrix of the probabili-ties with which errors are actually committed by thehand of the operator. So, the adoption of a “hard”

10 Adopting a diBerent “numerization” for the transition possibil-ities (or, equivalently, for the distortions) does not make any realdiBerence from the point of view of criterion (5.2), provided theorder is preserved and the values 0 and 1 are kept *xed. Instead,arithmetic averages as in criterion (6.1) have no sort of insensitiv-ity to order-preserving transformations; criterion (6.1) might proveto be appropriate in a situation where one interprets possibilitiesin some other way (recall that possibilities can be viewed as aspecial case of plausibilities, which in their turn can be viewedas a special case of upper probabilities; cf., e.g., [22]). In (iv) wemight have chosen a “negligible” positive value, rather than 0:again, this would have made no serious diBerence, save adding anegligible initial interval where the channel capacity would havebeen zero.

probabilistic model is in this case pretty unnatural. In-stead, in a “soft” possibilistic approach one speci*esjust one possibilistic matrix �, which contains pre-cisely the information which is needed and nothingmore.Unfortunately, the author’s telephone is not espe-

cially promising. Let us adopt criterion (5.2). If the al-lowed error possibility of the code is 2=3 (or more), theconfoundability graph is edge-free and no error pro-tection is required. If we choose the error possibility�=1=3, we have C1=3(�)= log &(G1=3(�))= log 3; inother words the 1=3-capacity, which is an asymp-totic 11 parameter, is reached already for n=1. To seethis use the inequalities (A.1) of Appendix A: the in-dependence number of G1=3(�) is 3, and a maximalindependent set of keys, which are far enough fromeach other so as not to be confoundable, is {0; 1; 6},as easily checked; however, one checks that 3 is alsothe chromatic number of the complementary graph.In practice, this means that an optimal codebook asin Theorem 5.1 may be constructed by juxtapositionof the input “letters” 0; 1; 6; the code is disappointing,since everything boils down to allowing only phonenumbers which use keys 0; 1; 6. As for decoding, theoutput sequence y is decoded to the single codeword cfor which �[n](y|c)¿1=3; so, error correction is cer-tainly successful if there have been no cross-overs oftype (iii) and (iv). If, for example, one digits num-ber 2244 rather than 1111 a successful error correc-tion takes place; actually, �[4](2244|c)¿1=3 only forc=1111. If instead one is so clumsy as to digit the“pretty unplausible” number 2225, this is incorrectlydecoded to 1116. Take instead the the more demandingthreshold �=0; the 0-capacity, as easily checked, goesdown to log 2; the 0-error code remains as disappoint-ing as the 1=3-error code, being obtained by allowingonly phone numbers made up of “far-away” digits asare 0 and 1, say. The design of convenient keyboards

11 When the value of an asymptotic functional (channel capac-ity, say, or source entropy, or the rate-distortion function as inAppendix B) is reached already for n=1, its computation is easy,but, unfortunately, this is so because the situation is so hopelessthat one is obliged to use trivial code constructions. By the way,this is always the case when one tries to compress possibilisticsources without distortion, as in Section 4. The interesting situ-ations correspond instead to cases when the computation of theasymptotic functional is diLcult, as for the pentagon, or evenunfeasible, as for the heptagon (cf. Appendix A).


such that their possibilistic capacity is not obtainedalready for n=1 is a graph-theoretic problem whichmay be of relevant practical interest in those situationswhen digiting an incorrect number may cause seri-ous inconveniences. More generally, exhibiting useful*nite-length code constructions would have a relationto the material of this paper, which is similar to the re-lation of coding theory (algebraic coding theory, say)to the asymptotic theory of coding (Shannon theory).

Remark 7.1. Rather than peak distortion, in (7.4) onemight use average distortion. This would give rise to astationary but de*nitely interactive channel for which:

�n(y|x) = 1− dn(x; y) =1n

∑

16i6n

�(yi|xi):

We leave open the problem of studying such a channeland ascertaining its meaning for real-world data trans-mission. Actually, one might even de*ne new distor-tions between sequences based on a diBerent way ofaveraging single-letter distortions in the general senseof aggregation operators (the very broad notion of ag-gregation operators and averaging operators is cov-ered, e.g., in [12] or [16]).

Now we pass to source coding and data compres-sion. We shall pursue an interpretation of possibilis-tic SNI sources and possibilistic source coding which*ts in with a meaning of the word “possible” to befound in the Oxford Dictionary of the English Lan-guage: possible= tolerable to deal with, i.e., accept-able, because it possesses all the qualities which arerequired. 12 Assume that certain items are acceptedonly if they pass n quality controls; each control i isgiven a numeric mark i from 0 (totally unaccept-able) to 1 (faultless); the marks which one can assignare chosen from a *nite subset of [0; 1] of numberswhich just stand for linguistic judgements. The qual-ity control as a whole is passed only when all the ncontrols have been passed. As an example, let us takethe source alphabet B equal to the alphabet of theseven graphemes output by the possibilistic channel

12 An interpretation which may be worth pursuing is: degree ofpossibility = level of grammaticality. This may be interesting alsoin channel coding, in those situation when decoding errors are lessserious when the encoded message has a low level of grammaticalcorrectness.

� which has been considered above. The “items” willbe sequences y of n graphemes, and the ith controlwill be made on the ith grapheme yi. The possibilityvector � over the seven graphemes of B, in the orderas they are listed, will be:

�= (1; 1; 1; 2=3; 1; 2=3; 1) over

B = {⊕;⊗;�;◦ ;�;◦\ ;◦}:A possibility smaller than 1 has been assigned tothe two output graphemes which are not also inputgraphemes; in practice, vector� has been obtained bytaking the maximum of the entries in the columns ofthe possibilistic matrix � which describes the chan-nel. When �(b)= 1 it is possible that the graphemeb has been received at the end of the channel ex-actly as it has been transmitted, when �(b)= 2=3 thegrapheme bwhich has been received is necessarily dis-torted with respect to the input grapheme, and that atleast one bar has been erased during transmission. 13

Let us *x a value �, 06�¡1, and rule out all the itemswhose possibility is6�. Then the accepted items canbe encoded by means of a possibilistic source code asin Section 4: each acceptable item is given a binarynumber whose length is nH�(�), or rather �nH�(�)�,by rounding to the integer ceiling, i.e., to the small-est integer which is as least as large as n times the�-entropy of �. In our case, when �¿2=3 only thesequences which do not contain the graphemes ©and ©\ are given a codeword, and so H�(�)= log 5;instead when �62=3 all the sequences have their owncodeword, and so H�(�)= log 7.

13 As a matter of fact, we have been using a formula proposedin the literature in order to compute marginal output possibilities�(b), when the marginal input possibilities �(a) and the condi-tional possibilities �(b|a) are given, namely

�(b) = max [�(a) ∧ �(b|a)]

the maximum being taken over all letters a∈A. In our case wehave set all the input possibilities �(a) equal to 1. The possibilisticformula is inspired by the corresponding probabilistic one, justreplacing sums and products by maxima and minima, as is usualwhen one passes from probabilities to possibilities; cf., e.g., [5]or [11].


Acknowledgements

We gladly acknowledge helpful discussions withF. Fabris on the relationship between possibilisticchannels and distortion measures as used in proba-bilistic source coding.

Appendix A. Graph capacity

We consider only simple graphs, i.e., graphs with-out multiple edges and without loops; we recall that agraph is assigned by giving its vertices and its edges;each edge connects two (distinct) vertices which arethen adjacent. If G is a graph, its complementarygraph QG has the same set of vertices, but two verticesare adjacent in QG if and only if they are not adjacentin G. By &(G) and '(G) we denote the independencenumber and the chromatic number of G, respectively.We recall that the independence number of a graph isthe maximum size of a set of vertices none of whichare adjacent (of an independent set, called also a stableset); the chromatic number of a graph is the minimumnumber of colours which can be assigned to its verticesin such a way that no two adjacent vertices have thesame colour. From a graph G with k vertices one maywish to construct a “power graph” Gn whose k n “ver-tices” are the vertex sequences of length n. Many suchpowers are described in the literature; of these we needthe following one, called sometimes the strong power:two vertices x= x1x2 : : : xn and u= u1u2 : : : un are ad-jacent in Gn if and only if for each component i eitherxi and ui are adjacent in G or xi = ui; 16i6n. Thereason for choosing this type of power becomes clearwhen one thinks of confoundability graphs G(W ) andof �-confoundability graphs G�(�) as de*ned in Sec-tion 3. Actually, one has:

G(Wn) = [G(W )]n; G�(�[n]) = [G�(�)]n:

The *rst equality is obvious; the second is implied bythe *rst and by Lemma 3.3: just take any stochasticmatrix W which is �-equivalent to �.If G is a simple graph, its graph capacity C(G),

called also Shannon’s graph capacity, is de*ned as

C(G) = limn

1nlog &(Gn):

The limit always exists, as it can be shown. It is rathereasy to prove that

log &(G)61nlog &(Gn)6 log '( QG) (A.1)

and so whenever &(G)= '( QG) the graph capacity isvery simply C(G)= log &(G). Giving a single-lettercharacterization of graph capacity can be however avery tough problem, which is still unsolved in its gen-erality [15]. We observe that the minimum value ofthe graph capacity is zero, and is reached wheneverthe graph is complete, i.e., has all the

(K2

)edges; the

maximum value of the capacity of a graph with k ver-tices is log k, and is obtained when the graph is edge-free (has no edges at all). We also observe that “pure”combinatorialists prefer to de*ne graph capacity as thelimit of n

√&(Gn), i.e., as 2C(G).

Example A.1. Let us take the case of a polygon Pkwith k vertices. For k =3, we have a triangle P3; then&(P3)= '( QP3)= 1 and the capacity C(P3) is zero. Letus go to the quadrangle P4; then &(P4)= '( QP4)= 2 andso C(P4)= 1. In the case of the pentagon, however,&(P5)= 2¡'( QP5)= 3. It was quite an achievement ofLovVasz to prove in 1979 that C(P5)= log

√5, as long

conjectured; the conjecture had resisted a proof formore than twenty years. The capacity of the heptagonP7 is still unknown.

Appendix B. The possibilistic rate-distortion func-tion

This appendix generalizes source coding as dealtwith in Section 4 and is rather more technical thanthe body of the paper. The reader is referred to Sec-tion 4 for a description of the problem of sourcecoding. In the case of source coding with distortion,beside the primary source alphabet A one has asecondary alphabet B, called also the reproductionalphabet, which is used to reproduce primary se-quences. A distortion matrix d is given which speci-*es the distortions d(a; b) between each primary lettera∈A and each secondary letter b∈B; the numbersd(a; b) are non-negative and for each primary lettera there is at least one secondary letter b such thatd(a; b)= 0, i.e., such as to perfectly reproduce a.We recall that distortion measures have already been


used in Section 7; unlike in Section 7, here we donot require d(a; b)61. The distortion between lettersa and b is extended to average distortion betweensequences x∈An and y∈Bn as we did in (7.3), orto peak distortion, called also maximal distortion, asin (7.2). Unlike in the case without distortion, herethe decoder f− maps the binary codeword f+(x)to a secondary sequence y∈Bn which should havean acceptably small distortion from the encoded pri-mary sequence x. Let us denote by g the compositionof encoder and decoder, g(x)=f−(f+(x)); the setof secondary sequences C= g(An)⊆Bn which areused to reproduce the primary sequences is calledthe codebook of the code f=(f+; f−). In practicethe secondary sequence y= g(x) is usually misin-terpreted as if it were the codeword for the primarysequence x, and correspondingly the mapping g iscalled the encoder (this is slightly abusive, but thespeci*cation of f+ and f− turns out to be irrelevantonce g is chosen). The rate of the code is the number

Rn =log |C|

n:

The numerator can be interpreted as the (non neces-sarily integer) length of the binary codewords outputby the encoder stricto sensu f+ and fed to the de-coder f−, and so the rate is the number of bits perprimary letter. From now on we shall forget about f+;the term “encoder” will refer solely to the mapping gwhich outputs secondary sequences y∈Bn.Let us begin by the average distortion dn, as is

common in the probabilistic approach. One *xes athreshold +¿0, a tolerated error probability �¿0, andrequires that the following reliability criterion is sat-is*ed:

Pn{x : dn(x; g(x)) ¿ +}6 �: (B.1)

The encoder g should be constructed in such a waythat the codebook C⊆Bn be as small as possible, un-der the constraint that the reliability criterion whichhas been chosen is satis*ed; for *xed n one can equiv-alently minimize the code rate:

Minimize the code ratelog |C|

nso as to satisfy

constraint (B:1):

For *xed +¿0 and �¿0, one is interested in theasymptotic value of the optimal rates. For �¿0 oneproves that the solution, i.e., the asymptotic value ofoptimal code rates, is given by the rate-distortionfunction

R�(P; +) = minXY : Ed(X; Y )6+

I(X ∧ Y ); � ¿ 0 (B.2)

in whose expression at the right � does not explicitlyappear. Above I(X ∧Y ) is the mutual information 14

of the random couple XY , X being a random primaryletter ouput by the source according to the probabil-ity distribution P. The second random component Yof the random couple XY belongs to the secondary al-phabet B, and so is a random secondary letter. Theminimum is taken with respect to all random couplesXY which are constrained to have an expected distor-tion Ed(X; Y ) which does not exceed the threshold +.The rate-distortion function does not look especiallyfriendly; luckily the problem of its computation hasbeen deeply investigated from a numeric viewpoint[4]. Observe however that, even if the computationof the rate-distortion function involves a minimiza-tion, there is no trace of n left and so its expression issingle-letter, unlike in the case of graph capacity.Let us proceed to zero-error coding with distortion.

The problem of *nding a single-letter expression forthe asymptotic value of optimal rates is not at all triv-ial, but it has been solved; not surprisingly, this valueturns out to depend only on the support of P, i.e., onthe fact whether the probabilities P(a) of source let-ters a are zero or non-zero. More precisely, for �=0the asymptotic value is given by the zero-error rate-distortion function:

R0(P; +)= maxX : P(a)=0⇒PX (a)=0

minXY : Ed(X; Y )6+

I(X∧Y ):(B.3)

Here the maximum is taken with respect to all ran-dom variables X whose support is (possibly strictly)included in the support of P, i.e., in the subset of let-ters a whose probability P(a) is strictly positive; PX isthe probability distribution of the random variable X .

14 The mutual information can be expressed in terms of Shannonentropies as I(X ∧ Y )=H (X ) + H (Y ) − H (XY ); it is seen asan index of dependence between the random variables X and Y ,and assumes its lowest value, i.e., 0, if and only if X and Y areindependent.


Theminimum in (B.3) is to be compared with R�(P; +)as in (B.2). In practice, one considers all the rate-distortion functions over the support of P, and thenselects the largest value which has been obtained; asfor numeric techniques which are available, cf. [4].If one chooses the peak distortion d∗

n rather than theaverage distortion dn, the reliability criterion (B.1) andthe de*nition of the rate-distortion function should bemodi*ed accordingly; in particular, the new reliabilitycriterion is

Pn{x : d∗n(x; g(x)) ¿ +}6 �: (B.4)

The asymptotic optimal rate for peak distortionR∗� (P; +) has in general a higher value 15 than

R�(P; +), since (B.4) is more demanding than (B.1).The expression of R∗

� (P; +) turns out to be thesame as in (B.2) and (B.3), only replacing the con-straint Ed(X; Y )6+ which de*nes the minimizationset by the more severe constraint d(X; Y )6+ (cf.[4]; by writing d(X; Y )= 0 we mean that the eventd(X; Y )= 0 has probability 1, i.e., that the support ofthe random couple XY is made up only of couples(a; b) for which d(a; b)= 0).

If the source is a SNI source described by givingthe possibility vector � over primary letters, one canconsider the same codes as before, but judge of theirreliability by referring to the new reliability criteria:

�n{x : dn(x; g(x)) ¿ +}6 � (B.5)

15 We recall that peak distortion can be taken back to codingwith average distortion with a threshold equal to zero; this istrue no matter whether the source is probabilistic or possibilistic.Actually, if one sets

�(a; b) = 0 iB d(a; b)6 +; else �(a; b) = d(a; b)

the inequality d∗n (x; y)6+ is clearly equivalent to the equality�n(x; y)= 0. So, coding at distortion level + with peak distortion isthe same as coding at distortion level zero with average distortion,after replacing the old distortion measure d by the new distortionmeasure �. The case of average distortion with +=0 and thegeneral case of peak distortion with any +¿0 can be both couchedinto the inspiring mould of graph theory; then the rate-distortionfunction is rather called the hypergraph entropy, or the graphentropy in the special case when the two alphabets A and B

coincide and when the distortion is Hamming distortion, as de*nedat the end of this appendix; cf. [4,20]. We recall that graph capacityand hypergraph entropy are the two basic functionals of the zero-error theory; both of them originated in a coding theoretic context,but both have found unexpected and deep applications elsewhere;cf. [15].

or, in the case of peak distortion:

�n{x : d∗n(x; g(x)) ¿ +}6 � (B.6)

to be compared with (B.1) and (B.4). The correspond-ing minimization problems are:

Minimize the code rate1nlog |C| so as to satisfy

constraint (B:5) or (B:6); respectively:

De�nition B.1. The possibilistic rate-distortionfunction R�(�;+) for average distortion and thepossibilistic rate-distortion function R∗

� (�;+) forpeak distortion are the limit of the rates Rn of codeswhich are optimal for the criterion (B.5) or (B.6),respectively, as the length n goes to in*nity; 06�¡1.

Lemma 3.1 gives soon the following lemma:

Lemma B.1. If P and � are �-equivalent, a codef=(f+; f−) is optimal for criterion (B:1) at zeroerror if and only if it is optimal for criterion (B:5)at �-error; it is optimal for criterion (B:4) at zero er-ror if and only if it is optimal for criterion (B:6) at�-error.

The following theorem is obtained from Lemma B.1after a comparison with the expressions of R0(P; +)and R∗

0 (P; +):

Theorem B.1. The possibilistic rate-distortion func-tions R�(�;+) and R∗

� (�;+); 06�¡1, are given by:

R�(�;+) = R0(P; +); R∗� (�;+) = R∗

0 (P; +)

for whatever P such as to be �-equivalent to �; moreexplicitly:

R�(�;+) = maxX :�(a)6�⇒PX (a)=0

minXY : Ed(X;Y )6+

I(X ∧ Y );

R∗� (�;+) = max

X : �(a)6�⇒PX (a)=0

minXY : d(X;Y )6+

I(X ∧ Y ):


Observe that the possibilistic rate-distortion func-tions R�(�;+) and R∗

� (�;+) are both non-increasingstep-functions of �. Actually, if i¡i+1 are two con-secutive entries in �, as in Proposition 4.1, the rela-tion of �-equivalence is always the same for whatever� such as i6�¡i+1. Unlike R�(�;+), R∗

� (�;+) isalso a step-function of +. Actually, if the distinct en-tries of the matrix d are arranged in the increasingorder, and if di¡di+1 are two consecutive entries, theconstraint (B.6) is the same for whatever + such thatdi6+¡di+1.In some simple cases the minima and the maxima

which appear in the expression of the various rate-distortion functions can be made explicit; the reader isreferred once more to [4]; the results given there aresoon adapted to the possibilistic case. We shall justmention one such special case: the two alphabets co-incide, A=B, the distortion matrix d is Hammingdistortion, i.e., d(a; b) is equal to 0 or to 1 accordingwhether a= b or a �= b, respectively; +=0. As a mat-ter of fact, one soon realizes that this is just a diBerentformulation of the problem of coding without distor-tion as in Section 4. A simple computation gives:

R�(�; 0) = R∗� (�; 0) = log |{a: �(a) ¿ �}|

in accordance with the expression of the possibilisticentropy given in Theorem 4.1. A slight generalizationof this case is obtained for arbitrary +¿0 when theinequality d(a; b)6+ is an equivalence relation whichpartitions the primary alphabet A into equivalenceclasses E. Then

R∗� (�;+) = log |{E: �(E) ¿ �}|:

In practice, optimal codes are constructed by taking aletter aE for each class E whose possibility exceeds �;each primary letter in E is then reproduced by usingprecisely aE. This way the asymptotic optimal rateR∗� (�;+) is achieved already for n=1, as in the case

of coding without distortion. This is bad news, sinceit means that optimal code constructions are bound tobe trivial; cf. footnote 11.

References

[1] M. Borelli, A. Sgarro, A possibilistic distance for sequencesof equal and unequal length, in: C. CWa lude, Gh. PWa un (Eds.),Finite VS In*nite, Discrete Mathematics and TheoreticalComputer Science, Springer, London, 2000, pp. 27–38.

[2] B. Bouchon-Meunier, G. Coletti, C. Marsala, PossibilisticConditional Events, IPMU 2000, Madrid, July 3–7 2000,Proceedings, pp. 1561–1566.

[3] Th.M. Cover, J.A. Thomas, Elements of Information Theory,Wiley, New York, 1991.

[4] I. CsiszVar, J. KTorner, Information Theory, Academic Press,New York, 1981.

[5] G. De Cooman, Possibility Theory, Internat. J. GeneralSystems 25 (4) (1997) 291–371.

[6] D. Dubois, H.T. Nguyen, H. Prade, Possibility theory,probability and fuzzy sets: misunderstandings, bridges andgaps, in: D. Dubois, H. Prade (Eds.), Fundamentals of FuzzySets, Kluwer Academic Publishers, Boston, 2000, pp. 343–438.

[7] D. Dubois, W. Ostasiewicz, H. Prade, Fuzzy Sets: History andBasic Notions, in: D. Dubois, H. Prade (Eds.), Fundamentalsof Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000,pp. 21–290.

[8] D. Dubois, H. Prade, Properties of measures of informationin evidence and possibility theories, Fuzzy Sets and Systems24 (1987) 161–182.

[9] D. Dubois, H. Prade, Fuzzy sets in approximate reasoning:inference with possibility distribution, Fuzzy Sets andSystems 40 (1991) 143–202.

[10] F. Fabris, A. Sgarro, Possibilistic data transmission andfuzzy integral decoding, IPMU 2000, Madrid, July 3–7 2000,Proceedings, pp. 1153–1158.

[11] E. Hisdal, Conditional possibilities, independence andnon-interaction, Fuzzy Sets and Systems 1 (1978) 283–297.

[12] G.J. Klir, T.A. Folger, Fuzzy Sets, Uncertainty andInformation, Prentice-Hall, London, 1988.

[13] G.J. Klir, M.J. Wierman, Uncertainty-Based Information:Elements of Generalized Information Theory, PhysicaVerlag=Springer Verlag, Heidelberg and New York, 1998.

[14] G.J. Klir, Measures of uncertainty and information, in: D.Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, KluwerAcademic Publishers, Boston, 2000, pp. 439–457.

[15] J. KTorner, A. Orlitsky, Zero-error information theory, Trans.Inform. Theory 44 (6) (1998) 2207–2229.

[16] H.T. Nguyen, E.A. Walker, A First Course in Fuzzy Logic,2nd Edition, Chapman & Hall, London, 2000.

[17] S. Ovchinnikov, An Introduction to Fuzzy Relations, in: D.Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, KluwerAcademic Publishers, Boston, 2000, pp. 233–259.

[18] C.E. Shannon, A mathematical theory of communication, BellSystem Technical J. 27 (3&4) (1948) 379–423, 623–656.

[19] C.E. Shannon, The zero-error capacity of a noisy channel,IRE Trans. Inform. Theory IT-2 (1956) 8–19.

[20] G. Simonyi, Graph entropy: a survey, in: W. Cook, L. LovVasz,P. Seymour (Eds.), Combinatorial Optimization, DIMACSSeries in Discrete Maths and Computer Science, vol. 20,1995, AMS, Providence, RI, pp. 399–441.

[21] D. Solomon, Data Compression, Springer, New York, 1998.[22] P. Walley, Statistical Reasoning with Imprecise Probabilities,

Chapman & Hall, London, 1991.[23] L. Zadeh, Fuzzy sets as a basis for a theory of possibility,

Fuzzy Sets and Systems 1 (1978) 3–28.

possibilistic information theory: a coding theoretic approach

Documents