journal of visual languages and...

A visual language for explaining probabilistic reasoning$

Martin Erwig n, Eric Walkingshaw

School of EECS, Oregon State University, Corvallis, OR 97331, USA

a r t i c l e i n f o

Article history:Received 7 September 2011Received in revised form12 November 2012Accepted 14 January 2013

Keywords:ExplanationProbabilityStory tellingExplanation transformationLanguage semantics

a b s t r a c t

We present an explanation-oriented, domain-specific, visual language for explainingprobabilistic reasoning. Explanation-oriented programming is a new paradigm that shiftsthe focus of programming from the computation of results to explanations of how thoseresults were computed. Programs in this language therefore describe explanations ofprobabilistic reasoning problems. The language relies on a story-telling metaphor ofexplanation, where the reader is guided through a series of well-understood steps fromsome initial state to the final result. Programs can also be manipulated according to a setof laws to automatically generate equivalent explanations from one explanationinstance. This increases the explanatory value of the language by allowing readers tocheaply derive alternative explanations if they do not understand the first. The languageis composed of two parts: a formal textual notation for specifying explanation-producingprograms and the more elaborate visual notation for presenting those explanations. Weformally define the abstract syntax of explanations and define the semantics of thetextual notation in terms of the explanations that are produced.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Probabilistic reasoning is often difficult to understand,especially for people with little or no corresponding educa-tional background. Even simple questions about conditionalprobabilities can have counterintuitive solutions, causingconfusion and even disbelief among laypeople despiteelaborate justifications. Consider, for example, the followingconditional probability problem: Three coins are flipped.Given that two of the coins show heads, what is the probabilitythat the third coin shows tails? Many people respond that theprobability is 50%, but it is, in fact, 75%.

If you do not understand the solution to the aboveproblem, an explanation will be provided shortly. If youdo understand, pretend for a moment that you donot—what would you do? Your best recourse might be

to simply ask someone that does understand. A goodpersonal explanation is ideal because the explainer canrephrase the explanation, answer questions, clarifyassumptions, and provide related examples as furtherillustration. Unfortunately, good personal explanationsare a comparatively scarce resource; they are not alwaysavailable, and cannot be easily shared or reused.

If you cannot get a personal explanation, you mightrefer to a probability textbook or seek other explanatorymaterial on the web. These explanations have muchhigher availability and reusability; a web-based explana-tion, in particular, can be accessed any number of timesfrom almost anywhere in the world. The trade-off is thatthese explanations lack the flexibility and adaptability ofa personal explanation. They are rarely presented in termsof the specific problem at hand and cannot respond to thequestions and confusions of a particular person.

In this paper we bridge this gap with a domain-specific, visual language called Probula1 for explaining

Contents lists available at SciVerse ScienceDirect

journal homepage: www.elsevier.com/locate/jvlc

Journal of Visual Languages and Computing

1045-926X/$ - see front matter & 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.jvlc.2013.01.001

$ This paper has been recommended for acceptance by Shi Kho Chang.n Corresponding author. Tel.: !1 541 737 8893;

fax: !1 541 737 3014.E-mail addresses: [email protected] (M. Erwig),

[email protected] (E. Walkingshaw). 1 A portmanteau of probability and fabula, the Latin word for story.

Journal of Visual Languages and Computing ] (]]]]) ]]]–]]]

Please cite this article as: M. Erwig, E. WalkingshawA visual language for explaining probabilistic reasoning, Journal ofVisual Languages and Computing (2013), http://dx.doi.org/10.1016/j.jvlc.2013.01.001i

probabilistic reasoning problems like the three coinsproblem above. Using Probula, we can produce visualexplanations for a wide range of conditional probabilityproblems, combining the flexibility and accessibility ben-efits of personal and electronic explanations. We alsoprovide a set of laws for transforming explanationscreated in Probula into alternative, equivalent explana-tions. This adds some of the adaptability of personalexplanations, allowing users who do not understand aninitial explanation to view the problem from differentperspectives, to reduce understood parts of the explana-tion to focus on more difficult parts, and to add andremove abstraction. However, our goal is not to replacepersonal or web-based explanations, but to complementthem. For example, a web page might provide a textualexplanation along with a Probula explanation as a sup-plementary resource, or a teacher might use a Probulaexplanation as a visual aid for a verbal explanation,employing automatically generated alternatives to illus-trate different points.

Probula is an example of an explanation-orientedlanguage—a language that supports explanation-orientedprogramming (EOP). EOP is a new paradigm where thefocus of programming is shifted from describing thecomputation of values to providing explanations of howand why those values are produced. A case for EOP ismade in Section 2, along with a brief account of ourprevious work in this area.

A Probula explanation of the three coins problem isgiven in Fig. 1. The explanation is read from top tobottom, like a story describing how an initial probabilitydistribution is transformed into the solution of the pro-blem. This story-telling model is the central metaphor onwhich the language is built. The idea is that each step inthe story is small and easy to understand, and that byfollowing each of these well-understood steps, the readercan see how the ultimate conclusion is reached. In Section

3 we motivate the choice of the story-telling model forexplaining probabilistic reasoning and describe its impacton the design of Probula.

We begin the explanation in Fig. 1 with an emptyprobability distribution, that is, a distribution with a 100%probability of no event. The first three steps of theexplanation are all generator steps, which introducenew events into a distribution. In this case, each generatorrepresents a coin flip. The filter step encodes our condi-tional constraint that we consider only cases where two ofthe three coins show heads. Finally, we group the possi-bilities in the resulting distribution into two cases accord-ing to the question posed by the problem—whether or notthe remaining coin shows tails.

The Probula language consists of two levels, eachrepresented by a distinct notation. The visual notationshown in Fig. 1 represents the story-level and is the moreimportant and interesting of the two. An object at thislevel is a story—a view on a transformable explanationthat is intended to be read by explanation consumers. Thesecond level is the plot-level, represented by a simpletextual notation. An object in this language is a plot, alsocalled an explanation program, which is a formal specifica-tion of a story written by the explanation creator. A plotcan be instantiated or executed with an initial distribu-tion to produce a story.

In Section 4 we formally define the notations of storiesand plots and relate the two through a third notation, amathematical representation of distribution graphs.A distribution graph is essentially a sequence of probabil-istic distributions, where each distribution is a group ofnodes (one for each value in the distribution), and wherethe nodes in each adjacent distribution are possiblyconnected by edges. Specifically, in Section 4, we equatedistribution graphs to the abstract syntax [9] of the visualstory notation, then define the denotational semantics ofplots also in terms of distribution graphs. In other words,a distribution graph represents a story, and the meaningof a plot is the story it produces. We also discuss andmotivate many aspects of the concrete syntax of thevisual notation, beyond the overarching story-tellingmetaphor.

Each step of a story is represented at the plot-level byan operation (such as generate, filter, or group) that, givena preceding distribution Di, extends the distribution graphby a new distribution Di!1 and new edges connectingnodes in Di and Di!1. In Section 5 we define the semanticsof all six operations provided by Probula in terms ofdistributions graphs, as described above.

Finally, a story represents only a single instance from afield of potential explanations for the underlying prob-abilistic reasoning problem that it explains. Through theuse of the above-mentioned transformation laws, we canautomatically transform the underlying plot for this storyinto several ‘‘equivalent’’ alternatives. In Section 6 wedefine this notion of plot equivalence, then enumerateand formally describe all such transformations. We alsodiscuss the trade-offs that each of these transformationpresents, both informally and through the developmentand application of a simple measure of vertical vs.horizontal distribution graph complexity.Fig. 1. Explanation of the three-coins problem.


M. Erwig, E. Walkingshaw / Journal of Visual Languages and Computing ] (]]]]) ]]]–]]]2

The rest of the paper consists of a discussion of relatedwork in Section 7, followed by conclusions and directionsfor future work in Section 8.

This work is an extension and consolidation of twoearlier papers on explaining probabilistic reasoning[12,13]. Prior to these papers, we presented a domain-specific embedded language (DSEL) in Haskell for creatingand manipulating (but not explaining) probabilisticvalues [10]. In [12] we reused this DSEL in the design ofa second Haskell DSEL for explaining probabilistic reason-ing that became the basis for Probula. That paper alsoprovides the first examples of Probula’s visual notationalong with an extended discussion of philosophicalresearch on explanations and a lengthier motivationfor the story-telling metaphor than is provided here.We formalized a subset of the visual notation of Probulain [13] and also provided a formal semantics for a subset ofthe operations used for creating Probula explanations. Thatwork also contained the first set of laws for transforming aProbula explanation into equivalent alternatives.

In this paper, we improve and complete the formaliza-tion of the visual notation begun in [13] and provide aformal semantics for all of the supported operations. Inparticular, we formalize for the first time the branchingoperation and the selection of representative examples,and improve the treatment of grouping operations. Wealso define a formal specification language for generatingthese explanations and clarify the relationship of thisformal notation to the visual notation. Finally, we providea more complete and detailed description of theexplanation-generating transformation laws.

2. Explanation-oriented programming

EOP is fundamentally motivated by two simple obser-vations. The first is that programs often produce unex-pected results. The second is that programs have valuenot only as tools for instructing computers, but also as amedium of communication between people.

When a program produces an unexpected result, a useris presented with several questions. First, is the resultcorrect? If so, what is the user’s misunderstanding? If not,where is the bug and how can it be fixed? In thesesituations an explanation of how the result was generatedor why it is correct would be very helpful. Unfortunately,such explanations are difficult to come by; in many casesa user’s only recourse is to go through a long andexhausting debugging process. For less critical problems,the user may simply give up. With a large and growingclass of end-user programmers [33], the ability to providegood explanations is more important than ever.

Although many tools for explaining programs exist, suchas debuggers and software visualization tools, these oftensuffer from a fundamental disconnect from the languagesthey are meant to explain. If a language provides explana-tion tools at all, they are typically designed post hoc, as anadd-on to the language itself. This forces these tools toreflect a notion of computation that was not designed forexplainability (rather, usually to maximize other propertieslike efficiency and expressiveness), leading to explanationsthat are difficult to produce and have low explanatory value.

In an explanation-oriented language, the focus onexplaining programs is shifted to an earlier stage inlanguage and tool development process, making it aprimary design goal of the language itself. In devisingsyntax and semantics, language designers consider notonly the production of values, but also explanations ofhow those values are obtained and why they are correct.

This idea has a huge potential to innovate languagedesign. Besides the obvious applications to debugging andprogram analysis, it also suggests an entirely new class ofdomain-specific languages where the explanation itself,rather than a final value, is the primary output of aprogram. This represents a stronger sense of explanationorientation and reflects the second observation above thatformal languages are an effective medium for commu-nicating ideas with other people. Probula is an example ofsuch a language, for producing explanations in thedomain of probabilistic reasoning.

Using a strongly explanation-oriented language likeProbula, an explanation designer, who is an expert in theapplication domain, can create and distribute explanationartifacts for specific problems in the domain. Theseartifacts can be customized and explored by non-expertexplanation consumers who do not understand the pro-blems or who are trying to gain a greater understandingof the domain itself. There are many possible applicationdomains for such languages, including the explanation ofmedical procedures, electrical systems, computer algo-rithms, and all kinds of scientific phenomena. In thispaper we address the application domain of probabilisticreasoning, something which is generally not well under-stood but of great scientific and practical importance, andcan thus benefit greatly from a language for creatingexplanations.

3. The story-telling model of explanations

The visual notation of Probula relies on several inter-related metaphors. In this section we motivate the mostimportant of these that likens the process of explaining aphenomenon to that of telling a story. This story-tellingmetaphor fundamentally directed the design of Probula.In this section, we will motivate this design decision anddemonstrate its impact on the design.

In our previous work we have conducted a brief surveyof the study of explanations and their representation [12].Our criteria for selecting a model for explaining probabil-istic reasoning was that the model should be (1) simpleand intuitive since the language is intended to be used notjust by philosophers, mathematicians, or scientists, but bylaypeople who likely have no background in explanationtheory or formal modeling; and (2) constructive, in thesense that the model should identify specific componentsof an explanation and their relationships, so that thesecan in turn be realized by specific language constructs.These criteria rule out several potential models of expla-nation, such as advanced statistical models [27,31], mod-els based on physical laws [32], and those based onprocess theory [6], all of which are too complicated forour purposes. Also ruled out are unificationist models


M. Erwig, E. Walkingshaw / Journal of Visual Languages and Computing ] (]]]]) ]]]–]]] 3

[24], which do not provide a constructive approach (andare also quite complicated).

We believe that the most promising explanation mod-els are those based on causation; that is, to explain aphenomenon means to identify what caused it. The ideathat explanations reveal causes for why things havehappened seems intuitive and goes back at least to Plato[30]. Popular models in the philosophical study of causa-tion are based on causal graphs [16,37]. In these models,explanations are represented by directed graphs withvariables as nodes (typically representing events orstates), where an edge leads from X to Y if X has a directeffect on Y. In our previous work we have workedextensively with one such representation called neurondiagrams [27], formalizing and extending the language in[14] and designing a Haskell DSEL for creating andanalyzing neuron diagrams in [36].

While causal graphs are a useful and expressiveexplanation model, they also place a navigational burdenon the user. That is, a user must decide which nodes,edges, or paths to look at and in which order. A morelinear representation is simpler and easier to understandbut also less expressive.

Our explanation-oriented focus and the selection cri-teria stated above lead us to choose a more linear viewof causation and ultimately to the story-telling modelof explanation. Specifically, we represent an explanationas a sequence of points in a story, where each pointcorresponds to some state. We move from point to pointthrough the story by an interleaved sequence of state-modifying steps that guide the reader from the initialstate to the explanandum.2 In designing domain-specificlanguages, we often embrace limited computational gen-erality to achieve a close fit with the target domain.Likewise for explanation-oriented DSLs, we embrace aless general explanation model if it fits the needs of ourdomain, leading to simpler and thus more understandableexplanations. The intuitiveness of this model is supportedby empirical evidence that suggests that presenting factsand explanations in the form of a story makes them moreconvincing and understandable [28].

The story-telling model is also constructive in that itsuggests a path for realization within a particular domain.Specifically, we must identify (1) some notion of statewithin the domain that will be manipulated throughoutthe story, and (2) a set of composable operations fordefining the transitions at each step. Tying this back to thelinearized notion of causal graphs, we must essentiallyrealize the representation of the nodes (which correspondto states in the story-telling model) and the edges (whichcorrespond to operations). The explanation of the three-coins problem presented in Section 1 demonstrates ouradaption of the story-telling model to the domain ofprobabilistic reasoning. The state, shown at each pointin the explanation, is a probabilistic distribution, whileoperations correspond to the annotated steps betweenstates.

The story-telling model also suggests a distinctionbetween two different levels of notation. An explanationis sufficiently defined by its initial state and the sequenceof operations that eventually transform it into the expla-nandum, which we call the plot. For example, the plot ofan explanation of how an omelette is made might list theingredients and directions for how to make one. However,the presentation of an explanation, or the story, must alsoinclude intermediate states generated by each step.For example, a story explaining the creation of an omel-ette might show a sequence of illustrations demonstrat-ing each preparation and cooking step that transforms theingredients into the finished omelette. This distinctionbetween definition and presentation, or plot and story, isreflected in the two notations of Probula.

4. A language for explaining probabilistic reasoning

In this section we describe how each of the languagefeatures motivated by the story-telling model—state,state-transforming operations, and a distinction betweenstory and plot—are represented both visually and formally.We begin with the representation of state as probabilitydistributions in Section 4.1 and of distribution-modifyingoperations in Section 4.2, which together lead to a notionof distribution graphs that serve as the abstract syntax ofstories in Probula. In Section 4.3 we introduce a limitedform of branching to Probula stories, enabling the explora-tion of choices between probabilistic outcomes in explana-tions, something which is common in probabilisticreasoning. Finally, in Section 4.4 we bring everythingtogether by formalizing the abstract syntax of stories asdistribution graphs, presenting a simple textual notationfor representing plots, and a semantics function for gen-erating a story (distribution graph) from a plot and aninitial distribution. Throughout this section, we also dis-cuss the concrete syntax of the visual notation, pointingout and motivating the important design decisions andvisual metaphors used.

4.1. Probability distributions

Each point in a Probula story is represented by a typed,discrete probability distribution. A distribution D over adiscrete random variable of type A is a set of pairs (x,p)where x : A and

Pp2rng!D"p# 1.3 We write the type of D as

/AS. For readability, when we write distributions expli-citly we render each pair (x,p) as xp where p is expressedas a percentage, and list all such pairs in D between anglebrackets. For example, lcoin#/H60,T40S is the distribu-tion of a loaded coin that lands on heads with 60%probability. In this example, lcoin has type /CS wheretype C has values fH,Tg.

Visually, we represent a distribution using the com-mon metaphor of spatial partitioning. A horizontal area ispartitioned into blocks, where each block corresponds to,

2 The thing that is to be explained.

3 This representation is isomorphic to a probability mass function,A-$0: :1%. Also note that we access the domain and range of a set of pairswith dom!&" and rng!&", respectively.



and is labeled by, a different value-probability pair in thedistribution (therefore, we often refer to these pairs in theformal notation also as blocks) and where the area of eachblock is proportional to its probability. Spatial partitioningis a good metaphor for representing distributions since itdirectly captures many abstract probabilistic axioms.For example, the sum of the areas (probabilities) of allblocks equals the area of the entire distribution (whichrepresents 100% probability), and as the area of one blockincreases, the area of other blocks must necessarilydecrease. From the perspective of operations that willtransform this state, we can view the probability space asa resource that operations can split, merge, and redis-tribute amongst values.

In addition to standard distributions, we also introducea notion of grouped distributions, an example of which canbe seen as the explanandum (final state) of the explana-tion in Fig. 1. A grouped distribution is a distributionwhose blocks have been consolidated into one or moregroups. In the visual notation, we represent groupeddistributions by drawing thick borders around each groupand by eliminating the thinner lines between blocks inthe same group.

We indicate a grouped distribution D in theformal notation with boldface. An ungrouped distributionD : /AS can be grouped (by the group operation, seeSection 5.2) according to a grouping function g : A-T thatmaps values in D onto some type T whose elements canbe checked for equality. All values that map to the samet : T will be put in the same group. For example, thegrouping function used to produce the explanandum inFig. 1 is a predicate from sequences of coin values to aboolean value indicating whether or not one of the coinsis tails, which we write as Cn-Bool. We represent a groupby a pair !t,X" where X # f!x,p"9!x,p" 2 D, g!x" # tg.A grouped distribution D is then a set of groups. We writethe type of D, a distribution of type /AS groupedaccording to a type T, as /AST . For example, the type ofthe final grouped distribution in the example is /CnSBool.

In the visual notation, a group is labeled by the set ofthe values it contains and the sum of their probabilities.While the latter is not represented explicitly in the formalnotation it can be easily derived. Note also that thegrouping value t is not shown explicitly in the visualnotation.

Grouped distributions give us a way to visually orga-nize or categorize the blocks in a distribution for thepurpose of enhancing an explanation, but they do notsignificantly affect the underlying distribution from amathematical perspective. A grouped distribution D isjust a lightweight view imposed on the original distribu-tion D, and we can easily recover D by simply removingthis view, D#

Srng!D"; this is the purpose of the ungroup

operation presented in Section 5.5. Alternatively, we canconsider plain distributions as a special case of groupeddistributions, grouped by the identity function. That is, D :/AS is similar to a grouped distribution D : /ASA, whereD# f!x,f!x,p"g" 9 !x,p" 2 Dg. Because of these similaritiesand since it is convenient, we often refer to bothplain and grouped distributions as simply ‘‘distributions’’(and range over these with D), unless the distinction is

important. We always distinguish between the types ofplain and grouped distributions, however, so despite thesimilarity above /ASa/ASA.

4.2. Operations

While each point in a story is represented by adistribution, the structure and progression of the story(its plot) is fundamentally determined by its operations,which describe the steps from one point in the story tothe next. An operation can therefore be viewed as afunction from one distribution to another. In the followingwe refer to the ith operation in the plot as oi and thefunction implementing this operation as fi; we use Di$1

for the (plain or grouped) distribution at the precedingpoint in the story and Di for the subsequent distribution,generated by applying f i!Di$1".

At the non-visual plot level, fi is a sufficient descriptionof an operation. In the visual story notation, however,operations are a bit more intricate. In addition to thename of the operation (generate, filter, group, etc.) and abrief annotation describing the transformation itperforms, the visual representation of a step from onepoint to the next includes edges between the blocks in thedistributions Di$1 and Di. This reflects the convention ofdata flow diagrams, making explicit the flow of values(possibly modified by the intermediate operation) fromone distribution to the next. The basic story telling modelimplies a representation of state and a set of state-manipulating operations, with the idea that each opera-tion will be small and simple enough to understand inisolation. In Probula, we extend that basic model with anexplicit representation of the state transformations pro-duced by the operations. In other words, instead of justshowing each point in the story (as we did in our textualDSEL [12]), edges in Probula describe how each stepproduces its subsequent story point, allowing readers todirect their effort on more important aspects of theexplanation.

In addition to the subsequent distribution Di, each opera-tion oi generates a set of edges Ei where each edge is a pair(v,w) where v 2 dom!Di$1" and w 2 dom!Di"[ f?g. An edge!v, ?" represents a terminating edge, an edge from Di$1 thatdoes not connect to a block in Di but ends instead with ahorizontal bar. This indicates a value in Di$1 that does notflow to the next distribution, for example, because it wasfiltered out, as seen in the filter step of the explanation inFig. 1. Note that since values in a distribution are unique, wecan uniquely identify a block in a plain distribution by itsvalue. For grouped distributions, we connect edges to groupsrather than blocks, and each group can be uniquely identifiedby its representative value t.

The meaning of an operation oi from the perspective ofstory generation is therefore not just fi, but a functionfrom Di$1 to the pair !Ei,Di". We call this function the storysemantics of the operation, written 1oiU. In Section 5 wewill define the story semantics of all of the operations inProbula.

Note that although each operation has an associatedannotation, we do not capture this in the formal repre-sentation above. While it would be trivial to add an



annotation value to the representation and story seman-tics of each operation, this would clutter the notationwithout adding any real benefit. More significantly, how-ever, we do not consider the automatic transformation ofannotations in Section 6. Instead annotations are consid-ered a purely secondary notation [15], that is, a way toconvey information outside the constraints of the formalnotation.

We call a sequence of distributions and block/group-connecting edges a distribution graph, and these form theabstract syntax of Probula’s visual notation. A simplifiedview of distribution graphs follows directly from thedefinitions of distributions and operations above. A plotcan be viewed as a sequence of operations o1, . . . ,ok, fromwhich a distribution graph !D0,E1,D1, . . . ,Ek,Dk" can begenerated by mapping the story semantics across theoperations and reducing the list with function applicationand the initial distribution D0 (saving the intermediateedges and distributions). Now we add another wrinkle toProbula stories, however, in order to represent a new classof probabilistic reasoning problems.

4.3. Story branching

Many probabilistic reasoning problems involve a ques-tion of choice—a scenario is established and the reader isasked which course of action has the best expectedbenefit to the actor in the scenario. One of the mostfamous (and famously misunderstood) examples is theso-called ‘‘Monty Hall Problem’’ [10,17,26]. In this sce-nario, a game show host presents a contestant (the actor)with three doors, one of which hides a prize. The contest-ant selects one of the doors, then the host opens one ofthe remaining two doors that does not contain a prize.The contestant is then presented with a choice: stay withthe originally selected door or switch to the other closeddoor? Most people believe that switching doors makes nodifference, but this is incorrect—switching doors doublesthe contestant’s chance of winning from 33 1

3% to 6623%.

A Probula explanation for the Monty Hall Problem ispresented in Fig. 2. It introduces two new operations andthe concept of story branching. The story reads as follows:Initially, there are three possibilities to consider—eitherthe prize is hidden behind the first, second or thirddoor—and each case is equally likely. This is representedby an initial distribution where each value is a triple withelements corresponding to doors; $ indicates the doorcontaining the prize (unknown by the contestant), and 0indicates a door without a prize. In the first step, we selectone of these cases as a representative example of theproblem since for the purposes of probabilistic analysis,it does not matter which door the prize is behind, just thatone of the three doors contains a prize. In other words, thethree cases are isomorphic.4 The next two generator stepsrepresent the contestant initial selecting a door and thehost opening a prizeless door. The selected door isrepresented visually by underlining, while the opened

door is outlined in a small rectangle. Note that when thecontestant has chosen the door with the prize, the host canopen either one of the other two doors, while he has no suchchoice in the other two cases (he must open the remainingdoor not containing the prize). The generator will thereforesplit the block in half in the first case, and leave theprobability unchanged in the other two cases. Next, we groupthe results according to whether or not the contestant iscurrently unknowingly winning the prize or not. This bringsus to the choice point, represented in Probula by a branch inthe story. In the left branch, the contestant does not switchdoors, while in the right branch, the contestant switches.Finally, we map the potential outcomes in each branch ontothe values ‘‘Win’’ and ‘‘Lose’’, indicating whether or not thecontestant won the prize. By comparing the explananda atthe end of each branch, we can see that switching doors leadsto a 66.7% chance of winning, while not switching doors leadsto only a 33.3% chance of winning.

Visually, we represent story branching by showing bothpossible paths for the story to take, separated by a dottedline. One branch is always preferred in that the edges fromthe last unbranched distribution lead to the first distributionin only one of the two branches, and step annotations areshown only for the step to the preferred branch. This issimply to reduce noise in the visual notation. An interactiveimplementation would allow readers to change the preferredbranch. Additional annotations (in parentheses) are providedto label each alternative in the branch, to help the readerunderstand the choice that the branch represents.

Branching fundamentally extends the expressiveness ofProbula by allowing us to explain a class of real-world-applicable problems concerned with making the best choicesin the face of probabilistic information. However, the verbosenotation clearly will not scale beyond only a very smallnumber of branches. This is an intentional design decision,consistent with our emphasis on explainability over otherqualities. By embracing the mostly linear story-telling model,we embraced explainability over expressiveness. Withbranching, we give some of that expressiveness back, butour particular choice of notation sacrifices scalability to avoidcompromising the simplicity and explicitness of the explana-tions. These sorts of design trade-offs are expected whendesigning languages for usability [15].

Having demonstrated branching in the visual notationat the story level, in the next subsection we extend theformal representations of plots and distribution graphs toalso incorporate branching. We also tie this entire sectiontogether with a semantics function for producing storiesfrom a plot and an initial distribution.

4.4. Distribution graphs, plots, and stories

As described in Section 1, the visual notation of Probulatells a story intended to be read by an explanationconsumer—someone who presumably does not under-stand the problem being explained. For this end-user, thevisual notation is the only aspect of Probula they need eversee. However, we have made a point of distinguishing thevisual notation of stories in Probula from the formal/textualnotations of plots and distribution graphs, which have differentintended audiences. Plots are created by explanation

4 Using Theorem 11, users can generate alternative explanations ifthey are not convinced of this. For an example, see Fig. 10 in Section 6.7.



producers—domain experts who understand the problem tobe explained and want to explain it to others. A plot isessentially an explanation-producing program while a storyis the result of executing that program. Distribution graphsare a formal representation of the abstract syntax of stories,and are used mainly by us, as language creators, to define thelanguage of Probula, formally describe the relationshipbetween plots and stories, and discuss the transformationof stories and plots. In this subsection we will complete thedefinitions of plots and distribution graphs, and relate thetwo by extending the story semantics of operations (seeSection 4.2) to plots.

In Section 4.2 we said that the plot of a story (itsstructure and progression) is fundamentally determinedby its operations. In Section 5 we catalog and define theseoperations; here, we consider their organization. Primar-ily, we compose operations into plots via a right-associative sequencing operator x , terminated by theend-of-plot symbol E. To produce branched stories like theexplanation of the Monty Hall Problem, a plot branchingoperator ^ is provided. A simplified grammar for plots isgiven below, using o to range over operations:

P :: ! E End of Plot

9ox P Plot Sequence

9P^P Plot Branch

The grammar is simplified in that it does not encode thesyntactic constraints that only some operations can beapplied to grouped distributions—while these constraintscould be encoded in the grammar, we prefer to enforce

them separately to keep the grammar clear and simple.Note also that the syntax allows for empty plots (E) thatproduce trivial, single-distribution explanations, and thatit does not restrict branching in any way, despite the lackof scalability in corresponding stories. It is the explana-tion creators’ responsibility to use branching judiciously,to keep explanations understandable.

The formal notation of distribution graphs followsdirectly from the definition of plots, and the notationsare almost structurally identical. We reuse the sequenceand branch symbols from plots directly since their mean-ing is similar here and since it is always clear from thecontext which language we refer to:

G :: !D Explanandum

9"D,E#xG Story Sequence

9G^G Story Branch

Graphs terminate in an explanandum and each step in astory sequence is represented by a pair of a precedingdistribution and a set of edges connecting it to the nextdistribution in the sequence. Note that this representationintroduces a small amount of redundancy when repre-senting branches since the distribution immediately pre-ceding a branch in the story will be represented multipletimes (once at the beginning of each branch) in the graph.This representation is chosen because it leads to a morestraightforward semantics function.

We define the story semantics of an operation oi inSection 4.2 to be a function 1oiU : D-"E,D# from a

33.3% 33.3% 33.3%

Host opens non-chosen no-win door

100%

EXAMPLE Select first value as arepresentative

33.3% 33.3% 33.3%

GENERATE Candidate selects closed door

GENERATE

33.3% 33.3%16.7% 16.7%

Group by currently winning or losingGROUP

33.3% 66.7%

33.3% 66.7%33.3% 66.7%

Lose33.3%

Win66.7%

Win33.3%

Lose66.7%

MAP Switch doors(Switch doors)(Don't switch doors)

MAPMAP

Fig. 2. Explanation of the Monty Hall Problem.



preceding distribution Di!1 to a pair containing the sub-sequent distribution Di and a set of edges Ei connectingthe two distributions. We use this definition to extend thenotion of story semantics to plots. The story semantics ofa plot P is a function from an initial distribution to thedistribution graph representing the abstract syntax of thegenerated story, 1PU : D-G. In the following, 1 " U#D$means to apply the function returned by 1 " U to D:

1EU% lD:D1ox PU% lD:#D,E$x1PU#D0$

1P^P0U% lD:1PU#D$^1P0U#D$

where #E,D0$ %1oU#D$

For the base case, we simply return the argumentdistribution as the explanandum when the end of the plotis reached. For branches, we recursively compute the storysemantics of each branch and compose the results with acorresponding branch in the graph. For plot sequences, wecompute the story semantics for the current operation andapply that function to the preceding distribution, yielding thenext distribution and a set of connecting edges. We use theseto create a story sequence in the graph and then recursivelycompute the semantics of the rest of the plot. Using thissemantics function we can generate a story given a plot andan initial distribution.

Finally, when describing transformations in Section 6, wewill need to talk about whether or not a transformationpreserves the meaning of the affected part of the distributiongraph. This poses two requirements: (1) a notion of adistribution subgraph and (2) a definition of the semantics(meaning) of a subgraph. The first is straightforward; given adistribution graph #D0,E1$x . . . x #Di!1,Ei$x " " " xDn wecan obtain the subgraph corresponding to steps j throughkrn as #Dj!1,Ej$x " " " xDk. For the second requirement,there are many possibilities. We define the semantics of a(sub)graph G to be its limits, written lim#G$, defined as thepair of G’s initial distribution and a sequence Dn of theexplanandum at the end of every branch in G. For the simpleand common case where G is a k-step story without branch-ing, the limits of the graph will therefore be the pair #D0,Dk$.Thus, when we say that the meaning of a subgraph ispreserved over some transformation, we mean that thesubgraph begins and ends in the same place, although it willpresumably change in the middle. This is important forensuring the locality of transformations.

We conclude this section with a summary of thesemantics of the major constructs introduced throughoutthis section, given in Fig. 3.

5. Catalog of operations

At the heart of the story-telling model is the idea that astory can be told as a sequence of steps where each stepcan be easily understood on its own, allowing the reader

to focus on the progression of the explanation from thebeginning to the explanandum. For explaining the domainof probabilistic reasoning, we identified six types of thesesteps, implemented as distribution manipulating (anddistribution graph generating) operations. In this sectionwe will discuss each of these operations in turn, definingthem formally and describing their representation in thevisual notation. We will also describe the syntactic con-straints on each operation (mentioned in Section 4.4),since some operations can only be applied to either plainor grouped operations, but not both.

Probula’s six operations are listed in the syntax defini-tion in Fig. 4, in the order that they will appear in thissection. Operations are distinguished in the formal nota-tion by a prefixed symbol and may include an argument,usually a function, defining the behavior of the operation.In the syntax, the metavariables e, g, f, and p all range overfunctions, but each have different constraints on theirtypes. These constraints will be discussed in the contextof their associated operations and are also summarizedlater in Section 5.7. The metavariable x ranges overdistribution values.

Recall from Section 4.2 and Fig. 3 that the storysemantics of an operation oi is a function from thepreceding distribution Di!1 to a pair of the subsequentdistribution Di and the connecting edges Ei. In this section,we will define the story semantics of each operationexplicitly. When discussing the interaction of types inan operation, however, we will usually refer to just thedistribution transformation—the function from Di!1 to Di,omitting the edges which carry no interesting typeinformation. In Section 5.7 we will summarize the typesof the distribution transformations of all of the operationspresented in this section.

5.1. Generate

A generator De introduces new probabilistic eventsinto a plain distribution. The event generating function e :A-/BS maps the values in the preceding distribution D :/AS to distributions of type /BS, where B is usuallyderived in some way from A. The distributions producedby e are concatenated and scaled according to the prob-abilities of the original values in D, in order to produce theresulting distribution D0 : /BS.

As an example, consider the definition of the secondgenerator in the explanation of the Monty Hall Problem inFig. 2, in which the host opens a door that was not chosenand does not contain the prize. This operation can bedefined as DopenDoor with the function openDoor defined

Syntactic Category Semantics Domain Descriptionof Semantics FunctionA! [0..1] probability mass functionD! (E,D) story semantics of an operationD! G story semantics of a plot

DoPG (D,D*) limits of a distribution graph

Fig. 3. Summary of denotations of constructs in the formal notation.

Fig. 4. Probula operations.



below (using pattern-matching):

openDoor!$00" #/$0050,$0050S

openDoor!$00" #/$00100S

openDoor!$00" #/$00100S

If the contestant has selected the door with the prize,there is a 50% chance of the host opening either remainingdoor. However, if the contestant has chosen a door with-out the prize, then there is a 100% chance of the hostopening the other prizeless door. The effect of applyingDopenDoor to its preceding distribution can be seen inFig. 2.

Often a generator simply extends a distribution ofsequences An with a new probabilistic A value. Forexample, each of the three generators in the three coinsproblem in Fig. 1 adds a new coin flip to the distribution.We can represent a coin flip with the following eventgenerating function flipCoin : Cn-/CnS that appendsheads or tails to the sequences in the existing distributionwith 50% probability each:

flipCoin!x" #/xH50,xT50S

For these simple sequence-extending cases, we can auto-matically derive an event generating function of typeAn-/AnS given just the distribution of the new valueto append D : /AS. Since this case is very common, weintroduce as syntactic sugar the notation DD for De wheree# lx:f!xy,p" 9 !y,p" 2 Dg. Using this, we can define thefirst three steps of the plot for the three coins problem asDcoinxDcoinxDcoin where coin : /CS is the distribu-tion /H50,T50S.

The story semantics of a generate operation De is givenexplicitly below. Edges are defined between every value xin D and every y produced by e(x). The probabilities in theresulting distribution D0 are scaled by multiplying theoriginal probabilities in D with those in the distributionsgenerated by e. We also have to add the probabilities foridentical y values that are produced by different invoca-tions of the generator. This is done in the definition of D00:

1DeU# lD:!E,D00"where E# f!x,y" 9 x 2 dom!D", y 2 dom!e!x""g

D0 # f!!x,y",pq" 9 !x,p" 2 D, !y,q" 2 e!x"g

D00 # f!y,X

!!x,z",p"2D0 ,z # y

p" 9 y 2 rng!dom!D0""g

Visually, we represent edges in generate steps by collap-sing the common tails of edges from the same source x,emphasizing the idea that generators partition or splitblocks in the probability space when a value in D leads tomore than one value in D00.

5.2. Group

A group operation g transforms a plain distribution Dinto a grouped distribution D0. As described in Section 4.1,grouping introduces a simplified view of a distribution,hiding some of its underlying structure by visually mer-ging sets of blocks into groups in order to emphasize

some aspect of the explanation. Unlike generators (andmost other operations), grouping does not fundamentallymodify the underlying probability distribution. Theoperation is provided purely to support the explanationof probabilistic situations rather than their computation.

The operation’s argument is a grouping function g :A-T that maps values in D : /AS onto some type Twhose elements can be checked for equality. Whenseveral values in D map to the same t : T, their blockswill be placed in the same group (represented by t) in theresulting grouped distribution D0 : /AST . These blockswill then be shown merged in the visual representation,and labeled by the sum of their probabilities. For example,we can represent the final step of the explanation of thethree coins problem in Fig. 1 by the operation hasTails,where the predicate hasTails : Cn-Bool is true if its argu-ment coin sequence contains at least one tails value andfalse otherwise. In the visual notation, the blocks of thethree sequences that satisfy hasTails are grouped togetherin the explanandum, and the probability of the group isthe sum of its component blocks. The remaining block(that does not satisfy the predicate) is isolated in its owngroup. We also distinguish grouped distributions visuallyby drawing the distribution with thick borders indicatingthat the view imposed by the group overrides the stan-dard representation of the distribution.

Recall from Section 4.1 that a grouped distribution isrepresented by a set of groups. Each group is a pair !t,X",where t is the grouping value produced by g (true or falsein the hasTails example above), and X is the set of blockscontained in the group. We build a grouped distributionD0 from a preceding plain distribution D and the groupingfunction g in the story semantics of group operationsbelow:

1 gU# lD:!E,D0"

where E# f!x,g!x"" 9 x 2 dom!D"g

D0 # f!t,f!x,p" 9 !x,p" 2 D, g!x" # tg"9t 2 fg!x"9x 2 dom!D"gg

We connect edges to blocks in plain distributions and togroups in grouped distributions, using values to addressblocks and grouping values to address groups (see Section4.2). This is demonstrated in the definition of E, where weproduce an edge from each block in D to its correspondinggroup in D0.

There are some differences between the structuralrepresentation of grouped distributions in the abstractsyntax (distribution graphs) and their rendering in theconcrete syntax (visual notation). One is that a group’srepresentative value t is not reflected in the visual nota-tion. We could imagine, for example, replacing the valuesdisplayed in a group by their common grouping value t.This would introduce a mechanism for abstraction thatmight sometimes be useful in representing more complexproblems. We choose not to do this for a couple reasons:(1) similar (but irreversible) functionality is providedalready by the map operation, discussed in Section 5.3and (2) there is a trade-off between introducingabstraction and mapping closely to our domain of prob-ability distributions [15]. For a language focused on



explainability, we again prefer simplicity, concreteness, anda high closeness of mapping over generality and scalability.Grouping values are needed in the abstract syntax, however,for edge addressing, as seen above, and for the filtering ofgrouped distribution, discussed in Section 5.4.

Another difference between the abstract and concreterepresentations of grouped distributions is that in the visualnotation we display the probability of a group as the sum ofthe probabilities of the blocks it contains. This reflects thatthe primary purpose of grouped distributions, from anexplanation standpoint, is to reduce the need for readersto perform hard mental operations [15] by organizingdistributions into the relevant cases explicitly. We do notrepresent this sum explicitly in the abstract syntax since itcan be easily derived.

5.3. Map

While the generate and group operations may only beapplied to plain distributions, the map operation (andfilter in the next section) can be applied to either plain orgrouped distributions. When applied to a plain distribu-tion, a map operation nf transforms D : /AS by applyingthe function f : A-B to each value in the distribution,producing a new distribution D0 : /BS. When applied to agrouped distribution D : /AST , map behaves similarly,applying f to every value within every group, preserv-ing the grouping and producing a distribution of typeD0 : /BST . We will consider the plain distribution casefirst and adapt this to grouped distributions afterward.

If f is one-to-one, the structure of D is unchanged by amap operation. That is, D0 will have the same number ofblocks as D and each will have the same probability—onlythe values in D will be changed. This is equivalent to thetraditional functional programming view of mapping afunction over a list or other data structure.

If f is not one-to-one, however, the blocks of A valuesthat map to the same B value will be merged; that is,the probability of a value y : B in D0 will be the sum ofthe probabilities of all the x values in D for which f !x" # y.For example, consider the following distribution of coinsequences, produced by applying the generator sequenceDcoinxDcoin to an empty initial distribution:

twoCoins#/HH25,HT25,TH25,TT25S : /CnS

Suppose we also have a function countTails : Cn-Int thatreturns the number of tails values in a sequence of coinflips. Applying ncountTails to twoCoins yields the followingplain distribution of integers:

numTails#/025,150,225S : /IntS

In this way, a map can function similarly to a groupoperation, except that it actually changes the values andstructure of the underlying distribution, so the originalcannot easily be recovered. If instead we apply countTailsto twoCoins, we get a distribution of coin sequences, groupedby integers:

/!0,fHH25g", !1,fHT25,TH25g", !2,fTT25g"S : /CnSInt

This overlap in functionality leads to a design decision forexplanation creators of whether to employ a map or groupoperation when combining events.

When choosing between map and group, if subsequentsteps in the story fundamentally rely on the combined result(that is, if the combined value is needed as input in a latergenerate or map step) then a map operation must be used.On the other hand, if the combined result is needed only forexplanatory purposes, to categorize the events into equiva-lence classes, then a group may be better since it directlyshows the values that make up the cases instead ofabstracting these away in the combined value. To demon-strate this, consider again the final step of the explanation ofthe three coins problem in Fig. 1. If we had applied nhasTailsinstead of hasTails at this step, we would have producedthe less useful plain distribution /false25,true75S instead ofthe grouped distribution shown.

To define the story semantics of the map operation, wefirst introduce a helper function map : !A-B" $/AS-/BS that maps a function over a plain distribution,merging blocks that map to the same values by summingtheir probabilities, as described above:

map!f ,D" # y,X

!x,p"2Df !x" # y

p

0

@

1

A9y 2 ff !x"9x 2 dom!D"g

8<

:

9=

;

With the heavy lifting off-loaded to map, the definition ofthe story semantics for nf when applied to a plaindistribution is mostly trivial:

1nf U# lD:!E,D0"

where E# f!x,f !x"" 9 x 2 dom!D"g

D0 #map!f ,D"

The edges for a map operation connect each block in D toits corresponding block in D0. Thus, every block in D willhave one departing edge while blocks in D0 could havepotentially many arriving edges, if f is not one-to-one.

When applying nf to a grouped distribution D, we justapply our helper function map to the subdistribution Xcontained in every group (t,X), thereby mapping f firstover the groups in D, then over the values in those groups.Note that each X is not properly a distribution since itsprobabilities do not sum to 1, but that this use of map isstill valid since X is structurally identical to a distributionand since map does not rely on its argument representinga complete sample space:

1nf U# lD:!E,D0"

where E# f!t,t" 9 !t,X" 2 Dg

D0 # f!t,map!f ,X"" 9 !t,X" 2 Dg

A subtle feature of this implementation is that maps actlocally with regard to groups. That is, if two values in Dmap to the same value through f, they will only be mergedin D0 if they were already in the same group. If they werein different groups in D, however, they will remain indifferent groups in D0 and thereafter, until a subsequentungroup operation (see Section 5.5).



Concerning edges for maps applied to grouped dis-tributions, we draw only a single edge between corre-sponding groups in D and D0, regardless of the effect of fon the values contained within. For example, see the lasttwo steps of the Monty Hall explanation in Fig. 2. This isto reduce visual noise in the explanation and provides away for explanation creators to indicate that all valuesin a group are affected in a similar way by the mapoperations.

5.4. Filter

Like map, the filter operation can be applied to eitherplain or grouped distributions, and has a different storysemantics for each case. Again, we consider the casefor plain distributions first, then adapt it to groupeddistributions.

A filter operation ?p transforms a plain distribution D :/AS by removing blocks from D according to the predicatep : A-Bool. The area of the removed blocks is redistributedproportionally among the remaining blocks in the resultingdistribution D0 : /AS. For example, recall the distributionnumTails!/025,150,225S from the previous subsection.Applying the filter ?isPos to numTails, where isPos :Int-Bool is true if its argument is greater than zero, yieldsthe distribution /166:7,233:3S. Note that the ratio of theprobabilities of the values that pass through the filter ispreserved in the resulting distributions.

The story semantics of the filter operation whenapplied to a plain distribution is defined below. In thedefinition, the constant c represents the proportion of theprobability space that is not filtered out by the predicate.This is used to scale the probabilities of the remainingvalues:

1?pU! lD:"E,D0#

where c!X

"x,q#2D,p"x#

q

E! f"x,x# 9 x 2 dom"D#,p"x#g [ f"x, ?#9x 2 dom"D#,:p"x#g

D0 ! f"x,q=c# 9 "x,q# 2 D,p"x#g

As described in Section 4.2, edges from blocks eliminatedby a filter end with terminating bars while edges fromblocks whose values pass the predicate are connected totheir corresponding blocks in the subsequent distribution.

For grouped distributions, filters work slightly differ-ently, filtering out whole groups of values at once. Thus,the type constraints on a filter operation ?p applied to agrouped distribution D : /AST are slightly different.The predicate p must have type T-Bool instead ofA-Bool, since the predicate is on the grouping valuesrather than the values in the underlying distribution.Often, a filter is performed immediately after a group,where the grouping function is the predicate and theargument to the filter operation is just lt:t! true. Thisaccentuates the effect of the filter, grouping together all ofthe elements that will and won’t pass the filter, andpresenting the sum of each groups probabilities. Section6.5 provides an example of this explanation strategy.

The story semantics of filtering a grouped distributionare given below. Again, c accumulates the proportion ofthe probability space that passes the filter and is used toscale the probabilities of values in the resulting distribu-tion:

1?pU! lD:"E,D0#

where c!X

"t,X#2D,q2rng"X#,p"t#

q

E! f"t,t# 9 t 2 dom"D#,p"t#g [ f"t, ?#9t 2 dom"D#,:p"t#g

D0 ! f"t,f"x,q=c# 9 "x,q# 2 Xg, "t,X# 2 D, p"t#g

Edges for the grouped distribution case are similar to theplain distribution case, except connecting between groupsrather than blocks.

5.5. Ungroup

An ungroup operation removes the view imposed bya previous group operation by transforming a groupeddistribution back into a plain distribution. That is, itdefines the distribution transformation /AST-/AS. Thisis almost, but not quite as simple as taking the union ofthe subdistributions contained in each group, as originallydescribed in Section 4.1. The hitch is that sometimes amap operation can map values in different groups to thesame value—we must merge these during the flatteningprocess by summing their probabilities as if the map hadbeen applied to a plain distribution. To do this, we canreuse the map helper function defined in Section 5.3.

In the following story semantics for the ungroupoperation, we build an intermediate plain distribution oftype /T $ AS in which each value x is prefixed by thevalue t of its group in the previous distribution.This allows us to flatten the distribution while ensuringthat every value is unique. We can then attain the desireddistribution of type /AS by invoking the helper functionmap with the function snd that returns the second valueof a pair. This will combine values that were the same butin different groups in the original grouped distribution,accumulating their probabilities:

1 U! lD:"E,D0#

where E! f"t,x#9"t,X# 2 D,x 2 dom"X#gD0 !map"snd,f""t,x#,p#9"t,X# 2 D,"x,p# 2 Xg#

We produce an edge leading to each block in theungrouped distribution from its containing group in thegrouped distribution.

An example use of the ungroup operation will beprovided in Section 6.5.

5.6. Representative example selection

Like the group and ungroup operations, the finaloperation in Probula is not strictly needed for computa-tion with probabilistic distributions (unlike the otherthree operations), but rather to support the creation ofeffective explanations. The operation ! x means to selectthe value x as a representative example from the previous



plain distribution. This operation is used when all of thecases in a distribution are isomorphic and equallyprobable, simplifying subsequent distributions through-out the explanation by assuming a single case rather thanconsidering all cases simultaneously.

A typical example of this operation’s use is provided inthe first step of the explanation of the Monty Hall Problemin Fig. 2. In the initial distribution, we represent that theprize could be hidden behind any one of the three doors,each with equal probability. The important observationsare that (1) it does not really matter which door hides theprize and (2) the probabilities of winning or losing are thesame for any one case as for the problem as a whole. Wecould confirm this by removing the second step of the plotand observing that the number of values in subsequentdistributions would be multiplied by three (and theirprobabilities correspondingly divided by three), but thatthe final probabilities in the explananda would remainunchanged.

Given the above description, it should be obvious thatthere are significant constraints on when we can employthe representative example operation. Fortunately, theseconstraints are easy to check and enforce. Given a pre-ceding distribution D and a plot ! xx P where !x,p" 2 D, werequire that 8!y,q" 2 D, p#q and the limits of the distribu-tion graphs 1PU!/x100S" and 1PU!/y100S" are the same.The second of these constraints reveals the story seman-tics of the operation, given below, in which we simplycreate a new distribution where the value x has 100%probability:

1! xU# lD:!E,D0"

where E# f!y,x" 9 y 2 dom!D"g

D0 #/x100S

The edges generated by a representative example step arealso trivial, but we represent them in the visual concretesyntax by merging their heads and making the line fromthe chosen example block bold.

The representative example operation requires asomewhat deeper understanding or intuition of probabil-istic reasoning to accept in isolation. In Section 6.7 wepresent a simple transformation for changing the parti-cular example chosen as a representative. This can help toconvince the readers of an explanation who do notimmediately see the isomorphism between the cases.

5.7. Summary

The operations presented in this section can be sepa-rated into two classes of three operations each, accordingto their primary role in the creation of explanations ofprobabilistic reasoning. The generate, map, and filteroperations represent probabilistic computations. Theseoperations manipulate the values and probabilities indistributions directly; that is, they are primarily con-cerned with the computation of results essential to theexplanation. The group, ungroup, and example operationsare concerned instead with the presentation of explana-tions. Group and ungroup manipulate a grouping view

overlaid on a distribution while the example operationreduces a set of equivalent cases to a single representativecase. These operations organize the computed results inorder to make the important points clear and the expla-nation understandable.

In Fig. 5 we summarize the type information for eachoperation, including the types of the distribution transfor-mations (from input to output) and of the arguments toeach operation. Whether an operation is applied to a plainor grouped transformation can be determined from thetype of the input distribution (/AS is plain, /AST isgrouped). Note that there are two entries for the mapand filter operations since they can be applied to eitherplain or grouped distributions, and their implementations,constraints, and distribution transformations differ depend-ing on this context.

In the next section we develop theorems for rearranging,merging, and introducing these operations to automaticallygenerate alternative but equivalent explanations.

6. Generating alternative explanations

One of the most important features of Probula is theability to algorithmically transform a single explanation,defined by an explanation creator, into many alternative,equivalent explanations of the same probabilistic reason-ing problem. This allows us to bridge the gap betweenthe two qualitatively different kinds of explanationsdiscussed in Section 1. On one end we have explanationsgiven directly from one person to another. Personalexplanations are extremely adaptable. The explainer cananswer questions, rephrase points, clarify assumptions,and provide alternative examples. The drawbacks ofpersonal explanations are that they are inconsistent,time-consuming to share, and often difficult to come byin the first place. At the other end are what might broadlybe called static explanations, those given in text orpictures, provided in a book or on the web. Theseexplanations are much less adaptable but make up for itwith higher consistency, availability, and shareability. Inthe middle, and intending to complement both, areexplanation stories like those provided by Probula.

Probula explanations can be presented as simpleillustrations, as throughout this paper, conferring thebenefits of static explanations. But they are also highlyadaptable, able to transformed through a set of laws intoalternative explanations that tell the same basic story in a

Fig. 5. Summary of operations and their types.



different way, emphasizing one point or another. In thissection we will present these laws, define what it meansfor an explanation to tell the same basic story, anddefine some simple metrics for discussing the differencesbetween tellings.

6.1. A story and its telling—fabula and sujet

Narrative is one of the most important and funda-mental way of sharing knowledge and ideas. It is the basisnot only of arts like literature and cinema, but also ofhistory, journalism, and the entire institution of academicauthorship. It is the idea that information can be con-veyed as a story that the same essential story can be toldin different ways, and that how that story is told impactshow it is perceived and understood.

The basic advantage of personal explanations can beboiled down to the fact that a personal explanation’snarrative is flexible—it can be modified on the fly to suitthe particular needs of the situation. If the same persongave an explanation of the Monty Hall Problem to threedifferent people, she would likely tell three differentstories. The important point of this section, however, isthat while these stories might differ in their telling, theymust also, as explanations of the same problem, sharesome common essence. In narratology, this point is some-times captured in the terms fabula, meaning the basicfacts or ‘‘raw material’’ of a story, and sujet, meaning howthe story is organized and told [5]. In our scenario, thefabula of the three personal explanations is constantwhile the sujet is likely to change between tellings,depending on the individual needs of the listeners andother factors.

In Probula, a plot and the story it generates correspondto a particular sujet—one telling or view of a larger class ofrelated stories with a common fabula. The laws presented inthis section represent transformations from one sujet in thisclass to another. Through the combination and repeatedapplication of these laws, we can view a Probula explana-tion not just as a single story, but as a navigable space ofdifferent stories that all explain the same basic problem.

The laws are defined as transformations of a part of aProbula plot (a subplot), usually translating one sequenceof operations into another. To help ensure that a lawpreserves the underlying fabula of generated stories,we compare the limits of the distribution subgraph (seeSection 4.4) generated by the affected subplot, before andafter the transformation. If the limits are the same, we saythat the transformation is limit preserving. While limitpreservation does not capture the preservation of fabuladirectly, it does so indirectly by guaranteeing that a storytransformation is strictly local. Specifically, a limit pre-serving transformation affects only the intermediatedistributions produced by a subplot and nothing else inthe story.

Transformation locality is important for several reasons.First, it enables the composition of transformation laws—new Probula explanations can be generated not only byapplying the laws described in this section directly, but bycomposing these laws into larger transformations, broad-ening the space of potential explanations. Second, it allows

us to consider the effect of a transformation independent ofits context. In the next subsection we will define a couple ofsimple metrics to support the isolated analysis of a trans-formation and to frame its effects on the perception andunderstandability of its generated stories.

6.2. Measures of story complexity

In order to quantify the effect of an explanationtransformation, we introduce the notions of the horizon-tal and vertical complexity of a distribution (sub)graph.These metrics are not intended to measure the effective-ness or understandability of an explanation directly, butto provide a way to quantify the effect that a transforma-tion has on an explanation, to support the comparison oftransformations and other discussion.

The vertical complexity of a (sequential) distributionsubgraph is simply the number of steps it contains.The horizontal complexity of a subgraph is the total numberof edges it contains divided by the number of steps.For example, consider the subgraph corresponding to thefirst three generate steps in the explanation of the threecoins problem in Fig. 1. The vertical complexity of thissubgraph is 3, while the horizontal complexity is (2!4!8)/3"4.67. Note that the merging of the heads and tails ofedges in the concrete syntax has no effect on the computa-tion of the horizontal complexity of a subgraph.

Vertical complexity intends to capture the intuitionthat the length of an explanation affects its understand-ability. A longer explanation presents more individualsteps that must be examined and understood, andincreases the cognitive load when considering the expla-nation as a whole. Horizontal complexity representsinstead the average difficulty of understanding each stepin the explanation. The reasoning is that steps with moreedges require more effort for readers to untangle andaccept. Again, however, neither of these metrics areintended to absolutely quantify understandability—wecannot take a graph with a vertical complexity of 4 andconclude that it is easier to understand than an unrelatedgraph with a vertical complexity of 5 (even if theirhorizontal complexities are identical). Instead we usecomplexity to discuss the relative understandability ofgraphs related through a transformation law.

As we will see in the rest of this section, most interestingtransformations present a trade-off between the verticaland horizontal complexity of an explanation. This meansthat there is usually not a unique ‘‘best’’ explanation to beachieved; different explanations fill different roles and havedifferent benefits and drawbacks. Using the transformationlaws presented below we can move between these differentexplanations. This not only enables users to get the parti-cular static explanations that work best for them, but also toexplore the explanation space and view many alternativeexplanations for the same problem, increasing the value ofProbula explanations further.

6.3. Operation fusion

The first class of transformations we will considerfollows from the observation that often adjacent operations



of the same kind can be merged by combining the effects oftheir argument functions. We call this operation fusion. Forexample, consider again the first three generate steps in theexample in Fig. 1. In Fig. 6 we show how these can be fusedinto a single generate step that represents the flipping of allthree coins at once. Adjacent generators, maps, and filterscan all be merged. In this subsection we will consider eachof these in turn.

Recall that a plot transformation is considered valid if itpreserves the limits of the distribution subgraph generatedby the affected subplot. When fusing two adjacent genera-tors De1 and De2, we must therefore identify an eventgenerating function e3 such that if 1De1 xDe2 x PU!D0" #!D0,E1"x !D1,E2"x !D2,E3"xG, then 1De3 x PU!D0" #!D0,E2

0" x !D2,E3"xG. Such a function is defined in thefollowing theorem for fusing adjacent generate steps. Notethat, throughout this section, we use the symbol * toindicate a directed, limit-preserving plot transformation andwe omit the common subsequent plot P from the notation.That is, we write a plot transformation o1 x $ $ $ x oj xP*o1 0 x $ $ $ x ok

0 x P more concisely as o1 x $ $ $ x oj*o1 0 x $ $ $ x ok

0.

Theorem 1. Generator fusion:

De1 xDe2*Dlx:f!z,qr" 9 !y,q" 2 e1!x", !z,r" 2 e2!y"g

Any number of adjacent generators can be fused byrepeatedly applying this theorem.

That generator fusion is limit preserving followsdirectly from the story semantics of plots from Section4.4 and of generate operations in Section 5.1. The initialdistribution D0 is trivially preserved. The preservation ofD2 can be confirmed by computing and comparing thesemantics of the LHS and RHS of the transformation.For the LHS, we first compute the intermediate distribu-tion D1 and use that to compute the semantics of D2, asfollows:

D1 #1De1U!D0" # f!y,pq"9!x,p" 2 D0,!y,q" 2 e1!x"g

D2 #1De2U!D1" # f!z,pqr"9!y,pq" 2 D1,!z,r" 2 e2!y"g

For the RHS we compute D2 directly:

D2 #1De3U!D0" # f!z,pqr"9!x,p" 2 D0,!z,qr" 2 e3!x"g

where e3 # lx:f!z,qr"9!y,q" 2 e1!x",!z,r" 2 e2!y"g

Since the two resulting distributions are also identical, thetransformation is limit preserving. We will not work throughthe limit-preservation proofs of all the transformations as

verbosely, but they can all be confirmed likewise by comput-ing and comparing the story semantics.

By fusing two generate steps into one, generator fusiontrivially reduces the vertical complexity of a story by one.To compute the change in horizontal complexity, we firstobserve that the number of edges at any one generatestep is equal to the number of values in the produceddistribution (D2 above). For example, in Fig. 1, the sizes ofthe first three distributions after the initial distributionare 2, 4, and 8, and these are also the number of edges ineach of the three generate steps. Therefore, the horizontalcomplexity of the story produced by the LHS of Theorem 1is !9D19%9D29"=2, while the horizontal complexity of theRHS is 9D29. Additionally, we know that 9D29Z9D19, sincegenerators can only add values to a distribution. There-fore, the horizontal complexity of the RHS is greater than(or equal to) that of the LHS by !9D29&9D19"=2.

In qualitative terms, the trade-off between vertical andhorizontal complexity with generator fusion is a trade-offbetween many small steps that incrementally build up acomplex distribution or fewer larger steps that do the same.We expect that the explanatory value of each view is notconstant throughout the process of reading and under-standing an explanation. When an explanation is firstpresented, it might be helpful to see how a complexdistribution is produced from smaller steps. Once this aspectis understood, however, the additional generators becomevisual noise that distract from the more subtle and inter-esting parts of an explanation. Through generator fusion, theuser can take advantage of both views—seeing the moreverbose explanation initially and fusing generators togetheras they are understood.

Like generators, adjacent maps and filters can also befused. For maps, we use the composition of the twomapped functions as the argument to the new mapoperation.

Theorem 2. Map fusion:nf 1 x

nf 2*n!f 2Jf 1"

We fuse adjacent filters by filtering with the conjunctionof their predicates.

Theorem 3. Filter fusion:

?p1 x ?p2*?lx:p1!x"4p2!x"

That these transformations are limit preserving can beconfirmed in the same way as generator fusion, byexpanding the story semantics of both sides of thetransformations and comparing. As with generator fusion,these transformations trivially reduce the vertical com-plexity of a story by one and increase (though again, notstrictly) the horizontal complexity of the explanation.

The number of edges in a filter or map step isdetermined not by the size of the produced distribution,as with generators, but rather by the size of the inputdistribution. Assume a similar D0 and D2 as above, forrepresenting the identical limits of the generated distri-bution graphs before and after map or filter fusion, and a

Fig. 6. The three generators from Fig. 1 fused into one.



similar D1 for representing the intermediate distributionproduced by the plot on the LHS only. Then the horizontalcomplexity of the graph before each transformation is!9D09"9D19#=2 while the horizontal complexity after eachtransformation is 9D09. In this case, we have the partialinequality 9D09Z9D19, since maps and filters can onlyremove values from distributions. So the horizontal com-plexity after each transformation is greater than before by!9D09$9D19#=2. These differences present a similar trade-off for explanation consumers as generator fusion.

Note that given a suitable representation, it is alsopossible to split previously fused generators, maps, or filters.However, such transformations are not possible in general(that is, on operations that were not previously fused)unless we impose further restrictions on these operations,making their arguments somehow decomposable.

6.4. Operation commutation

In addition to fusing adjacent operations of the samekind, many operations can be commuted with each other.These transformations present less clear explainabilitytrade-offs than operation fusion (or the transformationswe will look at in the next subsection) but they are oftenuseful as preparatory steps to setup other transforma-tions. Therefore, we will present some commutationtransformations here, relatively briefly.

Since filters do not affect the types of the distributionsthey modify, they can be freely commuted with eachother. Note that we use ) to indicate a limit-preserving,undirected plot transformation.

Theorem 4. Filter-filter swap:

?p1 x ?p2)?p2 x ?p1

That this transformation preserves the limits of thecorresponding distribution subgraph is obvious—in eithercase, the resulting distribution contains only the valuesthat pass both predicates. Observe that this theorem canalso be viewed as a corollary of Theorem 3 (Filter Fusion)since applying that transformation to both ?p1 x ?p2 and?p2 x ?p1 will result in the same fused filter operation.

Like all commutation transformations, filter swappinghas no effect on the vertical complexity of the producedgraph; no operations are added or removed, they aresimply rearranged. The transformation’s effect on hori-zontal complexity depends on the specific predicates p1and p2. If fewer values in D0 pass p1 than p2, the graphproduced by the LHS will have a lower horizontal com-plexity, while the opposite will be true if p2 filters morevalues out of D0 than p1. Although the transformationimpacts horizontal complexity inconsistently, swappingthe order of two filters might still help a user to under-stand the exact effect that each filter has on the resultingdistribution. Once these are understood, the filters mightbe fused with Theorem 3.

Unlike filters, maps do affect the types of their argu-ment distributions, meaning that maps cannot be asfreely commuted as filters. However, often we can pusha map nf below another operation o by composing f withthe function associated with o. In other words, we can

perform the mapping described by f implicitly within o, inorder to delay the actual map step. These transformationsshould be used judiciously since they can obscure theoriginal meaning of o, but they are safely limit preservingand may be useful to setup other transformations.

The following theorem provides a limit-preservingtransformation that pushes a map operation below a filteroperation.

Theorem 5. Map-filter swap:nf x ?p*?!pJf #x nf

Assuming that the input distribution to this subplot isD0 : /AS, then f : A-B and p : B-Bool. By composing pand f, we change the type of the predicate passed to thefilter to A-Bool, allowing it to be applied directly to D0.The intermediate distribution of the subgraph producedby the LHS will have type /BS, while on the RHS it willhave type /AS. The final distribution graphs will be thesame, however, and be of type /BS.

To illustrate map-filter swapping, consider the follow-ing plot posTails. Recall from Sections 5.3 and 5.4 that thefunction countTails : Cn-Int returns the number of tails ina sequence of coin values. The function isPos : Int-Boolreturns whether its argument is positive or not:

posTails% ncountTailsx ?isPosx E

This plot, given an initial distribution of coin sequences,produces an explanation of the number of tails in eachsequence that contains at least one tails. For example,given the initial distribution /HH25,HT25,TH25,TT25S, themap step produces the intermediate distribution/025,150,225S, and the filter step produces the finaldistribution /166:7,233:3S. This is exactly the runningexample from Sections 5.3 and 5.4.

Applying the map-filter swap transformation results inthe following alternative plot posTails0:

posTails0 % ?!isPosJcountTails#x ncountTailsx E

Given the same initial distribution, the filter step elim-inates the HH sequence, producing the intermediatedistribution /HT33:3,TH33:3,TT33:3S, and the map stepproduces the same final distribution /166:7,233:3S.

Map-filter swapping decreases the horizontal com-plexity of the produced graph since moving the filter upcan only decrease the number of edges produced by themap step. However, this is a case where horizontalcomplexity does not tell the whole story since thepredicate itself is now more complicated.

Using the same strategy as Theorem 5, we can alsopush maps below group operations. This transformation iscaptured in the following theorem.

Theorem 6. Map-group swap:nf x g* !gJf #x nf

As with map-filter swapping, this transformation willproduce a graph with a dubious decrease in horizontalcomplexity.

To illustrate map-group swapping, consider the follow-ing variant of our previous example, where we use the



isPos function to group, rather than filter the intermediatedistribution:

posTailsGrp! ncountTailsx isPosx E

Given the initial distribution /HH25,HT25,TH25,TT25S, themap step produces the intermediate distribution/025,150,225S, and the group step produces the groupeddistribution f"false,f025g#,"true,f150,225g#g, which has type/CnSBool.

Applying the map-group swap transformation to theposTailsGrp plot yields the following equivalent plot:

posTailsGrp0 ! "isPosJcountTails#x ncountTailsx E

If we apply this to the same initial distribution, we get thefollowing intermediate distribution f"false,fHH25g#,"true,fHT25,TH25,TT25g#g, which will map to the same resultinggrouped distribution as we obtained with posTailsGrp.

Not all pairs of operations containing a map can beswapped, however. Recall from Section 5.3 that the mapoperation merges values only locally within groups.For example, if we apply the operation ncountTails tothe grouped distribution f"false,fHH25,TH25g#,"true,fHT25,TT25g#g, we get the grouped distribution f"false,f025,125g#,"true,f125,225g#g, in which the two blocks that map to 1 arenot merged since they are in different groups. This means

that although we can lift a group above a map by simulatingthe map in the group operation, as in Theorem 6, we cannotlift an arbitrary map above a group, since it could result inblocks being merged in the intermediate distribution.For the same reason, we cannot swap map and ungroupoperations.

Finally, since filters on grouped distributions affectgroups rather than their contained values (see Section 5.4),we also cannot commute filter and ungroup operations.

6.5. Advanced transformations

In this subsection we move from the simple combiningand rearranging of operations to more subtle and complextransformations. Like the fusion operations, the laws pre-sented below pose interesting trade-offs for explainability.

The first transformation we will consider involves aprocess called filter lifting and is demonstrated by analternative explanation for the three-coins problemshown in Fig. 7. The basic idea is that if all of thedescendants of some block in a distribution Di (anywherein a distribution graph) are eliminated by a downstreamfilter, then we can introduce a new filter immediatelybelow Di that filters that block out. In the three-coinsexample, by referring to the original explanation in Fig. 1,

H H25%

H T25%

T H25%

T T25%

Two coin flipsGENERATE

THH16.7%

THT16.7%

HTH16.7%

HTT16.7%

HHH16.7%

HHT16.7%

Third coin flipGENERATE

H H H25%

H H T25%

H T H25%

T H H25%

H H H25%

H H T H T H T H H75%

FILTER

GROUP

H H33.3%

H T33.3%

T H33.3%

Disregard cases thatcannot have two heads

FILTER

Consider only cases wheretwo heads have been flipped

Group by whether ornot tails has been flipped

100%

Fig. 7. Example of filter lifting.



we can see that all of the descendants of the TT block inthe D2 distribution are eliminated by the filter two stepsbelow. In the transformed explanation in Fig. 7, weintroduce a new filter immediately after this value isgenerated, filtering it and its descendants out twosteps early (note that we have also fused the first twogenerators). We capture this transformation in the fol-lowing theorem.

Theorem 7. Filter Lifting Given preceding distribution D,if 8y descended from x 2 dom!1DeU!D"", :p!y", then

Dex . . . x ?p*Dex ?lz:xazx . . . x ?p

Filter lifting increases the vertical complexity of a dis-tribution graph by one, but offers the potential to sig-nificantly reduce the horizontal complexity. That filterlifting always decreases horizontal complexity by someamount is easy to see—the number of edges at any step ina story is a function of the size of one of its adjacentdistributions, and every distribution between De and ?pwill be strictly smaller after filter lifting. Exactly howmuch the horizontal complexity decreases dependson how many steps are between the generator and thefilter and how many new values are generated from the

eliminated value x. For the three-coins example, filterlifting reduces the horizontal complexity of the relevantsubgraph of the explanation from (8#8)/2$8 to!4#6#6"=3$ 51

3. This reduction in horizontal complexityreflects the observation, encoded in the transformedexplanation that values that will eventually be eliminatedare in some sense irrelevant. By introducing an additionalstep to eliminate these values immediately after theyarise, we can invest a little explanatory effort early toavoid the effort of considering their descendants over andover, at each step.

Note that we can also preemptively eliminate severalsuch irrelevant values from a single distribution with justone filter by alternating applications of Theorem 7 (filterlifting) and Theorem 3 (filter fusion).

The next two transformations we present can be usedto emphasize a particular step in an explanation’s story,drawing attention to it and making its effects moreexplicit. We do this by introducing a group-ungroup pairaround the step we wish to emphasize, using a processcalled group bracketing.

First we consider the bracketing of a filter operation.This is demonstrated in Fig. 8, where we have bracketed thefilter step in the three coins explanation from Fig. 1. When

GROUPGroup by whetherthere are two heads

HHH THH50%

HTT THT TTH TTT50%

100%

TTH12.5%

TTT12.5%

THH12.5%

THT12.5%

HTH12.5%

HTT12.5%

HHH12.5%

HHT12.5%

GENERATE Three coin flips

H H H H T H T H H100%

FILTERConsider only the casewhere there are two heads

H H H25%

H H T 25%

H T H25% 25%

H H H25%

H H T H T H T H H75%

GROUPGroup by whether ornot tails has been flipped

UNGROUPBreak group into components

HTHHHT

H H T

T H H

Fig. 8. Example of group bracketing.



bracketing a filter, the predicate of the filter is used as thegrouping function, producing a distribution divided into twogroups—those that will pass the filter and those that willfail. The filter step is then greatly simplified, eliminating onegroup and passing the other on as the subsequent distribu-tion, which is then ungrouped.

Group bracketing of a filter operation is capturedformally in the following theorem. Note that we cannotbracket a filter if we are already in the context of a groupoperation since groups cannot be nested. Also, recall fromSection 5.4 that filters on grouped distributions aredefined in terms of the grouping value rather than theunderlying elements, so we change the filter predicatefrom p to lt:t! true (equivalently, the identity function)since each group will be represented by the result of papplied to each of its contained elements.

Theorem 8 (Group Bracketing of a Filter). If not already inthe context of a group operation, then

?p* px ?lt:tx

Group bracketing increases the vertical complexity of astory by two, adding a group and ungroup operation.It also decreases the horizontal complexity. If D andD0 represent their common limits, with 9D9Z9D09, thehorizontal complexity of the unbracketed filter is 9D9; forthe bracketed version, there are 9D9 edges for the group step,2 edges for the filter step, and 9D09 edges for the ungroupstep, for a horizontal complexity of "9D9#2#9D09$=3, whichis clearly less than 9D9.5 The qualitative trade-off is that weaccentuate and simplify the filter step at the cost of theadditional complexity introduced by the new group andungroup steps. Through bracketing, a user can see all of thevalues that either pass or fail a filter, together and at a glance,rather than having to scan the outgoing edges of all values inthe unbracketed filter step.

Similarly, we can also bracket a map operation in agroup-ungroup pair. In this case, we use the mappedfunction first as a grouping function (assuming the func-tion maps to a type that can be checked for equality), thenapply the map, then ungroup.

Theorem 9 (Group Bracketing of a Map). Given f : A-B, ifB can be checked for equality, and if not already in thecontext of a group operation, then

nf* f x nf x

This operation offers little benefit if f is one-to-one,increasing the vertical complexity by two without chan-ging the horizontal complexity. However, if fmaps severalvalues in D to the same value in D0, the benefit is similarto bracketing filters. The horizontal complexity will bedecreased, and values in D that map to the same value inD0 will be put in the same group, making it easier to seewhich values in D will be merged in D0.

6.6. Transforming branched plots

Throughout this section we have considered only thetransformation of sequential plots. Since transformationsare local, these trivially apply also to branched plots aslong as the transformed subplot is entirely before thebranch point or contained within a particular branch. Thisview is overly strict, however, ruling out many seeminglyvalid transformations that span branch points. For exam-ple, the subplot nf 1 x "nf 2^

nf 3$ is clearly equivalent (in alimit preserving sense) to the subplot n"f 2Jf 1$^"nf 3Jf 1$.In the second we have fused nf 1 with nf 2 in the left branchand nf 1 with nf 3 in the right branch. We can’t do this byapplying Theorem 2 directly, however, because neitherpair of map operations are in direct sequence—the branchpoint gets in the way.

Rather than extend each transformation law toaccount for branching, we instead introduce a new lawthat can be composed with existing laws like map fusionto achieve the desired effect. We call the new transforma-tion branch unzipping, since it visually resembles theunzipping of a zipper, and in Fig. 9 we demonstrate itsuse as a preparatory step for fusing the maps in the aboveexample. In this figure we show an abstracted view ofthree stories, with the relevant steps annotated by theiroperations, and the distributions in the relevant sub-graphs colored a lighter gray (the darker distributionsare provided to give a sense of context). The leftmost storycorresponds to the story generated by our initial subplot,nf 1 x "nf 2^

nf 3$. In the second story we unzip one level ofthe story, pushing nf 1 out of the trunk and duplicating itin each branch. Finally, in the third story, we have appliedmap fusion twice, once to each branch, corresponding toour final subplot n"f 2Jf 1$^"nf 3Jf 1$.

The dual of branch unzipping is branch zipping, whichallows us to lift a duplicated operation at the top of eachbranch into the trunk of an explanation. We capture bothof these transformations in the following theorem.

Theorem 10 (Branch Unzipping/Zipping).

ox "PL^PR$)ox PL^ox PR

Recall from Section 4.4 that the limits of a distributionsubgraph G containing branches are a pair of the initialdistribution in G and a sequence of the final distribution ineach branch. That branch (un)zipping preserves the limits ofthe generated subgraph follows directly from an applicationof the story semantics of plots (also Section 4.4) to the LHSand RHS of above the law.To analyze the complexity impact of these transforma-tions, we must extend our definitions of vertical andhorizontal complexity to branched stories. Branchingrepresents complexity mostly on the horizontal axis.Therefore, we consider the vertical complexity of abranched story to be simply the average of the numberof steps along each path in the story, while for horizontalcomplexity we sum the edges across all steps, in allbranches, and divide by the (averaged) vertical complexity.

Using these definitions, we see that neither trans-formation has any effect on the vertical complexityof a story. Meanwhile, zipping decreases horizontal

5 Actually there are two degenerate cases where this is not true. If9D9! 9D09! 2 or if 9D9! 9D09! 1, then group bracketing preserves orincreases the horizontal complexity, respectively.



complexity proportional to the number of edges gener-ated by o, while unzipping increases horizontal com-plexity by the same amount. As the example abovedemonstrates, however, the complexity introduced byunzipping may be temporary when used as a preparatorystep for other transformations.

Finally, note that although we applied map fusion toboth branches after unzipping the above example, there isno requirement that we do so. For example, we can unzipthe subplot nf 1 x !nf 2^o3" (where o3 is presumably not amap) to !nf 1 x

nf 2"^!nf 1 x o3", then apply map fusion only

to the left branch to produce n!f 2Jf 1"^!nf 1 x o3". In this

case we have decreased vertical complexity by only half astep, instead of the full step decrease in the originalexample, while also increasing horizontal complexity.

6.7. Changing the representative example selection

The final law we consider follows from the discussionin Section 5.6. Recall that the operation ! x indicates that xis a representative example of the preceding distribution D,and that we can therefore replace D with the distribution/x100S. A representative value x is one that is bothisomorphic to, and equally probable as, every other valuey in D. This operation is used as the first step of the MontyHall Problem shown in Fig. 2, in which we choose the casewhere the prize is hidden behind the left door as arepresentative example.

Since x is isomorphic to all such values y, we can freelyreplace ! x with ! y without changing the meaning of theexplanation. For example, in the Monty Hall Problem we

Fig. 9. Example of unzipping a branching story to enable operation fusion.

Lose33.3%

Win66.7%

66.7% 33.3%66.7% 33.3%

33.3% 33.3% 33.3%

Host opens non-chosen no-win door

100%

EXAMPLE Select third value as arepresentative

33.3% 33.3% 33.3%

GENERATE Candidate selects closed door

GENERATE

33.3%33.3%

Group by currently winning or losingGROUP

66.7%

Win33.3%

Lose66.7%

MAP Switch doors(Switch doors)(Don't switch doors)

MAPMAP

16.7% 16.7%

33.3%

Fig. 10. Monty Hall problem explained with a different representative example.



can equally well choose the case where the prize is behindthe center door or the right door. Fig. 10 shows analternative explanation of the Monty Hall Problemin which we have chosen a different representativeexample. This transformation is captured generally inthe following law.

Theorem 11 (Select Representative Example). Given x 2dom!D0" and y 2 dom!D0", then

! x)! y

Since all values must be isomorphic as a precondition tothe use of the representative example operation, thelimits are trivially preserved by this transformation upto a similar isomorphism in the result distribution.

While changing the representative example has noeffect on the complexity of the story, it is an importanttransformation for supporting the understanding ofexplanations that use this rather subtle operation.It enables explanation readers to confirm for themselvesthat the particular representative example chosen doesnot determine the solution, and to see how differentexamples produce slightly different distributions but anoverall similar structure.

Effective explanations must necessarily be given froma particular perspective, using particular examples, andtelling a particular story. When constructing a staticexplanation, the explanation creator must fix theseaspects once and for all. A personal explainer can be moreflexible, changing their explanation during creation, asneeded by a particular explanation consumer. The trans-formation laws provided in this section help Probula spanthis gap, enabling an initially fixed static explanation tobe automatically adapted to tell new stories about thesame problem, with different examples and from differentperspectives.

7. Related work

Much of the most pertinent related work has beendiscussed already in Section 2 on the explanation-oriented programming paradigm, and Section 3 on thestory-telling model of explanation. In this section werecall this discussion, filling in the gaps, and discussseveral other areas of related work as well.

The story-telling model of explanation follows in thetradition of more general ideas from the philosophy ofscience that emphasize the importance of causality inexplanations [7,35,19]. The philosophical study of causa-tion is in turn a large area of research, which we havediscussed briefly in Section 3, and at greater length in ourprevious work [12,14]. Of particular interest are thevarious visual notations that have been used in thisresearch for representing graphs of causally relatedevents [16,27,37]. These causal graph notations areusually used as informal diagramming tools, to explaincausal situations and support discussion, but are rarelydefined as formal visual languages. Probula differs fromthese languages in two important ways. First, it is mainlylinear, a design decision that we discussed in Section 3.Second, it is typed and much more structured. Nodes in

causal graphs sometimes represent events and sometimesstates, and this ambiguity often leads to misunderstand-ings and obscures subtle distinctions in relationshipsbetween nodes [18]. In contrast, points in a Probula storyare always distributions of typed values, and relationshipsare described in terms of a small set of well-definedoperations with explicit type constraints. This makes themeaning of each point and each step in the story muchclearer than in less structured representations.

Throughout the paper we have given different expla-nations in Probula of the Monty Hall problem and thethree coins problem. The fallacies induced by these twoproblems are part of a large class of fallacies that arisefrom a misunderstanding of statistical independence andconditional probability. The gambler’s fallacy and the hothand fallacy are two contradictory fallacies related toindependence [2]. Given a sequence of heads results whenflipping a fair coin, the gambler’s fallacy expects a higherprobability of the next flip being tails, while the hot handfallacy expects the opposite. The inverse fallacy is the falseassumption that P!A9B" is approximately equal to P!B9A"[34]. Relatedly, the base-rate fallacy is the failure toconsider the underlying probability of an event whengiven some conditional evidence [3]. For example,consider a population in which 1 in 10 people have avirus, and a test for that virus that is accurate 90% of thetime. If a random person tests positive for the virus, thefallacy is that they are 90% likely to have the virus. Thisfails to take into account the lower probability of havingthe virus in the first place, and thus that there will bemore false positives than false negatives. In fact, the oddsof the person having the virus, given that they testedpositive, is 50%. These fallacies are widespread not onlyamong laypeople, but among professions where an expertunderstanding of conditional probability is critical, suchas clinicians [8]. Probula explanations can help to debunkthese fallacies by providing explanations that show howthe correct (though perhaps counterintuitive) answer isderived.

In Section 2 we discussed our previous work oncomputing with probability distributions [10] and sup-porting and explaining probabilistic reasoning [12,13].Although there are other languages that support compu-tation with probabilistic values, such as IBAL [29] and theOCaml DSEL designed by Kiselyov and Shan [23], theredoes not seem to be any other work on explaining thesecomputations. Domain-specific explanation support is notcompletely new, however, and can be found in the fieldof algorithm animation [22], where many ideas andapproaches have been developed to illustrate the workingof algorithms through custom-made or (semi-)automati-cally generated animations [21]. The work of Blumenkr-ants et al. is particularly relevant, where the use of thestory-telling metaphor was demonstrated to increase theexplanatory power of their algorithm animations [4].

Code debuggers represent a class of widely usedexplanation systems. Very generally, the goal whiledebugging is to obtain an explanation of some programbehavior, usually as part of an effort to fix a bug. Whiledebuggers help users find this information in the code(often after much time and effort), most operate at such a



low level that the output and effects they produce couldscarcely be considered an explanation. The WHYLINEsystem [25] inverts the debugging process, allowing usersto ask questions about program behavior and respondingby pointing to parts of the code responsible for theoutcomes. Although this system improves the processsignificantly, it can still only point to places in theprogram, limiting its explanatory power. We haveextended the basic idea of the WHYLINE to the domainof spreadsheets by allowing users to express expectationsabout the outcomes of cells, then generating changesuggestions that would produce the desired results [1].From a philosophical perspective, the generated changesuggestions represent counterfactual statements aboutthe state of the spreadsheet [27]. Counterfactual reason-ing as a basis for explanation is closely related to thestory-telling metaphor [12].

Probula, and the story-telling model in general, are alsorelated to the idea of dataflow languages, in which data isincrementally modified by passing through a directed graphof operations [20]. Our language could be viewed as adataflow language with the single, parametrically poly-morphic data type of probability distributions (plain orgrouped), with the operations described in Section 5 andan additional operation to support branching.

Finally, the idea of elevating explainability as a designcriterion for languages was first proposed in [11] wherewe have presented a visual language for expressingstrategies in game theory. The guiding principle of thedesign of this language was the concept of traceabilty. Thisis the idea that language designers should consider notonly how to represent programs, but also how to repre-sent the execution of programs (program traces), and howto relate programs to their traces in order to supportprogram understanding. This idea developed into thenotion of explanation-oriented programming as describedin this paper, and Probula reflects these original designideas well. Probula’s visual notation is an integratedrepresentation of the execution of a Probula program, asthe distributions that it generates, combined with arepresentation of the operations that produce thatexecution.

8. Conclusions and future work

We have presented Probula, a domain-specific, visuallanguage for explaining probabilistic reasoning problems.This language continues our research in the new para-digm of explanation-oriented programming, where thefocus of programming is not on the computation ofresults, but on explaining how and why those resultswere computed. We believe that this shift of focus revealsa new, fruitful, and under-explored area for languagedesign. Domain-specific and visual languages are espe-cially well-suited for explanation orientation for severalreasons.

Domain-specific languages allow us to express explana-tions in terms and structures appropriate to the domain; forexample, in Probula, we use probabilistic distributions as abasic construct in explanations of probabilistic reasoningproblems. This makes explanations in the language more

understandable by making them more concrete and bytaking advantage of users’ existing domain knowledge.It also makes the explanations more useful and externallyapplicable since they can be used not only to understand thespecific problems presented, but as tools to better under-stand the domain itself. This is one of the primary designgoals of Probula.

Visual languages are a powerful medium for explana-tion since we have many more potential dimensions forencoding information than is available in textual lan-guages (position, color, length, shape, etc.). This not onlygives us more flexibility and expressiveness in the designof explanations, but also makes it easier to directly reuseexisting notations from the domain, when applicable.Furthermore, it gives us access to a rich language of visualmetaphors and symbols. We have presented a few suchexamples in Probula, such as the use of spatial partition-ing to encode probability, and connection to represent theflow of data.

A significant contribution of this work is the formula-tion of laws which can be used to transform an explana-tion into many equivalent explanations of the sameproblem. In this way we can combine the accessibilitybenefits typical of static explanations with the adaptabil-ity benefits of personal explanations. Through the com-position of explanation transformations, we can derivefrom a single explanation a large, explorable space ofrelated explanations.

In order to fully realize the adaptability benefits ofpersonal explanations, however, we must also considerhow users interact with this explanation space. This is animportant topic for future work. In the most straightfor-ward view, users might directly manipulate an explana-tion within some tool, in order to generate relatedalternatives—manually fusing generators, for example,or changing the representative example selection. In thecontext of personal explanations, direct manipulationcorresponds to responding to a user’s question. Butpersonal explainers also adapt automatically to a user’sneeds, when they sense that the listener is confused, forexample. Therefore, we might also consider developingheuristics to determine when a particular transformationwill be most useful for a particular user, which the toolcan suggest or apply automatically.

References

[1] R. Abraham, M. Erwig, GoalDebug: a spreadsheet debugger for endusers, in: Twenty-ninth IEEE International Conference on SoftwareEngineering, 2007, pp. 251–260.

[2] P. Ayton, I. Fischer, The hot hand fallacy and the gambler’s fallacy:two faces of subjective randomness, Memory & Cognition 32 (2004)

1369–1378.[3] M. Bar-Hillel, The base-rate fallacy in probability judgments, Acta

Psychologica 44 (3) (1980) 211–233.[4] M. Blumenkrants, H. Starovisky, A. Shamir, Narrative algorithm

visualization, in: ACM Symposium on Software Visualization, 2006,pp. 17–26.

[5] P. Cobley, Narratology, in: M. Groden, M. Kreiswirth, I. Szeman(Eds.), The Johns Hopkins Guide to Literary Theory and Criticism,

2nd ed. John Hopkins University Press, London, 2005.[6] P. Dowe, Physical Causation, Cambridge University Press, Cam-

bridge, UK, 2000.



[7] M. Dummett, Bringing about the past, Philosophical Review 73(1964) 338–359.

[8] D.M. Eddy, Probabilistic reasoning in clinical medicine: problemsand opportunities, Judgment Under Uncertainty: Heuristics andBiases (1982) 249–267.

[9] M. Erwig, Abstract syntax and semantics of visual languages,Journal of Visual Languages and Computing 9 (5) (1998) 461–483.

[10] M. Erwig, S. Kollmansberger, Probabilistic functional programmingin Haskell, Journal of Functional Programming 16 (1) (2006) 21–34.

[11] M. Erwig, E. Walkingshaw, A visual language for representing andexplaining strategies in game theory, in: IEEE International Sym-posium on Visual Languages and Human-Centric Computing, 2008,pp. 101–108.

[12] M. Erwig, E. Walkingshaw, A DSL for explaining probabilisticreasoning, in: IFIP Working Conference on Domain-Specific Lan-guages, Lecture Notes in Computer Science, vol. 5658, 2009,pp. 335–359.

[13] M. Erwig, E. Walkingshaw, Visual explanations of probabilisticreasoning, in: IEEE International Symposium on Visual Languagesand Human-Centric Computing, 2009, pp. 23–27.

[14] M. Erwig, E. Walkingshaw, Causal reasoning with neuron diagrams,in: IEEE International Symposium on Visual Languages and Human-Centric Computing, 2010, pp. 101–108.

[15] T.R.G. Green, M. Petre, Usability analysis of visual programmingenvironments: a cognitive dimensions framework, Journal of VisualLanguages and Computing 7 (2) (1996) 131–174.

[16] J. Halpern, J. Pearl, Causes and explanations: a structural-modelapproach part i: causes, British Journal of Philosophy of Science 56(4) (2005) 843–887.

[17] E.C.R. Hehner, Probabilistic predicative programming, in: D. Kozen(Ed.), Mathematics of Program Construction, Lecture Notes inComputer Science, vol. 3125, Springer, Berlin/Heidelberg, 2004,pp. 169–185.

[18] C. Hitchcock, What’s wrong with neuron diagrams, in:J.K. Campbell, M. O’Rourke, H. Silverstein (Eds.), Causation andExplanation, MIT Press, Cambridge, MA, 2007, pp. 69–92.

[19] P. Humphreys, The Chances of Explanation, Princeton UniversityPress, Princeton, NJ, 1989.

[20] W.M. Johnston, J.R.P. Hanna, R.J. Millar, Advances in dataflowprogramming languages, ACM Computing Surveys 36 (1) (2004)1–34.

[21] V. Karavirta, A. Korhonen, L. Malmi, Taxonomy of algorithmanimation languages, in: ACM Symposium on Software Visualiza-tion, 2006, pp. 77–85.

[22] A. Kerren, J.T. Stasko, Algorithm animation – introduction, in:S. Diehl (Ed.), Revised Lectures on Software Visualization, LectureNotes in Computer Science, vol. 2269, 2001, pp. 1–15.

[23] O. Kiselyov, C. Shan, Embedded probabilistic programming, in: IFIPWorking Conference on Domain-Specific Languages, Lecture Notesin Computer Science, vol. 5658, 2009, pp. 360–384.

[24] P. Kitcher, Explanatory unification and the causal structure of theworld, in: P. Kitcher, W. Salmon (Eds.), Scientific Explanation, Uni-versity of Minnesota Press, Minneapolis, MN, 1989, pp. 410–505.

[25] A.J. Ko, B.A. Myers, Debugging reinvented: asking and answering whyand why not questions about program behavior, in: IEEE InternationalConference on Software Engineering, 2008, pp. 301–310.

[26] C. Morgan, A. McIver, K. Seidel, Probabilistic predicate transfor-mers, ACM Transactions on Programming Languages and Systems18 (3) (1996) 325–353.

[27] J. Pearl, Causality: Models, Reasoning and Inference, 2nd ed.Cambridge University Press, Cambridge, UK, 2009.

[28] N. Pennington, R. Hastie, Reasoning in explanation-based decisionmaking, Cognition 49 (1993) 123–163.

[29] N. Ramsey, A. Pfeffer, Stochastic lambda calculus and monads ofprobability distributions, in: Twenty-ninth Symposium on Princi-ples of Programming Languages, 2002, pp. 154–165.

[30] D. Ruben, Explaining Explanation, Routledge, London, UK, 1990.[31] W. Salmon, Scientific Explanation and the Causal Structure of the

World, Princeton University Press, Princeton, NJ, 1984.[32] W. Salmon, Causality without counterfactuals, Philosophy of

Science 61 (1994) 297–312.[33] C. Scaffidi, M. Shaw, B. Myers, Estimating the numbers of end users

and end user programmers, in: IEEE Symposium on Visual Lan-guages and Human-Centric Computing, 2005, pp. 207–214.

[34] G. Villejoubert, D. Mandel, The inverse fallacy: an account ofdeviations from Bayes theorem and the additivity principle, Mem-ory & Cognition 30 (2002) 171–178.

[35] G. von Wright, Explanation and Understanding, Cornell UniversityPress, Ithaca, NY, 1971.

[36] E. Walkingshaw, M. Erwig, A DSEL for studying and explainingcausation. In: Proceedings of the IFIP Working Conference onDomain-Specific Languages. EPTCS 66 (2011) 143–167. http://dx.doi.org/10.4204/EPTCS.66.7. arXiv:1109.0780 [cs.PL].

[37] J. Woodward, Making Things Happen, Oxford University PressNew York, NY, 2003.



journal of visual languages and...

Documents