Basic principles of graphing data.PDF

Download Basic principles of graphing data.PDF

Post on 13-Nov-2015

9 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Basic principles of graphing data 483</p><p>Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010</p><p>Point of View</p><p>Basic principles of graphing data</p><p>Marcin Kozak</p><p>Warsaw University of Life Sciences Dept. of Experimental Design and Bioinformatics Nowoursynowska159 02-776 Warsaw, Poland. E-mail </p><p>ABSTRACT: Data visualization is a very important aspect of data analysis and of presentation. Focusing onthe latter, this paper discusses various elements of constructing graphs for publications. Bad and good graphsare compared, and a checklist with graphical elements to be used while creating graphs is proposed.Key words: visualization, graphs, statistics</p><p>Princpios bsicos na confeco de grficos</p><p>RESUMO: A visualizao de dados um aspecto importante da anlise de dados e de sua apresentao. Nesseponto de vista enfocam-se vrios elementos da construo de grficos para publicaes. Grficos ruins e bonsso comparados, e uma lista de checagem de elementos de composio grfica proposta para ser utilizadadurante a confeco de grficos.Palavras-chave: visualizao, grficos, estatstica</p><p>Introduction</p><p>Agricultural scientists produce abundant results.Some of them are important, others may be less impor-tant, still others may be negligible. The way one pre-sents the results, then, counts for several reasons. First,one needs to emphasize those important ones. Second,for most data, interpretation is easiest with graphs, andit is far easier based on a good than a bad graph. Just asstatistical analysis can provide false conclusions (Huff,1954; Kozak, 2009a), poor graphs can misinform, some-times leading to serious misinterpretations.</p><p>Huff (1954) was probably the first to discuss thatgraphing statistical data can be a tool of misinformation.Later some of his notions found counter-arguments, yetthe main ideas still hold: graphing must be correct andconvey true information and interpretation. Scientificgraphs are not to be beautiful; they are to be informa-tive. Of course, ugly graphs will not convey any inter-esting message, so elegance should not be disregarded(Tufte 1991; 1997; 2001 and 2006).</p><p>Tufte (1991; 1997; 2001 and 2006), Cleveland (1993; 1994),Jacoby (1997; 1998) and Wilkinson (2005), among others,discuss graphing data in great detail. Harris (1999) is a use-ful reference for information graphics. Yet scientific litera-ture in the 21st century is full of poor graphs, some of whichare incomprehensible while others can even unintention-ally falsify information. Data visualization is a developingresearch area, and what was considered a good graph 30years ago does not have to be good these days. In this pa-per basic information is offered about graphing data withthe help of which authors should be able to construct suf-ficiently good scientific graphs. Focus will mainly beplaced on graphical rather than statistical aspects, so re-member that when constructing a graph, you must take all</p><p>care to ensure it delivers the message you intend it to, fromboth scientific and statistical points of view. Note also thatvisualization data with the purpose of data analysis and in-terpretation at the computer screen does not follow exactlythe same rules as those described in this paper dealsmainly with graphing for others, that is, for publications.</p><p>You will see good and bad graphs, what can gowrong when visualizing data, and why it is wrong. Atthe same time you will see how the problem can beovercome, and what to do to ensure the graph will befine. In the next section, General principles of graph-ing data, a general vision of graphing scientific datais presented. Next, in the section Specific principlesof graphing data some detailed rules to be followedare presented to ensure if a graph is efficient. Do notforget, however, that these are really basic principles,as the title of the paper says; if one wants to learn moreabout this topic, one should refer to sources cited inthis paper. The last section offers a short conclusion,while the Appendix lists data sets used in this paperand the information on the software employed to ana-lyze and graph these data. Note that in figure captionsmore information than is normally needed is given, forexample the type of graph. Each figure caption informswhether the figure (or its panels) represents a good orrather poor style.</p><p>General principles of graphing data</p><p>The general and most important principle of graph-ing data is to construct a graph so that it conveys themessage in the most efficient way, and the message onewants it to convey. Thus, all elements of the graphshould be helpful but not distracting, and important as-pects should be emphasized but not hidden. This is the</p></li><li><p>Kozak484</p><p>Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010</p><p>Table 1 Elements to check while constructing a graph. Refer to the text for details.</p><p>stniopataD palrevo,roloc,ezis,lobmys</p><p>seniL palrevo,roloc,htdiw,epyt</p><p>roloC dehsiugnitsidylisaeerastnemelellarehtehwkcehc;llatadedeenrehtehwkcehc</p><p>sexadnaxoBskramkcit,elgnatcerenil-elacsdnaelgnatceratad,eulavmumixamdnamuminim,tcepsa,xobfoepyt</p><p>,noitator,htgnel,epyttnof(slebal'skramkcit,)htdiwdnahtgnel,noitcerid,noitacol,forebmun()slebaltxetfosnoitaiverbba,elytsgnirebmun</p><p>dnegeLdna"stniopatad"rofevobaees(nihtiwstnemele,ezis,dnuoraxob,noitisop,dedeenrehtehwkcehc</p><p>)"sexadnaxob"nihtiw"slebal"dna,"senil"</p><p>dnuorgkcaB roloc,gnidahs,gnihctah-ssorc,srabfo;hpargehtfo</p><p>hpargehtedisnitxeT rolocdnaezis,elytstnof;dedeenrehtehwkcehc</p><p>snoitpaCehtdnahpargehtdnatsrednuotdedeensitahtgnihtyrevegninialpxe,evitamrofnirehtehwkcehc</p><p>sreviledtiegassem</p><p>snoisnemiD ytilanoisnemid-revorofkcehc</p><p>enilecnerefeR rolocdnahtdiw,epytenil,dedeenrehtehwkcehc</p><p>senildirG rolocdnahtdiw,epytenil,dedeenrehtehwkcehc</p><p>srabrorrE gnidnefoepyt,rolocdnahtdiw,epytenil;tneserperyehttahwnonevigsinoitamrofniehtfikcehc</p><p>yalpsidsillerTgninoitidnocdnaselbairavlenapfoeciohc,)segap,snmuloc,swor(tuoyal;stnemeleevobaehtlla</p><p>redrolenap,)s(elbairav</p><p>most general rule, and all the following rules accountfor this general one. Remember that all details count:viewers of a graph will follow your ideas if you helpthem. So if you dont feel an expert in visualization,think twice - or rather thrice - before deciding not to fol-low any of the rules.</p><p>Before making a graph, think over what message itis to convey and whether it is needed whatsoever. In gen-eral, graphing three numbers is not advisable; no matterhow amazing this can seem, one can find quite a fewsuch graphs, especially piecharts and barplots. In fact,two numbers are sometimes graphed. Two numbers willbe most efficiently presented within a sentence, whilefew numbers within a text-table (which is a simple tableinserted within the text - Tufte, 2001; Kozak, 2009b) or aregular table. Tufte (2001) argues that even a dozen orso numbers are better represented by a table than agraph, but this is arguable - in general for 10 and evenfewer numbers to be compared a graph can be preferred,although it depends on various things, including whatthe numbers represent.</p><p>Make graphs as simple and as complex as it is neededto deliver the message. But neither be afraid of a com-plex graph nor make it more complex than it should be.Remember that the human eye can work efficiently withvery composite images, detecting large amount of infor-mation in small spaces (Tufte, 2001). However, alwaysbe careful to make the graph readable and understand-able: avoid clutter (see, e.g., Reynolds et al., 2009; Silva,2009c).</p><p>An efficient way to proceed while constructing agraph is to:</p><p>1) figure out the contexts and the message</p><p>2) figure out the way of presenting it, so the type, lay-out and style of graph to be used</p><p>3) construct the graph</p><p>4) check Tables 1 and 2 and revise the graph accordingly5) check whether the revised version conveys the mes-sage you want it to convey</p><p>6) check Tables 1 and 2 and revise the graph accordingly</p><p>7) and so on</p><p>Table 1, mentioned before, offers basic elements tocheck while constructing a graph - they will be discussedin the following section. They do not refer to every typeof graph, but it will be best to check everything that islisted there, and decide what does and what does notapply to your graph. Table 2 cites Clevelands (1994)very useful principles of graphing data.</p><p>Note how similar the process of graph constructionis to the process of writing: figure out the message andstyle, construct, revise, revise, revise, revise Revisingcannot be ignored, and exactly like in writing, it can bea source of inspiration, leading to results and conclu-sions one could not even imagine before. The above al-gorithm of graph construction together with Tables 1and 2 will be especially useful for non-experienced graphconstructors, because those experienced ones follow itintuitively.</p><p>Specific principles of graphing data</p><p>Data points are very important elements of manygraphs, especially scatterplot and line plot with linessuperimposed on data points. Symbols representingthem together with their size and color should be care-fully chosen so that the points could be easily seen,and the patterns present in the data (or their lack)could be easily noticed. Figure 1, picturing logarithmof the volume of cherry trees against the logarithm of</p></li><li><p>Basic principles of graphing data 485</p><p>Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010</p><p>Table 2 Clear vision principles given by Cleveland (1994).</p><p>.ytiulfrepusdiovA.tuodnatsatadehtekaM.1</p><p>.atadehtwohsotstnemelelacihpargtnenimorpyllausivesU.2</p><p>skramkciT.elgnatcerenil-elacsehtnahtrellamsylthgilselgnatceratadehtekaM.elbairavhcaerofsenilelacsforiapaesU.3.drawtuotniopdluohs</p><p>.elgnatcerenil-elacsehtforoiretniehtrettulctonoD.4</p><p>.skramkcitforebmunehtodrevotonoD.5</p><p>enilehtteltonodtub,hpargeritneehtssorcaneesebtsumtahteulavtnatropminasierehtnehwenilecnereferaesU.6.atadehthtiwerefretni</p><p>ehtrettulcotroatadevitatitnauqehthtiwerefretniotelgnatcerenil-elacsehtforoiretniehtnislebalatadwollatonoD.7.hparg</p><p>.txetehtnironoitpacehtnisetontupdna,edistuoyekatuP.elgnatcerenil-elacsehtedisnisyekdnasetongnittupdiovA.8</p><p>.elbahsiugnitsidyllausivebtsumslobmysgnittolpgnippalrevO.9</p><p>.delbmessayllausivylidaerebtsumstesataddesoprepuS.01</p><p>.noitcudorperdnanoitcuderrednudevreserpebtsumytiralclausiV.11</p><p>.evitamrofnidnaevisneherpmocsnoitpacekaM.mroflacihpargotnisnoisulcnocrojamtuP.21</p><p>Figure 1 A scatterplot with a regression line superimposed,representing trees volume against girth, both aftera logarithmic transformation (data set trees). Thegraph is full of drawbacks: filled squares as plottingsymbols are difficult to distinguish; the box consistsof two scale lines only; the data rectangle is too largeso the points are clustered within a small area; thetick mark labels on the y-axis are vertical instead ofhorizontal even though they are just one-digitnumbers; the tick marks point inward. Note thatthe regression line extrapolates the values outsidethe ranges observed in the experiment, which canbe dangerous.</p><p>Figure 2 A scatterplot with a regression line superimposed,based on the same data as in Figure 1, but all thedrawbacks mentioned there are overcome. It isconsidered a GOOD GRAPH.</p><p>girth (data trees), is poor. The choice of filled squaresas symbols is bad because it is difficult to distinguishthe points. (By the way, a filled rhombus, which isthe default symbol for a scatterplot in Microsoft Ex-cel, is an equally poor choice.) Compare it with Fig-ure 2, whose readability is greatly enhanced thanksto open circles used as symbolsin fact, open circlesshould be considered the best symbol in case of someoverlap in scatterplots for non-grouped data. (Notethat for dotplots, in which no overlap occurs, closedcircles are usually used as plotting symbolsFigures</p><p>3 and 4). Sometimes, however, overlap can be a seri-ous problemsee Figure 5.</p><p>When data are grouped, the choice of plotting sym-bols matters too: c, z, }, should work well for littleoverlap, and c, +, </p></li><li><p>Kozak486</p><p>Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010</p><p>Figure 4 A horizontal dotplot representing the same data asin Figure 3. The drawback of Figure 3 is overcomehere: vertical dotplot enables one to use sufficientlylong labels so that no legend is needed. Note thatthis Figure is smaller than the previous one, yet it isbetter readable. It is considered a GOOD GRAPH.</p><p>Figure 3 A vertical dotplot representing mean area under thedisease progress curve (AUDPC) of 16 clones in oneenvironment (Hastings, FL; data set haynes). Theclones are ordered by decreasing yield. A drawbackof this graph is the legend: because the clones arepresented in the x-axis, their names are too long tobe presented without rotation or abbreviation. Theabbreviated names are used here, which calls for thelegend to explain the abbreviations.</p><p>(Carr et al., 1987), a combination of these two (Dupontand Plummer, 2003), or graphing regions of bivariatekernel density (Hydman, 1996). Yet another idea is touse the trellis display (Cleveland, 1993) for groupeddata, in which case the groups will be plotted in differ-ent panels, alleviating the overlap of points from dif-ferent groups; this type of display will be discussedlater.</p><p>Similar issues ought to be raised for lines: theyshould be easily distinguished, for which varying linetype (e.g., solid, dashed, dotted), width and color canbe used. Lines can be plotted together with data points,in which case their efficient combinations should be</p><p>chosen. In Figure 6 three options are presented for this:in the top panel, each of the five lines representing thegrowth of a chicken is plotted with the same style ofline and plotting symbol. Note that up to about 10 daysthe lines are difficult to be distinguished, and it is prac-tically impossible to notice here what can be seen inthe middle and bottom panels: that the lightest chickenat the end of the experiment was one of the heaviestduring the first days. Color in the bottom panel seemsto work better than varying line type and plotting sym-bol in the middle panel, but both succeed to conveythe most important information. Note that not all theinformation can be extracted from these plots. In this</p><p>Figure 5 Scatterplots representing replicated data of number of insects caught after applying various insecticides (data set InsectsSprays).A drawback of the left panel is the overlap of points. This is overcome in the right panel, where jittering is employed(Observe the GOOD GRAPH IN RIGHT PANEL).</p></li><li><p>Basic principles of graphing data 487</p><p>Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010</p><p>particular situation it does not seem a problem, but asimple solution might be to plot a logarithm of bodyweight on the y-axis - see Figure 7, where the lines areslightly easier to be distinguished for the first days of</p><p>the experiment (although in many situations the gainin readability will be much larger).</p><p>Color can be powerful, but do not overuse it. Savecolor differentiation for situations when it helps, anddont use it when it adds nothing to the graph. For ex-ample, would it make sense to use different colors f...</p></li></ul>