basic statistical operations - uhgabriel/courses/cosc6397_s14/bda... · values for two variables...

15
1 COSC 6397 Big Data Analytics Fundamental Analytics Edgar Gabriel Spring 2014 Basic statistical operations Calculating minimum, maximum, mean, median, standard deviation Data typically multi-dimensional -> analytics can be based on one or more dimensions of the data Image source: Hadoop MapReduce Cookbook, chapter 5.

Upload: hathuy

Post on 14-Mar-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

1

COSC 6397

Big Data Analytics

Fundamental Analytics

Edgar Gabriel

Spring 2014

Basic statistical operations

• Calculating minimum, maximum, mean, median,

standard deviation

• Data typically multi-dimensional -> analytics can be

based on one or more dimensions of the data

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 2: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

2

Group-by operations

• Calculate basic operations by group

– Allows to utilize more than one reducer

– Grouping based on key of the mapper step

Image source: Hadoop MapReduce Cookbook, chapter 5.

Frequency distributions

• arrangement of the values that one or more variables

take in a sample

• Each entry in the table contains the frequency or count

of the occurrences of values within a particular group

• table summarizes the distribution of values in the

sample

• Example:

– Analyze the log file of a web server

– Sort the number of hits received by each URL in

ascending order

– Input Example: 205.212.115.106 - - [01/Jul/1995:00:00:00:12 -0400] “GET

/shuttle/countdown/countdown.html HTTP/1.0” 200 3985

Page 3: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

3

Frequency distributions

• First MapReduce job counts the number of occurrences

of a URL

– Result of the MapReduce job: a file containing the list of

<URL> <no. of occurrences>

• Second MapReduce job

– Use the output of first MapReduce job as input

– Mapper: use <no of occurrences> as key and <URL> as

value

– Reducer: omit the <no of occurrences> in output file

(ignoring URL)

– Sorting done implicitly by the MapReduce framework

Example output

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 4: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

4

Histograms

• graphical representation of the distribution of data.

• estimate of the probability distribution of a continuous

variable

• representation of tabulated frequencies, shown as

adjacent rectangles, erected over discrete intervals

– area proportional to the frequency of the observations in

the interval

• Example:

– Determine the number of accesses to the web server per

hour

Image source: Hadoop MapReduce Cookbook, chapter 5.

Histograms

• Map step uses the hour as the key and ‘one’ as the

value

• Reducer sums up the number of occurrences for each

hour

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 5: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

5

Histograms

Scatter Plots

• A scatter plot is using Cartesian coordinates to display

values for two variables for a set of data

• Typically used when a variable exists that is below the

control of the experimenter

– a parameter is systematically incremented and/or

decremented by the other,

• also called the control parameter or independent

variable

• is typically plotted along the horizontal axis

– The measured or dependent variable is customarily

plotted along the vertical axis

Page 6: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

6

Scatter Plots

• Example: analyzes the data to find the relationship

between the size of the web pages and the number of

hits received by the web page

Image source: Hadoop MapReduce Cookbook, chapter 5.

Scatter Plots

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 7: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

7

Joining Two Datasets

Image source: Hadoop MapReduce Cookbook, chapter 5.

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 8: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

8

Data Visualization

• Goal: produce high quality figures

– to be easily incorporated into Latex (or Word) documents

– scalable and not too big in size

• Spreadsheet software packages can be used to make

figures, but there are limitations:

– scalability

– readability

– size

– conformity

Slides based on a Tutorial given by Peggy Lindner.

Figures, Formats and the right Tools

• Illustrations or diagrams (vector images):

– Adobe Portable Document Format (PDF)

– PostScript (PS)

– Encapsulated PostScript (EPS)

• Photography or microscopy (raster images):

– Tagged Image File Format (TIFF)

– EPS

– PS

– PDF

Important: set the resolution to the desired DPI (dots per inch) value before you begin your editing.

Gimp (http://www.gimp.org/)

Adobe Photoshop

Adobe Fireworks

GNUPLOT

Matlab

Tecplot

Slides based on a Tutorial given by Peggy Lindner.

Page 9: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

9

Resources

GNUPLOT homepage http://www.gnuplot.info/

Introduction to GNUPLOT and Not so FAQ and Solutions

http://www.ualberta.ca/~xz10/gnuplot/index-e.html

http://www.gnuplotting.org/

Software

• Linux from source or packaged in distribution

• MacOS – http://sites.google.com/site/imaximaimath/download-and-install/easy-install-on-

mac-os-x#TOC-Gnuplot

• Windows – http://sourceforge.net/projects/gnuplot/files/gnuplot/4.2.4/gp424win32.zip/down

load

Book

• Philipp K. Janert. 2009. Gnuplot in Action: Understanding Data with Graphs. Manning Publications Co., Greenwich, CT, USA.

Slides based on a Tutorial given by Peggy Lindner.

Basics

• GNUPLOT is a freely distributed command-line based

interactive plotting program

• Can be used in different modes:

– Interactive console

– Scripts

– Interactive GUI

Slides based on a Tutorial given by Peggy Lindner.

Page 10: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

10

More Basics - Syntacs

• GNUPLOT can display the manuals for its commands e.g. type help plot to get information on the plot

command

• commands can be shortened e.g. rep instead of replot

or p instead of plot

• reset restores the defaults

• several GNUPLOT commands in one line have to be separated by ;

• GNUPLOT comments start with #

• shell commands (e.g. vi) in GNUPLOT start with !

• file names have to be enclosed in single or double

quotes

Slides based on a Tutorial given by Peggy Lindner.

Output Formats

• GNUPLOT uses different “terminals” e.g. latex, tikz,

eps, png ...

• Example:

set terminal png picsize X Y

set output "plot.png"

set term postscript enhanced eps color "Helvetica,26“ \ linewidth 4 rounded

set output "ErrorDistributions.eps"

Slides based on a Tutorial given by Peggy Lindner.

Page 11: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

11

Simple Example

• To plot a sine curve open GNUPLOT and type:

f(x) = sin(x) # define a function

plot f(x) # plot this function

replot f(2*x) # plot another function

• Customizations:

set ytics 0.5; set mytics 5

rep # update plot

set xrange [-pi:pi] # x range

set xtics ("-pi" -pi, "-pi/2" -pi/2,

0, "pi/2" pi/2, "pi" pi)

Slides based on a Tutorial given by Peggy Lindner.

Scripting

• For your scientific work you will prefer scripts

• Store the commands in a text file

• Load the script in GNUPLOT by typing either:

gnuplot SinExample.plt (command-line)

load ’SinExample.plt’ (in GNUPLOT)

Slides based on a Tutorial given by Peggy Lindner.

Page 12: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

12

Plotting Data From Files

GNUPLOT can read data from files

• data columns are separated by a white-

space or tab

• Lines that begin with "#“ are ignored

• data formats can be specified with the using command

• blank lines in data files can be used to

identify individual blocks of data by using the index command

Slides based on a Tutorial given by Peggy Lindner.

Simple Example:

plot ‘data.txt’ using 1:2 with lines

• place or hide key

• set a title

• define axis labels

• change the number format

• select zoom

– manually select range of axis

– set yrange [*:*] ... select zoom of y-axis automatically,

• color, width and shape of lines/points (linetype / lt, pointtype / pt, linewidth / lw,

pointsize / ps)

• plot multiple data series separated by commas

set key top right

set nokey

set title "Subject V001" font "Helvetica-Bold,18"

set xlabel "Time [s]" font "Helvetica-Bold,16"

set format y "%1.1f"

set style line 1 lt 1 lw 4 lc rgb "red"

set style line 2 lt 2 lw 4 lc rgb "black"

set style line 3 lt 1 lw 4 lc rgb "black"

plot 'data.txt' using 1:2 title "TIMP" with lines ls 1, \

'' using 1:4 title "TIMF" with lines ls 2, \

'' using 1:3 title "GSR" with lines ls 3

set xrange [ 0 : 245 ]

set yrange [ 0 : 1.2 ]

Page 13: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

13

• Simple Example:

plot ‘data.txt’ using 1:2 with points

Scatter Plots – Fitting data

• Data fitting

– define a power law

– fit your data

– clip the fit to the fitted area

– add to plot

• define a label

set label "r = 0.371" at graph 0.6, graph 0.43

f(x) = m*x + bfit

f(x) 'AttemptsData.txt' using 1:2 via m,bplot \

'AttemptsData.txt' using 1:2 with points lt -1 pt 7 notitle

Vertical Bar Graphs & Boxplots

boxwidth 1.0 absolute

set style line 1 lt -1 lw 1

set style histogram cluster gap 1

set style data histogram

plot 'HistogramData.txt' index 0 using 2 fs solid 0.5 ls 1 title "Novices", \

'' index 1 using 2 fs solid 0.25 ls 1 title "Experienced"

boxwidth 0.3 absolute

set style fill solid border -1

set key left top

'MPE.txt' index 0 using 1:3:2:6:5 with candlesticks lt -1 lw 2 ti "Scenario 1" whiskerbars fs solid 0.5, \

'' index 0 using 1:4:4:4:4 with candlesticks lt -1 lw 2 notitle

Page 14: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

14

Other Useful Commands • Logscale

[un]set logscale [xy]

• do math on data columns

plot ’./data.txt’ u ($0*10):($2*10**($1)) w lp

• short format

p ‘data.txt’ u 1:2 w p pt 1 lt 2 lw 2

• Missing data:

set datafile missing 'NaN'

• Set grid

# Line style for axes

set style line 80 lt 1

set style line 80 lt rgb "#808080“

# Line style for grid

set style line 81 lt 0 # dashed

set style line 81 lt rgb "#808080" # grey

set grid back linestyle 81

set border 3 back linestyle 80

set ytics nomirror

Multiple Graphs • stack several plot commands

• scale the plot

• place the plot

• leave multi plot mode

set multiplot

set origin 0.0,0.9

unset multiplot

set size 1.0,0.9

Page 15: Basic statistical operations - UHgabriel/courses/cosc6397_s14/BDA... · values for two variables for a set of data ... •is typically plotted along the horizontal axis ... Tecplot

15

Optimizing Your Plots

1.Focus on the purpose of the figure: What do you want to

show?

2.Keep it simple and efficient: Choose good units for the

axes. Scale the axes to make good use of the figure’s area.

3.Explain what you plot: Label the axes, find a good title, add

a key to symbols and write a clear and complete caption (not

just re-stating what’s on the axes). The plot should be self

explanatory – the intended audience should understand it

even without reading the text of your paper.

4.Show your figure to a colleague: optimally somebody not

directly working with you – to check if it is clear.

[1] http://www.usm.lmu.de/CAST/talks/gnuplot.pdf

Checklist

• enough information in well chosen title / caption ?

• content of labels (e.g. units)?

• content of key of symbols?

• too much information for a single graph?

• plot type suited for purpose? scatter vs. line graph; error bars; fits

• is x the independent variable and y the dependent variable?

• large enough font size of labels?

• sufficient line width? (e.g. a few pixels for presentations with data projectors)

• optimized number of tick marks, minor tick marks?

• plot looks ”empty”? zoom ... make good use of the plot area

• plot format (eps, png, tikz) suited for the purpose?

[1] http://www.usm.lmu.de/CAST/talks/gnuplot.pdf