basic statistical operations - uhgabriel/courses/cosc6397_s14/bda... · values for two variables...
TRANSCRIPT
1
COSC 6397
Big Data Analytics
Fundamental Analytics
Edgar Gabriel
Spring 2014
Basic statistical operations
• Calculating minimum, maximum, mean, median,
standard deviation
• Data typically multi-dimensional -> analytics can be
based on one or more dimensions of the data
Image source: Hadoop MapReduce Cookbook, chapter 5.
2
Group-by operations
• Calculate basic operations by group
– Allows to utilize more than one reducer
– Grouping based on key of the mapper step
Image source: Hadoop MapReduce Cookbook, chapter 5.
Frequency distributions
• arrangement of the values that one or more variables
take in a sample
• Each entry in the table contains the frequency or count
of the occurrences of values within a particular group
• table summarizes the distribution of values in the
sample
• Example:
– Analyze the log file of a web server
– Sort the number of hits received by each URL in
ascending order
– Input Example: 205.212.115.106 - - [01/Jul/1995:00:00:00:12 -0400] “GET
/shuttle/countdown/countdown.html HTTP/1.0” 200 3985
3
Frequency distributions
• First MapReduce job counts the number of occurrences
of a URL
– Result of the MapReduce job: a file containing the list of
<URL> <no. of occurrences>
• Second MapReduce job
– Use the output of first MapReduce job as input
– Mapper: use <no of occurrences> as key and <URL> as
value
– Reducer: omit the <no of occurrences> in output file
(ignoring URL)
– Sorting done implicitly by the MapReduce framework
Example output
Image source: Hadoop MapReduce Cookbook, chapter 5.
4
Histograms
• graphical representation of the distribution of data.
• estimate of the probability distribution of a continuous
variable
• representation of tabulated frequencies, shown as
adjacent rectangles, erected over discrete intervals
– area proportional to the frequency of the observations in
the interval
• Example:
– Determine the number of accesses to the web server per
hour
Image source: Hadoop MapReduce Cookbook, chapter 5.
Histograms
• Map step uses the hour as the key and ‘one’ as the
value
• Reducer sums up the number of occurrences for each
hour
Image source: Hadoop MapReduce Cookbook, chapter 5.
5
Histograms
Scatter Plots
• A scatter plot is using Cartesian coordinates to display
values for two variables for a set of data
• Typically used when a variable exists that is below the
control of the experimenter
– a parameter is systematically incremented and/or
decremented by the other,
• also called the control parameter or independent
variable
• is typically plotted along the horizontal axis
– The measured or dependent variable is customarily
plotted along the vertical axis
6
Scatter Plots
• Example: analyzes the data to find the relationship
between the size of the web pages and the number of
hits received by the web page
Image source: Hadoop MapReduce Cookbook, chapter 5.
Scatter Plots
Image source: Hadoop MapReduce Cookbook, chapter 5.
7
Joining Two Datasets
Image source: Hadoop MapReduce Cookbook, chapter 5.
Image source: Hadoop MapReduce Cookbook, chapter 5.
8
Data Visualization
• Goal: produce high quality figures
– to be easily incorporated into Latex (or Word) documents
– scalable and not too big in size
• Spreadsheet software packages can be used to make
figures, but there are limitations:
– scalability
– readability
– size
– conformity
Slides based on a Tutorial given by Peggy Lindner.
Figures, Formats and the right Tools
• Illustrations or diagrams (vector images):
– Adobe Portable Document Format (PDF)
– PostScript (PS)
– Encapsulated PostScript (EPS)
• Photography or microscopy (raster images):
– Tagged Image File Format (TIFF)
– EPS
– PS
Important: set the resolution to the desired DPI (dots per inch) value before you begin your editing.
Gimp (http://www.gimp.org/)
Adobe Photoshop
Adobe Fireworks
GNUPLOT
Matlab
Tecplot
Slides based on a Tutorial given by Peggy Lindner.
9
Resources
GNUPLOT homepage http://www.gnuplot.info/
Introduction to GNUPLOT and Not so FAQ and Solutions
http://www.ualberta.ca/~xz10/gnuplot/index-e.html
http://www.gnuplotting.org/
Software
• Linux from source or packaged in distribution
• MacOS – http://sites.google.com/site/imaximaimath/download-and-install/easy-install-on-
mac-os-x#TOC-Gnuplot
• Windows – http://sourceforge.net/projects/gnuplot/files/gnuplot/4.2.4/gp424win32.zip/down
load
Book
• Philipp K. Janert. 2009. Gnuplot in Action: Understanding Data with Graphs. Manning Publications Co., Greenwich, CT, USA.
Slides based on a Tutorial given by Peggy Lindner.
Basics
• GNUPLOT is a freely distributed command-line based
interactive plotting program
• Can be used in different modes:
– Interactive console
– Scripts
– Interactive GUI
Slides based on a Tutorial given by Peggy Lindner.
10
More Basics - Syntacs
• GNUPLOT can display the manuals for its commands e.g. type help plot to get information on the plot
command
• commands can be shortened e.g. rep instead of replot
or p instead of plot
• reset restores the defaults
• several GNUPLOT commands in one line have to be separated by ;
• GNUPLOT comments start with #
• shell commands (e.g. vi) in GNUPLOT start with !
• file names have to be enclosed in single or double
quotes
Slides based on a Tutorial given by Peggy Lindner.
Output Formats
• GNUPLOT uses different “terminals” e.g. latex, tikz,
eps, png ...
• Example:
set terminal png picsize X Y
set output "plot.png"
set term postscript enhanced eps color "Helvetica,26“ \ linewidth 4 rounded
set output "ErrorDistributions.eps"
Slides based on a Tutorial given by Peggy Lindner.
11
Simple Example
• To plot a sine curve open GNUPLOT and type:
f(x) = sin(x) # define a function
plot f(x) # plot this function
replot f(2*x) # plot another function
• Customizations:
set ytics 0.5; set mytics 5
rep # update plot
set xrange [-pi:pi] # x range
set xtics ("-pi" -pi, "-pi/2" -pi/2,
0, "pi/2" pi/2, "pi" pi)
Slides based on a Tutorial given by Peggy Lindner.
Scripting
• For your scientific work you will prefer scripts
• Store the commands in a text file
• Load the script in GNUPLOT by typing either:
gnuplot SinExample.plt (command-line)
load ’SinExample.plt’ (in GNUPLOT)
Slides based on a Tutorial given by Peggy Lindner.
12
Plotting Data From Files
GNUPLOT can read data from files
• data columns are separated by a white-
space or tab
• Lines that begin with "#“ are ignored
• data formats can be specified with the using command
• blank lines in data files can be used to
identify individual blocks of data by using the index command
Slides based on a Tutorial given by Peggy Lindner.
Simple Example:
plot ‘data.txt’ using 1:2 with lines
• place or hide key
• set a title
• define axis labels
• change the number format
• select zoom
– manually select range of axis
– set yrange [*:*] ... select zoom of y-axis automatically,
• color, width and shape of lines/points (linetype / lt, pointtype / pt, linewidth / lw,
pointsize / ps)
• plot multiple data series separated by commas
set key top right
set nokey
set title "Subject V001" font "Helvetica-Bold,18"
set xlabel "Time [s]" font "Helvetica-Bold,16"
set format y "%1.1f"
set style line 1 lt 1 lw 4 lc rgb "red"
set style line 2 lt 2 lw 4 lc rgb "black"
set style line 3 lt 1 lw 4 lc rgb "black"
plot 'data.txt' using 1:2 title "TIMP" with lines ls 1, \
'' using 1:4 title "TIMF" with lines ls 2, \
'' using 1:3 title "GSR" with lines ls 3
set xrange [ 0 : 245 ]
set yrange [ 0 : 1.2 ]
13
• Simple Example:
plot ‘data.txt’ using 1:2 with points
Scatter Plots – Fitting data
• Data fitting
– define a power law
– fit your data
– clip the fit to the fitted area
– add to plot
• define a label
set label "r = 0.371" at graph 0.6, graph 0.43
f(x) = m*x + bfit
f(x) 'AttemptsData.txt' using 1:2 via m,bplot \
'AttemptsData.txt' using 1:2 with points lt -1 pt 7 notitle
Vertical Bar Graphs & Boxplots
boxwidth 1.0 absolute
set style line 1 lt -1 lw 1
set style histogram cluster gap 1
set style data histogram
plot 'HistogramData.txt' index 0 using 2 fs solid 0.5 ls 1 title "Novices", \
'' index 1 using 2 fs solid 0.25 ls 1 title "Experienced"
boxwidth 0.3 absolute
set style fill solid border -1
set key left top
'MPE.txt' index 0 using 1:3:2:6:5 with candlesticks lt -1 lw 2 ti "Scenario 1" whiskerbars fs solid 0.5, \
'' index 0 using 1:4:4:4:4 with candlesticks lt -1 lw 2 notitle
14
Other Useful Commands • Logscale
[un]set logscale [xy]
• do math on data columns
plot ’./data.txt’ u ($0*10):($2*10**($1)) w lp
• short format
p ‘data.txt’ u 1:2 w p pt 1 lt 2 lw 2
• Missing data:
set datafile missing 'NaN'
• Set grid
# Line style for axes
set style line 80 lt 1
set style line 80 lt rgb "#808080“
# Line style for grid
set style line 81 lt 0 # dashed
set style line 81 lt rgb "#808080" # grey
set grid back linestyle 81
set border 3 back linestyle 80
set ytics nomirror
Multiple Graphs • stack several plot commands
• scale the plot
• place the plot
• leave multi plot mode
set multiplot
set origin 0.0,0.9
unset multiplot
set size 1.0,0.9
15
Optimizing Your Plots
1.Focus on the purpose of the figure: What do you want to
show?
2.Keep it simple and efficient: Choose good units for the
axes. Scale the axes to make good use of the figure’s area.
3.Explain what you plot: Label the axes, find a good title, add
a key to symbols and write a clear and complete caption (not
just re-stating what’s on the axes). The plot should be self
explanatory – the intended audience should understand it
even without reading the text of your paper.
4.Show your figure to a colleague: optimally somebody not
directly working with you – to check if it is clear.
[1] http://www.usm.lmu.de/CAST/talks/gnuplot.pdf
Checklist
• enough information in well chosen title / caption ?
• content of labels (e.g. units)?
• content of key of symbols?
• too much information for a single graph?
• plot type suited for purpose? scatter vs. line graph; error bars; fits
• is x the independent variable and y the dependent variable?
• large enough font size of labels?
• sufficient line width? (e.g. a few pixels for presentations with data projectors)
• optimized number of tick marks, minor tick marks?
• plot looks ”empty”? zoom ... make good use of the plot area
• plot format (eps, png, tikz) suited for the purpose?
[1] http://www.usm.lmu.de/CAST/talks/gnuplot.pdf