introduction to machine...

75
Machine Learning for Language Technology 2015 Preliminaries Understanding and Preprocessing Data Marina Santini [email protected] Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015 Lecture 2: Preliminaries 1

Upload: others

Post on 24-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Machine Learning for Language Technology 2015

PreliminariesUnderstanding and Preprocessing Data

Marina [email protected]

Department of Linguistics and PhilologyUppsala University, Uppsala, Sweden

Autumn 2015Lecture 2: Preliminaries 1

Page 2: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Acknowledgements

• Weka Slides (teaching material*), Wikipedia, MathIsFun and other websites.

* http://www.cs.waikato.ac.nz/ml/weka/book.html

Lecture 2: Preliminaries 2

Page 3: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Outline

– Raw Data and Feature Representation:

• Concepts, instances, attributes

– Digression 1: Pills of Statistics

• Sampling, mean, variance, standard deviation, normalization, standardization, etc.

– Digression2: Data Visualization

• how to read a histogram, scatter plot, etc.

Lecture 2: Preliminaries 3

Page 4: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

DATA, CONCEPTS, INSTANCES, ATTRIBUTES, FEATURES

Raw Data and Data Representation

Lecture 2: Preliminaries 4

Page 5: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

What is data?

• Data is a collection of facts, such as numbers, words, measurements, observations or evenjust descriptions of things.

• Data can be qualitative or quantitative.

– Qualitative data is descriptive information (it describes something)

– Quantitative data is numeric information (numbers).

Lecture 2: Preliminaries 5

Page 6: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Singular or Plural?

• The singular form of data is "datum”. – Ex: "that datum is very high”

• The plural form of ”datum” is ”data”.• ”data” is plural when it indicates many individual datum

– Ex: "the data are available”

• But ”data” can also refer to collection of facts. In this case it is uncountable and takes the singular verb– Ex: "the data is available”

http://www.theguardian.com/news/datablog/2010/jul/16/data-plural-singular

Lecture 2: Preliminaries 6

Page 7: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Qualitative Data

• Categorial values

– Nominal (ex: eye colour)

– Ordinal (ex: street numbers)

Lecture 2: Preliminaries 7

Page 8: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Quantitative Data• Quantitative data can also be discrete or

continous.• Discrete data is counted, Continuous data is

measured– Discrete data can only take certain values (like

whole numbers)– Continuous data can take any value (within a

range)

Lecture 2: Preliminaries 8

Page 9: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Lecture 2: Preliminaries

Concepts, Instances, and Attributes

Components of the input:

Concepts: kinds of things that can be learned

Instances: the individual, independent examples of

a concept

Attributes: measuring aspects of an instance

9

Page 10: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

The importance of feature selectionand representation

Lecture 2: Preliminaries 10

Binary data is a special type of categorical data. Binary data takes only two values.

Page 11: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

GETTING TO KNOW YOUR DATA

Lecture 2: Preliminaries 11

Page 12: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Lecture 2: Preliminaries

Missing Data/Values

Types: unknown, unrecorded, irrelevant, etc.

Reasons:

collation of different datasets

measurement not possible

etc.

Missing data may have significance in itself (e.g.

missing test in a medical examination)

Most ML schemes assume that missing data have no

special significance. So… be careful and make your

own decisions.

12

Page 13: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Lecture 2: Preliminaries

Inaccurate values

Typographical errors in nominal attributes values need

to be checked for consistency

Typographical and measurement errors in numeric

attributes outliers need to be identified

13

Page 14: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Noise

• Noise is any unwanted anomaly in the data.

• In ML the presence of noise may cause difficulties in learning the classes and produce unreliable classifiers.

• Noise can be caused by:

– imprecisions in recording input attributes

– errors in labelling

– etc.

Lecture 2: Preliminaries 14

Page 15: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Lecture 2: Preliminaries

Getting to know the data

Simple visualization tools are very useful

Nominal attributes: histograms

Numeric attributes: graphs

Too much data to inspect? Take a sample!

15

Page 16: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

ARFF FORMATWeka (Waikato Environment for Knowledge Analysis)

Lecture 2: Preliminaries 16

Page 17: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Weka Software Packagehttp://www.cs.waikato.ac.nz/ml/weka/

Weka (Waikato Environment for Knowledge Analysis) is developed at University of Waikato in New Zealand.

A collection of state-of-the-art machine learning algorithms and data preprocessing tools.

It is open source. It is written in Java.

Contains implementations of learning algorithms that you can apply to your datasets.

Lecture 2: Preliminaries 17

Page 18: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Weka input data formats• General formats:

• Weka: – ARFFAttribute-Relation File format. – It is an ASCII file that describes a list of instances

sharing a set of attributes.Lecture 2: Preliminaries 18

Page 19: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

The ARFF format

Lecture 2: Preliminaries 19

Page 20: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Lecture 2: Preliminaries

Sparse data In some applications most attribute values in a

dataset are zero E.g.: word counts in a text categorization problem

ARFF supports sparse data

This also works for nominal attributes (where the first value corresponds to “zero”)

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”

0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”

{1 26, 6 63, 10 “class A”}

{3 42, 10 “class B”}

20

Page 21: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

SAMPLINGNORMAL DISTRIBUTIONMEASURES OF CENTRAL TENDENCY

Digression: Pills of Statistics

Lecture 2: Preliminaries 21

Page 22: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Population and Sample• Population and Sample

– Population: The whole group of ”things” we want to study• Ex: All students born between 1980 and 2000

– Sample: A selection taken from a larger group (the "population") so that youcan examine it to find out something about the larger group.

• Ex: 100 randomly chosen students students born between 1980 and 2000

In other words: the ’population' is the entire pool from which a statistical sample is drawn. The information obtained from the sample allows statisticians to develophypotheses about the larger population. Researchers gather information from a sample because of the difficulty ofstudying the entire population.

Lecture 2: Preliminaries 22

Page 23: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Sampling

• Sampling is a science in itself and there are different methods to sample a population

– Ex: random sampling, stratified sampling, multi-stage sampling, quota sampling, etc.

• The main concern: the sample should be representative of the population.

Lecture 2: Preliminaries 23

Page 24: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Distributions

Lecture 2: Preliminaries 24

Page 25: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Normal Distribution

• A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme.

Lecture 2: Preliminaries 25

Page 26: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Skewness

Lecture 2: Preliminaries

• When data is "skewed", it shows long tail on one side or the other:

26

Page 27: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Outliers

• An outlier is an observation point that is distant from other observations.

Lecture 2: Preliminaries 27

Page 28: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Measures of Central Tendency

• In a normal distribution, the mean, mode and median are all the same.

Lecture 2: Preliminaries 28

Page 29: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Right Skewed Distribution

Lecture 2: Preliminaries 29

Page 30: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Negative Skewed Distribution

Lecture 2: Preliminaries 30

Page 31: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Mean

• The mean is the average of the numbers: a calculated"central" value of a set of numbers. To calculate: Just add up all the numbers, then divide by how manynumbers there are.

Ex: what is the mean of 2, 7 and 9? • Add the numbers: 2 + 7 + 9 = 18 • Divide by how many numbers (i.e. we added 3

numbers): 18 ÷ 3 = 6• The Mean is 6

Lecture 2: Preliminaries 31

Page 32: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Median

• The Median is the middle number (in a sortedlist of numbers). To find the Median, placethe numbers you are given in value order and find the middle number. (If there are twomiddle numbers, you average them.)

• Find the Median of {13, 23, 11, 16, 15, 10, 26}.

• Put them in order: {10, 11, 13, 15, 16, 23, 26}

• The middle number is 15, so the median is 15.

Lecture 2: Preliminaries 32

Page 33: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Mode

• The Mode is the number which appears mostoften in a set of numbers.

• In {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6 (it occurs most often).

Lecture 2: Preliminaries 33

Page 34: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Frequency Table

• Ex of a frequency table:

Lecture 2: Preliminaries 34

Page 35: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

The mean of a frequency table

• In a frequency table, the mean is calculated by:

– multiply the score and the frequency, add up all the numbers and divide by sum of the frequencies

Lecture 2: Preliminaries 35

Page 36: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Mean: Formula

• The x with the bar on top means ”mean of x”

• Σ (sigma) means ”sum up”

• Σ fx means ”sum up all the frequencies times the matching scores”

• Σ f means ”sum up all the frequencies”

Lecture 2: Preliminaries 36

Page 37: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Quiz: The mean of a frequency table

• Calculate the mean of the following frequency table usingthe mean formula:

Answers (only one is correct)• 2.05• 5.2• 3.7

Lecture 2: Preliminaries 37

Page 38: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

MEASURES OF DISPERSION

Digression: Pills of Statistics

Lecture 2: Preliminaries 38

Page 39: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Measures of Dispersion

• Dispersion is a general term for different statistics that describe how values are distributed around the centre

Lecture 2: Preliminaries 39

Page 40: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Measures of Dispersion

• range

• quartiles

• interquartile range

• percentiles

• mean deviation

• variance

• standard deviation

• etc.

Lecture 2: Preliminaries 40

Page 41: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Range

• The range is the difference between the lowest and highest values.

– Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9-3 = 6.

Lecture 2: Preliminaries 41

Page 42: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Quartiles

• Quartiles are the values that divide a list of numbers intoquarters.– First put the list of numbers in order– Then cut the list into four equal parts– The Quartiles are at the "cuts”

• Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8 (The numbers must be in order)

• Cut the list into quarters. The result is:• Quartile 1 (Q1) = 4• Quartile 2 (Q2), which is also the Median = 5• Quartile 3 (Q3) = 8

Lecture 2: Preliminaries 42

Page 43: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Interquartile Range• The "Interquartile Range" is from Q1 to Q3.

• To calculate it just subtract Quartile 1 from Quartile 3:

Lecture 2: Preliminaries 43

Page 44: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Percentiles

• Percentile is the value below which a percentage of data falls (The data needs to be in order)

• Example: You are the 4th tallest person in a group of 20; 80% of people are shorter than you: That means you are at the 80th percentile.

• That is, if your height is 1.85m then "1.85m" is the 80th percentile height in that group.

Lecture 2: Preliminaries 44

Page 45: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Mean Deviation

• It is the mean of the distances of each value from their mean.

• Three steps:

– 1. Find the mean of all values

– 2. Find the distance of each value from that mean (subtract the mean from each value, ignore minus signs, and take the absolute value)

– 3. Then find the mean of those distances

Lecture 2: Preliminaries 45

Page 46: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Variance: σ2

• The Variance is the average of the squareddifferences from the mean.

• To calculate the variance follow these steps:– Work out the mean.

– Then for each number: subtract the Mean and square the result (the squared difference).

– Then work out the average of those squareddifferences.

Lecture 2: Preliminaries 46

Page 47: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Example: Compute the Variance

For the following dataset find the variance: {600, 470, 170, 430, 300}.

Mean = 600+470+170+430+300/5 = 394

For each number subtract the mean:

600-394=206; 470-394=76, 170-394=224, 430-394=36; 300-394=-94

Take each difference, square it, and then avarage the results. The variance is 21,704.

Lecture 2: Preliminaries 47

Page 48: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Standard Deviation: σ

• The Standard Deviation is one of the most reliable measure of how spread out numbers are.

• The formula is easy: it is the square root of the variance.

Lecture 2: Preliminaries 48

Page 49: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Standard Deviation Formula (population)

• μ = the mean• xi = the individual value of a

dataset• (xi - μ)

2 = for each value subtract the mean and square the result

• N = the total number of values in the dataset

• i=1 = start at this value (here the first number of the dataset)

• Σ = add up all the values• 1/N = divide by total number of

values in the dataset• √ = take the square root of all the

calculation

49

Page 50: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Standard Deviation Formula (sample)

Lecture 2: Preliminaries 50

Page 51: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Standard Deviation is the most reliable measure of dispersion

• Depending of the situation, not all measures of dispersion are equally reliable.

• For ex, the range can sometimes be misleading when there are extremely high or low values.– Example: In {8, 11, 5, 9, 7, 6, 3616}: the lowest value is 5,

and the highest is 3616. So the range is 3616-5 = 3611.

• However: The single value of 3616 makes the range large, but most values are around 10.

• So we may be better off using other measures such as Standard Deviation = 1262.65

Lecture 2: Preliminaries 51

Page 52: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Normal Distribution and Standard Deviation

Lecture 2: Preliminaries 52

Page 53: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Standard Deviation vs Variance

• A useful property of the standard deviation is that, unlikethe variance, it is expressed in the same units as the data.

• In other words: the StandDev is expressed in the same unitsas the mean is, whereas the variance is expressed in squareunits. So standard deviation is more intuitive…

• Note that a normal distribution with mean=10 and standDev = 3 is exactly the same thing as a normal distribution with mean=10 and variance = 9.

• Watch out and be clear of what you are using!

Lecture 2: Preliminaries 53

Page 54: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Quiz: Standard Deviation

68% of the frequency values of the word “and” in a corpus of email (assume emails have equal length) are between 51 and 64. Assuming this data is normally distributed, what are the mean and standard deviation?

1. Mean = 57; S.D. = 6.5

2. Mean = 57.5 ; S.D. = 6.5

3. Mean = 57.5; S.D. = 13

Lecture 2: Preliminaries 54

Page 55: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

These notions will be resumed later...

• … when dealing with statistical inference and other statistical methods.

• Standard Deviation Calculator: http://www.mathsisfun.com/data/standard-deviation-calculator.html

Lecture 2: Preliminaries 55

Page 56: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

NORMALIZATION AND STANDARDISATION

Digression: Pills of Statistics

Lecture 2: Preliminaries 56

Page 57: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Normalization

• To normalize data means to fit the data within unity, so all the data will take on a value between 0 and 1. Many formulas are available:

• Ex:

Lecture 2: Preliminaries 57

Page 58: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Standardization

• Standardization coverts all variables to a common scale and reflects how many standard deviations from the mean that the data point falls

• The number of standard deviations from the mean is also called the "Standard Score", "sigma" or "z-score".

Lecture 2: Preliminaries 58

Page 59: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

How to standardize

• z is the "z-score" (Standard Score)

• x is the value to be standardised

• μ is the mean

• σ is the standard deviation

Lecture 2: Preliminaries 59

Page 60: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Why Standardize?

• It can help us make decisions about our data.

Lecture 2: Preliminaries 60

Page 61: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

CHARTS AND GRAPHSData Visualization

Lecture 2: Preliminaries 61

Page 62: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Weka: Data Visualization

Lecture 2: Preliminaries 62

Page 63: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Outline

• Bar chart

• Histogram

• Pie chart

• Line chart

• Scatter plot

• Dot plot

• Box plot

Lecture 2: Preliminaries 63

Page 64: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Axes and Coordinates

• The left-right (horizontal) direction is commonly called X or abscissaThe up-down (vertical) direction is commonly called Y or ordinate

• The coordinates are always written in a certain order: the horizontal distance first, then the vertical distance.

Lecture 2: Preliminaries 64

Repetition: Read careful this web page: https://www.mathsisfun.com/data/cartesian-coordinates.html

Page 65: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Bar Chart• A Bar Chart (also called Bar Graph) is a

graphical display of data using bars of different heights.

• Bar charts are used to graph categorical data. Example:

Lecture 2: Preliminaries 65

Page 66: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Histogram• With continuous data, histograms are used.

• Histograms are similar to bar charts, but a histogram

groups numbers into ranges.

Lecture 2: Preliminaries 66

Page 67: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Pie Chart• It is a special chart that uses "slices" to show

relative sizes of data.

• Pie charts have been criticized.

Lecture 2: Preliminaries 67

Page 68: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Line Chart

• Line chart is a graph that shows information that is connected in some way (such as change over time).

Lecture 2: Preliminaries 68

Page 69: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Scatter plot

• A scatter plot has points that show the relationship between two sets of data.

• Example: each dot shows one person's weight versus their height.

Lecture 2: Preliminaries 69

Page 70: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Line of best fit

• Draw a "Line of Best Fit" (also called a "Trend Line") on the scatter plot to predict values that might not on the plot

Lecture 2: Preliminaries 70

Page 71: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Correlations

• Scatter plots are useful to detect correlations between the sets of data. – Correlation is Positive when the values increase together

– Correlation is Negative when one value decreases as the other increases

More on scatter plots: https://www.mathsisfun.com/data/scatter-xy-plots.html

Lecture 2: Preliminaries 71

Page 72: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Quiz: Scatter Plot

• The correlation seen in the graph at the right would be best described as:

1. high positive correlation

2. low positive correlation

3. high negative correlation

4. low negative correlation

Lecture 2: Preliminaries 72

Page 73: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Dot Plot• A dot plot is a graphical display of data using dots.

• It is an alternative to the bar chart, in which dots are used to depict the quantitative values (e.g. counts) associated with categorical variables.

Lecture 2: Preliminaries 73

Page 74: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

Box Plot

• Box plots are useful to highlight outliers, median and the interquartile range.

• aka box-and-whisker plots

Lecture 2: Preliminaries 74

Page 75: Introduction to Machine Learningsantini.se/teaching/ml/2015/lectures/Lecture02_2015_Preliminaries.pdf · Machine Learning for Language Technology 2015 Preliminaries Understanding

The End

Lecture 2: Preliminaries 75