01 machine learning introduction

Machine Learning for Data MiningIntroduction

Andres Mendez-Vazquez

May 13, 2015

1 / 56

Outline1 Why are we interested in Analyzing Data?

Intuitive Definition: The 3V’sComplexityData Everywhere

2 Machine LearningMachine Learning ProcessFeaturesClassificationClustering Analysis

3 Data MiningDefinitionApplicationsExample: Frequent Itemsets

4 Hardware SupportASICSGPU’s

5 ProjectsWhat projects can you do?

2 / 56

3 / 56

Intuitive Definition: Volume

When looking at the Volumes of Information, we have:Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!!

Examples of these Volumes are1 Records2 Transactions3 Web Searches4 etc

4 / 56

However

Something NotableWhat constitutes truly “high” volume varies by industry and evengeography!!!

Simply look at the DNA data for a cellular cycle.

Example

5 / 56

HoweverSomething NotableWhat constitutes truly “high” volume varies by industry and evengeography!!!

Simply look at the DNA data for a cellular cycle.

Example

5 / 56

Intuitive Definition: Variety

When looking at the Structure of the Information, we have:Variety like there is not tomorrow:

I It is structured, semi-structured, unstructured

SoDo you have some examples of structures in Information?

6 / 56

When Looking at the Velocity of this Information?Data in Motion!!!Velocity:

I Dynamic GenerationI Real Time Generation

Problems with that: LatencyLag time between capture or generation and when it is available!!!

7 / 56

For example

Imagine that I have a stream of m = 1025 integers with Ranges from[a1, ..., an] with n = 10, 000, 000Now, somebody ask you to find the most frequent item!!!

A naive algorithm1 Take hash table with a counter.2 Then, put numbers in the hash table.

ProblemsWhich problems we have?

8 / 56

For example

8 / 56

For example

8 / 56

However

There is theCount-Min Sketch Algorithm

Invented byCharikar, Chen and Farch-Colton in 2004

With PropertiesSpace Used Error Probability Error

1ε log

)· (logm + log n)

)δ ε

9 / 56

However

1ε log

)· (logm + log n)

)δ ε

9 / 56

However

1ε log

)· (logm + log n)

)δ ε

9 / 56

10 / 56

Complexity

Given all these thingsIt is necessary to correlate and share data across entities.It is necessary to link, match and transform data across businessentities and systems.

With this...Complexity goes through the roof!!!

11 / 56

Complexity

11 / 56

Complexity

11 / 56

And it is through the roof!!! Linking open-data communityproject

12 / 56

Cautionary Tale

Something NotableIn 1880 the USA made a Census of the Population in different aspects:

I PopulationI MortalityI AgricultureI Manufacturing

HoweverOnce data was collected it took 7 years to say something!!!

13 / 56

Cautionary Tale

13 / 56

Cautionary Tale

13 / 56

Cautionary Tale

13 / 56

Cautionary Tale

13 / 56

Cautionary Tale

13 / 56

Cautionary Tale

13 / 56

Ahhh...

Thus, Hollering came with the following machine (Circa 1890)!!!

14 / 56

Hollering Tabulating Machine

It was basically a sorter and counterUsing punching cards as memories.And Mercury Sensors.

Example

15 / 56

Hollering Tabulating Machine

It was basically a sorter and counterUsing punching cards as memories.And Mercury Sensors.

Example

15 / 56

It was FAST!!!

It took only!!!

2 years!!!Nevertheless in 1837Babbage’s Difference engine was

The First General Computer!!!Turing-complete!!!Way more complex than the tabulator!!! 53 years earlier!!!

16 / 56

It was FAST!!!

It took only!!!

16 / 56

It was FAST!!!

It took only!!!

16 / 56

It was FAST!!!

It took only!!!

16 / 56

It was FAST!!!

It took only!!!

16 / 56

Funny!!!

17 / 56

The Problem

Actually, it never reached completion because

Babbage was actually a yucky project manager!!!

18 / 56

The Problem

Actually, it never reached completion because

Babbage was actually a yucky project manager!!!

18 / 56

19 / 56

Data is Everywhere!

Lots of data is being collected and warehousedWeb data, e-commercePurchases at department/ grocery storesBank/Credit Card transactionsSocial Network

Many Places

20 / 56

Data is Everywhere!

Many Places

20 / 56

Data is Everywhere!

Many Places

20 / 56

Data is Everywhere!

Many Places

20 / 56

Data is Everywhere!

Many Places

20 / 56

The Staggering Numbers

A Ocean of DataHow many data in the world?

I 800 Terabytes, 2000I 160 Exabytes, 2006I 500 Exabytes (Internet), 2009I 2.7 Zettabytes, 2012I 35 Zettabytes by 2020

GenerationHow many data generated ONEday?

I 7 TB, TwitterI 10 TB, Facebook

Source: “Big data: The next frontier for innovation, competition, and pro-ductivity”McKinsey Global Institute 2011 21 / 56

Type of Data

ThusRelational Data (Tables/Transaction/Legacy Data)Text Data (Web)Semi-structured Data (XML)

And more...Graph Data

I Social Network, Semantic Web (RDF), . . .

Streaming DataI You can only scan the data once

22 / 56

Type of Data

22 / 56

Type of Data

22 / 56

Type of Data

22 / 56

Type of Data

22 / 56

Type of Data

22 / 56

Type of Data

22 / 56

The Ever Growing Landscape

23 / 56

Machine LearningDefinitionAlgorithms or techniques that enable computer (machine) to “learn” fromdata. Related with many areas such as data mining, statistics, informationtheory, etc.

Algorithm Types:Unsupervised LearningSupervised learningReinforcement learning

ExamplesArtificial Neural Network (ANN)Support Vector Machine (SVM)Expectation-Maximization (EM)Deterministic Annealing (DA)

24 / 56

25 / 56

Machine Learning Process

Process1 Feature Extraction/Feature Generation2 Clustering ≈ Class Identification ≈ Unsupervised Learning3 Classification ≈ Supervised Learning

Then...We start thinking: We need to process a lot of data...

Or...LARGE SCALE MACHINE LEARNING

26 / 56

27 / 56

Feature Generation/Dimensionality Reduction

Feature GenerationGiven a set of measurements, the goal is to discover compact andinformative representations of the obtained data.

Examples1 The Karhunen–Loève transform ≈ Principal Component Analysis

1 Popular for feature generation and Dimensionality Reduction2 The Singular Value Decomposition

1 Used for Dimensionality Reduction

28 / 56

Dimension Reduction/Feature Extraction

DefinitionProcess to transform high-dimensional data into low-dimensional ones forimproving accuracy, understanding, or removing noises.

Why?Curse of dimensionality: Complexity grows exponentially in volume byadding extra dimensions.

29 / 56

Dimension Reduction/Feature Extraction

DefinitionProcess to transform high-dimensional data into low-dimensional ones forimproving accuracy, understanding, or removing noises.

Why?Curse of dimensionality: Complexity grows exponentially in volume byadding extra dimensions.

29 / 56

Feature SelectionFeature SelectionWhich features should be used for the classifier?

Why? The Curse of Dimensionality!!!

Hypothesis Testing to discriminate good features

30 / 56

Feature SelectionFeature SelectionWhich features should be used for the classifier?

Why? The Curse of Dimensionality!!!

Hypothesis Testing to discriminate good features30 / 56

What can be done?

Measures for Class SeparabilityExample: Between-class scatter matrix:

Sb =M∑

i=1Pi (µi − µ0) (µi − µ0)T (1)

Where:µ0 is the global mean vector, µ0 =

∑Mi=1 Piµi .

µi the median of class ωi .Pi ∼= ni

31 / 56

What can be done?

Sb =M∑

i=1Pi (µi − µ0) (µi − µ0)T (1)

∑Mi=1 Piµi .

31 / 56

What can be done?

Sb =M∑

i=1Pi (µi − µ0) (µi − µ0)T (1)

∑Mi=1 Piµi .

31 / 56

What can be done?

Sb =M∑

i=1Pi (µi − µ0) (µi − µ0)T (1)

∑Mi=1 Piµi .

31 / 56

What can be done?

Feature Subset SelectionExamples:

I Filter ApproachF All combinations of features are used together with a separability

measure.I Wrapper Approach:

F Use the decided classifier itself to find the best set.

32 / 56

What can be done?

32 / 56

What can be done?

32 / 56

What can be done?

32 / 56

What can be done?

32 / 56

33 / 56

Classification

DefinitionA procedure dividing data into the given set of categories based on thetraining set in a supervised way.

What we want from classification?Generalization Vs. Specification

I Hard to achieve both

Avoid - overfitting/overtrainingI Early stoppingI Holdout validationI K-fold cross validationI Leave-one-out cross-validation

34 / 56

Classification

34 / 56

Classification

34 / 56

Classification

34 / 56

Classification

34 / 56

Classification

34 / 56

Classification

34 / 56

Classification

34 / 56

Avoid - overfitting/overtraining

Validation and Training ErrorUnderfitting Overfitting

Validation Error

Training Error

35 / 56

Examples of Classification Algorithms

Many Possible AlgorithmsLinear Classifiers: PerceptronProbability Classifiers: Naive BayesKernel Methods Classifiers : Support Vector MachinesNon-Linear Classifiers: Artificial Neural NetworksGraph Model Classifiers:. . .

36 / 56

37 / 56

Clustering Analysis

DefinitionGrouping unlabeled data into clusters, for the purpose of inference ofhidden structures or information.

Using, for exampleDissimilarity measurement

Angle : Inner product, . . .Non-metric : Rank, Intensity, . . .Distance : Euclidean (l2), Manhattan(l1), . . .

38 / 56

Clustering Analysis

38 / 56

Clustering Analysis

38 / 56

Clustering Analysis

38 / 56

Clustering Analysis

38 / 56

Example

39 / 56

Examples of Clustering Algorithms

Clustering1 Basic Clustering Algorithms

1 K-means2 Clustering Based in Cost Functions

1 Fuzzy C-means2 Possibilistic

3 Hierarchical Clustering1 Entropy based

4 Clustering Based in Graph Theory

40 / 56

41 / 56

What Is Data Mining?

Data mining (knowledge discovery in databases):Extraction of interesting information or patterns from data in largedatabases.

Alternative names and their “inside stories”:Knowledge discovery(mining) in databases (KDD)Knowledge extractionData/pattern analysisData archeologyBusiness intelligenceetc.

42 / 56

Examples: What is (not) Data Mining?

What is not Data Mining?1 Look up phone number in phone directory2 Query a Web search engine for information about “Amazon”

What is Data Mining?1 Certain names are more prevalent in certain US locations (O’Brien,

O’Rurke, O’Reilly. . . in Boston area)2 Group together similar documents returned by search engine

according to their context (e.g. Amazon rainforest, Amazon.com)

43 / 56

44 / 56

Data mining Applications

ApplicationsMining the Web for Structured DataNear Neighbor Search in High Dimensional Data.

I Frequent itemsets and Association Rules

Structure of the webgraphI PageRankI Link Analysis

Proximity on GraphsMining data streams.Large scale supervised machine learning techniques.

45 / 56

46 / 56

Example: Frequent Itemsets

Based in the Market-Basket Model1 On the one hand, we have items.2 On the other we have baskets, sometimes called “transactions.”

1 Each basket consists of a set of items (an itemset)2 They are small.

Examples1 {Cat, and, dog, bites}2 {Yahoo, news, claims, cat, dog, and, produced, viable, offspring}3 {Cat, killer, likely, is, a, big, dog}4 {Professional, free, advice, on, dog, training, puppy}

47 / 56

Then, we do the followingTransaction ID Cat Dog and a mated

1 1 1 1 0 02 1 1 1 1 13 1 1 0 1 04 0 1 0 0 0

48 / 56

Combinatorial Problem

ProblemHow many subsets we have?

But we can do the followingGiven the itemset x in a database D and a set of transactions {ti}i∈I

supp(x ,D) = |{ti ∈ D|x ∈ ti}| (2)

Then, setting a threshold εHow many frequent (supp(x ,D) > ε) itemsets?

49 / 56

supp(x ,D) = |{ti ∈ D|x ∈ ti}| (2)

49 / 56

supp(x ,D) = |{ti ∈ D|x ∈ ti}| (2)

49 / 56

50 / 56

Hardware Solutions: ASICS

Application-Specific Integrated Circuit (ASIC)An ASIC is an integrated circuit customized for a particular use, ratherthan intended for general-purpose use.

It allows for1 Lower Power Consumption.2 Better Colling Approaches.

Example: From Microsoft Research

51 / 56

52 / 56

Hardware Solutions: GPU’sIDEAS

Based on CUDA parallel computing architecture from NvidiaEmphasis on executing many concurrent LIGHT threads instead ofone HEAVY thread as in CPUs

Hardware for 8800

53 / 56

AdvantagesI Massively parallelI Hundreds of cores, millions of threadsI High throughput

LimitationsI May not be applicable for all tasksI Generic hardware (CPUs) closing the gap

54 / 56

55 / 56

Projects

Possible topic are:Oil exploration detection.Association Rule Preprocessing Project.Neural Network-Based Financial Market Forecasting Project.Page Ranking - Improving over the Google MatrixInfluence Maximization in Social Networks.Web Word Relevance Measures.Recommendation Systems.

There are more possibilities at https://www.kaggle.com/competitions

56 / 56

Projects

56 / 56

Projects

56 / 56

Projects

56 / 56

Projects

56 / 56

Projects

56 / 56

Projects

56 / 56

Projects

56 / 56

01 machine learning introduction

Engineering