Download - Intro to machine learning with scikit learn
1
Yossi Cohen
Machine Learning with Scikit-learn
2
INTRO TO ML PROGRAMMING
3
ML Programming
1. Get Data Get labels for supervised learning
2. Create a classifier
3. Train the classifier
4. Predict test data
5. Evaluate predictor accuracy
*Configure and improve by repeating 2-5
4
The ML Process
Filter
OutliersRegression
Classify
Validateconfigure
Model
Partition
5
Get Data & Labels• Sources
–Open data sources–Collect on your own
• Verify data validity and correctness• Wrangle data
–make it readable by computer–Filter it
• Remove Outliers
PANDAS Python library could assist in pre-processing & data manipulation before ML http://pandas.pydata.org/
6
Pre-Processing
Change formattingRemove redundant data Filter Data (take partial data)Remove OutliersLabelSplit for testing (10/90, 20/80)
7
Data Partitioning
• Data and labels–{[data], [labels]} –{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]}–Data: [Age, Do you love Nutella?]
• Partitioning will create–{[train data], [train labels],[test data], [test labels]}–We usually split the data on a ration of 9:1–There is a tradeoff between the effectiveness of the test and the learning we could provide to the classifier
• We will look at a partitioning function later
8
Learn (The “Smart Part”)
ClassificationIf the output is discrete to a limited amount of classes (groups)
RegressionIf the output is continues
9
Learn Programming
10
Create Classifier
For most SUPERVISED LEARNING algorithms this would be
C = ClassifyAlg(Params)Its up to us (ML guys) to set the best
paramsHow?
1. We could develop a hunch for it2. Perform an exhaustive search
11
Train the classifier
We assigned
C = ClassifyAlg(Params)
This is a general algorithm with some initalizer and configurations.
In this stage we train it using:
C.fit(Data, Labels)
12
Predict
After we have a trained Algorithm classifier C
Prdeicted_Labels = C.predict(Data)
13
Predictor Evaluation
We are not done yetThere is a need to evaluate the predictor
accuracy in comparison to other predictors and to the system requirements
We will learn several methods for this
14
ENVIRONMENT
15
The Environment
• There are many existing environments and tools we could use–Matlab with Machine learning toolbox–Apache Mahout –Python with Scikit-learn
• Additional tools–Hadoop / Map-Reduce to accelerate and parallelize large data set processing
–Amazon ML tools–NVIDIA Tools
16
Scikit-learn
• Installation Instructions inhttp://scikit-learn.org/stable/install.html#install-official-release
• Depends on two other libraries• numpy and scipy
• Easiest way to install on windows:• Install WinPython
http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/
–Lets install this togetherFor Linux / Mac computers just install the 3
libs separately using PIP
17
THE DATA
18
Data sets
There are many data sets to work onOne of them is the Iris data classification
into three groups. It has an interesting story you could google later
Well work on the iris data
19
Lab A – Plot the Iris data
Plot septal length vs septal width with labels ONLYHow? Google Iris data and the scikit learn environment
Try to understand the second part of the program with the PCA
20
Iris Dataimport matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
21
Plot Iris Data
plt.figure(2, figsize=(8, 6))
plt.clf()
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
22
Add PCA for better classificationfig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
cmap=plt.cm.Paired)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()
23
Iris Data Classified
24
25
Thank you!More About me:
Yossi CohenYossi [email protected]+972-545-313092+972-545-313092
Video compression and computer vision enthusiast & lecturer
Surfer