spring 2014, columbia university...

EECS 6890 – Topics in Information ProcessingSpring 2014, Columbia University

http://rogerioferis.com/VisualRecognitionAndSearch2014

Jun Wang, Jan 30

Visual Recognition and Search

Visual Recognition And Search Columbia University, Spring 20142

Brief Introduction

• About Me

– PhD from the EE Dept., Columbia Univ., 2011

“Semi-Supervised Learning for Scalable and Robust Visual Search”

– Research Staff Member (2010 - Present)

Business Analytics and Mathematical Sciences,

IBM T. J Watson Research

• About You - Background

– Machine learning

– Linear Algebra

– Optimization

– Probability and Statistics


Lecture 2: Machine Learning Fundamentals

• Definition

– a branch of artificial intelligence, concerns the construction

and study of systems that can learn from data - wiki

• Related Columbia Courses

– Machine Learning COMS 4771

http://www.cs.columbia.edu/~jebara/4771/

• Book

C. M. Bishop, Pattern Recognition and

Machine Learning, Springer, 2006


Overview

• Machine learning and data mining

• Representative machine learning problems

– Classification, clustering analysis, regressions,

dimensionality reduction, metric learning, feature learning,

matrix completion, graph learning, ensemble learning,

kernel learning

• Major learning paradigms

– Supervised learning

– Unsupervised learning

– Semi-supervised learning


Outline

• Regression and Classification

• Clustering

• Semi-supervised learning

• Dimensionality Reduction

• Metric Learning


Linear Regression

• Linear Regression

– Training data

– Linear model

• Least Square

– Squared error

– Optimal solution

Demo!


Logistic Regression

• Background

– 1936: Fisher method (linear discriminant analysis)

– 1940s: logistic regression

• Settings

– Input/observation: continuous variables

– Output/response: binary predictor

• Example

– X = [0.0 0.2 0.7 1.0 1.1 1.4 1.5 1.7 2.1 2.5]';

– Y= [0 0 0 0 0 1 1 1 1 1]';


Logistic Regression

• Logistic Sigmoid Function

– S-curve (0,1)

– Derivative

– Regression function (generalized linear models )

• Maximum likelihood estimation

– Logistic loss

– Iterative process to estimate the parametersDemo!


Linear Classification

• Linear Classifier

– Training data

– Linear classification function

• Hinge Loss

– maximum-margin classification

– Classification score


Support Vector Machine (SVM)

• Definitions

– Classification hyperplane

– Positive margin hyperplane

– Negative margin hyperplane

– Margin between and

margin


SVM Objective: Maximum-Margin

• Equals to minimizing

• Recall we have the training data

• Recall hinge loss

• Final objective

• Quadratic programming (quadprog function in Matlab)

margin


• Primal problem

• Lagrange method

• SVM dual problem

SVM: Sketch Derivation of Dual Form


• SVM linear classifier

is learned through solving the following optimization

problem

• SVM dual form: learn a linear classifier

by solving an optimization problem over

SVM: Primal and Dual Problems


• Primal problem: solving a variable

• Dual problem: solving a variable

• The learned variable is often sparse with few non-zero

elements

• Non-zero elements correspond to

support vectors

• Sparse solution gives efficient

classification process

• Sparse solution also indicates

better generalization

SVM: Primal and Dual Problems


Non-Separable SVM

• The above SVM solves separable cases

• data are often not linearly separable

• Relax hard constrains with slack

variables

• Penalize slack


Nonlinear SVM: Kernelization

• Data are often not linearly separable

• The power of kernerlization

– Mapping the data to higher-dimensional

space

– Quadratic polynomial

• Learn a linear classifier with feature

map

hyperplane


Nonlinear SVM: Dual Form

• Recall the dual form

• Nonlinear SVM with feature map

• Nonlinear SVM dual problem

• Observation: inner product of feature map


Nonlinear SVM: Kernel Trick

• Quadratic polynomial

• Kernel trick

– Do not need to calculate Kernel map explicitly

– Explicitly calculating kernel map is not feasible


Nonlinear SVM: Exemplar Kernels

• Linear kernel

• Polynomial kernel

– All polynomial terms up to degree

• Gaussian kernel (Radial Basis Function)

– Infinite dimensional feature map


SVM with RBF Kernel

• Classification function

• RBF SVM


SVM: Resources

• SVM video demo

http://www.youtube.com/watch?v=3liCbRZPrZA

• Steve Gunn’s SVM package

http://www.isis.ecs.soton.ac.uk/resources/svminfo/

• LibSVM

a comprehensive SVM package

http://www.csie.ntu.edu.tw/~cjlin/libsvm/


Summary: Loss Functions

• Quadratic loss

• Hinge loss

• 0-1 loss

logic indicator (1 if true, 0 if false)


Outline


• Clustering



• Metric Learning


Clustering – Unsupervised Learning

• Definition - wiki– clustering is the task of grouping a set of objects in such a way that

objects in the same group (called a cluster) are more similar to each

other than to those in other groups

• A popular tool for exploratory data mining in various

applications– Client/customer grouping for better marketing

– User/product grouping for better recommendation

– Patient population grouping for improving healthcare service delivery

– Social analytics: crime, education

– Science: genotype assignment, chemical compound grouping,

climatology

– …


K-Means Clustering

• A well-known and simple method for clustering data.

• An iterative process

�a) Estimating cluster centers (location of clusters)

�b) Calculating data points’ cluster membership

�c) Repeat a) & b) until nothing changes

�Matlab: IDX = kmeans(X,k)

http://en.wikipedia.org/wiki/K-means_clustering


K-Means Application: Bag-of-Visual-Words

• Employ K-means to extract visual key words

From Y.-G. Jiang’s slides


Hierarchical Clustering

• Clustering data points and building a hierarchy of

clusters

�Rely on distance/similarity measure

�Agglomerative: bottom up approach

�Divisive: top down approach

• Example


Maximum Likelihood Estimation

• Given data, estimate the underlying distributions

• Gaussian distribution

• Parameter estimation


• Data are generated from a series of Gaussian models

• Expectation Maximization (EM)

�E-step

�M-step

Mixture of Gaussians

grouping families

hidden variable


• Beyond linearly separable

• Graphs

• Similarity Matrix

Clustering on Nonlinear Data Manifold

Question?


• Transform data to a similarity graph

�Graph node is a data point

�Graph edge measures pair-wise similarity

• Clustering can be viewed as

partitioning the similarity graph

Graph Partition

A

B

degree matrix

code: http://www.cis.upenn.edu/~jshi/software/


• Spectral Graph Theory

�Graph Laplacian

�The eigenvalues and eigenvectors of the graph Laplacian

provide the structure and connectivity information of the

graph

• Algorithm sketch

�Graph construction to receive a similarity matrix

�Compute eigenvalues and eigenvectors of the Laplacian matrix

�Perform K-means clustering using the following as data

Spectral Clustering


Break


Outline


• Clustering



• Metric Learning


• Motivation

�Data is rich

�Labels are expensive to receive

�Can unlabeled data help classification?

• Key Assumptions of SSL

�Smoothness: yields a preference for

decision boundaries in low-density regions

�Clustering/manifold assumption: data tend to form discrete

clusters or lie in low-dimensional manifold

Semi-Supervised Learning (SSL) Overview


• Recall standard SVM

�Training data

�Learn a linear classifier

�SVM primal

• Transductive SVM

�Training data

SVM with Unlabeled Data


• Label propagation with graphs

Graph-Based SSL: Graph Propagation

Input samples with sparse labels Label propagation on a graph Label inference results

Unlabeled Positive Negative NegativePositive


Graph Propagation: Notation and Example

11

22

1

Weight matrix

Node degree matrixLabel matrix

a fraction of constructed graph

classes

sam

ple

s


• Graph Laplacian

• Normalized graph Laplacian

• An operator measuring the smoothness of a function

over the graph - Chung, spectral graph theory, 1997

Regularization with Graph Laplacian


• Prediction function estimation through optimizing a cost

function

• Two representative methods

� Gaussian Fields and Harmonic functions (Zhu et al., ICML03)

� Local and global consistency - LGC (Zhou et al., NIPS 04)

SSL with Graph Regularization

prediction function function smoothness Empirical loss


• Graph is more accurate for relevance ranking

Graph Based Ranking

Ranking by geodesic distance

Zhou, Weston, et al., NIPS 04


• Application: Interactive Visual Search

Visual Search using Graph Propagation

Wang et al., ICML08


Outline


• Clustering



• Metric Learning


Dimensionality Reduction - Embedding

• Objective: reduce the number of random variables

� Input

� Output

� Principle: minimize reconstruction error, preserving locality

• Two general categories of methods

� Linear DM: PCA&LDA

� Nonlinear DM: LLE, ISOMAP

• Three learning paradigms

� Unsupervised: PCA

� Supervised: LDA

� Semi-Supervised


Locally Linear Embedding

• Objective: Preserve local linear structure

• Algorithm

� Step 1: find the k nearest neighbor for each data point

� Step 2: find weight matrix with minimum reconstruction error

� Step 3: find embedding with minimum reconstruction error

http://www.cs.nyu.edu/~roweis/lle/


ISOMAP

• Objective: finds the projection that preserves the

global nonlinear geometry of the data

� Calculate the geodesic manifold interpoint distances

� Perform multidimensional scaling to derive projections to

preserve the geodesic distance

http://isomap.stanford.edu/


Outline


• Clustering



• Metric Learning


Distance Metric Learning

• Motivation

� Semantic gap: semantic description often differs from the

feature representation

• Applications

� Nearest neighbor search

� Clustering (K-Means)

� Graph learning

� Classification (SVM)similardissimilar


Mahalanobis Distances

• Squared Euclidean distance

• Mahalanobis distances

• Many metric learning approaches use the above

form.


Metric Learning and Linear Projection

• Cholesky decomposition

• Rewtite Mahalanobis distances as

• Mahalanobis distances can be viewed as squared

Euclidian distance after linear projection


Large Margin Metric Learning

• Problem setting and formulation

� Given similar sample pairs

� Minimize distance between similar sample pairs

� Satisfy relative distance constraints

• Objective function


Large Margin Metric Learning

• LMML objective

• Formulation with slack variables


Announcement

• Please form groups of two students and send us via

email (one email per group):

� The name of the members in the group and

� Three preferred presentation topics as soon as possible (no

later than Feb 06).

� Each group will have to prepare presentations in class and

work together in a project.

• List of paper presentation topics and information

about length of presentations have been posted

(check the presentations page)

• Required reading for next class (check the schedule

page)

spring 2014, columbia university...

Documents