Transfer Learning: Fundamentals and Applications
Sinno Jialin Pan (Ph.D.) Nanyang Assistant Professor,
School of Computer Science and Engineering, NTU, Singapore Homepage: http://www.ntu.edu.sg/home/sinnopan
Outline
• What is transfer learning? • Transfer learning methodologies
– Instance-based, Feature-based Parameter-based, Relational • Transfer learning applications
– Sentiment analysis, WiFi localization, Software defect prediction
• Future directions
Transfer of Learning Psychological point of view
• The study of dependency of human conduct, learning or performance on prior experience. – [Thorndike and Woodworth, 1901] explored how
individuals would transfer in one context to another context that share similar characteristics.
Transfer Learning Machine learning community
• Inspired by human’s transfer of learning ability • The ability of a system to recognize and apply
knowledge and skills learned in previous domains/tasks to novel tasks/domains, which share some commonality
A Motivating Example
• Goal: to train a robot to accomplish Task 𝑻1 in an indoor environment 𝑬1 using machine learning techniques: • Sufficient training data required: sensor readings to measure
the environment as well as human supervision, i.e. labels • A predictive model can be learned, and used in the same
environment
Task 𝑻1 in environment 𝑬1
Background Review
-30dbm -12dbm -50dbm -44dbm -99dbm
Sensor 1 Sensor 4 Sensor 3 Sensor 2 Sensor 5 Timestamp
𝑖
𝒙𝑖 ∈ 𝑅5×1
𝑦𝑖
Input feature vector:
Corresponding label: A labeled instance:
{(𝒙𝑖 ,𝑦𝑖)|𝑖 = 1, … ,𝑁} A set of labeled instances:
𝑓:𝒙 → 𝑦 A predictive model: OR 𝑃(𝑦|𝒙)
A Motivating Example (cont.)
• Limitations of traditional machine learning techniques: 1) Performance highly relies on whether sufficient labeled
data is available to build a predictive model 2) When environment changes (e.g., new domain or new
task), the learned predictive model performs poorly • A new predictive model has to be rebuilt from scratch
Environment changes 𝑬2 Task 𝑻1
Task 𝑻2 New robot
?
? ?
To train the robot from scratch? Expensive & time consuming!
Source Task Target Task 𝒇𝑻
Transfer Learning (cont.)
• Assumption: training and test data are assumed to be – Represented in the same feature space, AND – Follow the same data distribution
• In practice: training and test data come from different domains – Represented in different feature spaces, OR – Follow different data distributions
𝒇𝑺 Directly apply
Adaptively Transfer What if machines have
transfer learning ability?
Transfer Learning (cont.)
Traditional Machine Learning Transfer Learning
train
ing
dom
ains
test
dom
ains
train
ing
dom
ains
test
dom
ains
domain A domain B domain C
Transfer Learning (cont.)
• Given a target domain/task, transfer learning aims to 1) identify the commonality between the target domain/task
and previous domains/tasks 2) transfer knowledge from the previous domains/tasks to the
target one such that human supervision on the target domain/task can be dramatically reduced.
Source Domain / Task Data
Target Domain / Task Data
Predictive Models
Sufficient labeled training data
Unlabeled training/with a few labeled data
Transfer Learning Algorithms
Target Domain / Task Data
Testing
Other Motivating Examples
-30dBm -70dBm -40dBm
Other Motivating Examples (cont.)
• WiFi localization: signal strength changes a lot over different time periods, or across different mobile devices.
Time Period A Time Period B
Device B
Device A
Contour of signal strength values
Localization model
~ 1.5 meters
Average Error Distance
Localization model w/o transfer learning
~ 10 meters
Localization model w/ transfer learning
~ 3-4 meters
Other Motivating Examples (cont.)
• Sentiment analysis: users may use different sentiment words across different domains.
Sentiment classifier
~ 82 %
Classification Accuracy
Sentiment classifier w/o transfer learning
~ 70%
Sentiment classifier w/ transfer learning
~ 77 %
Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!
(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.
(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.
(4) Very realistic shooting action and good plots. We played this and were hooked.
(5) It is also quite blurry in very dark settings. I will never buy HP again.
(6) The game is so boring. I am extremely unhappy and will probably never buy UbiSoft again.
Product reviews on different domains
TL for Different ML Problems
• Transfer learning for reinforcement learning
• Transfer learning for classification/regression
[Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009]
[Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2010]
[Pan, Transfer Learning, Chapter 21, Data Classification: Algorithms and Applications 2014]
Transfer Learning Settings
Transfer Learning
Heterogeneous Transfer Learning
Heterogeneous
Feature Space
Homogeneous Transfer Learning
Homogeneous Unsupervised Transfer
Learning
Semi-Supervised Transfer Learning
Supervised Transfer Learning
Transfer Learning Approaches
Instance-based Approaches
Feature-based Approaches
Parameter-based Approaches
Relational Approaches
Instance-based TL Approaches
Source and target domains have a lot of overlapping features (domains share the same/similar support)
General Assumption
𝓧𝑺 𝓧𝑻
Instance-based Approaches: Case I
Given a target task, based on the learning framework of empirical risk minimization
Instance-based Approaches: Case I (cont.)
Correcting Sample Selection Bias / Covariate Shift [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]
Assumption: sample selection bias is caused by the data generation process
Instance-based Approaches: Correcting Sample Selection Bias
• Imagine a rejection sampling process, and view the source domain as samples from the target domain
Instance-based Approaches: Correcting Sample Selection Bias (cont.) • The distribution of the selector variable maps the target onto
the source distribution
Label instances from the source domain with label 1 Label instances from the target domain with label 0 Train a binary classifier
[Zadrozny, ICML-04]
Instance-based Approaches: Kernel mean matching (KMM)
Maximum Mean Discrepancy (MMD)
[Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial]
Representing Distributions in RKHS
𝑋~𝑃(𝑥)
Can we use a vector 𝜇𝑋 to represent the distribution?
𝜇𝑋 = 𝔼[𝑋]
𝜇𝑋 =𝔼[𝑋]𝔼[𝑋2]
𝜇𝑋 =𝔼[𝑋]𝔼[𝑋2]𝔼[𝑋3]
𝜇𝑋 =
𝔼[𝑋]𝔼[𝑋2]𝔼[𝑋3]
……
But, infinite, cannot be explicitly computed!
What about 𝜇𝑋 ∈ ℱ (RKHS)?
Kernel trick can be applied!
Mean Map in RHKS
Mean map: 𝜇𝑃 = 𝔼𝑥~𝑃(𝑥) 𝜙𝑥
Suppose 𝑋~𝑃(𝑥), and denote 𝑘 𝑥,∙ = 𝜙𝑥
𝜇𝑃:𝑃 → �𝜙𝑥d𝑃(𝑥)
𝑃(𝑥)
𝜇𝑃
RKHS ℋ
Empirical mean map for 𝑥𝑖 𝑖=1𝑛 : �̂�𝑃 = 1
𝑛∑ 𝜙𝑥𝑖𝑛𝑖=1 = 1
𝑛∑ 𝑘(𝑥𝑖 ,∙)𝑛𝑖=1
Mean Map in RHKS (cont.)
𝑥𝑆2 𝑥𝑆3 𝑥𝑆4
𝑥𝑆1
𝑥𝑇2 𝑥𝑇3 𝑥𝑇4
𝑥𝑇1
𝑋𝑠~𝑃𝑆(𝑥)
𝜙𝑥𝑆1
𝜙𝑥𝑆2
𝜙𝑥𝑆4
𝜙𝑥𝑆3
𝜇𝑃𝑆 𝜇𝑃𝑇
𝑋𝑇~𝑃𝑇(𝑥)
𝜙𝑥𝑇3
𝜙𝑥𝑇1 𝜙𝑥𝑇4
𝜙𝑥𝑇2
Dist 𝑋𝑆,𝑋𝑇
Instance-based Approaches: Kernel mean matching (KMM) (cont.)
[Huang etal., NIPS-06]
The resultant optimization is a simple QP problem
Instance-based Approaches: Direct density ratio estimation
KL divergence loss Least squared loss
[Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]
Instance-based Approaches: Case II
• Recall:
• Intuition: Part of the labeled data in the source domain can be
reused in the target domain after re-weighting based on their contributions to the classification accuracy of the learning problem in the target domain
≠
Instance-based Approaches: Case II (cont.)
TrAdaBoost [Dai etal ICML-07] For each boosting iteration,
– Use the same strategy as AdaBoost to update the weights of target domain data
– Use a new mechanism to decrease the weights of misclassified source domain data
TrAdaBoost
Source domain labeled data
target domain labeled data
The whole training dataset
To decrease the weights of the misclassified data
To increase the weights of the misclassified data
Classifiers trained on re-weighted labeled data
Target domain unlabeled data
• Misclassified instances: – increase the weights of the
misclassified target data – decrease the weights of the
misclassified source data
Feature-based TL Approaches
When source and target domains only have some overlapping features. (lots of features only have support in either the source or the target domain)
Feature-based TL Approaches (cont.)
How to learn 𝜑 ? • Solution 1: Encode application-specific knowledge to
learn the transformation. • Solution 2: General approaches to learning the
transformation.
TL for Sentiment Classification
Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!
(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.
(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.
(4) Very realistic shooting action and good plots. We played this and were hooked.
(5) It is also quite blurry in very dark settings. I will never_buy HP again.
(6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.
TL for Sentiment Classification (cont.)
compact sharp blurry hooked realistic boring 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
( ) sgn( ), [1,1, 1,0,0,0]Ty f x w x w= = ⋅ = −
compact sharp blurry hooked realistic boring 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1
Electronics
Video Game
Training
Prediction
TL for Sentiment Classification (cont.)
Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!
(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.
(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.
(4) Very realistic shooting action and good plots. We played this and were hooked.
(5) It is also quite blurry in very dark settings. I will never_buy HP again.
(6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.
TL for Sentiment Classification (cont.)
• Three different types of features – Source domain (Electronics) specific features, e.g., compact, sharp, blurry – Target domain (Video Game) specific features, e.g., hooked, realistic, boring – Domain independent features (pivot features), e.g.,
good, excited, nice, never_buy
TL for Sentiment Classification (cont.)
• How to select pivot features? – Term frequency on both source and target domain data. – Mutual information between features and source domain labels – Mutual information on between features and domains
• How to make use of the pivot features to align
features across domains? – Structural Correspondence Learning (SCL) [Biltzer etal. 2006] – Spectral Feature Alignment (SFA) [Pan etal. 2010]
How to Align Domain Specific Features?
• Intuition – Use pivot features as a bridge to connect domain-
specific features – Model correlations between pivot features and
domain-specific features – Discover new shared features by exploiting the
feature correlations
Spectral Feature Alignment (SFA)
If two domain-specific words have connections to more common pivot words in the graph, they tend to be aligned or clustered together with a higher probability. If two pivot words have connections to more common domain-specific words in the graph, they tend to be aligned together with a higher probability.
exciting
good
never_buy sharp
boring
blurry
hooked
compact
realistic Pivot features
Domain-specific features
7 6
8 3
6
2
4
5
Electronics
Video Game
High level idea
exciting
good
never_buy sharp
boring
blurry
hooked
compact
realistic Pivot features
Domain-specific features
7 6
8 3
6
2
4
5
Electronics
Video Game
boring realistic
hooked
blurry
sharp
compact
Electronics
Video Game Electronics
Electronics Video Game
Video Game
Derive new features
Spectral Clustering
SFA: Derive new features
sharp/hooked compact/realistic blurry/boring 1 1 0 1 0 0 0 0 1
45
( ) sgn( ), [1,1, 1]Ty f x w x w= = ⋅ = −
sharp/hooked compact/realistic blurry/boring 1 0 0 1 1 0 0 0 1
Electronics
Video Game
Training
Prediction
SFA: Algorithm
1. Identify pivot features, and consider the rest as domain-specific features.
2. Construct a bipartite graph between the pivot and domain-dependent features.
3. Apply spectral clustering on the graph. 4. Use the feature clusters with original features to
represent the source and target domain data.
Experiments: Datasets & Tasks
Dataset Tasks RevDat B→D E→D K→D D→B E→B K→B D→E B→E K→E D→K B→K E→K SentDat V→E S→E H→E E→V S→V H→V E→S V→S H→S E→H V→H S→H
47
[Blitzer et al. ACL-07], crawled from Amazon
Collected by MSRA, crawled from Amazon, Yelp and Citysearch
Dataset:
Task:
Overall Performance
48 Our method Blitzer’s method
Use the source domain classifier without adaptation
In-domain classifier: use labeled data in the target domain to train classifiers
Overall Performance (cont.)
49 Our method Blitzer’s method Use the source domain classifier without adaptation
In-domain classifier : use labeled data in the target domain to train classifiers
General Feature-based TL Approaches (cont.)
• Solution 2: General approaches to learning the transformation – Learning features by minimizing distance between
distributions – Learning features via multi-task learning – Learning features via self-taught learning
Transfer Component Analysis (TCA) [Pan etal., IJCAI-09, TNN-11]
Target Source
Latent factors
Temperature Signal properties
Building structure
Power of APs
Motivation
TCA (cont.)
Target Source
Latent factors
Temperature Signal properties
Building structure
Power of APs
Cause the data distributions between domains different
TCA (cont.)
Learning 𝜑 by only minimizing distance between distributions may map the data onto noisy factors.
TCA (cont.)
• Main idea: the learned 𝜑 should map the source domain and target domain data to a latent space spanned by the factors that reduce domain distance as well as preserve data structure
• High level optimization problem
min𝜑
Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 + 𝜆Ω(𝜑)
s.t. constraints on 𝜑 𝑋𝑆 and 𝜑 𝑋𝑇
Maximum Mean Discrepancy (MMD)
Distance Measure via MMD
Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 = 𝔼𝑥∼𝑃𝑇(𝑥) 𝜙 𝜑 𝑥 − 𝔼𝑥∼𝑃𝑆(𝑥) 𝜙 𝜑 𝑥ℋ
≈1𝑛𝑇
�𝜙 𝜑 𝑥𝑇𝑖
𝑛𝑇
𝑖=1
−1𝑛𝑆�𝜙 𝜑 𝑥𝑆𝑖
𝑛𝑆
𝑖=1 ℋ
Assume 𝜓 = 𝜙 ∘ 𝜑 be a RKHS with kernel 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜓(𝑥𝑖)𝑇𝜓(𝑥𝑖)
Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 = tr(𝐾𝐾)
𝐾 =𝐾𝑆,𝑆 𝐾𝑆,𝑇𝐾𝑇,𝑆 𝐾𝑇,𝑇
𝐾𝑖𝑗 =
1𝑛𝑆2
𝑥𝑖 , 𝑥𝑗 ∈ 𝑋𝑆
1𝑛𝑇2
𝑥𝑖 , 𝑥𝑗 ∈ 𝑋𝑇
−1
𝑛𝑆𝑛𝑇otherwise
TCA (cont.)
min𝜑
Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 + 𝜆Ω(𝜑)
s.t. constraints on 𝜑 𝑋𝑆 and 𝜑 𝑋𝑇
min𝜑
tr 𝐾𝐾 + 𝜆Ω 𝜑
s.t. constraints on 𝜑 𝑋𝑆 and 𝜑 𝑋𝑇
• In general, the kernel function 𝑘(𝜑 𝑥𝑖 ,𝜑(𝑥𝑗)) can be a highly nonlinear function of 𝜑, which is unknown
• A direct optimization of minimizing the quantity w.r.t. 𝜑 may get stuck in poor local minima
Solution I [Pan etal., AAAI-08]
Learning 𝜑 1) Learning 𝐾 2) Low-dimensional reconstructions
of 𝑋𝑆 and 𝑋𝑇 based on 𝐾
min𝐾≽0
tr 𝐾𝐾 − 𝜆tr 𝐾
s.t. 𝐾𝑖𝑖 + 𝐾𝑗𝑗 − 2𝐾𝑖𝑗 = 𝑑𝑖𝑗2 ,∀ 𝑖, 𝑗 ∈ 𝒩,𝐾𝟏 = 0.
Perform PCA on 𝐾
• It is a SDP problem, expensive! • It is transductive, cannot generalize on unseen instances! • PCA is post-processed on the learned kernel matrix, which may
potentially discard useful information
Maximize data variance
Minimize the distance between domains
Preserve the local geometric structure
1)
2)
Solution II [Pan etal., IJCAI-09, IEEE TNN-11]
Assume 𝐾 be low-rank, then 𝐾 = 𝐾�𝑊𝑊𝑇𝐾�
𝑊 ∈ ℝ(𝑛𝑆+𝑛𝑇)×𝑚 and 𝑚 ≪ 𝑛𝑆 + 𝑛𝑇
Learning 𝐾 Learning a low-rank matrix 𝑊
min𝑊
tr 𝐾�𝑊𝑊𝑇𝐾�𝐾 + 𝜆tr 𝑊𝑇𝑊
s.t. 𝑊𝑇𝐾�𝐻𝐾�𝑊 = 𝐼
Minimize distance between domains
Regularization term on 𝑊
Maximize data variance
Closed form solution for 𝑊∗: 𝑚 leading eigenvectors of (𝐾�𝐾𝐾� + 𝜆𝐼)−1𝐾�𝐻𝐾�
tr 𝑊𝑇𝐾�𝐾𝐾�𝑊
Experiments: Datasets & Setup
• Data Set: 2007 IEEE ICDM Contest Task 2 [Yang et al., 2008] WiFi data are collected in time period A (a source domain) and time period B (a target domain) in an indoor building around 145.5 × 37.5m2 .
• Baselines: Non-transfer, KPCA, other transfer learning approaches.
• Regression model: Regularized Least Square Regression (RLSR).
• Evaluation Criterion: Average Error Distance (AED)
Experiments: Overall Results
SSTCA Varying # dimensions of transfer learning methods.
Varying # unlabeled data in the target domain for training
TCA
Multi-task Feature Learning
Assumption: If tasks are related, they should share some good common features.
Goal: Learn a low-dimensional representation shared across related tasks.
General Multi-task Learning Setting
Multi-task Learning
Target Task 1
Setting 1
Setting 2
Target Task 2
… Target Task N
𝒇𝑻𝟏 𝒇𝑻𝟐 … 𝒇𝑻𝑵
Commonality among tasks
Source Task Target Task
𝒇𝑻 𝒇𝑺
Adaptively Transfer
Two main settings in transfer learning
Self-taught Feature Learning
• Intuition: – There exist some higher-level features that can help the
target learning task even only a few labeled data are given. • Steps:
– Learn higher-level features from a lot of unlabeled data – Use the learned higher-level features to represent the data
of the target task – Training models from the new representations of the target
task with corresponding labels
Self-taught Feature Learning (cont.)
• How to learn higher-level features – Sparse Coding [Raina etal., 2007] – Autoencoder [Glorot etal., 2011]
≈ 0.6 × +0.3 × +0.1 ×
Sparse coding illustration Autoencoder illustration
Heterogeneous Transfer Learning
• Source and target domains do not have any overlapping features, i.e., there does not exist any feature that has support in both the source and target domains.
• Some labeled data is available on the target domain to construct correspondence between domains
• Some parallel data (i.e, instance-correspondences) between domains are available
𝓧𝑺 𝓧𝑻
Heterogeneous TL: Application
• Sentiment analysis across different languages – [Zhou, Pan, etal., AAAI-14, Zhou, Tsang, Pan, etal.,
AISTATS-14] • Suppose: given a sufficient set of annotated product reviews in
English (i.e., source domain) • Goal: to learn a sentiment classifier for product reviews in
German (i.e., target domain)
English German
Parameter-based Approaches
• Motivation: A well-trained source model 𝜃𝑆 has captured a lot of structure from data. If two tasks are related, this structure can be transferred to learn a more precise target model 𝜃𝑇 with a few labeled data in the target domain
𝜃𝑆 𝜃𝑇
𝜃𝑇
Parameter-based TL Approaches (cont.) Assumption: If tasks are related, they may share similar parameter vectors. For example, [Evgeniou and Pontil, KDD-04]
Common part
Specific part for individual task
Parameter-based TL Approaches (cont.)
A general framework:
[Zhang and Yeung, UAI-10] [Agarwal etal, NIPS-10]
Relational TL Approaches
• Motivation: If two relational domains (non-i.i.d) are related, they may share some similar relations among objects. These relations can be used for knowledge transfer across domains
• Approaches: – Second-order Markov Logic [Davis and Domingos, ICML-09]
– Relational Adaptive bootstraPping (RAP) [Li etal., ACL-12]
Second-order Markov Logic
Actor(A) Director(B) WorkedFor
Movie (M)
Student (B) Professor (A) AdvisedBy
Paper (T)
Publication Publication
Academic domain (source) Movie domain (target)
MovieMember MovieMember
AdvisedBy (B, A) ⋀ Publication (B, T) => Publication (A, T)
WorkedFor (A, B) ⋀ MovieMember (A, M) => MovieMember(B, M)
P1 𝑥, 𝑦 ⋀ P2(𝑥, 𝑧) => P2(𝑦, 𝑧)
RAP
• RAP is designed for cross-domain fine-grained sentiment analysis
I liked the service and the staff, but the food is disappointed
Aspect terms Opinion terms
Positive Negative
RAP (cont.)
This movie has good script, great casting, excellent acting. This movie is so boring. The Godfather was the most amazing movie. The movie is excellent.
The camera is great. It is a very amazing product. I highly recommend this camera. It takes excellent photos. Photos had some artifacts and noise.
Camera
Movie
RAP (cont.)
• Bridge between cross-domain sentiment words – Domain independent (general) sentiment words
• Bridge between cross-domain topic words
RAP (cont.)
• Bridge between cross-domain topic words – Syntactic structure between topic and sentiment words
Sentiment words
Topic word Topic word
Common syntactic pattern: “topic word” – nsubj – “sentiment word”
Summary
Transfer Learning
Heterogeneous Transfer Learning
Homogeneous Transfer Learning
Unsupervised Transfer Learning
Semi-Supervised Transfer Learning
Supervised Transfer Learning
Instance-based Approaches
Feature-based Approaches
Parameter-based Approaches
Relational Approaches
In data level
In model level
Real-world Applications
Fundamental Research on
Transfer Learning
Application to Text Mining
Application to Social Media Analytics
Application to Image Classification
Application to WiFi Localization
Application to Sentiment Analysis
Application to Software Engineering
Application to Recommender Systems
Application to Bioinformatics
Defect Prediction
Program with defect information
Predictive Model (Machine learning)
Future defects
For a particular project:
Cross-Project Defect Prediction
“Training data is often not available, either because a company is too small or it is the first
release of a product”
[Zimmerman et al., Cross-project Defect Prediction, FSE-09]
“For many new projects we may not have enough historical data to train prediction models.”
[Rahman, Posnett, and Devanbu, Recalling the “Imprecision” of Cross-project Defect Prediction, ICSE-12]
Cross-Project Defect Prediction
Program with defect information
Predictive Model (Machine learning)
Future defects
OR
Cross-Project Defect Prediction
• [Zimmerman etal. FSE-09] – “We ran 622 cross-project predictions and found only 3.4%
actually worked.” Worked,
3.4%
Not worked, 96.6%
Cross-Project Defect Prediction
• [Rahman, Posnett, and Devanbu. FSE-12]
0
0.1
0.2
0.3
0.4
0.5
0.6
Cross Within
Avg. F-measure
Difference between Projects
• Development processes can be very different [Zimmermann etal., FSE-09] – The way the systems are being developed – Operating systems and environments – Tools and IDEs used for development – Coding styles – etc.
Defect Prediction Process
OR Using the same set of metrics
2 3 40 … 19 Y or N
Metric 1 … Metric 3 Metric 2 Metric 𝑚
instance 𝑖:
Distributions of feature values (data distributions) are different
Data Normalization
• N1: Min-max Normalization [0, 1]
• N2: 𝑍-score Normalization (mean=0, std=1)
• N3: 𝑍-score Normalization only using source mean and standard deviation
• N4: 𝑍-score Normalization only using target mean and standard deviation
• NoN: no normalization is used
Normalization TCA
Preliminary Results using TCA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8F-measure
Prediction performance of TCA varies according to different normalization strategies!
Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4
Safe Apache Apache Safe
Baseline: Cross-project defect prediction without TCA
TCA+
• TCA with decision rules to choose an appropriate data normalization strategy
• Steps: – #1: Characterize datasets for different projects – #2: Decide rules based on comparison in characteristics
between projects
Normalization TCA
TCA+
#1: Characterize Datasets
3
1
…
Dataset for Project A (source) Dataset for Project B (target)
2
4
5
8
9
6
11
𝑑1,2
𝑑1,5
𝑑1,3
𝑑3,11
3
1
…
2 4
5
8
9
6 11
𝑑2,6
𝑑1,2
𝑑1,3
𝑑3,11
DIST = 𝑑𝑖𝑗: ∀𝑖, 𝑗, 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑛, 𝑖 < 𝑗
#1: Characterize Datasets (cont.)
• Minimum (min) and maximum (max) value of DIST • Mean (mean) and standard deviation (std) of DIST • The number of instances (#ins)
Dataset for Project A (source) Dataset for Project B (target)
min𝑆 max𝑆 mean𝑆 std𝑆
#ins𝑆
min𝑇 max𝑇 mean𝑇 std𝑇
#ins𝑇
𝒄𝑆 𝒄𝑇
#2: Decision Rules
Rule #1 • (mean𝑆 ≈ mean𝑇) and (std𝑆 ≈ std𝑇) NoN Rule #2 • (min𝑆 ≠ min𝑇) and (max𝑆 ≠ max𝑇) N1 Rule #3, #4 • (std𝑆 ≠ std𝑇) and (#ins𝑆 ≠ #ins𝑇) N3 or N4 Rule #5 • Default N2
Experimental Setup
• 8 software projects
• Classification algorithm:
– Logistic regression
ReLink [Wu etal. FSE-11]
Projects # metrics (features)
Apache 26 Safe
ZXing
AEEEM [D’Ambros etal. MSR-10]
Projects # metrics (features)
Apache Lucene (LC)
61 Equinox (EQ) Eclipse JDT
Eclipse PDE UI Mylyn (ML)
Experimental Design (cont.)
Target project (Test set)
Source project (Training set)
Cross-project defect prediction
Experimental Design
Target project (Test set)
Source project (Training set)
Transfer defect learning with TCA/TCA+
ReLink Results
Baseline: Cross-project defect prediction without TCA/TCA+
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8F-measure
Baseline TCA TCA+ Within
Safe Apache Apache Safe Apache ZXing Baseline TCA TCA+ Within Baseline TCA TCA+ Within
ReLink Result (cont.)
Cross Source Target Safe Apache
Zxing Apache Apache Safe Zxing Safe
Apache ZXing Safe ZXing
Average
Baseline
0.52 0.69 0.49 0.59 0.46 0.10 0.49
TCA (N4)
0.64 0.64 0.72 0.70 0.45 0.42 0.59
TCA+
0.64 0.72 0.72 0.64 0.49 0.53 0.61
Within Target Target
0.64
0.62
0.33
0.53
Baseline: Cross-project defect prediction without TCA/TCA+
AEEEM Results
Baseline: Cross-project defect prediction without TCA/TCA+
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7F-measure
Baseline TCA TCA+ Within
JDT EQ PDE LC PDE ML Baseline TCA TCA+ Within Baseline TCA TCA+ Within
AEEEM Results (cont.)
Cross Source Target
JDT EQ LC EQ ML EQ
… PDE LC EQ ML JDT ML LC ML PDE ML
… Average
Baseline
0.31 0.50 0.24 …
0.33 0.19 0.27 0.20 0.27 …
0.32
TCA (N2)
0.59 0.62 0.56 …
0.27 0.62 0.56 0.58 0.48 …
0.41
TCA+
0.60 0.62 0.56 …
0.33 0.62 0.56 0.60 0.54 …
0.41
Within Target Target
0.58
… 0.27
0.30
… 0.42
Future Direction
• Theoretical study beyond generalization error bound – Given a source domain and a target domain, determine
whether transfer learning should be performed – For a specific transfer learning method, given a source and
a target domain, determine whether the method should be used for knowledge transfer
Future Direction (cont.)
• Transfer learning for deep reinforcement learning
Reward
Feature Extractor
Hand-crafted features (states)
Action
Agent
Traditional RL
Reward
Action
Observation
Deep RL
Agent
Mapping row screen pixels to predictions of final score for each of 18 joystick actions
Mapping hand-crafted features (states) to final score for each of 18 joystick actions
[Google DeepMind 2015]
Policy
Policy
[𝒇𝟏,…, 𝒇𝒏]
Observation
Future Direction (cont.) • Transfer learning for deep reinforcement learning
Deep RL Transfer Learning for Deep RL
Reference
• [Thorndike and Woodworth, The Influence of Improvement in one mental function upon the efficiency of the other functions, 1901]
• [Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009]
• [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009] • [Pan, Transfer Learning, Chapter 21, Data Classification: Algorithms and
Applications 2014] • [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009] • [Zadrozny, Learning and Evaluating Classifiers under Sample Selection
Bias, ICML 2004] • [Huang etal., Correcting Sample Selection Bias by Unlabeled Data, NIPS
2006]
Reference (cont.)
• [Sugiyama etal., Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation, NIPS 2007]
• [Kanamori etal., A Least-squares Approach to Direct Importance Estimation, JMLR 2009]
• [Dai etal., Boosting for Transfer Learning, ICML 2007] • [Biltzer etal., Domain Adaptation with Structural Correspondence Learning,
EMNLP 2006] • [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature
Alignment, WWW 2010] • [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI 2008] • [Pan etal., Domain Adaptation via Transfer Component Analysis, IJCAI
2009, TNN 2011]
Reference (cont.)
• [Argyriou etal., Multi-Task Feature Learning, NIPS 2007] • [Ando and Zhang, A Framework for Learning Predictive Structures from
Multiple Tasks and Unlabeled Data, JMLR 2005] • [Ji etal, Extracting Shared Subspace for Multi-label Classification, KDD
2008] • [Raina etal., Self-taught Learning: Transfer Learning from Unlabeled Data,
ICML 2007] • [Glorot etal., Domain Adaptation for Large-Scale Sentiment Classification:
A Deep Learning Approach, ICML 2011] • [Zhou, etal., Hybrid Heterogeneous Transfer Learning through Deep
Learning, AAAI 2014] • [Zhou, etal., Heterogeneous Domain Adaptation for Multiple Classes,
AISTATS-14]
Reference (cont.)
• [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004] • [Zhang and Yeung, A Convex Formulation for Learning Task
Relationships in Multi-Task Learning, UAI 2010] • [Agarwal etal, Learning Multiple Tasks using Manifold Regularization,
NIPS 2010] • [Davis and Domingos, Deep Transfer vis Second-order Markov Logic,
ICML 2009] • [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic Lexicons,
ACL 2012] • [Zimmerman etal., Cross-project Defect Prediction, FSE 2009] • [Rahman etal., Recalling the “Imprecision” of Cross-project Defect
Prediction, ICSE 2012] • [Nam etal., Transfer Defect Learning, ICSE 2013]