handling missing attributes using matrix factorization 

18
Handling Missing A,ributes using Matrix Factorization Övünç Bozcan Software Research Lab Dept. of Computer Engineering Boğaziçi University Istanbul, Turkey [email protected] Ayşe Başar Bener Data Science Lab Mechanical and Industrial Engineering Ryerson University Toronto, Canada [email protected]

Upload: cs-ncstate

Post on 06-May-2015

2.523 views

Category:

Technology


2 download

DESCRIPTION

Övünç Bozcan, Raise'13 Ayşe Başar Bener

TRANSCRIPT

Page 1: Handling Missing Attributes using Matrix Factorization 

Handling  Missing  A,ributes  using  Matrix  Factorization

Övünç Bozcan

Software Research Lab Dept. of Computer Engineering

Boğaziçi University Istanbul, Turkey

[email protected]

Ayşe Başar Bener

Data Science Lab Mechanical and Industrial Engineering

Ryerson University Toronto, Canada

[email protected]

Page 2: Handling Missing Attributes using Matrix Factorization 

Outline •  Introduction •  Related Work •  Matrix Factorization •  Experiment •  Results •  Conclusion

Page 3: Handling Missing Attributes using Matrix Factorization 

Introduction Ø  Software defect prediction models reveal defect

prone parts of the software to guide managers in allocating testing resources efficiently

Ø  Popular studies

Ø  Estimate number of defects remaining in software systems Ø  Discover defect associations Ø  Classify defect-proneness of software components into two classes,

defect-prone and not defect-prone

Ø Metrics Ø  Static code Ø  History Ø  Social

Page 4: Handling Missing Attributes using Matrix Factorization 

Introduction Ø Numerous defect prediction research in the last 40

years Ø  Statistical techniques with machine learning algorithms are adopted

Ø Nagappan et al., Ostrand et al., Zimmermann et al., Fenton et al., Khoshgoftaar et al.

Ø  Benchmarking studies Ø  Lessmann et al. and Menzies et al.

Ø  Systematic literature surveys Ø Hall et al.

Ø  Industrial case studies Ø  Tosun et al.

Page 5: Handling Missing Attributes using Matrix Factorization 

Introduction Ø  Major challenges in building defect prediction models:

Ø  High dimensionality of software defect data Ø  The number of available software metrics is too large for a classifier to work

Ø  Skewed, imbalanced data sets Ø  Proportion of one of the classes is quite larger than the proportion of the

other class. Ø  Performance limitations

Ø  Limited information content Ø  Performance ceiling effect

Ø  Incomplete datasets Ø  Features of the train set may differ from the features of test set

Ø  Some of the test set attributes may be missing Ø  There may be extra attributes in test sets

Ø  Building model with several datasets. Ø  Different datasets may have different attribute sets.

Page 6: Handling Missing Attributes using Matrix Factorization 

Introduction Ø Missing value pattern may be in

different forms: Ø  Data may be missing at individual points

Ø  Some attribute values may be considered as outliers. Data may be missing in chunks

Ø  You may want to build your model with several datasets and the attributes of these datasets may differ.

Ø When these datasets are concatenated, there will probably be missing chunks.

Ø  Solution might be: Ø  To use the largest common attribute set OR Ø  To introduce imputation to the missing

attributes

Page 7: Handling Missing Attributes using Matrix Factorization 

Proposed  Solution

Matrix Factorization is a solution to data scarcity problem in recommendation systems

Page 8: Handling Missing Attributes using Matrix Factorization 

Related  Work

•  Recommendation systems o  Netflix Prize competition

•  Koren, Bell, and Volinsky o  Collective Matrix Factorization

•  Singh et al. and Lippert et al.

Page 9: Handling Missing Attributes using Matrix Factorization 

Matrix  Factorization •  Netflix competition

o  Matrix Factorization models are actually superior to classical nearest-neighbor techniques as they offer incorporation of an additional information and scalable predictive accuracy (Bell et al.)

•  Matrix factorization is basically factorizing a large matrix into two smaller matrices called factors.

•  Factors are multiplied to obtain the original matrix.

Page 10: Handling Missing Attributes using Matrix Factorization 

Matrix  Factorization •  Nonnegative MF

Algorithms (Berry et al.) o  Multiplicative update

algorithms o  Gradient descent algorithms

•  Easiest to implement and to scale

o  Alternating least square algorithms

•  Multi Relational Matrix Factorization by Lippert et al. o  Low-norm Matrix

Factorization based on gradient descent algorithm

Page 11: Handling Missing Attributes using Matrix Factorization 

Experiment

Datasets Static  Code  Metrics

Churn  Metrics

Social  Metrics

Instances Defective  %

Android 106 15 25 12981 6.4 Linux  Kernel

106 15 25 14801 5.5

Perl 106 15 25 125 61.6 VLC 106 15 25 936 39.2

Datasets •  Android

o  Open source Operating System designed for mobile devices •  Linux Kernel

o  Open source operating system •  Perl

o  Stable, cross-platform, open source interpreted language •  VLC

o  Open source multimedia player

Page 12: Handling Missing Attributes using Matrix Factorization 

Experiment Performance Measures o  Pd o  Pf o  Balance

Learning Algorithms o  Naive Bayes o  Matrix Factorization

Page 13: Handling Missing Attributes using Matrix Factorization 

Experiment Experiment 1

•  The performance of Naive Bayes

algorithm is explored

•  Run 10 times 10-fold cross validation while gradually removing attributes from datasets

•  Attributes are removed according to their correlation with the class attribute •  Pearson correlation is used

•  4(datasets)x10(removal steps)x10x10(fold size)=4000 Naive Bayes prediction models are built

Experiment 2

•  The performances of Naive Bayes with Imputation and Matrix Factorization are compared

•  Attributes are chosen according to their correlation with the class attribute o  Pearson correlation is used

•  Imputation or removal procedure is done on the chosen attributes in the increasing proportion

•  4(datasets)x10(attribute selection steps)x10(imputation steps)x10(fold size)=4000 Naive Bayes and Matrix Factorization models are built

Page 14: Handling Missing Attributes using Matrix Factorization 

Results  (Exp.  1)

Balance values of Naive Bayes with respect to feature reduction percentage

Android Kernel

Perl VLC

Page 15: Handling Missing Attributes using Matrix Factorization 

Results  (Exp.  2)

Android Kernel

Perl VLC

Balance values of MF with respect to the missing Churn and Social Attribute data and NB with imputation on Churn and Social Attributes

Page 16: Handling Missing Attributes using Matrix Factorization 

Threats  to  Validity •  Internal Validity

o  Naive Bayes, Mean-Value Imputation and Matrix Factorization are used largely in previous studies.

o  Performance measurements used for evaluation are also adopted by several researchers in the past.

o  The number of studies discussing static code, history and social metrics is quite abundant.

o  The datasets are extracted from open source project repositories and they are also used in previous studies.

•  External validity o  Four different datasets extracted from open source project repositories. o  Nevertheless, our results are limited to the analyzed data and context

Page 17: Handling Missing Attributes using Matrix Factorization 

Conclusion •  Collective matrix factorization from recommender systems for

missing data problem in defect prediction •  Two experiments conducted

o  The performance of NB with feature reduction o  The performance of NB with mean-value imputation vs. the performance of MF with

missing data

•  NB performance decreases while the number of features are reduced.

•  Matrix Factorization performs better on datasets with missing data than the benchmark model with imputation

•  Future Work

o  Support the findings with using complex imputation techniques o  Different missing data scenarios may be adopted

Page 18: Handling Missing Attributes using Matrix Factorization 

Thank  You