semi-supervised single-label text categorization using...

53
Semi-supervised Single-label Text Categorization using Centroid-based Classifiers Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T´ ecnico — Technical University of Lisbon / INESC-ID SAC-IAR 2007, March 12th (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 1 / 19

Upload: others

Post on 18-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Semi-supervised Single-label Text Categorizationusing Centroid-based Classifiers

Ana Cardoso-Cachopo Arlindo Oliveira

Instituto Superior Tecnico — Technical University of Lisbon / INESC-ID

SAC-IAR 2007, March 12th

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 1 / 19

Page 2: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 3: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 4: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 5: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 6: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 7: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 8: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 9: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Outline

1 Problem Description

2 Characteristics of the Datasets

3 Why use Centroid-based Methods

4 Why use Unlabeled Data

5 Incorporate Unlabeled Data using EM

6 Incrementally Incorporate Unlabeled Data

7 Experimental Results

8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19

Page 10: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 11: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 12: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 13: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 14: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 15: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 16: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Problem Description

Text Categorization

Single-label

DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19

Page 17: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Characteristics of the Datasets

Train Test Total Smallest LargestDocs Docs Docs Class Class

R8 5485 2189 7674 51 3923

20Ng 11293 7528 18821 628 999

Web4 2803 1396 4199 504 1641

Cade12 27322 13661 40983 625 8473

Numbers of documents for the datasets: number of training documents,number of test documents, total number of documents, number ofdocuments in the smallest class, and number of documents in the largestclass.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 4 / 19

Page 18: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Centroid-based Methods

Very fast

Good Accuracy

0.0

0.2

0.4

0.6

0.8

1.0Centroid

SVM

k-NN

LSI

Vector

Dumb

R8 20Ng Web4 Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19

Page 19: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Centroid-based Methods

Very fast

Good Accuracy

0.0

0.2

0.4

0.6

0.8

1.0Centroid

SVM

k-NN

LSI

Vector

Dumb

R8 20Ng Web4 Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19

Page 20: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Centroid-based Methods

Very fast

Good Accuracy

0.0

0.2

0.4

0.6

0.8

1.0Centroid

SVM

k-NN

LSI

Vector

Dumb

R8 20Ng Web4 Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19

Page 21: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Centroid-based Methods

Very fast

Good Accuracy

0.0

0.2

0.4

0.6

0.8

1.0Centroid

SVM

k-NN

LSI

Vector

Dumb

R8 20Ng Web4 Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19

Page 22: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Centroid-based Methods

Very fast

Good Accuracy

0.0

0.2

0.4

0.6

0.8

1.0Centroid

SVM

k-NN

LSI

Vector

Dumb

R8 20Ng Web4 Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19

Page 23: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Centroid-based Methods

Very fast

Good Accuracy

0.0

0.2

0.4

0.6

0.8

1.0Centroid

SVM

k-NN

LSI

Vector

Dumb

R8 20Ng Web4 Cade12

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19

Page 24: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Unlabeled Data

Small amounts of labeled data available

Large amounts of unlabeled data available

Hard or expensive to label new data

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 6 / 19

Page 25: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Unlabeled Data

Small amounts of labeled data available

Large amounts of unlabeled data available

Hard or expensive to label new data

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 6 / 19

Page 26: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Why use Unlabeled Data

Small amounts of labeled data available

Large amounts of unlabeled data available

Hard or expensive to label new data

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 6 / 19

Page 27: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 28: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 29: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 30: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 31: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 32: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 33: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incorporate Unlabeled Data using EM

If the entire dataset is available from the start, like in a library.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19

Page 34: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 35: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 36: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 37: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 38: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 39: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 40: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Incrementally Incorporate Unlabeled Data

If the dataset changes over time, like a news feed or the web.

Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:

Classify dj according to its similarity to each of the centroids.

Update the centroids with the new document dj classified in theprevious step.

Outputs: For each class cj , the centroid −→cj .

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19

Page 41: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Synthetic Dataset

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

Gaussian Distributionsµ1 = 1.0, σ1 = 1.0

µ2 = −1.0, σ2 = 1.0

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 9 / 19

Page 42: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Synthetic Dataset

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0 5 10 15 20

Acc

ura

cy

Labeled documents per class

µ1 = 1.0, σ1 = 1.0, µ2 = −1.0, σ2 = 1.0

CentroidEMInc

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

Gaussian Distributionsµ1 = 1.0, σ1 = 1.0

µ2 = −1.0, σ2 = 1.0

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 10 / 19

Page 43: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Synthetic Dataset

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0 5 10 15 20

Acc

ura

cy

Labeled documents per class

µ1 = 1.0, σ1 = 2.0, µ2 = −1.0, σ2 = 2.0

CentroidEMInc

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

Gaussian Distributionsµ1 = 1.0, σ1 = 2.0

µ2 = −1.0, σ2 = 2.0

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 11 / 19

Page 44: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Synthetic Dataset

0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59

0 5 10 15 20

Acc

ura

cy

Labeled documents per class

µ1 = 1.0, σ1 = 4.0, µ2 = −1.0, σ2 = 4.0

CentroidEMInc

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

Gaussian Distributionsµ1 = 1.0, σ1 = 4.0

µ2 = −1.0, σ2 = 4.0

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 12 / 19

Page 45: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Real World Datasets

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 5 10 15 20 25 30 35 40

Acc

ura

cy

Labeled documents per class

R8

CentroidEMInc

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 13 / 19

Page 46: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Real World Datasets

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60 70

Acc

ura

cy

Labeled documents per class

20Ng

CentroidEMInc

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 14 / 19

Page 47: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Real World Datasets

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60 70

Acc

ura

cy

Labeled documents per class

Web4

CentroidEMInc

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 15 / 19

Page 48: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Real World Datasets

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

Acc

ura

cy

Labeled documents per class

Cade12

CentroidEMInc

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 16 / 19

Page 49: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Experimental Results - Real World Datasets

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 5 10 15 20 25 30 35 40

Acc

ura

cy

Labeled documents per class

R8

CentroidEMInc

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60 70

Acc

ura

cy

Labeled documents per class

20Ng

CentroidEMInc

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60 70

Acc

ura

cy

Labeled documents per class

Web4

CentroidEMInc

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

Acc

ura

cy

Labeled documents per class

Cade12

CentroidEMInc

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 17 / 19

Page 50: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Conclusions and Future Work

If the initial model of the data is sufficiently precise, using unlabeleddata improves performance.

Using unlabeled data degrades performance if the initial model is notprecise enough.

As future work, we plan to extend this approach to multi-labeldatasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 18 / 19

Page 51: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Conclusions and Future Work

If the initial model of the data is sufficiently precise, using unlabeleddata improves performance.

Using unlabeled data degrades performance if the initial model is notprecise enough.

As future work, we plan to extend this approach to multi-labeldatasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 18 / 19

Page 52: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Conclusions and Future Work

If the initial model of the data is sufficiently precise, using unlabeleddata improves performance.

Using unlabeled data degrades performance if the initial model is notprecise enough.

As future work, we plan to extend this approach to multi-labeldatasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 18 / 19

Page 53: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization

Thank You.

Any Questions?

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 19 / 19