web page classification by academic fields richard wang february 15, 2006

17
Web Page Classification by Academic Fields Richard Wang February 15, 2006

Upload: junior-glenn

Post on 23-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Web Page Classification by Academic Fields

Richard Wang

February 15, 2006

Page 2: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Introduction Objective

Train a classifier that classifies web pages by academic field using semi-supervised method

Identify interests/affiliations of people Filter web pages for field-specific applications (i.e. an N.E.R.

trained on C.S. web pages)

Assumptions Academic fields correspond to academic departments All web pages under an academic departmental

website is related to the academic field that the department corresponds to

Page 3: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Academic Fields We pre-define six academic fields (also showing an

example of each of their academic departmental URLs): Biological Sciences (i.e. web.mit.edu/biology/www) Computer Science (i.e. www.cs.cmu.edu) Economics (i.e. www.econ.gatech.edu) History (i.e. www.nyu.edu/gsas/dept/history) Law (i.e. www.law.miami.edu) Music (i.e. www.pitt.edu/~musicdpt)

Page 4: Web Page Classification by Academic Fields Richard Wang February 15, 2006

System Architecture

Academic Field Queries

Google

Candidate Dept. URLs (Field?, URLs)

Simple URL Classifier

True Dept. URLs(Field, URLs)

Web CrawlerWeb Crawler

True Dept. Pages (Field, Pages)

Candidate Dept. Pages(Field?, URLs, Pages)

Web Page Classifier

If Match

External Module (Optional)

Page 5: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Candidate Dept. URLs Manually devised Google queries for extracting

candidate departmental URLs:

The extracted URLs are then sent to A simple URL classifier The web crawler for crawling

allintitle: "Biological Sciences" OR Biology School OR Department OR Institute site:edu

allintitle: "Computer Science" -Mathematics School OR Department OR Institute site:edu

allintitle: Economics School OR Department OR Institute site:edu

allintitle: History -Art School OR Department OR Institute site:edu

allintitle: Law School OR Department OR Institute site:edu

allintitle: Music School OR Department OR Institute site:edu

Page 6: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Simple URL Classifier Learns URL from candidate dept. URLs by

keeping count of their term frequencies

The classifier determines the academic field of a URL by searching for those top URL tokens

Academic Fields Top Common Tokens in URL

Biological Sciences: biology (64%), bio (10%), biol (5%)

Computer Science: cs (69%), csc (3%), compsci (3%), cse (3%)

Economics: econ (44%), economics (38%), economic (4%)

History: history (80%), hist (4%)

Law: law (71%)

Music: music (86%), mus (2%)

Page 7: Web Page Classification by Academic Fields Richard Wang February 15, 2006

System Architecture

Academic Field Queries

Google

Candidate Dept. URLs (Field?, URLs)

Simple URL Classifier

True Dept. URLs(Field, URLs)

Web CrawlerWeb Crawler

True Dept. Pages (Field, Pages)

Candidate Dept. Pages(Field?, URLs, Pages)

Web Page Classifier

If Match

External Module (Optional)

Page 8: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Web Page Classifier

Since learning is iterative, we need a fast non-binary classifier: KNN is fast during training but extremely slow during

testing One vs. All learner that uses a simple inner learner

can be very fast during training and testing We decided to use One vs. All with Naïve Bayes

as the inner learner and a simple set of features: bag-of-words

Page 9: Web Page Classification by Academic Fields Richard Wang February 15, 2006

System Architecture

Academic Field Queries

Google

Candidate Dept. URLs (Field?, URLs)

Simple URL Classifier

True Dept. URLs(Field, URLs)

Web CrawlerWeb Crawler

True Dept. Pages (Field, Pages)

Candidate Dept. Pages(Field?, URLs, Pages)

Web Page Classifier

If Match

External Module (Optional)

Page 10: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Experimental Setting

Initial training set (seed) One entire website for each academic field Manually verified that those websites are indeed

departmental websites A total of 15880 web pages (18MB)

Test set Same setting as the initial training set but with

different websites A total of 1824 web pages (2MB)

Page 11: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Experimental ResultsWeb Page Classification Performance

0.16

0.18

0.20

0.22

0.24

0.26

0 2 4 6 8 10 12 14 16 18 20Iterations

Err

or

Rat

es

Fixed Sequence

Random Sequence

Page 12: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Confusion Matrix

Page 13: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Classifier Analysis (1)Biological Sciences Computer Science

Page 14: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Classifier Analysis (2)Economics History

Page 15: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Classifier Analysis (3)Law Music

Page 16: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Conclusion & Future Work

Classification performance can be improved by using unlabeled data

Try more iterations in the experiments Try to learn/classify more academic fields Try other multi-class classifiers

Page 17: Web Page Classification by Academic Fields Richard Wang February 15, 2006

Thank You

Questions?