web page classification by academic fields richard wang february 15, 2006
TRANSCRIPT
Web Page Classification by Academic Fields
Richard Wang
February 15, 2006
Introduction Objective
Train a classifier that classifies web pages by academic field using semi-supervised method
Identify interests/affiliations of people Filter web pages for field-specific applications (i.e. an N.E.R.
trained on C.S. web pages)
Assumptions Academic fields correspond to academic departments All web pages under an academic departmental
website is related to the academic field that the department corresponds to
Academic Fields We pre-define six academic fields (also showing an
example of each of their academic departmental URLs): Biological Sciences (i.e. web.mit.edu/biology/www) Computer Science (i.e. www.cs.cmu.edu) Economics (i.e. www.econ.gatech.edu) History (i.e. www.nyu.edu/gsas/dept/history) Law (i.e. www.law.miami.edu) Music (i.e. www.pitt.edu/~musicdpt)
System Architecture
Academic Field Queries
Candidate Dept. URLs (Field?, URLs)
Simple URL Classifier
True Dept. URLs(Field, URLs)
Web CrawlerWeb Crawler
True Dept. Pages (Field, Pages)
Candidate Dept. Pages(Field?, URLs, Pages)
Web Page Classifier
If Match
External Module (Optional)
Candidate Dept. URLs Manually devised Google queries for extracting
candidate departmental URLs:
The extracted URLs are then sent to A simple URL classifier The web crawler for crawling
allintitle: "Biological Sciences" OR Biology School OR Department OR Institute site:edu
allintitle: "Computer Science" -Mathematics School OR Department OR Institute site:edu
allintitle: Economics School OR Department OR Institute site:edu
allintitle: History -Art School OR Department OR Institute site:edu
allintitle: Law School OR Department OR Institute site:edu
allintitle: Music School OR Department OR Institute site:edu
Simple URL Classifier Learns URL from candidate dept. URLs by
keeping count of their term frequencies
The classifier determines the academic field of a URL by searching for those top URL tokens
Academic Fields Top Common Tokens in URL
Biological Sciences: biology (64%), bio (10%), biol (5%)
Computer Science: cs (69%), csc (3%), compsci (3%), cse (3%)
Economics: econ (44%), economics (38%), economic (4%)
History: history (80%), hist (4%)
Law: law (71%)
Music: music (86%), mus (2%)
System Architecture
Academic Field Queries
Candidate Dept. URLs (Field?, URLs)
Simple URL Classifier
True Dept. URLs(Field, URLs)
Web CrawlerWeb Crawler
True Dept. Pages (Field, Pages)
Candidate Dept. Pages(Field?, URLs, Pages)
Web Page Classifier
If Match
External Module (Optional)
Web Page Classifier
Since learning is iterative, we need a fast non-binary classifier: KNN is fast during training but extremely slow during
testing One vs. All learner that uses a simple inner learner
can be very fast during training and testing We decided to use One vs. All with Naïve Bayes
as the inner learner and a simple set of features: bag-of-words
System Architecture
Academic Field Queries
Candidate Dept. URLs (Field?, URLs)
Simple URL Classifier
True Dept. URLs(Field, URLs)
Web CrawlerWeb Crawler
True Dept. Pages (Field, Pages)
Candidate Dept. Pages(Field?, URLs, Pages)
Web Page Classifier
If Match
External Module (Optional)
Experimental Setting
Initial training set (seed) One entire website for each academic field Manually verified that those websites are indeed
departmental websites A total of 15880 web pages (18MB)
Test set Same setting as the initial training set but with
different websites A total of 1824 web pages (2MB)
Experimental ResultsWeb Page Classification Performance
0.16
0.18
0.20
0.22
0.24
0.26
0 2 4 6 8 10 12 14 16 18 20Iterations
Err
or
Rat
es
Fixed Sequence
Random Sequence
Confusion Matrix
Classifier Analysis (1)Biological Sciences Computer Science
Classifier Analysis (2)Economics History
Classifier Analysis (3)Law Music
Conclusion & Future Work
Classification performance can be improved by using unlabeled data
Try more iterations in the experiments Try to learn/classify more academic fields Try other multi-class classifiers
Thank You
Questions?