features and algorithms paper by: xiaoguang qi and brian d. davison presentation by: jason bender
TRANSCRIPT
WEB PAGE CLASSIFICATION
Features and Algorithms
Paper by: XIAOGUANG QI and BRIAN D. DAVISONPresentation by: Jason Bender
Outline Introduction to Classification Background
Classification TypesClassification Methods
Applications Features Algorithms Evolution of Websites
What is web page classification? The process of assigning a web page to
one or more predefined category labels (ex: news, sports, business…)
Classification is generally posed as a supervised learning problemSet of labeled data is used to train a
classifier which is applied to label future examples
Background - Classification Types Supervised learning problem broken into
sub problems:Subject ClassificationFunctional ClassificationSentiment ClassificationOther types of Classification
Subject Classification
Concerned with subject or topic of the web pageJudging whether a page is about arts,
business, sports, etc…
Functional Classification Role that the page is playing
Deciding a page to be a personal homepage, course page, admissions page, etc…
Sentiment Classification
Focuses on the opinion that is presented in a web page
Other types of Classification
Such as genre classification and search engine spam classification
Background - Classification Methods Binary vs. Multiclass Single Label vs. Multi Label Soft vs. Hard Flat vs. Hierarchical
Binary vs. Multiclass Classification
Single-Label vs. Multi-Label Classification
Soft vs. Hard Classification
Flat vs. Hierarchical Classification
Applications
Why is classification important and how can we use it efficiently?
Constructing, maintaining, or expanding web directories
Web directories provide an efficient way to browse for information within a predefined set of categories
Example:Open Directory Project
Currently constructed by human effort78,940 editors of ODP
Improving the quality of search results Big problem with search results is
search ambiguity
Helping question and answering systems Can use classification systems to help
improve the quality of answers Example: Wolfram alpha
Other applications Contextual advertising
Features
What features can we extract from a web page to use to help classify it?
Features - Introduction
Because of features such as the hyperlink <a> … </a>, webpage classification is vastly different from other forms of classification such as plaintext classification.
Features organized into two groups:○ On-page features – directly located on page○ Neighbor features – found on related pages
On Page Features Textual Contents & Tags
Bag-of-words○ N-gram feature
Rather than analyzing individual words, group them into clusters of n-words. - Ex: New York vs. new ….. ….. York
Yahoo! Has used a 5-gram feature
HTML tags – title, heading, metadata, main text
URL
On Page Features
Visual AnalysisEach page has two representations
○ Text via HTML○ Visual via the browser
Each page can be represented as a visual adjacency multigraph
Features of Neighbors
What happens when a page’s features are missing or are unrecognizable?
Features of Neighbors
AssumptionsIf page1 is in the neighborhood of many
“sports” pages then there is an increasing probability that page1 is also a “sports” page.
Linked pages are more likely to have terms in common
Features of Neighbors Neighbor Selection
Focus on pages within 2 steps of target6 types: parent, child, sibling, spouse,
grandparent, and grandchild
Features of Neighbors
Labels Anchor Text Surrounding Anchor Text
By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.
Features of Neighbors
Implicit LinksConnections between pages that appear in
the results of the same query and are both clicked by users
Algorithms
What are the algorithmic approaches to webpage classification?Dimension reductionRelational learningHierarchal classificationInformation combination
Dimension Reduction
Boost classification by emphasizing certain features that are more useful in classificationFeature Weighting
○ Reduces the dimensions of feature space○ Reduces computational complexity○ Classification more accurate as a result of
reduced space
Dimension Reduction
MethodsUse first fragmentK-nearest neighbor algorithm
○ Weighted features○ Weighted HTML Tags○ Metrics
Expected mutual informationMutual information
Relational Learning
Relaxation Labeling
Hierarchical Classification Based on “divide and conquer”
Classification problems split into hierarchical set of sub problems.
Error MinimizationWhen a lower level category is uncertain of
whether page belongs or not, shift assignment one level up.
Information Combination
Combine several methods into oneInformation from different sources are used
to train multiple classifiers and the collective work of those classifiers make a final decision.
Conclusion
Webpage classification is a type of supervised learning problem aiming to categorize a webpage into a predefined set of categories.
In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier
Evolution of Websites
Apple in 1998
Evolution of Websites
Apple 2008
Evolution of Websites
Nike in 2000
Evolution of Websites
Nike in 2008
Evolution of Websites
Yahoo in 1996
Evolution of Websites
Yahoo in 2008
Evolution of Websites
Microsoft in 1998
Evolution of Websites
Microsoft in 2008
Evolution of Websites
MTV in 1998
Evolution of Websites
MTV in 2008
Sources Web Page Classification: Features and Algorithms
by Xiaoguang Qi & Brian D. Davison
Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classification
by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic
The Evolution of Websiteshttp://www.wakeuplater.com/website-building/evolution-of-websites-10-popular-websites.aspx