probe, count, and classify: categorizing hidden web databases panagiotis g. ipeirotis luis gravano...

Probe, Count, and Classify:Categorizing Hidden Web

Databases

Panagiotis G. Ipeirotis Luis Gravano

Columbia University

Mehran SahamiE.piphany Inc.

Surface Web vs. Hidden Web

Surface WebLink structureCrawlable

Hidden WebNo link structureDocuments “hidden” behind search forms

SUBMIT

Keywords

CLEAR

Do We Need the Hidden Web?

Example: PubMed/MEDLINE

PubMed: (www.ncbi.nlm.nih.gov/PubMed) search: “cancer” 1,341,586 matchesAltaVista: “cancer site:www.ncbi.nlm.nih.gov” 21,830 matches

Surface Web Hidden Web2 billion pages 500 billion pages

(?)

http://www.ncbi.nlm.nih.gov/PubMed










Interacting With Searchable Text Databases

Searching: Metasearchers

Browsing: Yahoo!-like web directories:InvisibleWeb.comSearchEngineGuide.com

Example from InvisibleWeb.com

Health > Publications > PubMED

Created Manually!

http://www.invisibleweb.com/#Database%20Classification:%20An%20Example

http://www.invisibleweb.com/#?sessionid=%7BCFD816F0-26B9-11D4-B564-009027A35E71%7D&action=SETGLOBVAR:IWHierarchyID,597

Classifying Text Databases Automatically:

Outline

Definition of classification

Classification through query probing

Experiments

Database Classification: Two Definitions

Coverage-based classification:Database contains many documents about a category Coverage: #docs about this category

Specificity-based classification:Database contains mainly documents about a categorySpecificity: #docs/|DB|

Database Classification: An Example

Category: Basketball

Coverage-based classificationESPN.com, NBA.com, not KnicksTerritory.com

Specificity-based classificationNBA.com, KnicksTerritory.com, not ESPN.com

Database Classification: More Details

Thresholds for coverage and specificityTc: coverage threshold (e.g., 100)Ts: specificity threshold (e.g., 0.5)

Tc, Ts “editorial”

choices

Root

SPORTS

C=800 S=0.8

BASEBALLS=0.5

BASKETBALL

S=0.5

HEALTHC=200S=0.2

Ideal(D)Ideal(D): set of classes for database DClass C is in Ideal(D) if:

D has “enough” coverage and specificity (Tc, Ts) for C and all of C’s ancestorsandD fails to have both “enough”

coverage and specificity for each child of C

From Document to Database Classification

If we know the categories of all documents inside the database, we are done!

We do not have direct access to the documents.Databases do not export such data!

How can we extract this information?

Our Approach: Query Probing

1. Train a rule-based document classifier.

2. Transform classifier rules into queries.

3. Adaptively send queries to databases.

4. Categorize the databases based on adjusted number of query matches.

Training a Rule-based Document Classifier

Feature Selection: Zipf’s law pruning, followed by information-theoretic feature selection [Koller & Sahami’96]

Classifier Learning: AT&T’s RIPPER [Cohen 1995]Input: A set of pre-classified, labeled documentsOutput: A set of classification rules

IF linux THEN ComputersIF jordan AND bulls THEN SportsIF lung AND cancer THEN Health

Transform each rule into a queryIF lung AND cancer THEN health +lung +cancerIF linux THEN computers +linux

Send the queries to the database

Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule)

These documents would have been classified by the rule under its associated category!

Constructing Query Probes

Adjusting Query Results

Classifiers are not perfect!Queries do not “retrieve” all the documents in a categoryQueries for one category “match” documents not in this category

From the classifier’s training phase we know its “confusion matrix”

Confusion Matrix

comp sports

health

comp 0.70 0.10 0.00

sports

0.18 0.65 0.04

health

0.02 0.05 0.86

DB-real

1000

5000

50

Probing results700+500+0 = 1200

180+3250+2

= 3432

20+250+43 = 313

XX ==

Classified Classified intointo

Correct Correct classclass

M . Coverage(D) ~ ECoverage(D)

10% of “Sports” classified as “Computers”

10% of the 5000 “Sports” docs to “Computers”

Confusion Matrix Adjustment:Compensating for Classifier’s

Errors

comp

sports

health

comp 0.70 0.10 0.00

sports

0.18 0.65 0.04

health

0.02 0.05 0.86

DB-real

1000

5000

50

Probing results

1200

3432

313

XX==

Coverage(D) ~ M-1 . ECoverage(D)

-1-1

M is diagonally dominant, hence invertible

Multiplication better approximates the correct result

Classifying a Database

1. Send the query probes for the top-level categories2. Get the number of matches for each probe3. Calculate Specificity and Coverage for each category4. “Push” the database to the qualifying categories

(with Specificity>Ts and Coverage>Tc)

5. Repeat for each of the qualifying categories6. Return the classes that satisfy the

coverage/specificity conditions

The result is the Approximation of the Ideal classification

Real Example: ACM Digital Library

(Tc=100, Ts=0.5)

C/C++ Java Visual BasicPerl

Arts(0,0)

Sports(22, 0.008)

Science(430, 0.042)

Health(0,0)

Programming(1042, 0.18)

Hardware(2709, 0.465)

Software(2060, 0.355)

Computers(9919, 0.95)

Root

Experiments: Data

72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes)

500,000 Usenet articles (April-May 2000):Newsgroups assigned by hand to hierarchy nodesRIPPER trained with 54,000 articles (1,000 articles per leaf)27,000 articles used to construct estimations of the confusion matricesRemaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size

Comparison With Alternatives

DS: Random sampling of documents via query probes

Callan et al., SIGMOD’99Different task: Gather vocabulary statisticsWe adapted it for database classification

TQ: Title-based ProbingYu et al., WISE 2000Query probes are simply the category names

Average F-measure, Controlled Databases

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Ts, (Tc=8)

F-m

eas

ure

DS [Callan et al.] PnC TQ [Yu et al.]

PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing

Experimental Results: Controlled Databases

Feature selection helps.Confusion-matrix adjustment helps.F-measure above 0.8 for most <Tc, Ts> combinations.Results degrade gracefully with hierarchy depth.Relatively small number of probes needed for most <Tc, Ts> combinations tried. Also, probes are short: 1.5 words on average; 4 words maximum.

Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]

Web Databases

130 real databases classified from InvisibleWeb™.Used InvisibleWeb’s categorization as correct.Simple “wrappers” for querying (only # of matches needed).

The Ts, Tc thresholds are not known (unlike with the Controlled databases) but implicit in the IWeb categorization.

We can learn/validate the thresholds (tricky but easy!).More details in the paper!

Web Databases: Learning Thresholds

1

4

16

64

25

6

10

24

40

96

16

384

65

536

26

214

40

0.3

0.6

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8F

-me

as

ure

TecTes

Experimental Results:Web Databases

130 Real Web Databases.

F-measure above 0.7 for best <Tc, Ts> combination learned.

185 query probes per database on average needed for classification.

Also, probes are short: 1.5 words on average; 4 words maximum.

Conclusions

Accurate classification using only a small number of short queries

No need for document retrievalOnly need a result like: “X matches found”

No need for any cooperation or special metadata from databases

• Build “wrappers” automatically

• Extend to non-topical categories

• Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked)

• Extend to other classifiers (e.g., SVMs or Bayesian models)

• Integrate with searching (connection with database selection?)

Current and Future Work

Questions?

Easy, inexpensive method for database classificationUses results from document classification“Indirect” classification of the documents in a database Does not inspect documents, only number of matchesAdjustment of results according to classifier’s performanceEasy wrapper constructionNo need for any metadata from the database

Contributions

Related Work

Callan et al., SIGMOD 1999Gauch et al., Profusion Dolin et al., Pharos Yu et al., WISE 2000 Raghavan and Garcia Molina, VLDB 2001

Controlled Databases

500 databases built using 419,000 newsgroup articles

One label per document350 databases with single (not necessarily leaf) category 150 databases with varying category mixesDatabase size ranges from 25 to 25,000 articlesIndexed and queries using SMART

F-measure for Different Hierarchy Depths

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 1 2 3Level

F-m

easu

re

DS [Callan et al.] PnC TQ [Yu et al.]

PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing

Tc=8, Ts=0.3

Query Probes Per Controlled Database

500

750

1000

1250

1500

1750

2000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Tes

Inte

rac

tio

ns

wit

h D

ata

bas

e

PnC DS TQ

(a)

250

500

750

1000

1250

1500

1750

2000

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Tec

Inte

rac

tio

ns

wit

h D

atab

ase

PnC DS TQ

(b)

Web Databases: Number of Query Probes

14

1664

256

1024

4096

1638

4

6553

6

2621

44

0

0.3

0.6

0.9

0

100

200

300

400

500

600

Qu

ery

Pro

bes

Tec

Tes

3-fold Cross-validation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

F-1

F-2

F-3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Health Sports ScienceComputers

Arts

Health 0.753 0.018 0.124 0.021 0.017

Sports 0.006 0.171 0.021 0.016 0.064

Science 0.016 0.024 0.255 0.047 0.018

Computers 0.004 0.042 0.080 0.610 0.031

Arts 0.004 0.024 0.027 0.031 0.298

Real Confusion Matrix for Top Node of Hierarchy

probe, count, and classify: categorizing hidden web databases panagiotis g. ipeirotis luis gravano...

Documents

health slide

database classification

ideal classification

size slide

docsdb slide

category coverage

confusion matrix slide

child of c slide