database selection using actual physical and acquired logical collection resources in a massive...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-
specific Operational Environment
Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem MeziouResearch & DevelopmentThomson Legal & Regulatory – West GroupSt. Paul, Minnesota 55123 USA{Jack.Conrad,Peter.Jackson}@WestGroup.com
228th International VLDB '02 — J. Conrad20-23 Aug. 2002
Growth of Online Databases Westlaw and Westnews
0
2000
4000
6000
8000
10000
12000
14000
16000
No.
of
Dat
abas
esWestnews
Westlaw
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 3
Outline
Terminology Overview Research Contributions
(Novelty of Investigation) Corpora Statistics Experimental Set-up
Phase 1: Actual Physical Resources Phase 2: Acquired Logical Resources
Performance Evaluation Conclusions Future Work
Background to vocabulary, incl. that used in title
Background to vocabulary, incl. that used in title
Of our operational environment and overall problem space
Of our operational environment and overall problem space
Aspects of the problem that haven’t been explored before, esp. wrt scale & prod. sys.
Aspects of the problem that haven’t been explored before, esp. wrt scale & prod. sys.
We’ll look at the data sets used, namely, those listed for the next item
We’ll look at the data sets used, namely, those listed for the next item
We’ll compare the effectiveness of each approach on each data set
We’ll compare the effectiveness of each approach on each data set
I’ll share what conclusions we’re able to draw and discuss new directions this work may be taking
I’ll share what conclusions we’re able to draw and discuss new directions this work may be taking
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 4
Terminology Database Selection
• Given O(10K) DBs composed of textual documents, need to effectively & efficiently aid users to narrow info search
Actual Physical Resources• Exist O(1K) underlying physical DBs that can be leveraged
to reduce the dimensionality of problem• Have access to complete term distributions asso. w/ these
DBs
Acquired Logical Resources• Can re-architect underlying DBs along domain- and user-
centric content-types (e.g., Region, Topic, Doc-type, etc.)• Then profile those DBs using random or query-based
sampling
And hone in on the most relevant materials available in the systemAnd hone in on the most relevant materials available in the system
Organized around internal criteria such as pub. year, h/w system, etc. Organized around internal criteria such as pub. year, h/w system, etc.
Can characterize “logical” DBs using diff. sampling techniquesCan characterize “logical” DBs using diff. sampling techniques
Wanted to convince ourselves that we could first get reasonable results at this level
Wanted to convince ourselves that we could first get reasonable results at this level
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 5
OverviewOperational Environment
Over 15,000 databases consisting of 1,000s of docs Over one million U.S. attorneys Thousands of others in the UK, Canada, Australia, … O(100K) qrys submitted to Westlaw system each day
Motivations for Re-architecting System Showcasing 1000s of DBs typically a competitive advantage Segment of today’s users prefer global search environments Simplified activity of narrowing scope for online research
• User & domain-centric rather than hardware or maint.-centric• Primarily concentrating on areas of law and business
Toolkit approach to DBs and DB Selection tools• Diverse mechanisms for focusing on relevant information
Overview of Westlaw’s operational environmentOverview of Westlaw’s operational environment
Several 100,000 qrys are submitted …Several 100,000 qrys are submitted …
Each mech. optimized on a particular level of granularityEach mech. optimized on a particular level of granularity
We require our users to submit a DB IDWe require our users to submit a DB ID
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 6
Contributions of Research
Represent O(10,000) DBs DBs can contain O(100,000) documents Collection sizes vary by several
magnitudes Documents can appear in > 1 DB DBs cumulatively in TB, not GB range Docs represent real, not simulated domain Implemented in actual prod. environment
Work reported here involves between 2 and 3 TBWork reported here involves between 2 and 3 TB
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 7
Westlaw Architectural IssuesPhysical vs. “Logical” Databases
O (100)O (1000)
Jurisdiction
Legal Practice
Area
Doc-Type
. . .
Fed.
State
Local
Int’lAnalytical (2002/06)
WestNews (2002/05)
Case_Law (2002/03)
Regulatory (2002/07)
Statutes (2002/04)
Traditionally, data for the Westlaw System were
phys. stored in silos that were dictated by internal considerations, that is,
those that facilitated storage and maintenance
(publ. year, aggregate content type, or source)
Traditionally, data for the Westlaw System were
phys. stored in silos that were dictated by internal considerations, that is,
those that facilitated storage and maintenance
(publ. year, aggregate content type, or source)
Rather than categories of data that made sense to system users (in the legal domain),
categories such as legal jurisdiction (region), legal
practice area, or document-type (e.g., congressional leg.,
treatises, jury verdicts, etc)
Rather than categories of data that made sense to system users (in the legal domain),
categories such as legal jurisdiction (region), legal
practice area, or document-type (e.g., congressional leg.,
treatises, jury verdicts, etc)
This is what our primary objective was in re-arch the WL repository to achieve our logical
data sets
This is what our primary objective was in re-arch the WL repository to achieve our logical
data sets
The 3 cols labeled red rep. the 3 prim. bases for segment
The 3 cols labeled red rep. the 3 prim. bases for segment
The rows labeled in blue are … residual
sub-groupings resulting from this
strategy
The rows labeled in blue are … residual
sub-groupings resulting from this
strategy
Order Mag. Diff.Order Mag. Diff.
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 8
Corpora StatisticsCollection
InformationPhysicalDatabase
s(Phase 1)
Number of Collections
1000
Collections Profiled
100
StandardDocs / Profile
All
Average Docs / Collection
298,935
Average Tokens / Profile
97,299
Logical
Databases(Phase 2)
128
128
500 / 1000
378,468
22,296 / 47,450
Each doc partic. in profile
Each doc partic. in profile
Is basically entire dict.
Is basically entire dict.
Callan found 300 docs sufficed
Callan found 300 docs sufficed
Via samplingVia sampling
( 40% of WL) ( 90% of WL)
Roughly 25% & 50% of the complete dict.Roughly 25% & 50% of the complete dict.
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 9
Alternative Scoring Models Scoring: CORI 1-2-3
tf-idf based representing df-icf
absent terms given default belief prob.
Engine: WIN Bayesian Inference
Network Data: Collection Profiles
Complete Term Distr. (Phase 1) Random & Query-based sample
Term Distr. (Phase 2)
Scoring: Language Model occurrence based
via df + cf smoothing techniques used
on absent terms Engine: Statistical
Term / Concept Probabilities
Data: Collection Profiles Complete Term Distr. (Phase 1) Random & Query-based sample
Term Distr. (Phase 2)
1028th International VLDB '02 — J. Conrad20-23 Aug. 2002
tf * idf Scoring — Cori_Net3
)0.1|log(|
)5.0||
log(
)1(
)1()|(
Ccf
C
I
cwcw
K
Kdf
dfdtdtT
bbji ITddcwp
• Similar to Cori_Net2 but normalized w/o layered variables
The belief p(wi|cj) in collection cj due to observing term wi is
determined by db + (1 – db) * T * I
The belief p(wi|cj) in collection cj due to observing term wi is
determined by db + (1 – db) * T * I
Where db is the minimum belief component when term wi occurs
in collection cj
Where db is the minimum belief component when term wi occurs
in collection cj
This is the collection retrieval equivalent to normalized inverse doc freq (or idf)
This is the collection retrieval equivalent to normalized inverse doc freq (or idf)
Typically this tf type expr is normalized by df_max, but here we introduce K which has been inspired by exps in doc retrieval # our K is different than anything Callan or others have used # they have a set of parameters that
are successively wrapped around each other ( ) …
Typically this tf type expr is normalized by df_max, but here we introduce K which has been inspired by exps in doc retrieval # our K is different than anything Callan or others have used # they have a set of parameters that
are successively wrapped around each other ( ) …
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 11
Language Modeling
)()1()|()|( wPdwPdwP dbdocsum
)|()|(1
dwPdQPi
misequence
• Weighted Sum Approach (Additive Model)
• Query Treated as a Sequence of Terms
(Independent Events)
Is of course between 0 and 1 Is of course between 0 and 1
LM based only on a profile doc may face sparse data problems when the prob. of a word, w, given a profile ‘doc’ is 0 (unobserved event)
LM based only on a profile doc may face sparse data problems when the prob. of a word, w, given a profile ‘doc’ is 0 (unobserved event)
So it may be useful to extend the original document model with a db model
So it may be useful to extend the original document model with a db model
An additive model can help by leveraging extra evidence from the complete collection of profiles
An additive model can help by leveraging extra evidence from the complete collection of profiles
By summing in the contribution of a word at the db level, can mitigate uncertainty asso. w/ sparse data in non-add. model
By summing in the contribution of a word at the db level, can mitigate uncertainty asso. w/ sparse data in non-add. model
By treating qry as sequence of terms, w/ each term viewed as a separate event, and the qry rep. the joined event (permits dup. terms and phrasal expr.)
By treating qry as sequence of terms, w/ each term viewed as a separate event, and the qry rep. the joined event (permits dup. terms and phrasal expr.)
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 12
Test Queries andRelevance Judgments
Actual user submissions to DBS application Phase 1 (Physical Collections): 250 queries
Mean Length: 8.0 terms Phase 2 (Logical Collections): 100 queries
Mean Length: 8.8 terms
Complete Relevance Judgments Provided by domain experts before experiments run Followed training exercises to establish consistency Mean Positive Relevance Judgments per Query
Phase 1 (Physical Collections): 17.0 Phase 2 (Logical Collections): 9.1
Why did we use a diff. qry set for
Phase 2?
Why did we use a diff. qry set for
Phase 2?
Wanted qrys that were less general, more specific, with fewer positive rel. jdgmts per qry
Wanted qrys that were less general, more specific, with fewer positive rel. jdgmts per qry
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 13
Retrieval Experiments Database-level Test Parameters:
100 physical DBs vs. 128 logical DBs For logical DB profiles: Query-based vs. Random sampling
phrasal concepts vs. terms only stemming vs. no stemming scaling vs. none (i.e., global freq reduction) minimum term frequency thresholds
Performance Metrics: Standard Precision at 11-point Recall Precision at N-database cut-offs
It’s important to point out that our initial exps were at the …
It’s important to point out that our initial exps were at the …
Some of the variables we examined are indicated here
Some of the variables we examined are indicated here
Qrys with … versus …Qrys with … versus …
Stemmed terms vs. unstemmed terms
Stemmed terms vs. unstemmed terms
Inspired by speech recogn experiments – noiseInspired by speech recogn experiments – noise
We’ll see some examples of these next
We’ll see some examples of these next
DBS -- Phase 1: 100 Physical CollectionsCori_Net2 vs. LM (250 Queries)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Recall (Percent) [11-point]
Prec
isio
n (P
erce
nt)
Cori2_all
LM_all
Essentially represents the best from both methods for this Phase
Essentially represents the best from both methods for this Phase
And we see LM clearly outperforms CORI by > 10% at the first recall points
And we see LM clearly outperforms CORI by > 10% at the first recall points
Performance avg-ed over 250 qrysPerformance avg-ed over 250 qrys
Result consistent with recent results in the doc. retrieval domain
Result consistent with recent results in the doc. retrieval domain
DBS -- Phase 2: 128 Logical Collections Cori_Net2 vs. LM (100 queries)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Recall (Percent) [11-point]
Prec
isio
n (P
erce
nt)
Baseline_11pt
Cori2_0.6_300_stem
LM_Rand_500_1
When we move to the logical collections, we see a reversal in this relative performance
When we move to the logical collections, we see a reversal in this relative performance
Incl. the baseline in this case because it’s rel. closer to that of the two techs.
Incl. the baseline in this case because it’s rel. closer to that of the two techs.
Avg. prec. of the two may be sim., but CORI sign. better than other LM results here
(Rand_1000 and QBS 500+1000)
Avg. prec. of the two may be sim., but CORI sign. better than other LM results here
(Rand_1000 and QBS 500+1000)
DBS -- Phase 2: 128 Logical CollectionsEnhanced Cori_Net3 (100 Queries)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Recall (Percent) [11-point]
Prec
isio
n (P
erce
nt)
baseline_11pt
Cori3_400_1.0
Cori3_400_1.0_Lex
Cori3_400_1.0_Lex+
The final plot to be exhibitedThe final plot to be exhibited
Here we explore a special post-process lexical analysis of queries for
jurisdictionally relevant content
Here we explore a special post-process lexical analysis of queries for
jurisdictionally relevant content
I.e., when no such context is found, jurisdic- tionally biased collections are down weighted
I.e., when no such context is found, jurisdic- tionally biased collections are down weighted
For results marked Lex, process applied only to qrys w/ no juris. clues
For results marked Lex, process applied only to qrys w/ no juris. clues
For results marked Lex+, apply the reranking to all qrys, but leave the dbs that match the lexical clues in their orig. ranks
For results marked Lex+, apply the reranking to all qrys, but leave the dbs that match the lexical clues in their orig. ranks
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 17
Performance Evaluation WIN using CORI scoring
• Works better for Logical collections than Physical collections• Best results from random sampled DBs
Language Modeling with basic smoothing
• Performs best for Physical collections; less well for Logical• Top results from random sampled DBs
Jurisdictional Lexical Analysis contributes > 10% to average precision
Precision at init Recall
point
CORI
Physical DBs
70%
Logical DBs 85+%
LM
80%
70%
Results don’t agree w/ Callan’s, but he was operating in a non-cooperating env.
Results don’t agree w/ Callan’s, but he was operating in a non-cooperating env.
And as we saw, adding our post-process lexical analysis, precision increased by over 10% at the top recall points
And as we saw, adding our post-process lexical analysis, precision increased by over 10% at the top recall points
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 18
Document-level Relevance
RelevanceCategory
Quantity Percentage % Relevant(Cumulative)
On Point 1415 56.60% 56.60%
Relevant 439 17.56% 74.16%
Marginally Relevant
199 7.96% 82.12%
Not Relevant
447 17.88% ———
Combined 2500 100.00% 82.12%
Took 25% of our Phase 2 qrys and ran them against the top 5 CORI-ranked DBs, then evaluated the top 20 documents (2,500 docs total) – this is what
resulted
Took 25% of our Phase 2 qrys and ran them against the top 5 CORI-ranked DBs, then evaluated the top 20 documents (2,500 docs total) – this is what
resulted
“On Point” cat. surpasses the next three cats combined“On Point” cat. surpasses the next three cats combined
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 19
Conclusions
WIN using CORI scoring more effective than current LM for environments that harness database profiling via sampling
Language Modeling more sensitive to sparse data issues
Post-process Lexical Analysis contributes significantly to performance
Random-sampling Profile Creation outperforms Query-based sampling in the WL environment
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 20
Future Work
Document Clustering Basis for new categories of databases
Language Modeling Harness robust smoothing techniques Measure contribution to logical DB performance
Actual document-level relevance Expand set of relevance judgments Assess doc scores based on both DB + doc beliefs
Bi-modal User Analysis Complete automation vs. User interaction in DBS
May show promise for domains in which we know much less about the pre-
existing doc structure
May show promise for domains in which we know much less about the pre-
existing doc structure
Competing w/ high perf thanks to CORI
Competing w/ high perf thanks to CORI
Smoothing: Simple, linear, smallest
binomial, finite element, b-spline
Smoothing: Simple, linear, smallest
binomial, finite element, b-spline
Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-
specific Operational Environment
Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem MeziouResearch & DevelopmentThomson Legal & Regulatory – West GroupSt. Paul, Minnesota 55123 USA{Jack.Conrad,Peter.Jackson}@WestGroup.com
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 22
Related Work L. Gravano, et al., Stanford (VLDB 1995)
Presented GlOSS system to assist in DB selection task Used ‘Goodness’ as measure of effectiveness
J. French, et al., U. Virginia (SIGIR 1998) Came up with metrics to evaluate DB selection systems Began to compare effectiveness of different methods
J. Callan, et al., UMass. (SIGIR 95+99, CIKM 2000) Developed Collection Retrieval Inference Net (CORI) Showed CORI was more effective than GlOSS, CVV, others
20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 23
Background
Exponential growth of data sets on Web and in commercial enterprises
Limited means of narrowing scope of searches to relevant databases
Application challenges in large domain-specific operational environments
Need effective approaches that scale and deliver in focused production systems