tar 3.0 and training of predictive coding systems · tar 2.0 (cal) 1) review very small set of...
TRANSCRIPT
![Page 1: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/1.jpg)
1
TAR 3.0 and Training of Predictive Coding Systems
Bill DimmDecember 10, 2015
![Page 2: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/2.jpg)
2
Topics● How machines learn – pattern detection● Tips on reviewing docs for training● Performance measures: precision & recall● TAR 1.0: a baseline
– How to use control sets
– Random vs. non-random training
● TAR 2.0: better efficiency, especially for low prevalence● TAR 3.0: good if you don't need to review everything you'll produce
![Page 3: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/3.jpg)
3
Identifying Patterns
![Page 4: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/4.jpg)
4
Sorting by Relevance Score
![Page 5: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/5.jpg)
5
Benefits of Predictive Coding● Reduce Cost
– Less human doc review
● See Relevant Docs Earlier– Decide to settle before spending too much on e-discovery
● Quality Assurance– Detect inconsistent tagging
![Page 6: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/6.jpg)
6
Doc Review for Training● Think about how the computer will interpret the tags you apply● Don't tag doc as non-relevant just because it is duplicative● Emails with attachments
– Want to produce the whole family but only part may be relevant
![Page 7: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/7.jpg)
7
Garbage In, Garbage Out?● A low threshold for relevance may be better than a very precise view● System may identify possible mistakes● Large amount of low-quality training data may be better than a small amount of high-
quality
![Page 8: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/8.jpg)
8
Terminology● Prevalence (a.k.a. Richness)
– Percentage of all docs that are relevant
● Recall– Percentage of relevant docs found
– Important for defensibility
● Precision– Percentage of docs predicted to be relevant that actually are relevant
– Important for cost – reduce review
![Page 9: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/9.jpg)
9
Precision-Recall Curve
![Page 10: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/10.jpg)
10
TAR 1.01) Review training docs
2) Use control set to determine whether training should end
Back to (1) if additional training is worthwhile
3) Sort remaining docs by relevance score for review/production
4) Sample/test to ensure sufficient recall
● Training could be with random or non-random docs– These options are very different!
![Page 11: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/11.jpg)
11
Control Set
![Page 12: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/12.jpg)
12
How Much Training Data?● Training set size should never involve phrases like:
– 95% confidence
– +/- 2%
– Statistically significant sample (this isn't even a thing!)
● Those phrases are about determining how many docs are relevant.● Training is about which docs are relevant.
![Page 13: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/13.jpg)
13
Boxes of Gold/Lead: All Same● 1 million identical boxes. Some contain an ounce of gold, others an ounce of lead.● Sample 400 random boxes, find that 80 contain gold
– 20% +/- 5% contain gold with 95% confidence (really 16% to 24%)
● Sample 1,600 random boxes, 320 contain gold– 20% +/- 2.5% contain gold with 95% confidence
● Which boxes contain gold? No idea!
![Page 14: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/14.jpg)
14
Boxes of Gold/Lead: Colored● Boxes come in different colors. All boxes with the same color contain the same metal.● How many boxes do we sample to find the gold?
– Depends on how many colors there are
![Page 15: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/15.jpg)
15
Amount of Training Depends on Task
![Page 16: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/16.jpg)
16
Optimal Training● More training gives better predictions, so fewer non-relevant docs to review (higher
precision)● Benefit of additional training diminishes until not worthwhile● n = ρNR/P
– n = number of docs to review
– ρ = prevalence
– N = number of non-training docs
– R = desired recall
– P = precision at desired recall
● Do not use F1 to measure training progress. Use precision at desired recall.
![Page 17: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/17.jpg)
17
Optimal Training
![Page 18: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/18.jpg)
18
Static Control Set● Random set of documents reviewed at beginning● Fails to account for shifting understanding of relevance as review progresses
![Page 19: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/19.jpg)
19
Rolling Control Set● If training with random docs, use most recent docs as control set● Fixes the shifting understanding of relevance problem
![Page 20: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/20.jpg)
20
How to Select Training Docs● Representative / Random● Judgmental
– Human chooses docs considered to be good examples, e.g., keyword search
● Active Learning– Computer chooses docs based on a strategy intended to aid learning
![Page 21: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/21.jpg)
21
Active Learning Approaches● Common Outside of E-Discovery
– Docs estimated to have 50% chance of relevance
– Details may be specific to classification algorithm
– Remember the control set animation – points close to the separating boundary had most impact
● Continuous Active Learning (TAR 2.0)– Docs most likely to be relevant
– You were (probably) going to review them anyway
![Page 22: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/22.jpg)
22
Random vs Non-Random Training● Random
– Theoretically sound (sort of) – training docs look like the population
● Judgmental / Active– System sees larger number of relevant docs, which is good
– Bias
– Probabilities are distorted – relevance scores hard to interpret
● Thought experiment - teaching a child to recognize dogs– Do they need to see 9 birds for every dog?
– How about 99 birds for every dog?
– Plausibly more efficient: Lots of dogs, a few birds, and some wolves and foxes
![Page 23: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/23.jpg)
23
Catch & Release Control Set● Static control set is flawed for non-random training
– Docs in control set aren't similar to the unreviewed docs
● Identify random control set docs (from full population), then put them back into the review
![Page 24: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/24.jpg)
24
Higher Dimensions
![Page 25: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/25.jpg)
25
TAR 1.0 with Random Training
![Page 26: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/26.jpg)
26
TAR 2.0 (CAL)1) Review very small set of training docs (single relevant doc is enough)
2) Update predictions and sort remaining docs by relevance score
3) Review small number of docs with highest relevance scores
Back to (2) unless not many relevant docs left
4) Sample/test to ensure sufficient recall
● System continues to learn throughout.● No separation between training and review.● Huge number of relevant training docs.
![Page 27: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/27.jpg)
27
TAR 2.0
![Page 28: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/28.jpg)
28
TAR 1.0 v 2.0 Review to Get R=75%
![Page 29: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/29.jpg)
29
TAR 3.0 (CAL on Cluster Centers)1) Form conceptual clusters (narrow focus, fixed radius, agglomerative)
2) Review very small set of training docs (single relevant doc is enough)
3) Update predictions for cluster centers and sort by relevance score
4) Review small number of cluster centers with highest relevance scores
Back to (3) unless not many relevant cluster centers left
5) Generate predictions for full population, then you have a choice:
A) Produce docs without review (unless potentially privileged)
B) Produce docs with high relevance scores without review, and perform standard CAL on remainder (review top docs, update predictions, and iterate)
C) Review all docs that are candidates for production using standard CAL
6) Sample/test to ensure sufficient recall
● Training and review are separate, like TAR 1.0● No control set needed● Option to produce docs without review if desired● Free prevalence estimate
![Page 30: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/30.jpg)
30
TAR 3.0
![Page 31: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/31.jpg)
31
Prevalence Estimation with TAR 3.0● Count documents in clusters where the center is relevant● This is stratified sampling with two wild assumptions
– Center of cluster has same probability of being relevant as other docs in cluster
– Clusters that aren't hit have negligible number of relevant docs
● Not statistically justifiable, but works well if clustering is good
![Page 32: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/32.jpg)
32
TAR 3.0 Prevalence vs. Random
![Page 33: TAR 3.0 and Training of Predictive Coding Systems · TAR 2.0 (CAL) 1) Review very small set of training docs (single relevant doc is enough) 2) Update predictions and sort remaining](https://reader035.vdocuments.mx/reader035/viewer/2022063000/5f0ef0007e708231d441acf7/html5/thumbnails/33.jpg)
33
Workflow Comparison
TAR 1.0 TAR 2.0 TAR 3.0
Works for Low Prevalence No Yes Yes
Avoids Control Set No Yes Yes
Early Prevalence Estimate Yes? No Yes
Produce Without Review Yes No? Yes
See Relevant Docs Early No? Yes Yes
Diverse Early View of Relevance Yes No? Yes
Add Docs to Population Later Hard Easy Easy
Efficiency - Review Pos. Pred. Low High Medium
Efficiency - Don't Rev. Pos. Pred. Medium ? High