catalyst webinar 6.25.2015 best practices_multi-language review
TRANSCRIPT
877.557.4273
catalystsecure.com
WEBINAR
Best Practices in Planning and Conducting a Multi-Language Review
Chihiro Suzuki Jonathan Hiroshi Rossi Mark Noel
Presenters
Jonathan founded The CJK Group as an Asian language resource for law firms and corporate legal departments handling Asian-
language discovery. He’s experienced in a variety of large-scale matters and managing multi-language review teams.
Jonathan Hiroshi Rossi
Mark is a former IP litigator, co-founder of an e-discovery software startup, and research scientist who specializes in helping
clients use technology-assisted review, advanced analytics, and custom workflows to deal with large-scale matters.
Mark Noel
Chihiro is an invaluable bridge between Catalyst’s clients in Asia, their U.S. counsel, and internal teams in Japan and the United
States. Chihiro helps optimize workflows for clients in Asia who are dealing with multiple languages in their legal matters.
Chihiro Suzuki
Speakers
Managing Director, Professional Services, Catalyst
Founder and CEO, The CJK Group
International Operations Liaison, Catalyst
Preparation
Collections
Processing
Review and Review Workflows
Technology-Assisted Review
Discussion Topics
Data Collection - Sources▪ Whose Data to Collect: Custodian Interview.
Priority Custodians?
▪ What to Collect: PCs, smartphones, tables, external drives, backups, servers, etc.?
▪ Check Encryption: Many Asian companies use encryption. Check in advance and plan decryption prior to Data Collection.
▪ Check Primary Language: Learn the primary language used by Custodians. Be aware of different languages for the data sources.
▪ How to Collect: Forensic collection – Using a forensic tool is best to avoid potential problems such as garbled characters. It is also a defensible measure.
▪ Legal Issues: Be aware of any local data privacy laws. Any data should be kept out of the U.S. jurisdiction?
TAR systems don’t actually understand the words they index and analyze. Rather, they employ mathematical algorithms to analyze their patterns.
Handling Multi-Language Data Properly is Key to Multi-Language ReviewProcessing & Language Identification
1. Language Identification
▪ Not perfect! But helpful to plan multi-language review. ▪ May need to set default for languages that share characters
2. Tokenization
▪ Chinese, Japanese, and other languages that do not use spaces between words need to be “tokenized”
▪ Good tokenization is important – some tools systematically take two or three characters at a time without taking the meaning of words into consideration.
▪ Tokenized words are used for searches and ranking for TAR
3. Machine Translation
▪ Human translation of large document collections is expensive. ▪ Simple machine translation can be inaccurate though it can be used to get a
sense of what the documents are about. ▪ Enhanced Machine Translation (EMT) – build a glossary to improve MT
quality ▪ How can MT be incorporated into TAR workflow?
What Makes CJK So Problematic?
My hovercraft is full of eels – English
私のホバークラフトは鰻でいっぱいで – Japanese
我隻氣墊船裝滿晒鱔 – Chinese
Tokenization / Word Segmentation
1) Pattern Based > Overlapping Bigram
私の のホ ホバ バー
ーク クラ ラフ フト
トは は鰻 鰻で でい
いっ っぱ ぱい いで
my yh ho ov ve
er rc cr ra af
ft ti is sf fu
ul ll lo of fe
ee el ls
私のホバークラフトは鰻でいっぱいで
Tokenization / Word Segmentation
私のホバークラフトは鰻でいっぱいで
“私” “の” “ホバークラフト” “は” “鰻” “で” “いっぱい”
“My” “hovercraft” “is” “full” “of” “eels”
2) Semantic (e.g., Dictionary Driven)
TAR systems don’t actually understand the words they index and analyze. Rather, they employ mathematical algorithms to analyze their patterns.
Handling Multi-Language Data Properly is Key to Multi-Language ReviewProcessing & Language Identification
1. Language Identification
▪ Not perfect! But helpful to plan multi-language review. ▪ May need to set default for languages that share characters
2. Tokenization
▪ Chinese, Japanese, and other languages that do not use spaces between words need to be “tokenized”
▪ Good tokenization is important – some tools systematically take two or three characters at a time without taking the meaning of words into consideration.
▪ Tokenized words are used for searches and ranking for TAR
3. Machine Translation
▪ Human translation of large document collections is expensive. ▪ Simple machine translation can be inaccurate though it can be used to get a
sense of what the documents are about. ▪ Enhanced Machine Translation (EMT) – build a glossary to improve MT
quality ▪ How can MT be incorporated into TAR workflow?
Japanese language documents, mostly email
Short review timeline (about four weeks)
Rolling collections
Very low richness
“Fossa”Case Study:
International Patent Litigation: 15.6 Million Japanese Docs
Deployed Predict after counsel had already reviewed about 10,000 documents found via search terms
Used de-duplication and junk removal to reduce population moved into Predict
Rankable population started at 2.1 million with a richness of about 0.6%.
“Fossa”Case Study:
International Patent Litigation: 15.6 Million Japanese Docs
Predict richness after initial 10,000 judgmental seeds was 37% — an increase of more than a factor of 60.
Within measurement error (97% confidence, 3.5% margin of error), all responsive documents had been sorted into the top 17% of the ranked population
“Fossa”Case Study:
International Patent Litigation: 15.6 Million Japanese Docs
Workflow combined Predict review and targeted search, which was then fed back into Predict.
Through rolling collections over the course of several weeks, the Predict population grew to 3.6 million unique, rankable documents out of a master population of over 15 million.
Final Predict performance alone reached defensible recall in the top 20% of the population
“Fossa”Case Study:
International Patent Litigation: 15.6 Million Japanese Docs
Defensible recall reached with a total review effort fewer than half a million documents — about 1/30 of the reviewable population.
Predict is now being used for review of opposing party productions and depo prep in this same case.
International Patent Litigation: 15.6 Million Japanese Docs“Fossa”Case Study:
1. What are the anticipated languages in the data set?
2. What is the anticipated size of the corpus per language?
3. What is the availability of licensed, available, capable, and bilingual attorneys in each of the major cities for each of the required languages?
4. Will a bilingual review manager be necessary?
5. Regarding cost control & scarcity of resources, can we leverage non-attorney translator reviewers managed by an attorney project manager?
6. Is a secure, remote attorney review a viable option?
7. Will this review take place on-site at the client’s location or will it be managed by a service provider in their facility?
8. Will there be a need for in-country review? (for example, in Japan)
9. As a law firm or corporation, are we (or the service provider) sufficiently prepared to evaluate the resource talent pool to plan accordingly?
10. Do our vendors meet the criteria to adequately handle certain languages effectively?
11. Do our vendors speak the languages they staff?
12. To what extent can our vendors handle foreign language review?
Checklist to Prepare for Multi-Language Review
Multi-Language Workflows: Parallel Projects
Japanese
English
Ranking the same population
Separate ranking projects for each language
Multi-language training documents are shared
Prioritized review started off with judgmental seeds
Collection was 10% rich. Both English and Spanish ranking instances fed review team 40% richness
Final production: 91% recall with 33% of population reviewed
Shareholder Litigation: 66,000 Bilingual Documents“Chinchilla”Case Study:
Prioritized review started off with judgmental seeds
Collection was 10% rich. Both English and Spanish ranking instances fed review team 40% richness
Final production: 91% recall with 33% of population reviewed
Stopping Point
Order of review
Shareholder Litigation: 66,000 Bilingual Documents“Chinchilla”Case Study:
Prioritized review started off with judgmental seeds
Collection was 10% rich. Both English and Spanish ranking instances fed review team 40% richness
Final production: 91% recall with 33% of population reviewed
Stopping Point
Order of review
Shareholder Litigation: 66,000 Bilingual Documents“Chinchilla”Case Study:
English
Japanese
Ranking the same population
One project for both languages
Primary language filters route top-ranked docs to appropriate team
Multi-Language Workflows: Shared Project
Jonathan founded The CJK Group as an Asian language resource for law firms and corporate legal departments handling Asian-
language discovery. He’s experienced in a variety of large-scale matters and managing multi-language review teams.
Jonathan Hiroshi Rossi [email protected]
Mark Noel is a former IP litigator, co-founder of an e-discovery software startup, and research scientist who specializes in
helping clients use technology-assisted review, advanced analytics, and custom workflows to deal with large-scale matters.
Mark Noel [email protected]
Chihiro is an invaluable bridge between Catalyst’s clients in Asia, their U.S. counsel, and internal teams in Japan and the United
States. Chihiro helps optimize workflows for clients in Asia who are dealing with multiple languages in their legal matters.
Chihiro Suzuki [email protected]
Thank You!We Hope You Enjoyed Our Presentation