catalyst webinar 6.25.2015 best practices_multi-language review

877.557.4273

catalystsecure.com

WEBINAR

Best Practices in Planning and Conducting a Multi-Language Review

Chihiro Suzuki Jonathan Hiroshi Rossi Mark Noel

Presenters

Jonathan founded The CJK Group as an Asian language resource for law firms and corporate legal departments handling Asian-

language discovery. He’s experienced in a variety of large-scale matters and managing multi-language review teams.

Jonathan Hiroshi Rossi

Mark is a former IP litigator, co-founder of an e-discovery software startup, and research scientist who specializes in helping

clients use technology-assisted review, advanced analytics, and custom workflows to deal with large-scale matters.

Mark Noel

Chihiro is an invaluable bridge between Catalyst’s clients in Asia, their U.S. counsel, and internal teams in Japan and the United

States. Chihiro helps optimize workflows for clients in Asia who are dealing with multiple languages in their legal matters.

Chihiro Suzuki

Speakers

Managing Director, Professional Services, Catalyst

Founder and CEO, The CJK Group

International Operations Liaison, Catalyst

Preparation

Collections

Processing

Review and Review Workflows

Technology-Assisted Review

Discussion Topics

Data Collection - Sources▪ Whose Data to Collect: Custodian Interview.

Priority Custodians?

▪ What to Collect: PCs, smartphones, tables, external drives, backups, servers, etc.?

▪ Check Encryption: Many Asian companies use encryption. Check in advance and plan decryption prior to Data Collection.

▪ Check Primary Language: Learn the primary language used by Custodians. Be aware of different languages for the data sources.

▪ How to Collect: Forensic collection – Using a forensic tool is best to avoid potential problems such as garbled characters. It is also a defensible measure.

▪ Legal Issues: Be aware of any local data privacy laws. Any data should be kept out of the U.S. jurisdiction?

TAR systems don’t actually understand the words they index and analyze. Rather, they employ mathematical algorithms to analyze their patterns.

Handling Multi-Language Data Properly is Key to Multi-Language ReviewProcessing & Language Identification

1. Language Identification

▪ Not perfect! But helpful to plan multi-language review. ▪ May need to set default for languages that share characters

2. Tokenization

▪ Chinese, Japanese, and other languages that do not use spaces between words need to be “tokenized”

▪ Good tokenization is important – some tools systematically take two or three characters at a time without taking the meaning of words into consideration.

▪ Tokenized words are used for searches and ranking for TAR

3. Machine Translation

▪ Human translation of large document collections is expensive. ▪ Simple machine translation can be inaccurate though it can be used to get a

sense of what the documents are about. ▪ Enhanced Machine Translation (EMT) – build a glossary to improve MT

quality ▪ How can MT be incorporated into TAR workflow?

What Makes CJK So Problematic?

My hovercraft is full of eels – English

私のホバークラフトは鰻でいっぱいで – Japanese

我隻氣墊船裝滿晒鱔 – Chinese

Tokenization / Word Segmentation

1) Pattern Based > Overlapping Bigram

私ののホホババー

ーククララフフト

トはは鰻鰻ででい

いっっぱぱいいで

my yh ho ov ve

er rc cr ra af

ft ti is sf fu

ul ll lo of fe

ee el ls

私のホバークラフトは鰻でいっぱいで

Tokenization / Word Segmentation

私のホバークラフトは鰻でいっぱいで

“私” “の” “ホバークラフト” “は” “鰻” “で” “いっぱい”

“My” “hovercraft” “is” “full” “of” “eels”

2) Semantic (e.g., Dictionary Driven)

Start immediately with whatever and whomever; review = trainingTAR 2.0

TAR systems don’t actually understand the words they index and analyze. Rather, they employ mathematical algorithms to analyze their patterns.

Handling Multi-Language Data Properly is Key to Multi-Language ReviewProcessing & Language Identification

1. Language Identification

▪ Not perfect! But helpful to plan multi-language review. ▪ May need to set default for languages that share characters

2. Tokenization

▪ Chinese, Japanese, and other languages that do not use spaces between words need to be “tokenized”

▪ Good tokenization is important – some tools systematically take two or three characters at a time without taking the meaning of words into consideration.

▪ Tokenized words are used for searches and ranking for TAR

3. Machine Translation

▪ Human translation of large document collections is expensive. ▪ Simple machine translation can be inaccurate though it can be used to get a

sense of what the documents are about. ▪ Enhanced Machine Translation (EMT) – build a glossary to improve MT

quality ▪ How can MT be incorporated into TAR workflow?

Japanese language documents, mostly email

Short review timeline (about four weeks)

Rolling collections

Very low richness

“Fossa”Case Study:

International Patent Litigation: 15.6 Million Japanese Docs

Deployed Predict after counsel had already reviewed about 10,000 documents found via search terms

Used de-duplication and junk removal to reduce population moved into Predict

Rankable population started at 2.1 million with a richness of about 0.6%.



Predict richness after initial 10,000 judgmental seeds was 37% — an increase of more than a factor of 60.

Within measurement error (97% confidence, 3.5% margin of error), all responsive documents had been sorted into the top 17% of the ranked population



Workflow combined Predict review and targeted search, which was then fed back into Predict.

Through rolling collections over the course of several weeks, the Predict population grew to 3.6 million unique, rankable documents out of a master population of over 15 million.

Final Predict performance alone reached defensible recall in the top 20% of the population



Defensible recall reached with a total review effort fewer than half a million documents — about 1/30 of the reviewable population.

Predict is now being used for review of opposing party productions and depo prep in this same case.

International Patent Litigation: 15.6 Million Japanese Docs“Fossa”Case Study:

1. What are the anticipated languages in the data set?

2. What is the anticipated size of the corpus per language?

3. What is the availability of licensed, available, capable, and bilingual attorneys in each of the major cities for each of the required languages?

4. Will a bilingual review manager be necessary?

5. Regarding cost control & scarcity of resources, can we leverage non-attorney translator reviewers managed by an attorney project manager?

6. Is a secure, remote attorney review a viable option?

7. Will this review take place on-site at the client’s location or will it be managed by a service provider in their facility?

8. Will there be a need for in-country review? (for example, in Japan)

9. As a law firm or corporation, are we (or the service provider) sufficiently prepared to evaluate the resource talent pool to plan accordingly?

10. Do our vendors meet the criteria to adequately handle certain languages effectively?

11. Do our vendors speak the languages they staff?

12. To what extent can our vendors handle foreign language review?

Checklist to Prepare for Multi-Language Review

Multi-Language Workflows: Parallel Projects

Japanese

English

Ranking the same population

Separate ranking projects for each language

Multi-language training documents are shared

Prioritized review started off with judgmental seeds

Collection was 10% rich. Both English and Spanish ranking instances fed review team 40% richness

Final production: 91% recall with 33% of population reviewed

Shareholder Litigation: 66,000 Bilingual Documents“Chinchilla”Case Study:

Prioritized review started off with judgmental seeds

Collection was 10% rich. Both English and Spanish ranking instances fed review team 40% richness

Final production: 91% recall with 33% of population reviewed

Stopping Point

Order of review

Shareholder Litigation: 66,000 Bilingual Documents“Chinchilla”Case Study:

English

Japanese

Ranking the same population

One project for both languages

Primary language filters route top-ranked docs to appropriate team

Multi-Language Workflows: Shared Project

Jonathan founded The CJK Group as an Asian language resource for law firms and corporate legal departments handling Asian-

language discovery. He’s experienced in a variety of large-scale matters and managing multi-language review teams.

Jonathan Hiroshi Rossi [email protected]

Mark Noel is a former IP litigator, co-founder of an e-discovery software startup, and research scientist who specializes in

helping clients use technology-assisted review, advanced analytics, and custom workflows to deal with large-scale matters.

Mark Noel [email protected]

Chihiro is an invaluable bridge between Catalyst’s clients in Asia, their U.S. counsel, and internal teams in Japan and the United

States. Chihiro helps optimize workflows for clients in Asia who are dealing with multiple languages in their legal matters.

Chihiro Suzuki [email protected]

Thank You!We Hope You Enjoyed Our Presentation

mailto:[email protected]



catalyst webinar 6.25.2015 best practices_multi-language review

Law

multilanguage data

multilanguage review

primary language

asian language discovery

asian language resource

data sources

review workflows technology

data collection sources