making the old new again - modern technical provides access to historical chemical information

30
Making the Old New Again – Modern Technology Provides Access to Historical Chemical Information 253 rd ACS National Meeting & Exposition April 4, 2017 CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. Sabine Kuhn, ChemZent Product Manager, CAS John Tinsley, CEO, Iconic Translation Machines Ltd.

Upload: iconic-translation-machines

Post on 23-Jan-2018

273 views

Category:

Science


0 download

TRANSCRIPT

Making the Old New Again – Modern Technology Provides Access to Historical Chemical Information 253rd ACS National Meeting & Exposition April 4, 2017

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved.

Sabine Kuhn, ChemZent Product Manager, CAS John Tinsley, CEO, Iconic Translation Machines Ltd.

Unique collaboration between CAS and Iconic leads to a new solution and opportunities

•  CAS is the leading global source of chemical information for scientific and patent research

–  We organize, analyze and share information that drives scientific discoveries –  We facilitate your research to fuel

tomorrow's innovation

•  Iconic Translation Machines Ltd. is a leading provider of state-of-the-art translation and language technology

–  We help the world's largest translation companies, information providers and government organizations to break down the language barrier by adopting specialist natural language processing solutions of superior quality, tailored with expertise

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 2

Together, we will do great things.

Background – ChemZentTM is the CAS implementation of Chemisches Zentralblatt

Chemisches Zentralblatt is •  A set of books that covers over three million abstracts from patents

and journals •  The first and oldest abstracts journal in chemistry •  The only comprehensive abstract journal prior to 1907 •  An extensive source of detailed abstracts covering the chemical

literature from 1830 to 1969 •  Until now, only available in German

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 3

Why search through old musty books when you can just use SciFinder®…

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 4

…to access ChemZent for a window into the published literature of early influencers of chemistry

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 5

ChemZent: A historical, archival chemistry collection now searchable in SciFinder

ChemZent •  Allows users to access the entire Chemisches Zentralblatt collection in one

place, searchable in English with indexing for substances and concepts •  Takes advantage of the inherent, easy-to-use functionality of SciFinder •  Provides searchability across 140 years of Chemisches Zentralblatt abstracts

without the need to know the year of interest •  Covers many authors who formulated the theories and hypotheses that

shaped the foundation of today’s chemistry

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 6

ChemZent provides immeasurable value to researchers and institutions around the world

•  Libraries pursuing completeness for historical record

•  Intellectual property professionals interested in early patentability information

•  History of chemistry researchers interested in the birth of chemistry as a science

•  Professors looking to expand their curriculum

•  Process chemists and scientists exploring simpler chemistry methodologies

CAS is a division of the American Chemical Society.

Copyright 2017 American Chemical Society. All rights reserved. 7

Finding ChemZent records in SciFinder is easy

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 8

Reviewing the answers, we can see that Rosalind Franklin did a lot more than help with the discovery of the structure of DNA

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 9

Eugène-Melchior Péligot – isolation of Uranium 1841

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 10

Henri Becquerel – radioactivity of Uranium

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 11

In another leap forward, Enrico Fermi conducted significant research on radiation and radioactivity

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 12

Iconic Machine Translations made it possible to bring ChemZent to market within a year of idea inception

•  Image extraction •  Optical character recognition •  Abstract identification •  Machine translation •  Positioning, indexing and

packaging

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 13

existing technology innovation

expertise collaboration

Project overview

Project statistics •  803,734 PDFs

processed •  3,000,000+ records

extracted and translated

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 14

Large scale digitization and translation of 140 years worth of German chemical information – journals and patents – for indexing and search. Image-based PDF documents were digitized (via OCR), extracted into individual records, separated into fields, translated and packaged according to XML specifications.

Ambitious timeline

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 15

Lead-in time (development

+ analysis)

Biweekly deliveries

(small batches)

1st Major delivery (1 million records)

2nd Major delivery (1 million records)

3rd Major delivery

(all remaining records)

Product launch

Aug2015

May2016

Oct2015

Dec2015

Mar2016

June2016

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 16

Scan books to PDF

Filter relevant pages

Analysis & consistency

check

OCR post-process / cleaning

Record extraction

Machine translation

XML packaging to SciFinder standards

OCR to machine

readable format

Final output combining digital,

searchable PDFs with location information

QA 1

QA 2

QA 3

Constant communication •  weekly reporting •  biweekly calls •  bimonthly deliveries

Analysis and consistency check

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 17

What is a record?

1834 1907 1968

18

1.  Title 2.  Author(s) 3.  Abstract 4.  Bibliographic data 1.  Author(s)

2.  Title 3.  Bibliographic data 4.  Abstract

1.  Title 2.  Author(s) 3.  Bibliographic data 4.  Abstract

1.  Author(s) 2.  Title 3.  Abstract 4.  Bibliographic data

Analysis and consistency check

CAS is a division of the American Chemical Society. Copyright 2016 American Chemical Society. All rights reserved. 19

Record Patterns Record

Analysis and consistency check (cont.)

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 20

Bibliographic Patterns

Bibliographic Data

Record

Record extraction •  Machine learned-based extraction

algorithms work on cleaned OCR output, using patterns, keywords and formatting information

•  This separation of records was the most challenging aspect of the project

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 21

X:23.11Y:498.67

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 22

<source-text>2832MitsubishiReyonK.K.,Tokyo(Erfinder: KenichiMurotaniundHiroshiSugimoto,Nagoya), Japan, Mit basischen und kationischen Farbstoffen gut färbbare Polypropylenfasern. Manverspinnt Polyproylen, das eine wasserunlösl. Verb, enthält, die durch Rk. von Aminen oderMetallsalzenmitkomplexenHeteropolysäurenausPhosphomolybdänsäure,Phosphowolfram-molybdänsäureu.Phosphowolframsäureerhaltenwird.—Zu100(g)H3P04u.660Na2W04•HaOin71W.gibtman1430C18H37N(CH3)3•CIin31W.;derabzentrifugierteNd.wirddurchDialysegereinigtu.mit100nichtion. Aktivierungsmittel versetzt. Zu Polypropylen gibt man 2% des Prod. sowie Füllstoffe u.Stabilisierungs-mittelu.verspinnt.(Jap.A.S.12986/1965vom15/2.1963,ausg.24/6.1965.)—3S.M.Pankow7122</source-text>

<title><language><language-code>de</language-code><language-name>German</language-name></language><machine-translation-flag>false</machine-translation-flag><title-text>MitbasischenundkationischenFarbstoffengutfärbbarePolypropylenfasern</title-text>

</title><abstract>

<language><language-code>de</language-code><language-name>German</language-name></language><machine-translation-flag>false</machine-translation-flag><abstract-text><p>MitbasischenundkationischenFarbstoffengutfärbbarePolypropylenfasern.ManverspinntPolyproylen,daseinewasserunlösl.Verb,enthält,diedurchRk.vonAminenoderMetallsalzen mit komplexen Heteropolysäuren aus Phosphomolybdänsäure, Phosphowolfram-molybdänsäureu.Phosphowolframsäureerhaltenwird.—Zu100(g)H3P04u.660Na2W04•HaOin71W.gibtman1430C18H37N(CH3)3•CIin31W.;derabzentrifugierteNd.wirddurchDialysegereinigtu.mit100nichtion.Aktivierungsmittelversetzt.ZuPolypropylengibtman2%desProd.sowieFüllstoffeu.Stabilisierungs-mittelu.verspinnt.</p></abstract-text>

</abstract>

Rawso

urcete

xt

Extracted;tleand

abstract

Fielding

•  Identification of the reference data and fielding were treated separately

•  Once identified, references were fielded according to patterns and key terms

•  Challenges were presented by relative inconsistency in layout

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 23

<patent-reference><patent-number>12986</patent-number><inventor-list>

<person-group><first-nameEmptyYN="N">Kenichi</first-name><last-name>Murotani</last-name><affiliation-components><address-line>Nagoya),Japan</address-line></affiliation-components></person-group><person-group><first-nameEmptyYN="N">Hiroshi</first-name><last-name>Sugimoto</last-name><affiliation-components><address-line>Nagoya),Japan</address-line></affiliation-components></person-group>

</inventor-list><assignee-list>

<person-group><collective-name>MitsubishiReyonK.K.</collective-name><affiliation-components> <address-line>Tokyo</address-line></affiliation-components>

</person-group></assignee-list><patent-application>

<patent-office>JP</patent-office><publication-date> <date>1963-02-15</date></publication-date>

</patent-application><patent-grant>

<patent-office>JP</patent-office><publication-date><date>1965-06-24</date></publication-date>

</patent-grant></patent-reference>

State-of-the-art proprietary technology – Iconic “Ensemble Architecture”

24

Translation Separation •  Separate translation processes for

processing titles and abstracts •  Less “reordering” required given

telegraphic style in titles

Dictionaries & Feedback •  Incorporation of fixes and technical

glossaries provided during pilot evaluations and ongoing quality audits by CAS chemists

Entity Identification •  Identification of chemical names, e.g.,

1,2-DICHLOROCYCLOHEXANE to ensure processing as a single entity

German text prior to machine translation

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 25

<title><language><language-code>de</language-code><language-name>German</language-name></language><machine-translation-flag>false</machine-translation-flag><title-text>Mit basischen und kationischen Farbstoffen gut färbbarePolypropylenfasern</title-text>

</title><abstract>

<language><language-code>de</language-code><language-name>German</language-name></language><machine-translation-flag>false</machine-translation-flag><abstract-text><p>Mit basischen und kationischen Farbstoffen gut färbbarePolypropylenfasern. Man verspinnt Polyproylen, das eine wasserunlösl. Verb, enthält,die durch Rk. von Aminen oder Metallsalzen mit komplexen Heteropolysäuren ausPhosphomolybdänsäure, Phosphowolfram-molybdänsäure u. Phosphowolframsäure erhaltenwird.—Zu100(g)H3P04u.660Na2W04•HaOin71W.gibtman1430C18H37N(CH3)3•CIin31W.;derabzentrifugierteNd.wirddurchDialysegereinigtu.mit100nichtion.Aktivierungsmittelversetzt.ZuPolypropylengibtman2%desProd.sowieFüllstoffeu.Stabilisierungs-mittelu.verspinnt.</p></abstract-text>

</abstract>

German to English MT for TITLES German to English MT for ABSTRACTS

<title><language><language-code>de</language-code><language-name>German</language-name></language><machine-translation-flag>false</machine-translation-flag><title-text>With basic and cationic colouring substances easily dyeable polypropylenefibres</title-text>

</title><abstract>

<language><language-code>de</language-code><language-name>German</language-name></language><machine-translation-flag>false</machine-translation-flag><abstract-text><p>With basic and cationic colouring substances easily dyeablepolypropylenefibres.Isspunpolypropylenehavingawasserunlösl.Compoundbyreactionof amine or metal salts with complex hetero poly acids from phosphomolybdenic acid,phosphotungstate-molybdic acid and phospho tungsten acid.-To 100 (g) and 660 Na2W04H3P04•H2Oin71Watersis1430C18H37N(CH3)3•CIin31W.;Theprecipitationiscentrifugedbydialysisandcleanedwith100nonion.Activatingagentareadded.Topoly propylene is 2% of the product as well as fillers and stabilizing agents andspun.</p></abstract-text>

</abstract>

Machine translation to English

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 26

German to English MT for TITLES German to English MT for ABSTRACTS

A real challenge

Acronyms, (spurious) abbreviations, misspellings, language evolution – the collective bane of translation automation

•  Über vs. Ueber

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 27

Abbreviation Full form Translation But…! F. Schmelzpunkt Melting point F = flourine

S. Säure Acid S = sulfur

u. und and with exceptions like…

u. Mk. unter dem Mikroskop under the microscope

Abbreviations found in the first 10 years worth of documents

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 28

Summary and conclusion

•  Successful collaboration –  New product and sales approach for CAS

•  Machine translation and one-time purchase rather than subscription –  Iconic applied translation systems and proprietary ensemble architecture

to a complex subject •  ChemZent exceeded sales expectations for year one

–  Academic and commercial customers •  Iconic and CAS are continuing collaboration on other projects

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 29

Contact information

CAS is a division of the American Chemical Society. Copyright 2017 American Chemical Society. All rights reserved. 30

Dr. John Tinsley CEO & Co-Founder Iconic Translation Machines Ltd. Web: http://www.IconicTranslation.com Twitter: @iconictrans

Dr. Sabine Kuhn Manager, Content Projects CAS, a division of the American Chemical Society Tel: 614-447-3600, ext. 3028 Web: http://www.cas.org/ Twitter:@CASChemistry