tapta4ipc: helping translation of ipc definitions bruno pouliquen...

13
Tapta4IPC: helping translation of IPC definitions Bruno Pouliquen ([email protected] ) 25 feb 2013, IPC workshop Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration

Upload: chastity-cummings

Post on 23-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Tapta4IPC: helping translation of IPC definitions

Bruno Pouliquen ([email protected])

25 feb 2013, IPC workshop

Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration

Statistical Machine Translation: bottom-up approach

no rules, no grammar, no dictionary, no terminology, only the parallel texts (bitexts)

We use an open-source system: Moses

Tapta: Translation of Patent Titles and Abstract• Originally built to translate patent applications• Adapted to various applications

Introduction

data

system

Our system prepares the data for Moses, apply some post-processing (filter, pruning, binarization, optimization…) and offers a Web interface to translate

Tapta framework

clean re-cleantrain-model

post-filter prune binarize optimize Publish

sourcelanguage

Bitexts

Gather/convert data

targetlanguage

Introduction: Tapta

In WIPO, as part of Patentscope (English,French,German,Chinese,Japanese)

eg. http://patentscope.wipo.int/translate/simpleTranslate.jsf?id=JP75694586&langpair=jaen

Automatic translation of a patent application only available in Japanese…

In United Nations (English from/into Arabic,French,Spanish,Russian & Chinese)

Technical workflow

Moses’ training

phrase table

reordering model

Moses decoder Moses decoder Moses decoder

Translationserver

En Es

Strengthening of forum for human dignity : legal aid

Fortalecimiento del foro para la dignidad humana – asistencia jurídica

must respect all aspects of human dignity

debe respetar todos los aspectos de la dignidad humana

should fully respect human dignity

se deben respetar plenamente la dignidad humana

Translationclient

language model

Filter align.

Tokenization

Score alignment

Filter wrong language

Sentence-split

Sentence-align

Filter align.

Filter wrong language

Bitexts aligned at sentence level

sourcelanguage

Bitexts

targetlanguage

IPC context

• Gather data:– Get existing definitions – Add IPC schema (xml on WIPO website)– Add “few” texts from patents

• “learn” translation model• Translate new texts

Get existing data, build parallel texts

<ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="EN"><textBody> <title> <titlePart> <text>Wheel guards</text></titlePart></title></textBody></ipcEntry>

WO/2013/014517(EN) TYRE FOR VEHICLE WHEELS(FR) PNEUMATIQUE POUR ROUES DE VÉHICULE

IPC schema…

Patent texts…

<ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="FR"><textBody> <title><titlePart> <text>Couvre-roues</text> </titlePart></title></textBody></ipcEntry>

Wheels roues

Wheel guards Couvre-roues

Tyre for vehicle wheels Pneumatique pour roues de véhicule

Existing definitions…

Bitext: training material…

How well it works?

Automatic evaluation: BLEU score

Principle : similarity of n-grams between evaluated and reference sentences

On IPC definition English-French: bleu=48%

(without patent data: 44%)

Good quality

needs human post-editing

Tapta4IPC prototype (1)

Live demo using:http://patentscope.wipo.int/translateUN/translateIPC.jsf

http://fulty3.wipo.int:8080/Wtapta/translateIPC.jsf

Tapta4IPC prototype (2)

Conclusion / future work

This is a prototype, but the quality looks already acceptable

Human evaluation?

Better integrate the tool

In PCA6TRANSDEF ?

Other languages?

Tapta4IPC in various languages

Tapta4IPC should work reasonably well on the following languages (we have built some language specific tools and we have patent corpora):

• German• Japanese• Korean• Spanish• Dutch • Portuguese• Chinese• Russian

More challenging:• Czech, Slovak, Polish (many word forms, training corpus?)• Estonian (even more word forms, would in theory require more

training corpus)

Other languages: Arabic, Italian, Danish, Swedish etc.

Thank you for your attention

اهتمامكم على لكم شكراMerci pour votre attention!

感谢您的关注Grazie per la vostra attenzione!¡ Gracias por su atención !Vielen Dank für Ihre Aufmerksamkeit! Obrigado pela vossa atenção!Dziękuję bardzo za Państwa uwagę! Děkujeme za Vaši pozornost!Ďakujem ti veľmi pekne za tvoju

pozornosť Tänan tähelepanu eest!Благодарим за Вашето внимание!Tak for Jeres opmærksomhed!Thank you for your attention!