kantanmt advanced features presentation

42
KantanMT.com - Advanced features No Hardware. No Software. No Hassle MT

Upload: kantanmt

Post on 04-Dec-2014

204 views

Category:

Technology


0 download

DESCRIPTION

www.kantanmt.com Learn about some of KantanMT's more advanced features including; KantanWatch, GENTRY, Pre-processing.

TRANSCRIPT

Page 1: KantanMT Advanced Features Presentation

No Hardware. No Software. No Hassle MT.

Page 2: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Advanced Features of KantanMT.com

- A Path to Real Results

Page 3: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What we aim to cover today? KantanWatch™ – How the automatic quality metrics are calculated

and what we can deduce from them GENTRY Parsing - How to develop customised parsers for your client

data files GENTRY Regex – How to use regular expressions to build GENTRY

and PEX rules GENTRY PEX - Automate the post-editing of repetitive errors using

PEX rules GENTRY Pre-Processor - How to use the GENTRY Pre-Processor to

improve your training data KantanISR™ - adding training data to engines without a engine full

re-build Questions & Answers

Page 4: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanWatch™: Quality MetricsThree different methods available in KantanMT

BLEU, F-Measure & TER Common characteristic

Compute similarity of MT generated texts to hand-crafted reference texts On one hand - MT output; then on the other hand - Human equivalent

translation The smaller the difference => the better the quality! Each quality metric is based on quite simple mathematics

Page 5: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

F-Measure: AutomatedF-Measure

Recall & Precision Metric

Flaw: no penalty for reordering

• How accurately a MT system recalls or find words in its phrase tables and use them in generating a target translation and in addition how precise it is in putting that into a target translation

Page 6: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

F-Measure: AutomatedF-Measure

Recall & Precision Metric

Flaw: no penalty for reordering

Reference Translation

MT OutputPrecision

correctMT-Len

66%

Recall

correctRef-Len

80%

F-Measure

Precision * Recall(Precision + Recall) /2

73%

4/5 4/680 * 66

--------------(80 +66) /2

 ♪ ? ♫ ♪ ??? ♫ ??

Page 7: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

TER - AutomatedTER (Translation Error Rate)

Min number of edits to transform output to reference

WER / Levenshtein distance measure General indicator of Post-Editing Effort

Reference Translation

MT OutputTER

Substitutions + insertions + deletionsReference-length

Assumption: The fewer

Substitutions / Insertions /

Deletion – the fewer Post-Edits

Page 8: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

BLEU - Automated BLEU Score

Measures how many words overlap, giving higher scores to sequential words

High correlation between BLEU and human judgement of translation quality

Reference Translation

MT Output

  ♫ ♪ ♫ ♪ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♫ ♪ ♫

Bravo!Bravo!

Bravo!

Bravo!

Translation is more fluent

Page 9: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanWatch™Taking the three metrics together provide good indication of the quality

of a KantanMT engine.

* KantanWatch Reports

Page 10: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanWatch™The ‘Client Profile Report’ can be used to track and monitor automated

metrics

* KantanWatch Reports

Page 11: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanWatch™Time-graphs offer good overview of the maturing of a KantanMT engine

* KantanWatch Reports

Page 12: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanWatch™Valuable information for developers of MT Engines

Automated BLEU, F-MEASURE, TER very useful and practical No individual measurement has absolute meaning but taking the three metrics

together provides a good indication of the quality of a KantanMT engine

Page 13: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What is GENTRY™?• GENTRY is a parsing framework for KantanMT.com

• Data Cleansing – designed to clean both training and client files• Data Tokenisation/Detokenisation – maximises the performance of

KantanMT engines• File Parsing – all training data and client files are parsed using GENTRY

Rule files• PEX – Post-Editing automation for client files• Preprocessor – GENTRY provides preprocessor capabilities for all training

data formats

Page 14: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

File Parsing…GENTRY is script driven and uses Rule files (*.rul) to instruct your

KantanMT engine what to translateOn standard files we have standard Rule files running in the

background and determining what sections of a file requires translation

Some file formats for translation are not standardThese rule files are easy to create by using a simple text editor of

minutes

Page 15: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

File Parsing - The challenge? What to translate?

XML file

Rule file (*.rul)

Page 16: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

In a more complicated scenario…

XML file

Rule file (*.rul)

Page 17: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What is a RULE File? Definition

Defines elements that are to be translated by KantanMT.com Attributes

Roots - defines all the Root elements in your XML that you want to translate using your KantanMT engine Regex – defines regular expressions to run on Root elements – extraction, output and re-insertion

Created in Text Editor

Defaultxml.rul

Encoded in UTF-8

Page 18: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Regex Section – regular expressions Regex Section – extraction, output, re-insertion rules

<gextractrule> - defines a matching regex for each root element. Only root elements that match this rule are processed

<gextractoutputrule> - defines how each matching root element is presented to KantanMT <ginsertrule> - defines rule used to re-insert translated segment back into original client file

Page 19: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

GENTRY– A demonstration

Page 20: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Click here to ‘Translate’ Successful File Parsing

Page 21: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

GENTRY – Regular ExpressionRegular Expressions are at the core of how GENTRY operates

Commonly referred to as Regex KantanMT supports standard LINUX Regex Used in PEX, Rule and Preprocessor script files

Regex Basics

1

2

3

4

5

Page 22: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

GENTRY – Regular ExpressionMore advanced Regex

KantanMT Regex Reference http://www.kantanmt.com/help_regex.php

Groups

\d{2,4}

[0-9]{2,4}

1

2

3

4

5

6

7

8

Page 23: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What is PEX?PEX is Post-Editing automation

Based on GENTRY REGEX Define advanced Search & Replace rules for translation

Features Easy to use – based on GENTRY REGEX constructs Very powerful - excellent at finding patterns in translated outputs Automatically applied to all translated files processed by KantanMT.com

Benefits Dramatically cut down on post-editing effort time and cost

Page 24: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

For example, suppose your KantanMT engine generates the following translations:-

Source Text: Modèle SP17 pour ordinateur 17" Target Text: Modelo sp17 para computador 17”

A single PEX construct can be used to automate the post-editing

Working with PEX

Created in Text Editor

Page 25: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

For example, suppose codes are not always 17 and could be any 2 digit number:-

Source Text: Modèle SP22 pour ordinateur 22" Target Text: Modelo sp22 para computador 22”

We can use \d to represent any digit – so \d\d represents two digits!

Working with PEX

Page 26: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Now suppose the product numbers are made up of SP followed by a number of two digits up to four digits.

Source Text: Modèle SP1294 pour ordinateur Target Text: Modelo sp1294 para computador

\d{2,4} is a more generic way to define a set of 2 to 4 digits! That’s the power of GENTRY Regex!

Working with PEX

Page 27: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What does PEX file look like?

PEX files are created in a simple text editorMust be saved with a .PEX file extensionSave as UTF8 to ensure support for accented characters and DBCS

languages

Created in Text Editor

Modify this section

Page 28: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

PEX – A demonstration

MS Excel file (.xlsx)

Click here to ‘Translate’

Page 29: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

PEX – A demonstration

Click here to download

Incorrect casing in output

English Source File

Translated Target File

Page 30: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

PEX – A demonstration

PEX file

Click here to ‘Translate’

Page 31: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

PEX – A demonstration

Click here to Download

Correct casing in output!!!

English Source File

Translated Target File

Successful Automatic Post-

Editing

Page 32: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What is the GENTRY Preprocessor?What is a Preprocessor?

Search & Replace script for training materialAllows quick modifications to training material

Features Easy to use – based on GENTRY REGEX constructsVery powerful - excellent at finding patterns in training material

BenefitsQuickest way of updating existing training data sets

Page 33: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

What is the GENTRY Preprocessor?Usage Scenarios

Product Name changes Write PPX file to change Product Name in training material and then rebuild engine

Anonymising training data sets Removing company names from training material

Changing terminology throughout training data sets Substitute word/phrase with new terminology

Cleansing Training Data Removing bad data from training data sets

Page 34: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Working with PPX filesPPX files are created using a text editorChange

terminology

Cleansing training

data

Reformatting numbers

Rule:

<search>$(\d{,3})\.(\d{,2})</search><replace>€$1</replace>

Example:

Will replace…$145.13 with €145 or $24.9 with €24

Page 35: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

How do we implement PPX?Create a source.ppx &

target.ppxUpload in the ‘Training

Data’ tab alongside your training data

Select ‘Build’Your entire Training data

will be Pre-processedClick here to ‘Build’

Create a source.ppx

& target.ppx

Page 36: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanISR™(Instant Segment Retrainer)Perform instant segment

retraining using a pop-up editorAdd training data to engines

quickly and easily without having to fully rebuild an engine

API Version 2.0 includes full support for the KantanISR™ feature

Training Data tab

Instant Segment Retrainer

Page 37: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

KantanISR™...a Demonstration

Training Data tab

Click here to access ISR

Click here to save

Page 38: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Summary• KantanWatch™ – How the automatic quality metrics are calculated and what

we can deduce from them• GENTRY Parsing - How to develop customised parsers for all your file formats• GENTRY Regex – How to use regular expressions to build GENTRY and PEX

rules• GENTRY PEX - Automate the post-editing of repetitive errors using PEX rules• GENTRY Pre-Processor - How to use the GENTRY Pre-Processor to improve

your training data• KantanISR™ - adding training data to engines without a full re-build

Page 39: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Questions & Answers

Thank you!

Page 40: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Additional informationSign up for FREE evaluation at KantanMT.com

Click here to ‘Signup’

Fill in details

Page 41: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Additional informationFor additional information please visit:GENTRY Parsing: http://www.kantanmt.com/help_gentry.phpPEX (Automatic Post-editing): http://www.kantanmt.com/help_pex.phpKantanMT Regex Reference http://www.kantanmt.com/help_regex.php

Contact me at:Kevin McCoy E-mail: [email protected]: +353 86 823 1527

Page 42: KantanMT Advanced Features Presentation

KantanMT.com - Advanced features

Thank you!