kantanmt advanced features presentation
DESCRIPTION
www.kantanmt.com Learn about some of KantanMT's more advanced features including; KantanWatch, GENTRY, Pre-processing.TRANSCRIPT
No Hardware. No Software. No Hassle MT.
KantanMT.com - Advanced features
Advanced Features of KantanMT.com
- A Path to Real Results
KantanMT.com - Advanced features
What we aim to cover today? KantanWatch™ – How the automatic quality metrics are calculated
and what we can deduce from them GENTRY Parsing - How to develop customised parsers for your client
data files GENTRY Regex – How to use regular expressions to build GENTRY
and PEX rules GENTRY PEX - Automate the post-editing of repetitive errors using
PEX rules GENTRY Pre-Processor - How to use the GENTRY Pre-Processor to
improve your training data KantanISR™ - adding training data to engines without a engine full
re-build Questions & Answers
KantanMT.com - Advanced features
KantanWatch™: Quality MetricsThree different methods available in KantanMT
BLEU, F-Measure & TER Common characteristic
Compute similarity of MT generated texts to hand-crafted reference texts On one hand - MT output; then on the other hand - Human equivalent
translation The smaller the difference => the better the quality! Each quality metric is based on quite simple mathematics
KantanMT.com - Advanced features
F-Measure: AutomatedF-Measure
Recall & Precision Metric
Flaw: no penalty for reordering
• How accurately a MT system recalls or find words in its phrase tables and use them in generating a target translation and in addition how precise it is in putting that into a target translation
KantanMT.com - Advanced features
F-Measure: AutomatedF-Measure
Recall & Precision Metric
Flaw: no penalty for reordering
Reference Translation
MT OutputPrecision
correctMT-Len
66%
Recall
correctRef-Len
80%
F-Measure
Precision * Recall(Precision + Recall) /2
73%
4/5 4/680 * 66
--------------(80 +66) /2
♪ ? ♫ ♪ ??? ♫ ??
KantanMT.com - Advanced features
TER - AutomatedTER (Translation Error Rate)
Min number of edits to transform output to reference
WER / Levenshtein distance measure General indicator of Post-Editing Effort
Reference Translation
MT OutputTER
Substitutions + insertions + deletionsReference-length
Assumption: The fewer
Substitutions / Insertions /
Deletion – the fewer Post-Edits
KantanMT.com - Advanced features
BLEU - Automated BLEU Score
Measures how many words overlap, giving higher scores to sequential words
High correlation between BLEU and human judgement of translation quality
Reference Translation
MT Output
♫ ♪ ♫ ♪ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♫ ♪ ♫
Bravo!Bravo!
Bravo!
Bravo!
Translation is more fluent
KantanMT.com - Advanced features
KantanWatch™Taking the three metrics together provide good indication of the quality
of a KantanMT engine.
* KantanWatch Reports
KantanMT.com - Advanced features
KantanWatch™The ‘Client Profile Report’ can be used to track and monitor automated
metrics
* KantanWatch Reports
KantanMT.com - Advanced features
KantanWatch™Time-graphs offer good overview of the maturing of a KantanMT engine
* KantanWatch Reports
KantanMT.com - Advanced features
KantanWatch™Valuable information for developers of MT Engines
Automated BLEU, F-MEASURE, TER very useful and practical No individual measurement has absolute meaning but taking the three metrics
together provides a good indication of the quality of a KantanMT engine
KantanMT.com - Advanced features
What is GENTRY™?• GENTRY is a parsing framework for KantanMT.com
• Data Cleansing – designed to clean both training and client files• Data Tokenisation/Detokenisation – maximises the performance of
KantanMT engines• File Parsing – all training data and client files are parsed using GENTRY
Rule files• PEX – Post-Editing automation for client files• Preprocessor – GENTRY provides preprocessor capabilities for all training
data formats
KantanMT.com - Advanced features
File Parsing…GENTRY is script driven and uses Rule files (*.rul) to instruct your
KantanMT engine what to translateOn standard files we have standard Rule files running in the
background and determining what sections of a file requires translation
Some file formats for translation are not standardThese rule files are easy to create by using a simple text editor of
minutes
KantanMT.com - Advanced features
File Parsing - The challenge? What to translate?
XML file
Rule file (*.rul)
KantanMT.com - Advanced features
In a more complicated scenario…
XML file
Rule file (*.rul)
KantanMT.com - Advanced features
What is a RULE File? Definition
Defines elements that are to be translated by KantanMT.com Attributes
Roots - defines all the Root elements in your XML that you want to translate using your KantanMT engine Regex – defines regular expressions to run on Root elements – extraction, output and re-insertion
Created in Text Editor
Defaultxml.rul
Encoded in UTF-8
KantanMT.com - Advanced features
Regex Section – regular expressions Regex Section – extraction, output, re-insertion rules
<gextractrule> - defines a matching regex for each root element. Only root elements that match this rule are processed
<gextractoutputrule> - defines how each matching root element is presented to KantanMT <ginsertrule> - defines rule used to re-insert translated segment back into original client file
KantanMT.com - Advanced features
GENTRY– A demonstration
KantanMT.com - Advanced features
Click here to ‘Translate’ Successful File Parsing
KantanMT.com - Advanced features
GENTRY – Regular ExpressionRegular Expressions are at the core of how GENTRY operates
Commonly referred to as Regex KantanMT supports standard LINUX Regex Used in PEX, Rule and Preprocessor script files
Regex Basics
1
2
3
4
5
KantanMT.com - Advanced features
GENTRY – Regular ExpressionMore advanced Regex
KantanMT Regex Reference http://www.kantanmt.com/help_regex.php
Groups
\d{2,4}
[0-9]{2,4}
1
2
3
4
5
6
7
8
KantanMT.com - Advanced features
What is PEX?PEX is Post-Editing automation
Based on GENTRY REGEX Define advanced Search & Replace rules for translation
Features Easy to use – based on GENTRY REGEX constructs Very powerful - excellent at finding patterns in translated outputs Automatically applied to all translated files processed by KantanMT.com
Benefits Dramatically cut down on post-editing effort time and cost
KantanMT.com - Advanced features
For example, suppose your KantanMT engine generates the following translations:-
Source Text: Modèle SP17 pour ordinateur 17" Target Text: Modelo sp17 para computador 17”
A single PEX construct can be used to automate the post-editing
Working with PEX
Created in Text Editor
KantanMT.com - Advanced features
For example, suppose codes are not always 17 and could be any 2 digit number:-
Source Text: Modèle SP22 pour ordinateur 22" Target Text: Modelo sp22 para computador 22”
We can use \d to represent any digit – so \d\d represents two digits!
Working with PEX
KantanMT.com - Advanced features
Now suppose the product numbers are made up of SP followed by a number of two digits up to four digits.
Source Text: Modèle SP1294 pour ordinateur Target Text: Modelo sp1294 para computador
\d{2,4} is a more generic way to define a set of 2 to 4 digits! That’s the power of GENTRY Regex!
Working with PEX
KantanMT.com - Advanced features
What does PEX file look like?
PEX files are created in a simple text editorMust be saved with a .PEX file extensionSave as UTF8 to ensure support for accented characters and DBCS
languages
Created in Text Editor
Modify this section
KantanMT.com - Advanced features
PEX – A demonstration
MS Excel file (.xlsx)
Click here to ‘Translate’
KantanMT.com - Advanced features
PEX – A demonstration
Click here to download
Incorrect casing in output
English Source File
Translated Target File
KantanMT.com - Advanced features
PEX – A demonstration
PEX file
Click here to ‘Translate’
KantanMT.com - Advanced features
PEX – A demonstration
Click here to Download
Correct casing in output!!!
English Source File
Translated Target File
Successful Automatic Post-
Editing
KantanMT.com - Advanced features
What is the GENTRY Preprocessor?What is a Preprocessor?
Search & Replace script for training materialAllows quick modifications to training material
Features Easy to use – based on GENTRY REGEX constructsVery powerful - excellent at finding patterns in training material
BenefitsQuickest way of updating existing training data sets
KantanMT.com - Advanced features
What is the GENTRY Preprocessor?Usage Scenarios
Product Name changes Write PPX file to change Product Name in training material and then rebuild engine
Anonymising training data sets Removing company names from training material
Changing terminology throughout training data sets Substitute word/phrase with new terminology
Cleansing Training Data Removing bad data from training data sets
KantanMT.com - Advanced features
Working with PPX filesPPX files are created using a text editorChange
terminology
Cleansing training
data
Reformatting numbers
Rule:
<search>$(\d{,3})\.(\d{,2})</search><replace>€$1</replace>
Example:
Will replace…$145.13 with €145 or $24.9 with €24
KantanMT.com - Advanced features
How do we implement PPX?Create a source.ppx &
target.ppxUpload in the ‘Training
Data’ tab alongside your training data
Select ‘Build’Your entire Training data
will be Pre-processedClick here to ‘Build’
Create a source.ppx
& target.ppx
KantanMT.com - Advanced features
KantanISR™(Instant Segment Retrainer)Perform instant segment
retraining using a pop-up editorAdd training data to engines
quickly and easily without having to fully rebuild an engine
API Version 2.0 includes full support for the KantanISR™ feature
Training Data tab
Instant Segment Retrainer
KantanMT.com - Advanced features
KantanISR™...a Demonstration
Training Data tab
Click here to access ISR
Click here to save
KantanMT.com - Advanced features
Summary• KantanWatch™ – How the automatic quality metrics are calculated and what
we can deduce from them• GENTRY Parsing - How to develop customised parsers for all your file formats• GENTRY Regex – How to use regular expressions to build GENTRY and PEX
rules• GENTRY PEX - Automate the post-editing of repetitive errors using PEX rules• GENTRY Pre-Processor - How to use the GENTRY Pre-Processor to improve
your training data• KantanISR™ - adding training data to engines without a full re-build
KantanMT.com - Advanced features
Questions & Answers
Thank you!
KantanMT.com - Advanced features
Additional informationSign up for FREE evaluation at KantanMT.com
Click here to ‘Signup’
Fill in details
KantanMT.com - Advanced features
Additional informationFor additional information please visit:GENTRY Parsing: http://www.kantanmt.com/help_gentry.phpPEX (Automatic Post-editing): http://www.kantanmt.com/help_pex.phpKantanMT Regex Reference http://www.kantanmt.com/help_regex.php
Contact me at:Kevin McCoy E-mail: [email protected]: +353 86 823 1527
KantanMT.com - Advanced features
Thank you!