training: forward area language converter (falcon) army research laboratory 8/20/02

41
Training: Training: Forward Area Language Forward Area Language Converter (FALCON) Converter (FALCON) Army Research Laboratory 8/20/02

Upload: morris-malone

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Training: Training: Forward Area Language Forward Area Language

Converter (FALCON)Converter (FALCON)

Army Research Laboratory

8/20/02

What is FALCON?

• Forward Area Language CONverter• Document evaluation tool for non-linguists• Evaluate captured documents in the field

– Digitize paper documents when necessary– Process electronic file documents when available– Search documents for key words– Assist in measuring the importance of a document– Allows user to prioritize documents for evaluation

• Assists linguists by providing pre-translation

• Requirement from the 18th ABN Corps• Specification by FAST Science Advisor• U.S. Pacific Command• Consortium of Government and

Military groups, National Security Agency/FIDUL/Central Intelligence Agency/Air Force/ Contractors building prototypes/…

• ARL providing system integration and improvements

Who Got Us Here?

• Documents converted to digital images by the scanner or digital camera

• Digital images converted to electronic text via Optical Character Recognition (OCR)

• Foreign text is converted to English via Machine Translator (MT)

• Foreign or English text is searched for keywords from a list selected by the operator

How Does It Work?

ScanDocument

OCRImage

TranslateText File

Search ForKey Words

This slide represents the basic form in which the system works.

Digital camera

Scanner

Image File

Text File

Or from disk

Or from disk

System Functions

OutputReport file

FALCON Components

• Scanning software– Scansoft Paperport Deluxe 8.0

• OCR software– ABBYY Finereader 6.0 Professional

– BBN 1.0

– Sakhr Automatic Reader Pro 6.0

• MT software– Systran Premium 3.3

– Gister 6.0

– Apptek Transphere 2.5

FALCON Components

• Tools– Microsoft Word 2000– ActivePerl 5.6.1– ImageMagick 5.4.4– GhostScript 7.0.4– Sun Java 2 SDK 1.3.1 04– HotSpot 4.0– Transcoder 1.0– Document Converter 1.3

• Configuration– FALCON 4.01

Configure FALCON

• Select language– Select translation direction. English to foreign language feature is limited to

very few languages.– Source language is what original document is in– Target language is what you want the original document translated into

• Specify key word search options– Select the key word list filename– List is automatically updated when you add files– Select either Source or Target searching

• Specify Input method– Scan: Must go to PaperPort to initiate scans with scanner– Image: Load an image file from a disk

• Digital Camera options to normalize and rotate image

– Text: Load a text file from a disk. Skips OCR.

Configure FALCON

• Batch Process– Allows processing of multiple files– Works with both image and text files

• Digital Camera options to normalize and rotate image files

– Choose input/output directories– Set report file and keyword database file names

• Options– Set maximum number of files

– Choose output directory

Running FALCON

• Double click on the FALCON icon: • All programs must be closed before

running FALCON

Click on Continue once programs are closed

Running FALCON

Specify translation direction(“English to…” only available for a few languages)

Specify foreign language

Keyword options

Digital camera options

Scan button also used to set options and close window with no processing -- or in preparation for using digital camera

Keyword searching can be turned off

InputMethods

(Broken down by region)

Falcon options

• Click on the Scan button in the FALCON window• Launch PaperPort• Click on the “Acquire from Twain” Button

– Launches the interface to your scanner– Settings must be as follows:

• Black/white or Line drawing for color depth• 300 DPI image resolution

– Scan as per the directions to your scanner– Image appears on PaperPort Desktop

• Paperport has tools to Cut/Crop/Rotate/Save the image

• Send image through Falcon by clicking on

Input Method: Scan

Input Method: Load Image

• Click on “Load Image File…” button in the FALCON window

• Standard Open File dialog appears• Image file formats:

– .JPG for jpeg compressed files

– .BMP for Windows bitmap files

– .TIF for tiff files

– .PDF for Acrobat files

• Locate image file and click on “Open”

Digital Camera

• Take picture of document– Can take pictures of multiple documents

• Transfer Compact Flash (CF) Card to PC• Process document image

– Double click on the FALCON icon– Set Language– Click on “Digital Camera Image” checkbox– Continue with “Load Image” input method

• See “Camera Training” slides for further details

• Click on “Load Text File…” button in FALCON window

• Standard Open File dialog appears• Text file formats:

– .RTF for rich text format– .TXT for plain or encoded text format

• Locate text file and click on “Open”

Input Method: Load Text

FALCON Options

• Click on “Options…” button in FALCON window

• Enter maximum number of files– User warned when # of files reaches 95% of limit

– Warned again when # of files is at limit

– Final warning when # of files is 5% over limit – offered Autoclean option or cancel FALCON run

• Choose output folder• These options only apply to single FALCON

runs, not Batch Process• Click on “OK”

Falcon Options

Maximum # of filesFalcon will warn user when amountof files approaches this limit

Output directory

Falcon Options

Approach warning

Exceeded warning

Final warning withAutoclean option

OCR The Image

• OCR progresses after image file is sent through– Converts the image file to text file– Very “language” specific (character set)– Passes the resulting text to translation software– “Invisible” process

• A progress window opens• No user intervention required

• Arabic OCR software (SAKHR) requires a hardware dongle to be present on the parallel port

• Translation progresses after the OCR or when input is a text file. – Creates a new text file in the new language– Very “language” specific (character set)– “Invisible” process

• A progress window opens• No user intervention required

• Passes the SOURCE and TARGET text to Microsoft Word for display and key word search.

Translate the Text File

Display and Search• Results displayed in Microsoft Word

– Left document window displays Source text– Right document window displays Target text– Untranslated words are prefixed by >>

• Key word search conducted automatically– Searched window disappears and progress bar appears while search is

conducted and reappears when done– Key words highlighted in Yellow– Key word hit count displayed in pop-up box– Key words are matched by whole word or phrase

• Toolbars at top of Source and Target window allow Falcon functions to be redone manually

• Document windows are not displayed under Batch Process

FALCON Report File

Word Toolbars

Microsoft Word Falcon Source Toolbar

Microsoft Word Falcon Target Toolbar

Translate text inthe current window(puts results in new Target window) Access Key word menu to:

-Select a different key word list-Create a new key word list-Edit current key word list

Do a new search in the current window

Clear highlighting from a key word search

Display On Screen Keyboard

Run Finereader 6.0 OCR

Batch Process

• Click on “Batch Process…” button in FALCON window

• Batch Setup dialog appears• Select Image or Text files• Select Input/Output directories• Select report file/keyword database file

names• Digital camera options available• Click on “Start”

Batch Process

Image/Text file selection

Digital camera options

Input/Output directories

Report file and keyword database file name selection

Batch Report File

Key Word Lists

• Unicode encoded text files– Each key word or phrase on a single line

– English keyword lists stored in a specific location: C:\Falcon Programs\Key Word Lists

– Foreign language keyword lists stored in corresponding sub-directory broken out by regions, for example: C:\Falcon Programs\Key Word Lists\East Europe\Russian

– Can use any editor, Notepad or Microsoft Word. Word toolbar will open a new Microsoft word document with the key word list ready for editing.

• See manual for details on encoded text files.

FALCON File Management

• FALCON stores files in default directories– “C:\Falcon Documents” for single runs

– “C:\Falcon Documents\Batch” for batch runs

• OCR software generates “ocr#.rtf” or “tcr#.rtf”• MT software generates “mt#.rtf”• Files numbered sequentially after each document is

processed• FALCON generates report file with default name

– “Falcon Report (SrcLng to TgtLng).txt”

• Batch Process generates report file and MS Access database file– “Batch Report (SrcLng to TgtLng).txt”

– “Batch Report (SrcLng to TgtLng) - Keywords.mdb”

FALCON File Management

MT mt#.rtfmt#.rtf

OCRocr#.rtfocr#.rtf

FALCONFalcon Report(SrcLng to TgtLng).txtFalcon Report(SrcLng to TgtLng).txt

Batch Report(SrcLng to TgtLng).txtBatch Report(SrcLng to TgtLng).txt

Batch Report(SrcLng to TgtLng) - Keywords.mdbBatch Report(SrcLng to TgtLng) - Keywords.mdb

tcr#.rtftcr#.rtf

Saving Files

• User has capability of archiving important files to alternate user-selected directories

• Document image files from PaperPort– Exported in JPEG format (.JPG )

• OCR & MT result files (ocr#.rtf & mt#.rtf, displayed in Word)– Source and Target window in Rich Text Format (.RTF)

– Alternatively can be saved in Word native format at your discretion, or encoded text. See Word help and the Falcon Manual for more information.

• Report files– Saved in plain text format (.TXT)

• Keyword Database Files– Saved in MS Access database format (.MDB)

PaperPort

FALCON

Word

fname.JPGfname.JPG

fname - SRC.RTFfname - SRC.RTFSource

Target fname - TRG.RTFfname - TRG.RTF

Use one file namefor these files

generatedfrom a document.

Saving Files

Falcon Report(SrcLng to TgtLng).TXTFalcon Report(SrcLng to TgtLng).TXT

Batch Report(SrcLng to TgtLng).TXTBatch Report(SrcLng to TgtLng).TXT

Batch Report(SrcLng to TgtLng) - Keywords.MDBBatch Report(SrcLng to TgtLng) - Keywords.MDB

• Following slides describe some of the common problems encountered

• Some solutions are provided.• For more detail, consult the Falcon

Manual.

Common Problems

• Sometimes characters are not recognized properly

• Causes– Poor quality document– Scan is too dark/light– Software error rate of about .1%

• These errors cause words to not translate– No options to translate misspelled words

OCR Errors

Solution to OCR Errors

• Correct errors prior to translation• Process is called “Document

Evaluation With OCR Error Correction”– See user manual

• Improves translation results

Arabic OCR Dongle• Problem: Falcon will not work on Arabic files.• Problem: Sakhr Automatic Reader will not start.• Solution: You must have the OCR Dongle installed

on your computer.• Parallel port. Feeds through the port for other

peripherals.• The Dongle is your software license.

– Lost dongle means you must buy a new license!

• Damaged dongles returned to ARL for replacement with local OCR distributor.– Send by Certified Mail (Get a receipt!)– Address in the Falcon Manual

• Different Brightness (threshold) values for different paper types– White paper, newspaper, glossy documents

• Dirty, folded, or damaged documents• Require user adjustment of Brightness

and/or Contrast to optimize OCR process• Consult the documentation for your scanner

on how to make this adjustment.• Brightness and Contrast adjustment is an

art, not a science and takes a little practice.

Unusual Documents

Wrong Language

• Symptoms: – The OCR takes a long time – The Target file in Word looks like junk– Falcon may outright crash during the process

• Fix: – Be sure the language setting is right.

File Locations

• Problem: I saved several files on the FALCON. These files are important and I am unable to locate them. What to do?

• Each of the software packages specifies it’s own default directory. You must correct the directory setting to put your saved files where you want them.

• Fix: Practice careful file maintenance (names and locations).

No Room At The Inn

• PaperPort, and the FALCON hard drive in general, has a limited amount of storage area. If you do not delete unwanted files and archive important files then the storage areas will eventually get full.

• Maintain the files on your FALCON.– Temporary files are created in the C:\Falcon Documents

directory each time Falcon is run.– These files are numbered sequentially and named mt##.rtf,

ocr##.rtf, and tcr##.rtf.– Falcon will issue warning when amount of files approaches set

limit, and will remove all temporary files at user’s request.– Delete these files periodically to maintain a clean disk.

Software Problems / Reporting

Forward questions and problems to Chris Schlesiger [email protected] 290-2473Commercial 301-394-2473