training: forward area language converter (falcon) army research laboratory 8/20/02
TRANSCRIPT
Training: Training: Forward Area Language Forward Area Language
Converter (FALCON)Converter (FALCON)
Army Research Laboratory
8/20/02
What is FALCON?
• Forward Area Language CONverter• Document evaluation tool for non-linguists• Evaluate captured documents in the field
– Digitize paper documents when necessary– Process electronic file documents when available– Search documents for key words– Assist in measuring the importance of a document– Allows user to prioritize documents for evaluation
• Assists linguists by providing pre-translation
• Requirement from the 18th ABN Corps• Specification by FAST Science Advisor• U.S. Pacific Command• Consortium of Government and
Military groups, National Security Agency/FIDUL/Central Intelligence Agency/Air Force/ Contractors building prototypes/…
• ARL providing system integration and improvements
Who Got Us Here?
• Documents converted to digital images by the scanner or digital camera
• Digital images converted to electronic text via Optical Character Recognition (OCR)
• Foreign text is converted to English via Machine Translator (MT)
• Foreign or English text is searched for keywords from a list selected by the operator
How Does It Work?
ScanDocument
OCRImage
TranslateText File
Search ForKey Words
This slide represents the basic form in which the system works.
Digital camera
Scanner
Image File
Text File
Or from disk
Or from disk
System Functions
OutputReport file
FALCON Components
• Scanning software– Scansoft Paperport Deluxe 8.0
• OCR software– ABBYY Finereader 6.0 Professional
– BBN 1.0
– Sakhr Automatic Reader Pro 6.0
• MT software– Systran Premium 3.3
– Gister 6.0
– Apptek Transphere 2.5
FALCON Components
• Tools– Microsoft Word 2000– ActivePerl 5.6.1– ImageMagick 5.4.4– GhostScript 7.0.4– Sun Java 2 SDK 1.3.1 04– HotSpot 4.0– Transcoder 1.0– Document Converter 1.3
• Configuration– FALCON 4.01
Configure FALCON
• Select language– Select translation direction. English to foreign language feature is limited to
very few languages.– Source language is what original document is in– Target language is what you want the original document translated into
• Specify key word search options– Select the key word list filename– List is automatically updated when you add files– Select either Source or Target searching
• Specify Input method– Scan: Must go to PaperPort to initiate scans with scanner– Image: Load an image file from a disk
• Digital Camera options to normalize and rotate image
– Text: Load a text file from a disk. Skips OCR.
Configure FALCON
• Batch Process– Allows processing of multiple files– Works with both image and text files
• Digital Camera options to normalize and rotate image files
– Choose input/output directories– Set report file and keyword database file names
• Options– Set maximum number of files
– Choose output directory
Running FALCON
• Double click on the FALCON icon: • All programs must be closed before
running FALCON
Click on Continue once programs are closed
Running FALCON
Specify translation direction(“English to…” only available for a few languages)
Specify foreign language
Keyword options
Digital camera options
Scan button also used to set options and close window with no processing -- or in preparation for using digital camera
Keyword searching can be turned off
InputMethods
(Broken down by region)
Falcon options
• Click on the Scan button in the FALCON window• Launch PaperPort• Click on the “Acquire from Twain” Button
– Launches the interface to your scanner– Settings must be as follows:
• Black/white or Line drawing for color depth• 300 DPI image resolution
– Scan as per the directions to your scanner– Image appears on PaperPort Desktop
• Paperport has tools to Cut/Crop/Rotate/Save the image
• Send image through Falcon by clicking on
Input Method: Scan
Input Method: Load Image
• Click on “Load Image File…” button in the FALCON window
• Standard Open File dialog appears• Image file formats:
– .JPG for jpeg compressed files
– .BMP for Windows bitmap files
– .TIF for tiff files
– .PDF for Acrobat files
• Locate image file and click on “Open”
Digital Camera
• Take picture of document– Can take pictures of multiple documents
• Transfer Compact Flash (CF) Card to PC• Process document image
– Double click on the FALCON icon– Set Language– Click on “Digital Camera Image” checkbox– Continue with “Load Image” input method
• See “Camera Training” slides for further details
• Click on “Load Text File…” button in FALCON window
• Standard Open File dialog appears• Text file formats:
– .RTF for rich text format– .TXT for plain or encoded text format
• Locate text file and click on “Open”
Input Method: Load Text
FALCON Options
• Click on “Options…” button in FALCON window
• Enter maximum number of files– User warned when # of files reaches 95% of limit
– Warned again when # of files is at limit
– Final warning when # of files is 5% over limit – offered Autoclean option or cancel FALCON run
• Choose output folder• These options only apply to single FALCON
runs, not Batch Process• Click on “OK”
Falcon Options
Maximum # of filesFalcon will warn user when amountof files approaches this limit
Output directory
OCR The Image
• OCR progresses after image file is sent through– Converts the image file to text file– Very “language” specific (character set)– Passes the resulting text to translation software– “Invisible” process
• A progress window opens• No user intervention required
• Arabic OCR software (SAKHR) requires a hardware dongle to be present on the parallel port
• Translation progresses after the OCR or when input is a text file. – Creates a new text file in the new language– Very “language” specific (character set)– “Invisible” process
• A progress window opens• No user intervention required
• Passes the SOURCE and TARGET text to Microsoft Word for display and key word search.
Translate the Text File
Display and Search• Results displayed in Microsoft Word
– Left document window displays Source text– Right document window displays Target text– Untranslated words are prefixed by >>
• Key word search conducted automatically– Searched window disappears and progress bar appears while search is
conducted and reappears when done– Key words highlighted in Yellow– Key word hit count displayed in pop-up box– Key words are matched by whole word or phrase
• Toolbars at top of Source and Target window allow Falcon functions to be redone manually
• Document windows are not displayed under Batch Process
Word Toolbars
Microsoft Word Falcon Source Toolbar
Microsoft Word Falcon Target Toolbar
Translate text inthe current window(puts results in new Target window) Access Key word menu to:
-Select a different key word list-Create a new key word list-Edit current key word list
Do a new search in the current window
Clear highlighting from a key word search
Display On Screen Keyboard
Run Finereader 6.0 OCR
Batch Process
• Click on “Batch Process…” button in FALCON window
• Batch Setup dialog appears• Select Image or Text files• Select Input/Output directories• Select report file/keyword database file
names• Digital camera options available• Click on “Start”
Batch Process
Image/Text file selection
Digital camera options
Input/Output directories
Report file and keyword database file name selection
Key Word Lists
• Unicode encoded text files– Each key word or phrase on a single line
– English keyword lists stored in a specific location: C:\Falcon Programs\Key Word Lists
– Foreign language keyword lists stored in corresponding sub-directory broken out by regions, for example: C:\Falcon Programs\Key Word Lists\East Europe\Russian
– Can use any editor, Notepad or Microsoft Word. Word toolbar will open a new Microsoft word document with the key word list ready for editing.
• See manual for details on encoded text files.
FALCON File Management
• FALCON stores files in default directories– “C:\Falcon Documents” for single runs
– “C:\Falcon Documents\Batch” for batch runs
• OCR software generates “ocr#.rtf” or “tcr#.rtf”• MT software generates “mt#.rtf”• Files numbered sequentially after each document is
processed• FALCON generates report file with default name
– “Falcon Report (SrcLng to TgtLng).txt”
• Batch Process generates report file and MS Access database file– “Batch Report (SrcLng to TgtLng).txt”
– “Batch Report (SrcLng to TgtLng) - Keywords.mdb”
FALCON File Management
MT mt#.rtfmt#.rtf
OCRocr#.rtfocr#.rtf
FALCONFalcon Report(SrcLng to TgtLng).txtFalcon Report(SrcLng to TgtLng).txt
Batch Report(SrcLng to TgtLng).txtBatch Report(SrcLng to TgtLng).txt
Batch Report(SrcLng to TgtLng) - Keywords.mdbBatch Report(SrcLng to TgtLng) - Keywords.mdb
tcr#.rtftcr#.rtf
Saving Files
• User has capability of archiving important files to alternate user-selected directories
• Document image files from PaperPort– Exported in JPEG format (.JPG )
• OCR & MT result files (ocr#.rtf & mt#.rtf, displayed in Word)– Source and Target window in Rich Text Format (.RTF)
– Alternatively can be saved in Word native format at your discretion, or encoded text. See Word help and the Falcon Manual for more information.
• Report files– Saved in plain text format (.TXT)
• Keyword Database Files– Saved in MS Access database format (.MDB)
PaperPort
FALCON
Word
fname.JPGfname.JPG
fname - SRC.RTFfname - SRC.RTFSource
Target fname - TRG.RTFfname - TRG.RTF
Use one file namefor these files
generatedfrom a document.
Saving Files
Falcon Report(SrcLng to TgtLng).TXTFalcon Report(SrcLng to TgtLng).TXT
Batch Report(SrcLng to TgtLng).TXTBatch Report(SrcLng to TgtLng).TXT
Batch Report(SrcLng to TgtLng) - Keywords.MDBBatch Report(SrcLng to TgtLng) - Keywords.MDB
• Following slides describe some of the common problems encountered
• Some solutions are provided.• For more detail, consult the Falcon
Manual.
Common Problems
• Sometimes characters are not recognized properly
• Causes– Poor quality document– Scan is too dark/light– Software error rate of about .1%
• These errors cause words to not translate– No options to translate misspelled words
OCR Errors
Solution to OCR Errors
• Correct errors prior to translation• Process is called “Document
Evaluation With OCR Error Correction”– See user manual
• Improves translation results
Arabic OCR Dongle• Problem: Falcon will not work on Arabic files.• Problem: Sakhr Automatic Reader will not start.• Solution: You must have the OCR Dongle installed
on your computer.• Parallel port. Feeds through the port for other
peripherals.• The Dongle is your software license.
– Lost dongle means you must buy a new license!
• Damaged dongles returned to ARL for replacement with local OCR distributor.– Send by Certified Mail (Get a receipt!)– Address in the Falcon Manual
• Different Brightness (threshold) values for different paper types– White paper, newspaper, glossy documents
• Dirty, folded, or damaged documents• Require user adjustment of Brightness
and/or Contrast to optimize OCR process• Consult the documentation for your scanner
on how to make this adjustment.• Brightness and Contrast adjustment is an
art, not a science and takes a little practice.
Unusual Documents
Wrong Language
• Symptoms: – The OCR takes a long time – The Target file in Word looks like junk– Falcon may outright crash during the process
• Fix: – Be sure the language setting is right.
File Locations
• Problem: I saved several files on the FALCON. These files are important and I am unable to locate them. What to do?
• Each of the software packages specifies it’s own default directory. You must correct the directory setting to put your saved files where you want them.
• Fix: Practice careful file maintenance (names and locations).
No Room At The Inn
• PaperPort, and the FALCON hard drive in general, has a limited amount of storage area. If you do not delete unwanted files and archive important files then the storage areas will eventually get full.
• Maintain the files on your FALCON.– Temporary files are created in the C:\Falcon Documents
directory each time Falcon is run.– These files are numbered sequentially and named mt##.rtf,
ocr##.rtf, and tcr##.rtf.– Falcon will issue warning when amount of files approaches set
limit, and will remove all temporary files at user’s request.– Delete these files periodically to maintain a clean disk.
Software Problems / Reporting
Forward questions and problems to Chris Schlesiger [email protected] 290-2473Commercial 301-394-2473