sift™ comparison with tesseract - aerstone labs · overview this document illustrates the...

12
SIFT™ comparison with Tesseract Bhaarat Sharma

Upload: vuongkhanh

Post on 07-Apr-2018

233 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

SIFT™ comparison with TesseractBhaarat Sharma

Page 2: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Table of ContentsOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1

SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1

Tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1

Sample Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2

Image One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2

Image One Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2

Image Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Image Two Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Image Three . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Image Three Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Image Four . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4

Image Four Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4

Image Five . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5

Image Five Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5

Image Six . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6

Image Six Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6

Image Seven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7

Image Seven Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7

Image Eight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8

Image Eight Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8

Image Nine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9

Image Nine Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9

Image Ten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10

Image Ten Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10

Page 3: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

OverviewThis document illustrates the difference in OCR results between Aerstone SIFT™ and GoogleTesseract. We will begin with a quick overview of the two solutions, followed by a presentation ofextracted text from ten sample images using both solutions, and concluding with a comparison ofthe results.

SIFTSIFT™, from Aerstone Labs, is a browser-based and agentless data loss prevention solution. It isdesigned to protect an organization from accidentally spilling information onto unauthorizednetwork enclaves. SIFT™ can be used as a stand-alone portal, or integrated seamlessly with existingdocument management systems. Once configured to search for the kind of data an organizationconsiders sensitive, based on keywords or regular expressions, SIFT™ can be used to implementdata transfer approval workflow, and to optionally tag documents with any discovered keywords.SIFT™ natively supports both searchable documents (e.g., MS Office) and non-searchable assets(e.g., picture, video, and scanned PDFs).

To extract the data from non-searchable assets, SIFT™ utilizes patent-pending pre-processingalgorithms. The pre-processing algorithms detect the regions in the image that have text and thensends only those regions of the image to the OCR engine. SIFT™ is designed to integrate with anyOCR engine, but uses the Tesseract open source OCR engine by default.

If you want to know more about SIFT™, visit Aerstone Labs SIFT. For a 3 minute video overview ofSIFT™, visit SIFT Overview Video.

TesseractGoogle Tesseract is an OCR engine originally developed by Hewlett Packard, and later worked onand released by Google as an open source library under the Apache License. Tesseract OCR resultsare relatively decent for images with low noise (for example, a scanned book page), and providedthe scanned text is at least 300 DPI. However, as we’ll show later in the results section, the resultsobtained by Tesseract are less impressive for images that have high noise (for example, a picturewith text annotation), or where the text isn’t well-structured. For more information on GoogleTesseract, visit Tesseract project on GitHub.

1

Page 4: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Sample Images

Image One

Sample Image 1

Image One Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

1 TelephoneCompanySuspicious ActivitySentinel //MagnoliaDeclassify on:12.31.48 MarylandColumbia SilverSpringWashington Lotsof white vans Lotsof cars Sentinel //Magnolia 17January 201314:28:56

17 January 201314:28:56SENTINEL IIMAGNOLIADeclassify on:12.31.48 MarylandLOTS OF WHITEVANS WashingtonSilver SpringColumbiaTelephoneCompany YIJ/c.Suspicious ActivityLOTS OF CARSSENTINEL IIMAGNOLIA

SENHNEL IIMAGNOIJATelephoneCompanyDeclassifynn:12.31.ususpicious ActivitysamuELuuAGnouA ' ' '17Janualy2013{klflfifi

SIFT (100%)Tesseract (18%)

2

Page 5: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Two

Sample Image 2

Image Two Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

2 SMI SensoMotoricInstrumentsNewsletter

JAY ‘L’UJLLI I I.“NEWSLETTERSensoMotoricInstruments

SIFT (75%)Tesseract (0%)

Image Three

Sample Image 3

Image Three Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

3 Mike Wants You!Pixas is alwayslooking for newtalent! Check outour latest CareerListings. Up WebSite Visit theofficial Up website Partly CloudyExperience thepitfalls of babydelivery in Pixar’slatest short.

Experience thepitfalls PartlyCloudy Pixar’slatest short ofbaby delivery in"5. Mike WantsYou! Visit theofficial Up website for new talent!Check out ourlatest careerlistings

MikuWamsVnu!Up Web Site PartlyElnudy p xarmahmt‘ "rkm’j vm-hr gum Exam-rmma mum : M mmmm mm m Lu mma! hat/nrwn-v m uA’rnEzvanbvakMammmm

SIFT (82%)Tesseract (8%)

3

Page 6: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Four

Sample Image 4

Image Four Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

4 GIFT FINDER GIFT FINDER FINDER SIFT (100%)Tesseract (50%)

4

Page 7: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Five

Sample Image 5

Image Five Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

5 Reduce, reuse,recycle recycledbags handcraftedjewelry ReusableBottles BambooBowls

Reduce, reuse,recycle BambooBowls ReusableRecycled jewelryhandcrafted

Reduce, reuse,recycle

SIFT (82%)Tesseract (27%)

5

Page 8: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Six

Sample Image 6

Image Six Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

6 Country: SyriaGEOS:333954N/0361946E18 Apr 17 MainPrison Probablecrematorium 2017Digital Globe

2017 Digital GlobeProbablecrematorium MainPrison GEOS:333954N/0361946ECountry Syria 8Apr 17

W ‘1 N, © 2017DlgltalGlobe //l..\§'9'" ‘\

SIFT (92%)Tesseract (14%)

6

Page 9: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Seven

Sample Image 7

Image Seven Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

7 Control of theindustrial railferry northeast ofKerch could allowthe Russianmilitary totransfer heavyequipment such astanks and suppliesto flow direct fromRussia. This ferrywith tracks on thedeck allowsrailroad cars to beloaded directly onand off the ship.Images:DigitalGlobeTerraMetrics andCNES/Astrium viaGoogle EarthUKRAINE

es: DigitalGIobe.TerraMetrics andCNES/Astrium viaGoogle Earth ofKerch could allowthe Russianmilitary to Controlof the industrialrail ferrynortheast andsupplies to flowdirect from Russiatransfer heavyequipment such astanks UKRAINEThis ferry withtrack railroad carsto be loadeddirectly on on thedeck aIIO' and offthe ship

(70mm of memdusmal mu ferrynnnheast M muchcoma anowthePuss-an mmtarv mtransfer "slaweqmpmem such as[Links andSupphes (a flowdirect 1mm Russa,on me deck auowsrmlmad came heInaded mvemy onand game sal. Tm:ferrywuh tracks2‘3"

SIFT(94%)Tesseract (5%)

7

Page 10: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Eight

Sample Image 8

Image Eight Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

8 Other FantasticDealsstayparis.comStudios from €80per night (sleeps2)1 Bed Apts from€129 per night(sleeps 3) 2 BedApts from €135per night (sleeps4)

Studios from €80per night (sleeps2)1 Bed Apts from€129 night(sleeps3) 2 BedApts from €135per night (sleeps4) ther FantasticDeals

studies ham€80pernlgm Isl-spam ‘Immimmempunlgtnmwu —ZMAp’sfiomfiJSpunlgh’mq

SIFT (88%)Tesseract (0%)

8

Page 11: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Nine

Sample Image 9

Image Nine Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

9 InternationalGreen ComputingConference August15 - 18, 2010CHICAGO, ILgreen-conf.comCALL FOR PAPERSNOW OPEN DueApril 7, 2010

green-oonlorgAUGUST 15 - 18,2010 I CHICAGO,IL DUE APRIL 7,2010 CALL FORPAPERS NOWOPENCONFERENCEINTERNATIONAI.OOMPUTING

0 gaéeNOOMPUTINQCONFERENCEAUGUSY 157 mm»cmuco, l.www.9reenroonrorg CALL ranPAPERS Now mamnu: APRIL 7,2010

SIFT (90%)Tesseract (20%)

9

Page 12: SIFT™ comparison with Tesseract - Aerstone Labs · Overview This document illustrates the difference in OCR results between Aerstone SIFT™ and Google Tesseract. We will begin

Image Ten

Sample Image 10

Image Ten Results

Sample ImageNumber

Ground Truth SIFT™ Tesseract Accuracy

10 Take an IELTS test Take an IELTS test SIFT(100%)Tesseract (0%)

ConclusionFor assets with high noise, and low structure, Aerstone SIFT™ provides markedly better results thanTesseract out of the box. This is due to the extensive patent-pending image pre-processing thatSIFT™ does against each image, which helps identify text in complex images, and ultimately yieldssubstantially better results than native tools. For a live demo, or to discuss production integrationscenarios, contact [email protected].

10