![Page 1: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/1.jpg)
Enhancing NfNusing Text Analytics and
VisualizationDeb Paul, Andrea Matsunaga, Miao Chen, Jason
Best, Reed Beaman, Sylvia Orli, William Ulate
iDigBio – Notes From Nature Hackathon December 2013Increasing Citizen Science Participation in Museum Specimen Digitization
![Page 2: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/2.jpg)
Preprocess specimen label images with OCR
Remove (and use!) noise from text Utilize OCR text
◦ create word cloud linked to record ids◦ differentiate hand-written from typed labels
Allow transcribers to choose terms from word cloud to create individual sets
Allow validators to choose sets to clean
Text Clusters What
![Page 3: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/3.jpg)
Enhance user experience User Happiness! Leverage user expertise Improve speed Reduce Errors Enables ditto function
Reasons for Cluster Methodology
Why
![Page 4: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/4.jpg)
Users like ordered datasets Transcription
◦faster with ordered/sorted sets◦less error prone with sorted sets
User Stories Who
![Page 5: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/5.jpg)
Segregate hand-written from typed labels
Ben Brumfield code uses regex to sort out garbage (higher garbage = higher likelihood hand-written)◦ Read all about it at Ben’s blog!◦ Code is at GitHub◦ Humanities community using now!
Let transcriber choose label format Typed?....go to word cloud workflow
Handwriting vs. Typed
![Page 6: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/6.jpg)
extract character-level n-gram from a corpus (the OCR corpus + an external good corpus, ideally)
obtain a list of character-level n-gram◦ e.g. bi-gram looks like th, sh, ph ...
given a word, compute its probability based on the n-gram probability
this is used for computing the final OCR confidence score
can use standard dictionary in computing the score (if time allows)
Estimating OCR confidence
![Page 7: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/7.jpg)
Word Cloud Workflow
![Page 8: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/8.jpg)
Word Cloudhttp://aocr1.acis.ufl.edu/datasets/lichens/silver/ocr/WebrootDatasetsLichensSilverOcr.txt
![Page 9: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/9.jpg)
Lichen Label
Images
10,000
200hand-typed
200hand-parsed
200silver OCR
200silver-parsed
Gold txt Gold csv
Silver txt Silver csv
ComparisonAnalysisScoringParsing
WordCloud
ComparisonAnalysisScoring
Crowd!
aOCR iDigBio – BRIT Hackathon Lichen Dataset
Stop/not useful words removed
SOLR Index
![Page 10: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/10.jpg)
Herbarium SheetImages
5,000
100hand-typed
100hand-parsed
100silver OCR
100silver-parsed
Gold txt Gold csv
Silver txt Silver csv
ComparisonAnalysisScoring
ComparisonAnalysisScoringParsing
WordCloud Crowd!
aOCR iDigBio – BRIT Hackathon NYBG Herbarium Dataset
![Page 11: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/11.jpg)
CalBug Entomolog
y Images
500
100hand-typed
100hand-parsed
100silver OCR
100silver-parsed
Gold txt Gold csv
Silver txt Silver csv
ComparisonAnalysisScoring
SegmentationSorting
Comparison
AnalysisScoring
aOCR iDigBio – BRIT Hackathon CalBug Entomology Dataset
![Page 12: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/12.jpg)
http://ammatsun.acis.ufl.edu:5901/carrot2-webapp-3.8.1/
Word Cloud using Solr + Carrot2
![Page 13: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/13.jpg)
Folder View of Search
![Page 14: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/14.jpg)
Circles Visualization of Search
![Page 15: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/15.jpg)
Foam Tree Visualization of Search
![Page 16: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/16.jpg)
OCR Confidence Estimation
![Page 17: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/17.jpg)
Token Histogram – Google Charts
![Page 18: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/18.jpg)
Using OCR output to enhance the transcription process
Ll Ll Team!
![Page 19: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/19.jpg)
Data Sets◦ silver (ABBYY) OCR output from 200 lichen packet
images◦ all ABBYY OCR output from 10,000 lichen packet
images◦ silver and gold (ABBYY, Tesseract) OCR output from 100
herbarium sheets OCR output from 5000 herbarium sheets
◦ SI BVP dataset
Word Clouds usingN- gram Scoring,Faceting,Solr + Carrot2
![Page 20: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/20.jpg)
Word Cloud using Solr + Carrot2
Index OCR text output using SOLR http://ammatsun.acis.ufl.edu:5900/solr/#/lic
henssilver/schema-browser?field=content Using Carrot2 to visualize data http://
ammatsun.acis.ufl.edu:5901/carrot2-webapp-3.8.1/
![Page 21: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/21.jpg)
Use for initial sort or validation
Imagine Integration with NfN/BVP
![Page 22: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/22.jpg)
10,000+ records sample
![Page 23: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/23.jpg)
discovery how many documents have this issue
iDigBio Faceted Collection Browser
![Page 24: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/24.jpg)
discovery how many documents have this issue
iDigBio Faceted Collection Browser
![Page 25: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/25.jpg)
early triage◦score each document◦score transcriptions low scores to human high scores to automated parsing
humans check human correct
◦transcription errors◦ocr errors
Visualizing OCR Confidence
Why
![Page 26: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/26.jpg)
PLANTS OF BAHAMA ISLANDS
PLALANANTNTSTS S O OFOF F B BABAHAHAHAM…
Training Set
Herb gold
AMAMA A I ISISLSLALANANDNDS…
3-grams
Test Set
Herb silver
100 docs
100 docs
N-grams probability
model
![Page 27: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/27.jpg)
Visualizing OCR Confidence
![Page 28: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/28.jpg)
herb_allsilver_trigram_html.zip
5000 OCR html filesSee it for yourself!
![Page 29: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/29.jpg)
Use for initial sort or validation
Let’s make it happen!
![Page 30: Deb Paul, Andrea Matsunaga, Miao Chen, Jason Best, Reed Beaman, Sylvia Orli, William Ulate iDigBio – Notes From Nature Hackathon December 2013 Increasing](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56649e9f5503460f94ba119b/html5/thumbnails/30.jpg)
Happy HoLlidays!
Ll Ll Team!