scientific data curation and processing with apache tika chris a. mattmann senior computer...
TRANSCRIPT
![Page 1: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/1.jpg)
Scientific data curation and processing with Apache Tika
Chris A. MattmannSenior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
![Page 2: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/2.jpg)
Roadmap• 1st part of the talk
– Why Tika?– What is Tika?– What are the current versions of Tika?– What can it do?
• 2nd part of the talk– NASA Earth Science Data Systems– Data System Needs and Requirements– How does Tika help?
![Page 3: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/3.jpg)
And you are?
• Apache Member involved in– Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion)
• Architect/Developer at NASA JPL in Pasadena, CA
• Software Architecture/Engineering Prof at USC
![Page 4: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/4.jpg)
The Information Landscape
![Page 5: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/5.jpg)
Proliferation of content types available
• By some accounts, 16K to 51K content types*
• What to do with content types?– Parse them
• How?• Extract their text and structure
– Index their metadata• In an indexing technology like Lucene, Solr, or in
Google Appliance– Identify what language they belong to
• Ngrams
*http://filext.com/
![Page 6: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/6.jpg)
Importance of content types
![Page 7: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/7.jpg)
Importance of content type detection
![Page 8: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/8.jpg)
Search Engine Architecture
![Page 9: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/9.jpg)
Goals• Identify and classify file types
– MIME detection• Glob pattern
– *.txt– *.pdf
• URL– http://…pdf– ftp://myfile.txt
• Magic bytes• Combination of
the above means
• Classification means reaction can be targeted
![Page 10: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/10.jpg)
is…• A content analysis and detection toolkit• A set of Java APIs providing MIME type
detection, language identification, integration of various parsing libraries
• A rich Metadata API for representing different Metadata models
• A command line interface to the underlying Java code
• A GUI interface to the Java code
![Page 11: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/11.jpg)
Tika’s (Brief) History• Original idea for Tika came from Chris Mattmann
and Jerome Charron in 2006• Proposed as Lucene sub-project
– Others interested, didn’t gain much traction
• Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit– A Content Management System
• Graduated from the Incubator to Lucene sub-project in 2008
• Graduated to Apache TLP in April 2010• Over 90 issues shipping in latest release (0.8)
![Page 12: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/12.jpg)
Community• Mailing lists
– User: 153 peeps– Dev: 114 peeps
• Committers/PMC– 10 peeps– Probably 5-6 active
• Releases– 7 releases so far– Working on 0.8
Credit: svnsearch.org
![Page 13: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/13.jpg)
Getting started rapidly…like now!
• Download Tika from:– http://tika.apache.org/download.html
• Grab tika-app-0.7.jar• alias tika “java –jar tika-app-0.7.jar”• tika < somefile.doc > extracted-text.xhtml• tika –m < somefile.doc > extracted.met
• Works on Windows too (alias only on UNIX)
![Page 14: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/14.jpg)
Detecting MIME types from Java
• String type = Tika.detect(…)– java.io.InputStream– java.io.File– java.net.URL– java.lang.String
![Page 15: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/15.jpg)
Adding new MIME types
• Got XML?
• Based on freedesktop.org spec (loosely)
![Page 16: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/16.jpg)
Many custom applications and tools
• You need this: to read this:
![Page 17: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/17.jpg)
Third-party parsing libraries• Most of the custom applications come with
software libraries and tools to read/write these files– Rather than re-invent the wheel, figure out a
way to take advantage of them• Parsing text and structure is a difficult
problem– Not all libraries parse text in equivalent
manners– Some are faster than others– Some are more reliable than others
![Page 18: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/18.jpg)
Parsing
• String content = Tika.parseToString(…)– InputStream– File– URL
![Page 19: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/19.jpg)
Streaming Parsing
• Reader reader = Tika.parse(…)– InputStream– File– URL
![Page 20: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/20.jpg)
Extraction of Metadata• Important to follow common Metadata models
– Dublin Core – any electronic resource– XMP – also general like Dublin Core– Word Metadata – specific to .doc, .ppt, etc.– EXIF – image related
• Lots of standards and models out there– The use and extraction of common models allows for
content intercomparison– All standardize mechanisms for searching– You always know for X file type that field Y is there and of
type String or Int or Date
![Page 21: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/21.jpg)
Cancer Research Example
![Page 22: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/22.jpg)
Cancer Research Example
Attributes
Relationships
![Page 23: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/23.jpg)
Metadata• Metadata met = new Metadata();
//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT));
• Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forcast, etc.)– New in Tika 0.8! run: tika --list-met-models
![Page 24: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/24.jpg)
Methods for language identification
• N-grams– Method of detecting next character or set
of characters in a sequence– Useful in determine whether small
snippets of text come from a particular language, or character set
• Non-computational approaches– Tagging– Looking for common words or characters
![Page 25: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/25.jpg)
Language Detection• LanguageIdentifier lang =
new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename))));
• System.out.println(lang.getLanguage());• Uses Ngram analysis included with Tika
– Originating from Nutch– Can be improved
![Page 26: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/26.jpg)
Running Tika in GUI form
• tika --gui
<html xmlns:html=“…”><body>…</body></html>
![Page 27: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/27.jpg)
Integrating Tika into your App
• Maven• Ant• Eclipse• It’s just a set of jars
– tika-core– tika-parsers– tika-app– tika-bundle tika-core
tika-parsers
tika-app
tika-bundle
![Page 28: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/28.jpg)
Some really great stuff in 0.8
• Container aware detection and MIME improvements
• “Drop in” Parsers– Compressed RTF / TNEF / LZFU parsing
available via external plugin at Github
• New Parsers– RSS– Scientific files: NetCDF, HDF
![Page 29: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/29.jpg)
Improvements to Tika
• Adding more parsers for content types– Omnigraffle?
• Expanding ability to handle random access file parsing– Scientific data file formats, some work on
this
• Improving language and charset detection
![Page 30: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/30.jpg)
Part 2
Science Data Systems at NASA
![Page 31: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/31.jpg)
NASA Ground Data Systems
Credit: D. Woollard
![Page 32: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/32.jpg)
Context• NASA develops science data processing systems
for multiple earth science missions• These systems convert the instrument telemetry
delivered to earth from space into useful data for scientific research
• Typical characteristics– Remote sensing instruments that orbit the Earth multiple
times daily– Data are acquired constantly– Complex algorithms convert instrument measurements to
geophysical quantities
![Page 33: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/33.jpg)
The Square Kilometer Array• 1 sq. km of
antennas• Never-before
seen resolution looking intothe sky
• 700 TB– Per second!
![Page 34: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/34.jpg)
NASA DESDynI Mission
• 16 TB/day
• Geographically distributed
• 10s of 1000s of jobs per day
• Tier 1 Earth Science Decadal Mission
![Page 35: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/35.jpg)
Some Considerations• Scale
– Data throughput rates– # of data types– # of metadata types– # of users to send the data to
• Federation– Must leave the data where it is– Socio/Economic/Political
• Heterogeneity– Technology, data formats, skills!
![Page 36: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/36.jpg)
Apache OODT
• We’ve got some components to deal with these issues
![Page 37: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/37.jpg)
How are we building these systems now? -Allow for
push/pull of data over arbitrary
protocols
- Ingestion builds std catalog and
archive
-Deliver product metadata to
search, portal or GIS
-Plug in arbitrary met extractors
![Page 38: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/38.jpg)
How are we building these systems now? -Separation of
file management from workflow
management
-Allow for heterogeneous
computing resources
-Easily integrate PGEs
-Leverages same ingestion crawler
![Page 39: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/39.jpg)
What does this have to do with Tika?
Metadata Ext: TIKA!
Metadata Ext: TIKA!
MIME identification: TIKA!
MIME identification: TIKA!
![Page 40: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/40.jpg)
What does this have to do with Tika?
Metadata Ext: TIKA!
MIME identification: TIKA!
MIME identification: TIKA!
![Page 41: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/41.jpg)
Science Data File Formats• Hierarchical Data Format (HDF)
– http://www.hdfgroup.org – Versions 4 and 5– Lots of NASA data is in 4, newer NASA data in 5– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages• C/C++, Python, Java
![Page 42: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/42.jpg)
Science Data File Formats• network Common Data Form (netCDF)
– www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4– Heavily used in DOE, NOAA, etc.– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial ranges)
– Custom readers/writers/APIs in many languages• C/C++, Python, Java
– Not Hierarchical representation: all flat
![Page 43: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/43.jpg)
So how does it work?• Ingestion
– Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format
– Need to extract their met, catalog and archive them, etc.
• Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability into the Apache trunk
• Processing– Processors (PGEs) generate NetCDF and HDF,
must extract met, catalog and archive
![Page 44: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/44.jpg)
Tool support• Entire stacks of tools written around
these formats– OPeNDAP, LAS, readers, writers, custom
NASA mission toolkits– OGC
• WMS, WCS, etc.
– Unique, one of a kind software build around these data file formats
• Apache can contribute strongly in this area!
![Page 45: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/45.jpg)
Besides processing science files
• …Tika also helps with• MIME identification
– Useful in remote file acquisition– Useful in classification (catalog/archive) of
existing content– Useful in crawling (see my Nutch talk)
• Language identification– Can be useful when data is coming from around
the world, but need to quickly identify whether or not we can process it
![Page 46: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/46.jpg)
Big Goal• More closely link OODT and Tika
– Add new parser to Tika
– Easily get OODT met extractor based on it
• Contribute back some features still baking in OODT– Configuration aspects of parsing
– File types and extensions for science data files
• Spatial– Some work done in my CS572 class on spatial parser
for Tika – would be great to integrate with Tika, OODT, SIS, and Solr
![Page 47: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/47.jpg)
NASA Geo Challenges• Sometimes the data isn’t annotated with lat and lon
– How to discover this?
• Even when the data is annotated with spatial information,computation of e.g.,bounding box aroundthe poles is difficult
• Efficiency and speed are difficult since data is at scale
![Page 48: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/48.jpg)
Alright, I’ll shut up now
• Any questions?
• THANK YOU!– [email protected]– @chrismattmann on Twitter
![Page 49: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/49.jpg)
Acknowledgements
• Some Tika material inspired by Jukka Zitting’s talks– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika-4427630
• NASA Jet Propulsion Laboratory– OODT Team
![Page 50: Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56649ed45503460f94be4e30/html5/thumbnails/50.jpg)
Book
• Jukka and I are writinga book on Tika– Working on Chapters 8
and 9 of 15
• Early Access availablethrough MEAPprogram
• http://manning.com/mattmann/