content analysis for ecm with apache tika
DESCRIPTION
Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.TRANSCRIPT
Content analysis for ECM with Apache Tika
Paolo Mottadelli -
Paolo Mottadelli
ON BOARD!
3
Paolo Mottadelli
Agenda
4
Paolo Mottadelli
Main challenge
5
Luceneindex
Paolo Mottadelli
Other challenges
6
Paolo Mottadelli
A real world challenge
? ? ?
7
Searching .docx .xlsx .pptx in Alfresco ECM
Paolo Mottadelli
Agenda
8
Paolo Mottadelli
What is Tika?
9
Another Indian Lucene project? No.
Paolo Mottadelli
What is Tika?
It is a Toolkit
10
Paolo Mottadelli
Current coverage
11
Paolo Mottadelli
A brief history of Tika
Sponsored by the Apache Lucene PMC
12
Paolo Mottadelli
Tika organization
13
Changing after graduation
Paolo Mottadelli
Getting Tika
… and contributing
14
Paolo Mottadelli
Tika Design
15
Paolo Mottadelli
The Parser interfacevoid parse(InputStream stream, ContentHandler
handler, Metadata metadata) throws IOException, SAXException, TikaException;
16
Paolo Mottadelli
Tika Design
17
Paolo Mottadelli
Document input stream
18
Paolo Mottadelli
Tika Design
19
Paolo Mottadelli
XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>...</title>
</head>
<body> ... </body>
</html>
20
Paolo Mottadelli
Why XHTML?
• Reflect the structured text content of the document
• Not recreating the low level details• For low level details use low level parser libs
21
Paolo Mottadelli
ContentHandler (CH) and Decorators (CHD)
22
Paolo Mottadelli
Tika Design
23
Paolo Mottadelli
Document metadata
24
Paolo Mottadelli
… more metadata: HPSF
25
Paolo Mottadelli
Tika Design
26
Paolo Mottadelli
Parser implementations
27
Paolo Mottadelli
The AutoDetectParser
• Encapsulates all Tika functionalities• Can handle any type of document
28
Paolo Mottadelli
Type DetectionMimeType type = types.getMimeType(…);
29
Paolo Mottadelli
tika-mimetypes.xml
An example: Gzip
<mime-type type="application/x-gzip">
<magic priority="40">
<match value="\037\213" type="string“ offset="0" />
</magic>
<glob pattern="*.tgz" />
<glob pattern="*.gz" />
<glob pattern="*-gz" />
</mime-type>
30
Paolo Mottadelli
Supported formats
31
Paolo Mottadelli
A really simple exampleInputStream input =
MyTest.class.getResourceAsStream("testPPT.ppt");
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
new OfficeParser().parse(input, handler, metadata);
String contentType = metadata.get(Metadata.CONTENT_TYPE);
String title= metadata.get(Metadata.TITLE);
String content = handler.toString();
32
Paolo Mottadelli
Demo
33
?
Paolo Mottadelli
Future Goals
34
Paolo Mottadelli
Who uses Tika?
35
Paolo Mottadelli
Agenda
36
Paolo Mottadelli
ECM: what is it?
37
Paolo Mottadelli
ECM: Manage
• Indexing• Categorization
*
*
38
Paolo Mottadelli
ECM: we love SEARCHING!
39
Paolo Mottadelli
ECM: we love SEARCHING!
40
Paolo Mottadelli
ECM: we love SEARCHING!
41
Paolo Mottadelli
Don’t do it on your own
Tika shields ECMfrom usingmany single components
42
Paolo Mottadelli
Agenda
43
Paolo Mottadelli
Alfresco: short presentation
44
Paolo Mottadelli
Alfresco: short presentation
45
Paolo Mottadelli
Who uses Alfresco?
46
Paolo Mottadelli
Alfresco RepositoryJSR-170 Level2 Compatible
47
Paolo Mottadelli
Repository Architecture
Hibernate
Content
Lucene
Content IndexDatabase
SearchNode
Node Content QueryIndex
Services
Components
Storage
48
Paolo Mottadelli
Repository Architecture
Hibernate
Content
Lucene
Content IndexDatabase
SearchNode
Node Content QueryIndex
Services
Components
Storage
49
Paolo Mottadelli
Alfresco Search
50
Paolo Mottadelli
Alfresco Search
51
Paolo Mottadelli
Use case
52
Paolo Mottadelli
Use case
53
Paolo Mottadelli
Without Tika:
54
Paolo Mottadelli
Step 1
55
Paolo Mottadelli
Step 2
for (ContentTransformer transformer : transformers)
{
long transformationTime = transformer.getTransformationTime();
if (bestTransformer == null || transformationTime < bestTime)
{
bestTransformer = transformer;
bestTime = transformationTime;
}
}
return bestTransformer;
ContentTransformerRegistryProvides the most appropriate
ContentTransformer
56
Paolo Mottadelli
Step 2 (explained)Too many differentContentTransformer implementations
57
Paolo Mottadelli
Step 3Transform
public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... }
Example: PoiHssfContentTransformer
58
Paolo Mottadelli
Step 3 (explained)
Too many differentContentTransformer implementations
... again !?!
59
Paolo Mottadelli
Step 4
Lucene index creationContentReader reader = contentService.getReader(nodeRef, propertyName);
ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);
transformer.transform(reader, writer); reader = writer.getReader();
. . . . . . . .
doc.add(new Field(attributeName, reader, Field.TermVector.NO));
60
Paolo Mottadelli
Let’s do it using Tika
61
Paolo Mottadelli
Step 1 + Step 2 + Step 3
String name = “resource.doc”InputStream input = getResourceAsStream(name);
Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler();
new AutoDetectParser().parse(input, handler, metadata);
String title = metadata.get(Metadata.TITLE);String content = handler.toString();
62
Paolo Mottadelli
Step 1 to 4 (compressed)
String name = “resource.doc”InputStream input = getResourceAsStream(name);
Reader reader = new ParsingReader(input, name);
. . . . . .
doc.add(new Field(attributeName, reader, Field.TermVector.NO));
63
Paolo Mottadelli
Results: 1 & 2
64
Paolo Mottadelli
Extension use caseAdding support forMicrosoft Office Open XML Documents(Office 2007+)
65
Paolo Mottadelli
Apache POI
Apache POI providesText Extraction support
for Office OpenXML formatsand
An advanced coverage ofSpreadsheetML specification
(WordprocessingML & PresentationML to come)
66
Paolo Mottadelli
Apache POIApache POI status
67
Paolo Mottadelli
Apache POI TextExtractors
POIXMLDocument document;
Package pkg = Package.open(stream);
textExtractor = ExtractorFactory.createExtractor(pkg);
if (textExtractor instanceof XSSFExcelExtractor) {
setType(metadata, OOXML_EXCEL_MIMETYPE
document = new XSSFWorkbook(pkg);
}
else if (textExtractor instanceof XWPFWordExtractor){…}
else if (textExtractor instanceof XSLFPowerPointExtractor){…}
setPOIXMLProperties(metadata, document);
68
Paolo Mottadelli
Can we find it?
69
Paolo Mottadelli
Results: 3 & 4
70