content analysis for ecm with apache tika

71
Content analysis for ECM with Apache Tika Paolo Mottadelli -

Upload: paolo-mottadelli

Post on 08-May-2015

4.003 views

Category:

Technology


0 download

DESCRIPTION

Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.

TRANSCRIPT

Page 1: Content analysis for ECM with Apache Tika

Content analysis for ECM with Apache Tika

Paolo Mottadelli -

Page 2: Content analysis for ECM with Apache Tika

Paolo Mottadelli

[email protected]

2

Page 3: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ON BOARD!

3

Page 4: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

4

Page 5: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Main challenge

5

Luceneindex

Page 6: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Other challenges

6

Page 7: Content analysis for ECM with Apache Tika

Paolo Mottadelli

A real world challenge

? ? ?

7

Searching .docx .xlsx .pptx in Alfresco ECM

Page 8: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

8

Page 9: Content analysis for ECM with Apache Tika

Paolo Mottadelli

What is Tika?

9

Another Indian Lucene project? No.

Page 10: Content analysis for ECM with Apache Tika

Paolo Mottadelli

What is Tika?

It is a Toolkit

10

Page 11: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Current coverage

11

Page 12: Content analysis for ECM with Apache Tika

Paolo Mottadelli

A brief history of Tika

Sponsored by the Apache Lucene PMC

12

Page 13: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika organization

13

Changing after graduation

Page 14: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Getting Tika

… and contributing

14

Page 15: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

15

Page 16: Content analysis for ECM with Apache Tika

Paolo Mottadelli

The Parser interfacevoid parse(InputStream stream, ContentHandler

handler, Metadata metadata) throws IOException, SAXException, TikaException;

16

Page 17: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

17

Page 18: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Document input stream

18

Page 19: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

19

Page 20: Content analysis for ECM with Apache Tika

Paolo Mottadelli

XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title>...</title>

</head>

<body> ... </body>

</html>

20

Page 21: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Why XHTML?

• Reflect the structured text content of the document

• Not recreating the low level details• For low level details use low level parser libs

21

Page 22: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ContentHandler (CH) and Decorators (CHD)

22

Page 23: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

23

Page 24: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Document metadata

24

Page 25: Content analysis for ECM with Apache Tika

Paolo Mottadelli

… more metadata: HPSF

25

Page 26: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

26

Page 27: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Parser implementations

27

Page 28: Content analysis for ECM with Apache Tika

Paolo Mottadelli

The AutoDetectParser

• Encapsulates all Tika functionalities• Can handle any type of document

28

Page 29: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Type DetectionMimeType type = types.getMimeType(…);

29

Page 30: Content analysis for ECM with Apache Tika

Paolo Mottadelli

tika-mimetypes.xml

An example: Gzip

<mime-type type="application/x-gzip">

<magic priority="40">

<match value="\037\213" type="string“ offset="0" />

</magic>

<glob pattern="*.tgz" />

<glob pattern="*.gz" />

<glob pattern="*-gz" />

</mime-type>

30

Page 31: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Supported formats

31

Page 32: Content analysis for ECM with Apache Tika

Paolo Mottadelli

A really simple exampleInputStream input =

MyTest.class.getResourceAsStream("testPPT.ppt");

Metadata metadata = new Metadata();

ContentHandler handler = new BodyContentHandler();

new OfficeParser().parse(input, handler, metadata);

String contentType = metadata.get(Metadata.CONTENT_TYPE);

String title= metadata.get(Metadata.TITLE);

String content = handler.toString();

32

Page 33: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Demo

33

?

Page 34: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Future Goals

34

Page 35: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Who uses Tika?

35

Page 36: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

36

Page 37: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: what is it?

37

Page 38: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: Manage

• Indexing• Categorization

*

*

38

Page 39: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: we love SEARCHING!

39

Page 40: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: we love SEARCHING!

40

Page 41: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: we love SEARCHING!

41

Page 42: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Don’t do it on your own

Tika shields ECMfrom usingmany single components

42

Page 43: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

43

Page 44: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco: short presentation

44

Page 45: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco: short presentation

45

Page 46: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Who uses Alfresco?

46

Page 47: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco RepositoryJSR-170 Level2 Compatible

47

Page 48: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

48

Page 49: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

49

Page 50: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco Search

50

Page 51: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco Search

51

Page 52: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Use case

52

Page 53: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Use case

53

Page 54: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Without Tika:

54

Page 55: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 1

55

Page 56: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 2

for (ContentTransformer transformer : transformers)

{

long transformationTime = transformer.getTransformationTime();

if (bestTransformer == null || transformationTime < bestTime)

{

bestTransformer = transformer;

bestTime = transformationTime;

}

}

return bestTransformer;

ContentTransformerRegistryProvides the most appropriate

ContentTransformer

56

Page 57: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 2 (explained)Too many differentContentTransformer implementations

57

Page 58: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 3Transform

public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... }

Example: PoiHssfContentTransformer

58

Page 59: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 3 (explained)

Too many differentContentTransformer implementations

... again !?!

59

Page 60: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 4

Lucene index creationContentReader reader = contentService.getReader(nodeRef, propertyName);

ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);

transformer.transform(reader, writer); reader = writer.getReader();

. . . . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

60

Page 61: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Let’s do it using Tika

61

Page 62: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 1 + Step 2 + Step 3

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler();

new AutoDetectParser().parse(input, handler, metadata);

String title = metadata.get(Metadata.TITLE);String content = handler.toString();

62

Page 63: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 1 to 4 (compressed)

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Reader reader = new ParsingReader(input, name);

. . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

63

Page 64: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Results: 1 & 2

64

Page 65: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Extension use caseAdding support forMicrosoft Office Open XML Documents(Office 2007+)

65

Page 66: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Apache POI

Apache POI providesText Extraction support

for Office OpenXML formatsand

An advanced coverage ofSpreadsheetML specification

(WordprocessingML & PresentationML to come)

66

Page 67: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Apache POIApache POI status

67

Page 68: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Apache POI TextExtractors

POIXMLDocument document;

Package pkg = Package.open(stream);

textExtractor = ExtractorFactory.createExtractor(pkg);

if (textExtractor instanceof XSSFExcelExtractor) {

setType(metadata, OOXML_EXCEL_MIMETYPE

document = new XSSFWorkbook(pkg);

}

else if (textExtractor instanceof XWPFWordExtractor){…}

else if (textExtractor instanceof XSLFPowerPointExtractor){…}

setPOIXMLProperties(metadata, document);

68

Page 69: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Can we find it?

69

Page 70: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Results: 3 & 4

70

Page 71: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Q & A

71

[email protected]