metadata extractors, content transformers & renditions

Metadata Extractors, Content Transformers & Renditions

Neil Mc Erlean

Who am I?

Lead Engineer in the Services Team

4 years at Alfresco (since 3.2)

Previously worked on•Hybrid Sync•Alfresco in the Cloud•Various services/components

•Transformers & Extractors•REST APIs•Actions & Behaviours and more…

Ex-astrophysicist (of which more later)

Talk content

What data is in your content?

How does Alfresco get at it?

What does Alfresco do with it?

How can you use these features?

Introductory material•no prior knowledge assumed

Talk content - Breaking it down

Your content & its metadata

Alternative renditions of your content

Overviews of the 3 services

Java Foundation APIs. JavaScript.

Configuring & extending Alfresco.

All code samples available as runnable tests - download from the website.

#1 Metadata Extraction

#2 Content Transformation

Alfresco uses them to produce

•images (thumbnails)•plain text (indexing)•inter-Office transforms

Also generally useful

#3 Rendition Service

• Very similar to transformations

• More general service

• More than just content to content

How do these components work?

Mostly by leveraging existing OSS Java libs•Notably Apache Tika

Some external OS processes too•OpenOffice.org (OOo), LibreOffice•ImageMagick•pdf2swf (swftools)

Some bespoke impls e.g. zip - txt

‘embedded’ thumbnails/previews iWorks, Office

General Considerations

CPU, memory

In process vs. out of process vs. Remote CPU

Selection of ‘best’ extractor/transformer

Stay for Andy Hunt’s talk for Support’s troubleshooting tips

Metadata Extraction

#1 Metadata Extraction

• Triggered on content creation or update.• or on demand

• ‘Best’ available extractor obtained from MetadataExtracterRegistry.

• This Extractor pulls out the metadata.• Format depends on the extractor lib/impl.• key/value pairs

• These data are mapped onto the Alfresco content model• configurable mapping.

<ExtractorClass>.properties

Metadata extraction - JavaMetadataExtracterRegistry registry = appContext.getBean("metadataExtracterRegistry”,

MetadataExtracterRegistry.class);

ContentReader reader =

contentService.getReader(nodeRef,

ContentModel.PROP_CONTENT);

MetadataExtracter extractor = registry.getExtracter(reader.getMimetype());

Map<QName, Serializable> props =

new HashMap<QName, Serializable>();

extractor.extract(reader,

OverwritePolicy.EAGER, props);

Overwrite Policy – when re-extracting

• EAGER• extracted value is not null

• PRUDENT• db property doesn’t exist or is null or “” (+

above)• CAUTIOUS

• existing property == undefined

<ExtractorClass>.properties mappingnamespace.prefix.cm=http://www.alfresco.org/model/content/1.0

author=cm:author

title=cm:title

#Note need to escape ‘:’ in key name

geo\:lat=cm:latitude

geo\:long=cm:longitude

Mapping properties

• Can map extracted key-value onto multiple content properties

• Can ignore extracted key-values i.e. not map.

Metadata extraction - JavaScript

var action = actions.create('extract-metadata'); action.execute(nodeRef);

Ways to customise & extend

• Customisation of existing extractors• Define new mappings – to an existing or a

new content model.• Adding new extractors

• Identify 3rd party lib that can read the binary file

• Or write your own code to do this• Extend

AbstractMappingMetadataExtracter• Or write a Tika plugin• Define metadata mappings

• org.alfresco.repo.content.metadata

Recap

• Metadata extraction harvests ‘hidden’ data and maps it into Alfresco content model.

• Support for many MIME types

• Metadata insertion coming• it’s on HEAD but currently disabled• also maps metadata tags to cm:taggable

• “Best” extractor selection covered below

Content Transformers

Out of the box transformers• text, html, xml• Microsoft Office (doc & docx formats)• OpenDocument Format• iWorks (Keynote, Pages, Numbers)• Images• Shockwave Flash (SWF)• RFC822 email, Outlook .msg email• Adobe PDF, Illustrator, PSD• Electronic publication (epub)• Rich Text (RTF)• MP3• Archives (ZIP, tar)• Many more

Available transformers

• No ‘graph’ of transform paths/mime types

• Spring beans extend “baseContentTransformer”

• They implement isTransformable(from, to)

• They can be• simple (A to B)• ‘complex’ (A to C, via B)• failover (A to B, A to B…)• overlapping (multiple beans for same

path)• dynamically un/available (e.g. OOo)

/api/service/mimetypes webscript

http://localhost:8080/alfresco/service/mimetypes

•MIME types

•Metadata Extractors

•Content Transformers

•As services come and go (OOo), entries may disappear

http://localhost:8080/alfresco/service/mimetypes

/api/service/mimetypes webscriptapplication/vnd.openxmlformats-officedocument.presentationml.presentation - pptx

Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter

Transformable To:

application/pdf = Using a Direct Open Office Connection

application/vnd.ms-powerpoint = Using a Direct Open Office Connection

application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

application/x-shockwave-flash = Complex via: application/pdf

image/jpeg = Complex via: application/pdf

image/png = Complex via: application/pdf

text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer

text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer

text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer

Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection

application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

“Best” transformer selection

• Alfresco prefers• available transformers (obviously)• ‘explicit’ transformers• previously fast transformers*

• Alfresco doesn’t understand the output quality• pass/fail• fast/slow

* past performance is not a guide to future performance.

Content Transformation - JavaContentTransformerRegistry registry =

appContext.getBean("contentTransformerRegistry”);

ContentReader reader = contentService.getReader

(nodeRef, ContentModel.PROP_CONTENT);

ContentWriter writer = contentService.getWriter

(targetNode, ContentModel.PROP_CONTENT, true);

writer.setEncoding("UTF-8”);

writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

// Now have a reader & writer ready to go

Content Transformation – Java ctd.ContentTransformer transformer =

registry.getTransformer

(MimetypeMap.MIMETYPE_ZIP,

reader.getSize(),

MimetypeMap.MIMETYPE_TEXT_PLAIN, null);

transformer.transform(reader, writer);

Content Transformation - JavaScript

var action = actions.create('transform');

action.parameters["destination-folder"] = node.parent;

action.parameters["assoc-type"] =

"{http://www.alfresco.org/model/content/1.0}contains";

action.parameters["assoc-name"] =

node.name + "transformed";

action.parameters["mime-type"] = "text/plain";

action.execute(testNode);

Config: Transformer Filtering/Debugging

• org.alfresco.service.cmr.repository.

TransformationOptionLimits

• timeouts, size limits, page limits• content.transformer.OpenOffice.

mimeTypeLimits.txt.pdf.maxSourceSizeKBytes=5120

• org.alfresco.repo.content.TransformerDebug

• contextual logging

Extending

• Follow the Alfresco patterns• org.alfresco.repo.content.transform

• Remember the chains

• Remember the subsystems• ImageMagick• OpenOffice

• Remember the Enterprise variants• JodConverter

Recap

• Many transformations & paths possible• No graph

• Can be expensive in CPU/memory

• Transformation to text = free indexing

• No link between source & transformed content• Thumbnails are children of their source

nodes• Bespoke behaviours ensure thumbnails are

updated

Renditions

Renditions

• A more general feature than transformers

• Although with a strong overlap• Thumbnails are renditions• Previews are renditions

• Not all renditions are thumbnails/previews

Renditions

• Flexible location

• Always associated to their source node.• Child nodes of their source node.• Child nodes of another folder node.

• Updated when their source updates.

• Can be disabled with marker aspect• rn:preventRenditions• See ‘preventRenditions’ spring bean to

register other ‘unrenditionable’ content classes

• Can reflect the content and/or metadata of their source node.

Standard rendition engines

• reformat redirects to vanilla transforms

• image image manipulation parameters

• freemarker run some FTL against source content

• xslt run XSLT on (XML) source node

• composite rendition series [reformat, crop]

Persistence of Rendition Definitions

1. Create Rendition Definition

2. Set parameter values on it

3. Execute it against a source node

• Definitions can be persisted

• Useful for complex or commonly used• RenditionService.save(), .load()

• Saved into Alfresco’s Data Dictionary

Renditions - JavaNodeRef jpgNodeRef; QName renditionName = QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");

RenditionDefinition renditionDef = renditionService.createRenditionDefinition (renditionName, "imageRenderingEngine");

renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128);renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false); ChildAssociationRef chAssRef = renditionService.render(jpgNodeRef, renditionDef);

Renditions - JavaScriptvar renditionDef = renditionService

.createRenditionDefinition("cm:cropResize”,

"imageRenderingEngine");

renditionDef.parameters["destination-path-template”]

= "/Company Home/Cropped Images/${name}.jpg";

renditionDef.parameters["isAbsolute"] = true;

renditionDef.parameters["xSize"] = 50;

renditionDef.parameters["ySize"] = 50;

renditionService.render(testNode, renditionDef);

var renditions = renditionService.getRenditions(testNode);

Recap

• Renditions == Transformations++

• More complex, more powerful

metadata extractors, content transformers & renditions

Documents