metadata extractors, content transformers & renditions
DESCRIPTION
Metadata Extractors, Content Transformers & Renditions. Neil Mc Erlean. Who am I?. Lead Engineer in the Services Team 4 years at Alfresco (since 3.2) Previously worked on Hybrid Sync Alfresco in the Cloud Various services/components Transformers & Extractors REST APIs - PowerPoint PPT PresentationTRANSCRIPT
Metadata Extractors, Content Transformers & Renditions
Neil Mc Erlean
Who am I?
Lead Engineer in the Services Team
4 years at Alfresco (since 3.2)
Previously worked on•Hybrid Sync•Alfresco in the Cloud•Various services/components
•Transformers & Extractors•REST APIs•Actions & Behaviours and more…
Ex-astrophysicist (of which more later)
Talk content
What data is in your content?
How does Alfresco get at it?
What does Alfresco do with it?
How can you use these features?
Introductory material•no prior knowledge assumed
Talk content - Breaking it down
Your content & its metadata
Alternative renditions of your content
Overviews of the 3 services
Java Foundation APIs. JavaScript.
Configuring & extending Alfresco.
All code samples available as runnable tests - download from the website.
#1 Metadata Extraction
#2 Content Transformation
Alfresco uses them to produce
•images (thumbnails)•plain text (indexing)•inter-Office transforms
Also generally useful
#3 Rendition Service
• Very similar to transformations
• More general service
• More than just content to content
How do these components work?
Mostly by leveraging existing OSS Java libs•Notably Apache Tika
Some external OS processes too•OpenOffice.org (OOo), LibreOffice•ImageMagick•pdf2swf (swftools)
Some bespoke impls e.g. zip - txt
‘embedded’ thumbnails/previews iWorks, Office
General Considerations
CPU, memory
In process vs. out of process vs. Remote CPU
Selection of ‘best’ extractor/transformer
Stay for Andy Hunt’s talk for Support’s troubleshooting tips
Metadata Extraction
#1 Metadata Extraction
• Triggered on content creation or update.• or on demand
• ‘Best’ available extractor obtained from MetadataExtracterRegistry.
• This Extractor pulls out the metadata.• Format depends on the extractor lib/impl.• key/value pairs
• These data are mapped onto the Alfresco content model• configurable mapping.
<ExtractorClass>.properties
Metadata extraction - JavaMetadataExtracterRegistry registry = appContext.getBean("metadataExtracterRegistry”,
MetadataExtracterRegistry.class);
ContentReader reader =
contentService.getReader(nodeRef,
ContentModel.PROP_CONTENT);
MetadataExtracter extractor = registry.getExtracter(reader.getMimetype());
Map<QName, Serializable> props =
new HashMap<QName, Serializable>();
extractor.extract(reader,
OverwritePolicy.EAGER, props);
Overwrite Policy – when re-extracting
• EAGER• extracted value is not null
• PRUDENT• db property doesn’t exist or is null or “” (+
above)• CAUTIOUS
• existing property == undefined
<ExtractorClass>.properties mappingnamespace.prefix.cm=http://www.alfresco.org/model/content/1.0
author=cm:author
title=cm:title
#Note need to escape ‘:’ in key name
geo\:lat=cm:latitude
geo\:long=cm:longitude
Mapping properties
• Can map extracted key-value onto multiple content properties
• Can ignore extracted key-values i.e. not map.
Metadata extraction - JavaScript
var action = actions.create('extract-metadata'); action.execute(nodeRef);
Ways to customise & extend
• Customisation of existing extractors• Define new mappings – to an existing or a
new content model.• Adding new extractors
• Identify 3rd party lib that can read the binary file
• Or write your own code to do this• Extend
AbstractMappingMetadataExtracter• Or write a Tika plugin• Define metadata mappings
• org.alfresco.repo.content.metadata
Recap
• Metadata extraction harvests ‘hidden’ data and maps it into Alfresco content model.
• Support for many MIME types
• Metadata insertion coming• it’s on HEAD but currently disabled• also maps metadata tags to cm:taggable
• “Best” extractor selection covered below
Content Transformers
Out of the box transformers• text, html, xml• Microsoft Office (doc & docx formats)• OpenDocument Format• iWorks (Keynote, Pages, Numbers)• Images• Shockwave Flash (SWF)• RFC822 email, Outlook .msg email• Adobe PDF, Illustrator, PSD• Electronic publication (epub)• Rich Text (RTF)• MP3• Archives (ZIP, tar)• Many more
Available transformers
• No ‘graph’ of transform paths/mime types
• Spring beans extend “baseContentTransformer”
• They implement isTransformable(from, to)
• They can be• simple (A to B)• ‘complex’ (A to C, via B)• failover (A to B, A to B…)• overlapping (multiple beans for same
path)• dynamically un/available (e.g. OOo)
/api/service/mimetypes webscript
http://localhost:8080/alfresco/service/mimetypes
•MIME types
•Metadata Extractors
•Content Transformers
•As services come and go (OOo), entries may disappear
/api/service/mimetypes webscriptapplication/vnd.openxmlformats-officedocument.presentationml.presentation - pptx
Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter
Transformable To:
application/pdf = Using a Direct Open Office Connection
application/vnd.ms-powerpoint = Using a Direct Open Office Connection
application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection
application/x-shockwave-flash = Complex via: application/pdf
image/jpeg = Complex via: application/pdf
image/png = Complex via: application/pdf
text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer
text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer
text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer
Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection
application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection
“Best” transformer selection
• Alfresco prefers• available transformers (obviously)• ‘explicit’ transformers• previously fast transformers*
• Alfresco doesn’t understand the output quality• pass/fail• fast/slow
* past performance is not a guide to future performance.
Content Transformation - JavaContentTransformerRegistry registry =
appContext.getBean("contentTransformerRegistry”);
ContentReader reader = contentService.getReader
(nodeRef, ContentModel.PROP_CONTENT);
ContentWriter writer = contentService.getWriter
(targetNode, ContentModel.PROP_CONTENT, true);
writer.setEncoding("UTF-8”);
writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);
// Now have a reader & writer ready to go
Content Transformation – Java ctd.ContentTransformer transformer =
registry.getTransformer
(MimetypeMap.MIMETYPE_ZIP,
reader.getSize(),
MimetypeMap.MIMETYPE_TEXT_PLAIN, null);
transformer.transform(reader, writer);
Content Transformation - JavaScript
var action = actions.create('transform');
action.parameters["destination-folder"] = node.parent;
action.parameters["assoc-type"] =
"{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assoc-name"] =
node.name + "transformed";
action.parameters["mime-type"] = "text/plain";
action.execute(testNode);
Config: Transformer Filtering/Debugging
• org.alfresco.service.cmr.repository.
TransformationOptionLimits
• timeouts, size limits, page limits• content.transformer.OpenOffice.
mimeTypeLimits.txt.pdf.maxSourceSizeKBytes=5120
• org.alfresco.repo.content.TransformerDebug
• contextual logging
Extending
• Follow the Alfresco patterns• org.alfresco.repo.content.transform
• Remember the chains
• Remember the subsystems• ImageMagick• OpenOffice
• Remember the Enterprise variants• JodConverter
Recap
• Many transformations & paths possible• No graph
• Can be expensive in CPU/memory
• Transformation to text = free indexing
• No link between source & transformed content• Thumbnails are children of their source
nodes• Bespoke behaviours ensure thumbnails are
updated
Renditions
Renditions
• A more general feature than transformers
• Although with a strong overlap• Thumbnails are renditions• Previews are renditions
• Not all renditions are thumbnails/previews
Renditions
• Flexible location
• Always associated to their source node.• Child nodes of their source node.• Child nodes of another folder node.
• Updated when their source updates.
• Can be disabled with marker aspect• rn:preventRenditions• See ‘preventRenditions’ spring bean to
register other ‘unrenditionable’ content classes
• Can reflect the content and/or metadata of their source node.
Standard rendition engines
• reformat redirects to vanilla transforms
• image image manipulation parameters
• freemarker run some FTL against source content
• xslt run XSLT on (XML) source node
• composite rendition series [reformat, crop]
Persistence of Rendition Definitions
1. Create Rendition Definition
2. Set parameter values on it
3. Execute it against a source node
• Definitions can be persisted
• Useful for complex or commonly used• RenditionService.save(), .load()
• Saved into Alfresco’s Data Dictionary
Renditions - JavaNodeRef jpgNodeRef; QName renditionName = QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");
RenditionDefinition renditionDef = renditionService.createRenditionDefinition (renditionName, "imageRenderingEngine");
renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128);renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false); ChildAssociationRef chAssRef = renditionService.render(jpgNodeRef, renditionDef);
Renditions - JavaScriptvar renditionDef = renditionService
.createRenditionDefinition("cm:cropResize”,
"imageRenderingEngine");
renditionDef.parameters["destination-path-template”]
= "/Company Home/Cropped Images/${name}.jpg";
renditionDef.parameters["isAbsolute"] = true;
renditionDef.parameters["xSize"] = 50;
renditionDef.parameters["ySize"] = 50;
renditionService.render(testNode, renditionDef);
var renditions = renditionService.getRenditions(testNode);
Recap
• Renditions == Transformations++
• More complex, more powerful
End