extreme content makeover: migrating content to dita
DESCRIPTION
Presented by Joe Gollner at Documentation and Training West, May 6-9, 2008 in Vancouver, BCWhile most organizations would not want to admit it, their content currently exists in a state of disarray. In all too many cases, the legacy content is in such a state that it could become the subject of a prime-time television show provided someone could be found who would agree to mount such a large-scale renovation job. And however daunting the prospect might be, the movement of the marketplace for almost all industry sectors is such that these types of renovations are not only unavoidable they are often urgently needed.TRANSCRIPT
Extreme Content Makeover:Migrating Content to DITA Joe Gollner
Copyright © Stilo International 2008
Migrating Content to DITA Vice Presidente-Publishing Solutions
Stilo [email protected]
The Essence of Content Conversion
Got this! Want that!Got this! Want that!
L C t t EditiLegacy Content Edition
TopicsThe Growing Demand for High Quality ContentThe Growing Demand for High Quality Content
Challenges with Converting Content
Solution Patterns for Converting Content
ConversionConversionRefactoringMetadataLi kiLinkingValidation
C l iConclusionsKey Lessons Learned
An Inconvenient Truth – About Content
Case Study: Drug Look-up Tool
Migrating drug information into a
precise digital form represented a key challenge.
Sources:Mil 33 Q kMiles33, Quark
& vendor monographs
Enterprise Content FrameworksEnterprisero
ls
Programs Domains
Document SourcesActive
Con
tr
ed Publishing Services WebDocument Sources
Ontology Sources
External
Spec
ializ
eM
odel
sR
ulesLegacy
Publishing Services
Discovery Services
ApplicationInte
grat
e
Content ArchitectureData Sources
Inputs Outputs
MechanismsUsers Tools
Data Services
Service Oriented
Authors
Subject Matter Experts
Content Management
Content Processing
Resources
B d t
MechanismsService Oriented Content Architectureslead to high demands
being placed on content resources and
Administrators
Information Architects
Developers
Content Authoring
Development Tools
Web Services
Budget
Personnel
Infrastructure
content resources and the affordability of the
overall process.
Observations on Content ExpectationsWithin this larger context what is expected of content?Within this larger context, what is expected of content?
1. Content will be available as valid XML2. Content will be modularized3. Content will be discretely addressable4. Content will be uniquely identifiable using metadata5. Content will be linked to related content6 Content will be process able with almost perfect confidence6. Content will be process-able with almost perfect confidence
How much legacy content is ready to play this role?How much legacy content is ready to play this role?(How much XML content is even ready for this?)
The Harsh Reality of Legacy ContentLegacy Contentg y
All content resources that require modification in order to be useful
The Legacy Content SpectrumOpaque
Not directly processable (e.g., paper)AnnoyingAnnoying
Aggressively proprietaryLittle or no predictability in usage
Poll tedPollutedNormally processable but frequentlyfilled with deviations & additions (HTML)
TolerableDocumented format that exposes format& structure in a processable form
Content Processing RoadmapACQUIRE ENRICH DELIVER
CONTEXT Import SelectMetadata
ContentProcessing Convert Collect Compile
ManageImport Select PublishCONTENT
ContentProcessing Refactor Relate Resolve
CONNECTIONS Import SelectLinks
Convert ContentACQUIRE ENRICH DELIVER
CONTEXT Import SelectMetadata
ContentProcessing CompileCollectConvert
ManageImport Select PublishCONTENT
ContentProcessing Refactor Relate Resolve
CONNECTIONS Import SelectLinks
Converting Content
??
Conversion: changing the format of legacy content to make it increasinglysuitable for efficient management, revision, reuse and publishing.
Conversion FundamentalsConversion is unavoidable and always under-estimatedConversion is unavoidable and always under-estimated
Conversion is fundamentally a matter of interpretationParsing the legacy format & layoutInferring a meaning from this informationCorrelating the format & layout to a target structureCorrelating the format & layout to a target structureAddressing problems introduced by format peculiaritiesLeveraging the content itself to guide format interpretation E h i i t ti l b t hi t t ttEnhancing interpretive rules by matching content patterns
Automating conversion typically relies on two stages:Format Interpreter that can make sense of source formattingRules-based Correlation Processor that maps content into structures
Conversion Process TemplateSource to S bj tTarget
InteractionSource Analysis
Source to Target
Mapping
SubjectMatterExperts
Legacy
Target XML
Schema
Guidance
Modify Conversion
Process
LegacySourceContent
ModifiedConversion
Rules
ManualEditing
ExistingConversion
Rules
Execute C i Result Identified
I iExample 1
Conversion Process
esuAnalysis
de edIssues Interaction
pSet
SampleSet 10%
2
Validation &Verification
ApplicationTests
CompleteSet 100%
3Complete
Show Me!
Conversion Process InitiationContent AnalysisContent Analysis
Document all features of source content and format
Establish Control CollectionsEstablish Control CollectionsCollections can be used to group files with similar featuresRules can be tailored to address these featuresCollections provide useful management units for tracking & reporting
Clearly Define the Target End StateSh ld b ll it d t lid ti & ifi ti ti itiShould be well-suited to validation & verification activitiesConversion should be separate from refactoring which can follow itEnsure that application testing is performed for verification
Structural validation is not sufficientThe converted content must support its intended uses
Conversion Process PlanningPrepare a Conversion SpecificationPrepare a Conversion Specification
Document analysis results & content mapping rulesIncorporate naming conventions to be applied
Instances media resources identifiers cross referencesInstances, media resources, identifiers, cross-references… Establish a representative Example Set early in process
A limited set of files that exhibit main features of source contentM t h d ith t d t t th t ill t t i t d d ltMatched with converted content that illustrates intended resultUsed to iteratively refine rules & troubleshoot problemsForms part of the Conversion Specification
Prepare a Conversion PlanDocument intervention procedures to be followedDefine manual editing guidelinesg gExplore outsourcing opportunities to enhance process or reduce costsPrepare schedule & cost estimates
Conversion Process RefinementImplement initial Conversion ProcessImplement initial Conversion Process
Maximize automationDevelop validation & verification scenarios that leverage automationEnsure conversion rules can be modified by non-programmers
The goal is to interact with Subject Matter Experts efficientlyBased on Conversion Specification & Example Setp p
Test Conversion ProcessFollow the process from beginning to endFollow the process from beginning to end
Including application tests & output reviewLook for opportunities to enhance automationPerform trial interventions & manual editing to improve proceduresRevise Conversion Specification, Example Set & automation
Conversion Process Execution & AdaptationProcess refinement should continue throughout conversionProcess refinement should continue throughout conversion
Improve automation as the first response to identified issuesMinimize manual editing and ensure it is made as routine as possible
Suitable for outsourcing under knowledgeable guidance
Application Testing is important (verification)Where all target applications are not availableWhere all target applications are not available
Develop tests that will minimize risksReduce risk of rework
M l l ft f t i t t ti i l t i kManual clean-up after format interpretation is less at risk Manual editing as part of content mapping is at greater risk
Separate format interpretation from content mappingp p pp gAn interim XML format should be used as an interfaceInterim format should retain all details available in source content
Refactor ContentACQUIRE ENRICH DELIVER
CONTEXT Import SelectMetadata
ContentProcessing Convert CompileCollect
ManageImport Select PublishCONTENT
ContentProcessing Relate ResolveRefactor
CONNECTIONS Import SelectLinks
Refactoring Content
Refactoring: restructuring content, without loss of meaning, to improve itsg g , g, psuitability for management, maintenance and specifically reuse. Refactoring entails two activities: bursting & normalization
Aspects of RefactoringRefactoring breaks down intoRefactoring breaks down into two tasks
BurstingNormalizationNormalization
Content BurstingDecomposing content into components p g poptimized for reuse
Content NormalizationS t ti l f d d i t i i t i bilitSystematic removal of redundancies to improve maintainability
ChallengesMaintaining a complete equivalence with the originalMaintaining a complete equivalence with the originalAdapting the linking mechanisms so they remain valid and functional
Usually entails introduction of an indirect referencing scheme
Refactoring StrategiesStrategy needed to ensure adequate returns on investmentStrategy needed to ensure adequate returns on investment
Approach must balance cost, risk, effort and time in a practical way
Con
vers
ion
Out
puts
Com
pare
Out
puts
Refactoring: Planning Granularity LevelFinding the Right Level of GranularityFinding the Right Level of Granularity
What are the most “natural” joints where content can be burstHow is content most meaningfully
ManagedManagedAuthoredUsed
Ideally there is a level of granularity that is consistent across the viewsIdeally there is a level of granularity that is consistent across the viewsWhat to Avoid
Over-ambition in defining granularity levelAt some point of decomposition, content becomes
MeaninglessVery difficult to manage Very expensive to achieve across large sets of contentChallenging to work with for authors
NormalizationNormalization is an optimization appropriate for content that:Normalization is an optimization appropriate for content that:
Has a long lifespan Exhibits a significant rate of changeWill be translated into other languagesWill be translated into other languages
Normalization occurs at two levelsAt the level of managed granularity (component)
Commonly performed tasks in technical documentationExample: Procedures for accessing a control interface
At a sub-component levelBoilerplate text (e.g., copyright notice or disclaimer)Advisories (e.g., safety warnings)
Automation can support the process under guidanceAutomation can support the process under guidanceIdentify redundancies & implement replacement decisionsFacilitate verifications that there has been no content loss or output impacts
Realizing Savings through Refactoring
Collect MetadataACQUIRE ENRICH DELIVER
CONTEXT Import SelectMetadata
ContentProcessing Convert CompileCollect
ManageImport Select PublishCONTENT
ContentProcessing Refactor Relate Resolve
CONNECTIONS Import SelectLinks
Collecting Metadata
M t d t t f d t th t id i f ti b t th d tMetadata: a set of data that provides information about other data.Collecting Metadata: extracting, validating, integrating, supplementing, synchronizing and storing metadata from, and about, the content.
Sources of MetadataInternal OntologyInternal
Segments of content designated as valuable metadataAtt ib t il bl i f t
metadata
Attributes available in source formatKeywords & AbstractAnnotations Identify
E t t
ExternalSystem Data (file information)
metadata
T i
ExtractInsert
Associated keywords & descriptionsRatings & commentaryProcess context Taxonomy
Topic
Topic
ocess co te tAdditional information drawn from other sources (e.g., part database)
Link Network
Topic
Topic
Establish RelationshipsACQUIRE ENRICH DELIVER
CONTEXT Import SelectMetadata
ContentProcessing Convert CompileCollect
ManageImport Select PublishCONTENT
ContentProcessing Refactor ResolveRelate
CONNECTIONS Import SelectLinks
Establishing Relationships
Explicit Links (Actual)
Identifier Source Target Type
A1
A2
Implicit Links (Potential)
Identifier Source Target Type
B1
B2
Reuse Links (Physical)
Identifier Resource Request ConditionIdentifier Resource Request Condition
R1
R2
Links: the connections or relationships between things that represent a significant portion of the meaning and value of content
All About LinksIncreasingly importantIncreasingly importantEssential for portals (enabling navigation)Adding linksg
Source / target identificationLink specificationLink generationLink generationLink validationLink extractionLink reportingLink activation
Level of precisionLevel of precision is high as is the potential for error
Content ValidationValidation
Essential capabilityEnables consistent processingStreamlines processesStreamlines processesConfirms conversion end-point
Validation must beAccurateManageable
Convert Transform Publish
ManageableInformativeActionable
Relate
Refactor Collect Compile
ResolvePro-activeContinuously improving
Relate Resolve
ConclusionsContent conversion is an unavoidable undertaking
Performance Support Portals demand high-precision contentdemand high precision content
Content conversion is a challenging undertaking
Particularly given the precision being demanded of the results
Content conversion is a manageable undertaking
Guided automation Substantially reduces costsSubstantially reduces costsDramatically improves quality
But there is no magic…
Your Dreams Can Come True