Strategies LLCTaxonomy
May 22, 2005 Copyright 2005 Taxonomy Strategies LLC. All rights reserved.
Workshop: Why and How to Use Dublin Core for Enterprise-Wide
Metadata Applications
Ron Daniel & Joseph Busch
Taxonomy Strategies
2TAXONOMY STRATEGIES LLC The business of organized information
Workshop goals
1. What is the Dublin Core?2. Answer these enterprise-wide metadata ROI questions:
What is the value proposition for adding metadata to content? Does metadata make content reusable? Findable? Improve productivity? How can metadata value be measured in a way that quantifies how it contributes to the bottom line?
3. Answer these Business process questions: How is Dublin Core tagging being done on content to expose
metadata to portals, search engines, and other metadata-aware applications? How are metadata value spaces (controlled vocabularies) maintained within an enterprise? Across enterprises?
4. Answer these technology questions: What tools exist to use Dublin Core and other metadata
standards in enterprise information management environments?
3TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
4TAXONOMY STRATEGIES LLC The business of organized information
Who we are: Joseph Busch
Over 25 years in the business of organized information Founder, Taxonomy Strategies Director, Solutions Architecture, Interwoven VP, Infoware, Metacode Technologies (acquired by Interwoven,
November 2000) Program Manager, Getty Foundation Manager, Pricewaterhouse
Metadata and taxonomies community leadership President, American Society for Information Science & Technology Director, Dublin Core Metadata Initiative Adviser, National Research Council Computer Science and
Telecommunications Board Reviewer, National Science Foundation Division of Information and
Intelligent Systems Founder, Networked Knowledge Organization Systems/Services
5TAXONOMY STRATEGIES LLC The business of organized information
Who we are: Ron Daniel, Jr.
Over 15 years in the business of metadata & automatic classification Principal, Taxonomy Strategies Standards Architect, Interwoven Senior Information Scientist, Metacode Technologies (acquired by
Interwoven, November 2000) Technical Staff Member, Los Alamos National Laboratory
Metadata and taxonomies community leadership Chair, PRISM (Publishers Requirements for Industry Standard
Metadata) working group Acting chair: XML Linking working group Member: RDF working groups Co-editor: PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1 & 2
reports.
6TAXONOMY STRATEGIES LLC The business of organized information
Recent & current projects
Government Commodity Futures Trading
Commission Defense Intelligence Agency ERIC Federal Aviation Administration Federal Reserve Bank of Atlanta Forest Service GSA Office of Citizen Services (
www.firstgov.gov) Head Start Infocomm Development Authority of
Singapore NASA (nasataxonomy.jpl.nasa.gov) Small Business Administration Social Security Administration USDA Economic Research Service USDA e-Government Program (
www.usda.gov)
Commercial Allstate Insurance Blue Shield of California Debevoise & Plimpton Halliburton Hewlett Packard Motorola PeopleSoft Pricewaterhouse Coopers Siderean Software Sprint Time Inc.
Commercial subcontracts Agency.com – Top financial services Critical Mass – Fortune 50 retailer Deloitte Consulting – Big credit card Gistics/OTB – Direct selling giant
NGO’s CEN IDEAlliance IMF OCLC
7TAXONOMY STRATEGIES LLC The business of organized information
What we do
Organize Stuff
8TAXONOMY STRATEGIES LLC The business of organized information
Who are you? Tell us:
Your name Your organization Your job title The things you want to get from this workshop
9TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
10TAXONOMY STRATEGIES LLC The business of organized information
Metadata: Different definitions
Library & Information Science
Author/Title/Subject Controlled Vocabularies for
Subject Codes (e.g. Dewey)
Authority Files for Author Names
Database Tables/Columns/
Datatypes/Relationships References for some
values
11TAXONOMY STRATEGIES LLC The business of organized information
Metadata: Why it matters
“Adding metadata to unstructured content allows it to be managed like structured content. Applications that use structured content work better.”
“Enriching content with structured metadata is critical for supporting search and personalized content delivery.”
“Content that has been adequately tagged with metadata can be leveraged in usage tracking, personalization and improved searching.”
“Better structure equals better access: Taxonomy serves as a framework for organizing the ever-growing and changing information within a company. The many dimensions of taxonomy can greatly facilitate Web site design, content management, and search engineering. If well done, taxonomy will allow for structured Web content, leading to improved information access.”
12TAXONOMY STRATEGIES LLC The business of organized information
Metadata: Supports core functions
Asset metadata – Who:
Creator, Publisher, Contributor, Type, Format,
Identifier
Subject metadata –What, Where & Why:
Subject, Title, Description, Coverage
Relational metadata – Links between and to:
Source, Relation
Use metadata – When & How:
Date, Language, Rights
Enabled Functionality
Co
mp
lex
ity
http://dublincore.org/documents/dces/
More efficient editorial process
Better navigation &
discovery
13TAXONOMY STRATEGIES LLC The business of organized information
Hierarchical classification of things into a tree structureHierarchical classification of things into a tree structure
What is a taxonomy? Systematics view
Kingdom Phylum Class Order Family Genus Species
AnimaliaChordata
MammaliaCarnivora
CanidaeCanis
C. familiari
Linnaeus …
Segment Family Class Commodity
44-Office Equipment and Accessories and Supplies .12-Office Supplies
.17-Writing Instruments
.05-Mechanical pencils
.06-Wooden pencils
.07-Colored pencils
UNSPSC …
14TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
15TAXONOMY STRATEGIES LLC The business of organized information
Dublin Core: A little more complicated
Elements1. Identifier2. Title3. Creator4. Contributor5. Publisher6. Subject7. Description8. Coverage9. Format10. Type11. Date12. Relation13. Source14. Rights15. Language
AbstractAccess rightsAlternativeAudienceAvailableBibliographic citationConforms toCreatedDate acceptedDate copyrightedDate submittedEducation levelExtentHas formatHas partHas versionIs format ofIs part of
Is referenced byIs replaced byIs required byIssuedIs version ofLicenseMediatorMediumModifiedProvenanceReferencesReplacesRequiresRights holderSpatialTable of contentsTemporalValid
RefinementsBoxDCMITypeDDCIMTISO3166ISO639-2LCCLCSHMESHPeriodPointRFC1766RFC3066TGNUDCURIW3CTDF
EncodingsCollectionDatasetEventImageInteractive ResourceMoving ImagePhysical ObjectServiceSoftwareSoundStill ImageText
Types
16TAXONOMY STRATEGIES LLC The business of organized information
Dublin Core framework for corporate use
Not just 15 elements A framework to enable cross-resource exploration and
use
Dublin Core is framework for “integration metadata” at BellSouth
Source: Todd Stephens, BellSouth
17TAXONOMY STRATEGIES LLC The business of organized information
ElementData Type Length
Req. / Repeat Source Purpose
Asset Metadata
Unique ID Integer Fixed 1 System supplied Basic accountability
Recipe Title String Variable 1 Licensed Content Text search & results display
Recipe summary String Variable 1 Licensed Content Content
Main Ingredients List Variable ?Main Ingredients vocabulary
Key index to retrieve & aggregate recipes, & generate shopping list
Subject Metadata
Meal Types List Variable * Meal Types vocab
Browse or group recipes & filter search results
Cuisines List Variable * Cuisines
Courses List Variable * Courses vocab
Cooking Method Flag Fixed * Cooking vocab
Link Metadata
Recipe Image Pointer Variable ? Product Group Merchandize products
Use Metadata
Rating String Variable 1 Licensed Content Filter, rank, & evaluate recipes
Release Date Date Fixed 1 Product Group Publish & feature new recipes
Legend: ? – 1 or more * - 0 or more
Metadata: A data specification – a recipe example
dc:identifier
dc:title
dc:description
X
X
X
X
X
dcterms:hasPart
dc:date
dc:type=“recipe”, dc:format=“text/html”, dc:language=“en”
18TAXONOMY STRATEGIES LLC The business of organized information
Why Dublin Core?
Dublin Core is a de-facto standard across many other systems and standards
RSS (1.0), OAI Inside organizations – portals,
CMS, …
Mapping to DC elements from most existing schemes is simple
Beware of force-fits
Why will metadata already exist? Because of search projects,
portal integration projects, etc. that are creating it or standardizing a mapping.
Source: Todd Stephens, BellSouth
Per-Source Data Types, Access Controls, etc.
Dublin Core and Similar
Taxonomies, Vocabularies,
Ontologies
19TAXONOMY STRATEGIES LLC The business of organized information
Creator
“An entity primarily responsible for making the content of the resource”
In other words – Author, Photographer, Illustrator, … Potential refinements by creative role Rarely justified
Creators can be persons or organizations
Key Point – Reminder: Name variations are a big issue in data quality: Ron Daniel Ron Daniel, Jr. Ron Daniel Jr. R.E. Daniel Ronald Daniel Ronald Ellison Daniel, Jr. Daniel, R.
Name fields may contain other information <dc:creator>Case, W. R. (NASA
Goddard Space Flight Center, Greenbelt, MD, United States)</dc:creator>
Best practice – Validate names against LDAP or other “Authority File”
Refinements
None
Encodings
None
20TAXONOMY STRATEGIES LLC The business of organized information
Example – Name mismatches
One of these things is not like the other:
Ron Daniel, Jr. and Carl Lagoze; “Distributed Active Relationships in the Warwick Framework”
Hojung Cha and Ron Daniel; “Simulated Behavior of Large Scale SCI Rings and Tori”
Ron Daniel; “High Performance Haptic and Teleoperative Interfaces”
Differences may not matterIf they do This error cannot be reliably detected automatically Authority files and an error-correction procedure are
needed
21TAXONOMY STRATEGIES LLC The business of organized information
Contributor
“An entity responsible for making contributions to the content of the resource.”
In practice – rarely used. Difficult to distinguish from
Creator. Adds UI Complexity for no real
gain
Best Practice?
Recommendation – Don’t use.
Refinements
None
Encodings
None
22TAXONOMY STRATEGIES LLC The business of organized information
Publisher
“An entity responsible for making the resource available”.
Problems: All the name-handling stuff of
Creator. Hierarchy of publishers (Bureau,
Agency, Department, …)
Refinements
None
Encodings
None
23TAXONOMY STRATEGIES LLC The business of organized information
Title
“A name given to the resource”.
Issues: Hierarchical Titles
e.g. Conceptual Structures: Information Processing in Mind and Machine (The Systems Programming Series)
Untitled Works Metaphysics
Refinements
Alternative
Encodings
None
24TAXONOMY STRATEGIES LLC The business of organized information
Identifier
“An unambiguous reference to the resource within a given context”
Best Practice: URL
Future Best Practice: URI?
Problems Metaphysics Personalized URLs Multiple identifiers for same
content Non-standard resolution
mechanisms for URIs
Recommendations – Plan how to introduce long-lived URLs
Refinements
Bibliographic Citation
Encodings
URI
25TAXONOMY STRATEGIES LLC The business of organized information
Date
“A date associated with an event in the life cycle of the resource”
Woefully underspecified.
Typically the publication or last modification date.
Best practice: YYYY-MM-DD
Refinements
CreatedValidAvailableIssuedModifiedDate AcceptedDate CopyrightedDate Submitted
Encodings
DCMI PeriodW3C DTF (Profile of ISO 8601)
26TAXONOMY STRATEGIES LLC The business of organized information
Subject
The topic of the content of the resource.
Best practice: Use pre-defined subject schemes, not user-selected keywords. Supported Encodings probably not
useful for most corporate needs
Factor “Subject” into separate facets. People, places, organizations, events,
objects, services Industry sectors Content types, audiences, functions Topic
Some of the facets are already defined in DC (Coverage, Type) or DCTERMS (Audience)
Refinements
None
Encodings
DDCLCCLCSHMESHUDC
27TAXONOMY STRATEGIES LLC The business of organized information
Coverage
“The extent or scope of the content of the resource”.
In other words – places and times as topics.
Key Point – Locations important in SOME environments, irrelevant in others. Time periods as subjects rarely important in commercial work.
Best Practice – ISO 3166-1, 3166-2
Refinements
SpatialTemporal
Encodings
Box (for Spatial)ISO3166 (for Spatial)Point (for Spatial)TGN (for Spatial)W3CTDF (for Temporal)
28TAXONOMY STRATEGIES LLC The business of organized information
Description
“An account of the content of the resource”.
In other words – an abstract or summary
Key Point – What’s the cost/benefit tradeoff for creating descriptions? Quality of auto-generated
descriptions is low For search results, hit highlighting
is probably better
Refinements
AbstractTable of Contents
Encodings
None
29TAXONOMY STRATEGIES LLC The business of organized information
Type
“The nature or genre of the content of the resource”
Best Current Practice: Create a custom list of content types, use that list for the values. Try to avoid “image”, “audio”, and
other format names in the list of content types, they can be derived from “Format”.
No broadly-acceptable list yet found.
Refinements
None
Encodings
DCMI Type
30TAXONOMY STRATEGIES LLC The business of organized information
Format
“The physical or digital manifestation of the resource.”
In other words – the file format
Best practice: Internet Media Types
Outliers: File sizes, dimensions of physical objects
Refinements
ExtentMedium
Encodings
IMT
31TAXONOMY STRATEGIES LLC The business of organized information
Language
“A language of the intellectual content of the resource”.
Best Practice: ISO 639, RFC 3066
Dialect codes: Advanced practice
Refinements
None
Encodings
ISO639-2RFC1766RFC3066
32TAXONOMY STRATEGIES LLC The business of organized information
Relation
“A reference to a related resource”
Very weak meaning – not even as strong as “See also”.
Best practice: Use a refinement element and URLs.
Refinements
Is Version OfHas VersionIs Replaced ByReplacesIs Required ByRequiresIs Part OfHas PartIs Referenced ByReferencesIs Format OfHas FormatConforms To
Encodings
URI
33TAXONOMY STRATEGIES LLC The business of organized information
Source
“A reference to a resource from which the present resource is derived”
Original intent was for derivative works
Frequently abused to provide bibliographic information for items extracted from a larger work, such as articles from a Journal
Refinements
None
Encodings
URI
34TAXONOMY STRATEGIES LLC The business of organized information
Rights
“Information about rights held in and over the resource”
Could be a copyright statement, or a list of groups with access rights, or …
Refinements
Access RightsLicense
Encodings
None
35TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
Strategies LLCTaxonomy
May 22, 2005 Copyright 2005 Taxonomy Strategies LLC. All rights reserved.
CEN/ISSS Workshop on Dublin Core. Guidance information for the deployment of Dublin Core metadata in Corporate Environments
http://www.cenorm.be/cenorm/businessdomains/businessdomains/isss/cwa/cwa15247.asp
37TAXONOMY STRATEGIES LLC The business of organized information
Dublin Core: CEN/ISSS Workshop on Dublin Core Metadata – corporate uses
Applied Information Technique
AstraZenica BBC BellSouth Cisco Daimler Chrysler Giunti Labs GSK Halliburton
HP IBM Intel John Wiley & Sons Lilly PeopleSoft Rohm Haas SAP Software AG Unisys
38TAXONOMY STRATEGIES LLC The business of organized information
How is Dublin Core used in corporate environments?
57%
43% 43%
29%
0%
10%
20%
30%
40%
50%
60%
De facto Simple Access enabler Compliance
Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core
– Guidance information for the deployment of Dublin Core metadata in Corporate Environments
39TAXONOMY STRATEGIES LLC The business of organized information
Jurisdiction
Industry Impact
BRM Impact
Form TypeAgency AudienceKeyword Topic
Taxonomy: e-Forms exampleTaxonomy: e-Forms example
0001 Legislative
1000 Judicial1100
Executive Office of Pres
0003 Exec Depts1200 Agriculture1300 Commerce9700 Defense9100 Education8900 Energy7500 HHS7000 DHS8600 HUD1400 Interior1500 Justice1600 Labor1900 State6900 Transport2000 Treasury3600 Veterans
Ind AgenciesIntl Orgs
ApplicationApprovalClaimInformation
requestInformation
submission
InstructionsLegal filingPaymentProcuremen
tRenewalReservationService
requestTestOther inputOther
transaction
Agriculture & food
CommerceCommunica-
tionsEducationEnergyEnv proForeign relsGovtHealth &
safetyHousing &
comm devLaborLawNamed grpsNational defNat resourcesRecreationSci & techSocial pgmsTransport
AllGeneral
CitizenBusinessGovtEmployeeNative American
Non-resident
TouristSpecial
group
00 Generic11
Agriculture21 Mining22 Utilities23
Construct31-33
Manuf42
Wholesale44-45
Retail48-49 Trans51 Info52 Finance54
Profession55 Mgmt56 Support61
Education62 Health
Care71 Arts72
Hospitality81 Other
Services92 Public
Admin
FederalState +Local +Other +
Citizen SrvcsSocial SrvsDefenseDisastersEcon DevEducationEnergyEnv MgmtLaw EnfJudicial
CorrectionalHealthSecurityIncome Sec
IntelligenceIntl AffairsNat ResourTransportWorkforceScience
DeliverySupport Manageme
nt
Controlled VocabulariesControlled Vocabularies
Facets
40TAXONOMY STRATEGIES LLC The business of organized information
How Dublin Core is extended?
100%
86%
57% 57%
0%
20%
40%
60%
80%
100%
120%
Doc Types Products &Services
Roles InconsistentEncoding
Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core
– Guidance information for the deployment of Dublin Core metadata in Corporate Environments
41TAXONOMY STRATEGIES LLC The business of organized information
Custom business process document types? Ouch!
Oil & gas services company document types
analysis, appraisals, assessments, forecasts, predictions
agendas, plans, designs, schedules, workflow
applications, proposals, requests, requirements
permits, consents, approvals, rejections, certificates
work orders, correspondence
auditing, compliance, testing, inspections, operations reports
lessons learned, after-action reviews, meeting minutes, FAQs
policies, procedures, training manuals, standards, best practices
research notes, journal articles
newsletters, bulletins, press releases
ads, brochures, data sheets, technical notes, case studies, price lists
checklists, templates, forms, logos, branding
software, database forms
42TAXONOMY STRATEGIES LLC The business of organized information
The power of taxonomy facets
4 independent categories of 10 nodes each have the same discriminatory power as one hierarchy of 10,00010,000 nodes (104) Easier to maintain Can be easier to
navigate
43TAXONOMY STRATEGIES LLC The business of organized information
Taxonomic metadata example:Form SS-4. Employer Identification Number (EIN)
Facet Values
Agency IRS
Content Type Information Submission
Industry Impact
Generic
Jurisdiction Federal
Programs & Services
Support Delivery of Services/General Government/Taxation Management
Keyword Topic
Commerce/Employment taxes
Audience Business
44TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
45TAXONOMY STRATEGIES LLC The business of organized information
Fundamentals of metadata ROI
Tagging content using metadata and a taxonomy are costs, not benefits.
There is no benefit without exposing the tagged content to users in some way that cuts costs or improves revenues.
Putting metadata and a taxonomy into operation requires UI changes and/or backend system changes, as well as data changes.
You need to determine those changes, and their costs, as part of the ROI.
46TAXONOMY STRATEGIES LLC The business of organized information
Common metadata ROI scenarios
Catalog site Increased sales. Increased productivity.
Customer support Cutting costs. Increased sales.
Compliance Avoiding penalties.
Knowledge worker productivity Less time searching, more time working.
Executive Mandate No ROI study, just someone with a vision and a budget.
47TAXONOMY STRATEGIES LLC The business of organized information
Guided Navigation
2-3 clicks to product
No dead ends
http://www.tesco.com/winestore
Metadata ROI: Catalog site
48TAXONOMY STRATEGIES LLC The business of organized information
Metadata ROI: Catalog site
Increased sales Product findability. Product cross-sells and up-
sells. Customer loyalty.
1-5% increase in sales $57.6B sales (’04) $2.1B net income (’04)
Enterprise portal cost $6M
$600M to $2B/year $21M to $105M/year
1-5% increase in productivity $50K average cost per employee 310,400 employees (’04)
$155M to $776M/year
49TAXONOMY STRATEGIES LLC The business of organized information
Metadata ROI: Customer support model
Policy categories for browsing
Type and go to search for specific policies
Good search results for policy topics, e.g., “pets”
Refine search offered with results
Help on search page, not a click away.
50TAXONOMY STRATEGIES LLC The business of organized information
Metadata ROI: Customer support model
Self service Fewer customer calls. Faster, more accurate CSR
responses through better information access.
25-50% service efficiency increase 300K customer service calls
per month $6 cost per call
Manual processing 100,000 documents 2 pages per document $4 per page $800K
$5.4M to $10.8M/yr
$186M to $930M/year ($575M) to $169M/year
1-5% increased sales $18.6B sales (’04) ($761M) net income (’04)
51TAXONOMY STRATEGIES LLC The business of organized information
Metadata ROI: Compliance
Avoiding penalties for breaching regulations
SOX: up to 5 years in jail SOX: up to $5M
Following required procedures
Loss of company $100B revenue (’00)
Loss of partner companies Arthur Andersen
$100B
52TAXONOMY STRATEGIES LLC The business of organized information
Searching
Creating
Commun-icating
Knowledge workers spend up to 2.5 hours each day looking for information …
… But find what they are looking for only 40% of the time.
— Kit Sims Taylor
53TAXONOMY STRATEGIES LLC The business of organized information
High cost of not finding information
“The amount of time wasted in futile searching for vital information is enormous, leading to staggering costs …”
— Sue Feldman,
High cost of poor classification Poor classification costs a 10,000 user organization $10M
each year—about $1,000 per employee.
— Jakob Nielsen, useit.com
But “better search” itself is a weak ROI
54TAXONOMY STRATEGIES LLC The business of organized information
Creating new
content
Recreating existing content
SearchingCommun-icating
26%9%
Knowledge workers spend more time re-creating existing content than creating new content
— Kit Sims Taylor
55TAXONOMY STRATEGIES LLC The business of organized information
Metadata ROI: Productivity
Decreased cost to market Decreased development
cost Increased R&D productivity Reduced time for sales &
marketing 1-5% decrease in drug
development cost $800M/drug
5-10% increase in R&D productivity
13% of revenue $39B in sales (’04)
10-20% decrease in time for sales & marketing
13% of revenue
Enterprise document management system cost
$10M
$8M to $16M/drug
$254M to $507M/year
$254M to $507M/year
56TAXONOMY STRATEGIES LLC The business of organized information
Metadata FAQ: Executive mandate is key
There is no ROI out of the box Just someone with a vision
…and the budget to make it happen.
What’s really needed? Demos and proofs of value. So that a stronger cost benefit argument can be made for
continuing the work
57TAXONOMY STRATEGIES LLC The business of organized information
Metadata FAQ: How do you sell it?
Don’t sell “metadata” or “taxonomy”, sell the vision of what you want to be able to do.
Clearly understand what the problem is and what the opportunities are.
Do the calculus (costs and benefits) Design the taxonomy (in terms of LOE) in relation to the
value at hand.
58TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
59TAXONOMY STRATEGIES LLC The business of organized information
Overview of metadata practices
Identify the team Use (or map to) Dublin Core for basic information. Extend with custom elements for specific facts. Use pre-existing, standard, vocabularies as much as
possible. ISO country codes for locations Product & service info from ERP system Validate author names with LDAP directory
Design a QC Process Start with an error-correction process, then get more formal on
error detection Large-scale ontologies may be valuable in automated error
detection
60TAXONOMY STRATEGIES LLC The business of organized information
Factor “Subject” into smaller facets
Size DMOZ tries to organize all
web content, has more than 600k categories!
Difficulty in navigating, maintaining
Hidden facet structure “Classification Schemes” vs.
“Taxonomies”
61TAXONOMY STRATEGIES LLC The business of organized information
Sources for 7 common vocabularies
Vocabulary Definition Potential Sources
Organization Organizational structure. FIPS 95-2, U.S. Government Manual, Your organizational structure, etc.
Content Type Structured list of the various types of content being managed or used.
DC Types, AGLS Document Type, AAT Information Forms , Records management policy, etc.
Industry Broad market categories such as lines of business, life events, or industry codes.
FIPS 66, SIC, NAICS, etc.
Location Place of operations or constituencies.
FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics Div, US Postal Service, etc.
Function Functions and processes performed to accomplish mission and goals.
FEA Business Reference Model, Enterprise Ontology, AAT Functions, etc.
Topic Business topics relevant to your mission and goals.
Federal Register Thesaurus, NAL Agricultural Thesaurus, LCSH, etc.
Audience Subset of constituents to whom a piece of content is directed or intended to be used.
GEM, ERIC Thesaurus, IEEE LOM, etc.
Products and Services
Names of products/programs & services.
ERP system, Your products and services, etc.
dc:publisher
dc:type
dc:coverage
dc:subject
dcterms:audience
62TAXONOMY STRATEGIES LLC The business of organized information
Cheap and Easy Metadata
Some fields will be constant across a collection. In the context of a single collection those kinds of
elements add no value, but they add tremendous value when many collections are brought together into one place, and they are cheap to create and validate.
63TAXONOMY STRATEGIES LLC The business of organized information
Taxonomy Business Processes
• Taxonomies must change, gradually, over time if they are to remain relevant
• Maintenance processes need to be specified so that the changes are based on rational cost/benefit decisions
• A team will need to maintain the taxonomy on a part-time basis
• Taxonomy team reports to some other steering committee
64TAXONOMY STRATEGIES LLC The business of organized information
Published CVs and STs
Consuming Applications
Syndicated Terminologies
IntranetSearch
’’
Web CMS
Archives
ERMS
Custodians
Notifications
Change Requests & Responses
ISO3166-1
Other External
ERP
Other Internal
Vocabulary Management
System
Other Controlled
Items
…
’’
Intranet Nav.
DAM
…
Definitions about the Controlled Vocabulary Governance Environment
Controlled Vocabulary Governance Environment
2: CV Team decides when to update CVs
3: Team adds value via mappings, translations, synonyms, training materials, etc.
1: Syndicated Terminologies change on their own schedule
4: Updated versions of CVs published to consuming applications
CVs
65TAXONOMY STRATEGIES LLC The business of organized information
Other Controlled Items
Taxonomy Team will have additional items to manage: Charter, Goals, Performance Measures Editorial rules Team processes Tagger training materials (manual and automatic) Outreach & ROI
Communication plan Website Presentations Announcements
Roadmap
66TAXONOMY STRATEGIES LLC The business of organized information
Taxonomy governance | Generic team charter
Taxonomy Team is responsible for maintaining: The Taxonomy, a multi-faceted classification scheme Associated taxonomy materials, such as:
Editorial Style Guide Taxonomy Training Materials Metadata Standard Team rules and procedures (subject to CIO review)
Team evaluates costs and benefits of suggested changeTaxonomy Team will: Manage relationship between providers of source
vocabularies and consumers of the Taxonomy Identify new opportunities for use of the Taxonomy across
the Enterprise to improve information management practices Promote awareness and use of the Taxonomy
67TAXONOMY STRATEGIES LLC The business of organized information
Other Controlled Items - Editorial RulesTo ensure consistent style, rules are needed
Issues commonly addressed in the rules: Sources of Terms Abbreviations Ampersands Capitalization Continuations (More… or Other…) Duplicate Terms Hierarchy and Polyhierarchy Languages and Character Sets Length Limits “Other” – Allowed or Forbidden? Plural vs. Singular Forms Relation Types and Limits Scope Notes Serial Comma Spaces Synonyms and Acronyms Term Arrangement (Alphabetic or …) Term Label Order (Direct vs. Inverted)
Must also address issue of what to do when rules conflict – which are more important?
Rule Name Editorial Rule
Use Existing Vocabularies
Other things being equal, reusing an existing vocabulary is preferred to creating a new one.
Ampersands The character '&' is preferred to the word ‘and’ in Term Labels.Example: Use Type: “Manuals & Forms”, not “Manuals and Forms”.
Special Characters Retain accented characters in Term Labels.Example: España
Serial comma If a category name includes more than two items, separate the items by commas. The last item is separated by the character ‘&’ which IS NOT preceded by a comma.Example: “Education, Learning & Employment”, not “Education, Learning, & Employment”.
Capitalization Use title case (where all words except articles are capitalized).Example: “Education, Learning & Employment”NOT “Education, learning & employment”NOT “EDUCATION, LEARNING & EMPLOYMENT”NOT “education, learning & employment”
… …
68TAXONOMY STRATEGIES LLC The business of organized information
Roles in Two Taxonomy Governance Teams
Executive Sponsor Advocate for the taxonomy team
Business Lead Keeps team on track with larger business
objectives Balances cost/benefit issues to decide
appropriate levels of effort Specialists help in estimating costs
Obtains needed resources if those in team can’t accomplish a particular task
Technical Specialist Estimates costs of proposed changes in terms
of amount of data to be retagged, additional storage and processing burden, software changes, etc.
Helps obtain data from various systems
Content Specialist Team’s liaison to content creators Estimates costs of proposed changes in terms
of editorial process changes, additional or reduced workload, etc.
Small-scale Metadata QA Responsibility
Taxonomy Specialist Suggests potential taxonomy changes based on
analysis of query logs, indexer feedback Makes edits to taxonomy, installs into system
with aid of IT specialist
Content Owner Reality check on process change suggestions
Business LeadCustodians Responsible for content in a specific CV.
Training Representative Develops communications plan, training
materials
Work Practices Representative Develops processes, monitors adherence
IT Representative Backups, admin of CV Tool
Info. Mgmt. Representative Provides CV expertise, tie-in with larger IM effort
in the organization.
Team structure at a different org.
69TAXONOMY STRATEGIES LLC The business of organized information
Taxonomy governance | Where changes come from
experience
End User
Firewall
Taxonomy
Content TaggingLogic
ApplicationUI
TaggingUI
Tagging Staff
Taxonomy Editor
Staff notes
‘missing’concepts
Query log analysis
Requests from other parts of NASA
experience
End User
Taxonomy Team
FirewallFirewall
Taxonomy
Content TaggingLogic
TaggingLogic
ApplicationUI
ApplicationUI
TaggingUI
TaggingUI
Tagging Staff
Taxonomy Editor
Staff notes
‘missing’concepts
Query log analysis
Requests from other parts of the organization
Team considerations
1. Business goals
2. Changes in user experience
3. Retagging cost
Recommendations by Editor
1. Small taxonomy changes (labels, synonyms)
2. Large taxonomy changes (retagging, application changes)
3. New “best bets” content
Application Logic
70TAXONOMY STRATEGIES LLC The business of organized information
Principles
Basic facets with identified items – people, places, projects, instruments, missions, organizations, … Note that these are not subjective “subjects”, they are objective “objects”.
Clearly identify the Custodians of the facets, and the process for maintain and publishing them.
Subjective views can be laid on top of the objective facts, but should be in a different namespace so they are clearly distinguishable.
For example, labels like “Anarchist” or “Prime Minister” can be applied to the same person at different times (e.g. Nelson Mandela).
71TAXONOMY STRATEGIES LLC The business of organized information
Enterprise Portal challenges when organizing content
Multiple subject domains across the enterprise Vocabularies vary Granularity varies Unstructured information represents about 80%
Information is stored in complex ways Multiple physical locations Many different formats
Tagging is time-consuming and requires SME involvement Portal doesn’t solve content access problem
Knowledge is power syndrome Incentives to share knowledge don’t exist Free flow of information TO the portal might be inhibited
Content silo mentality changes slowly What content has changed? What exists? What has been discontinued? Lack of awareness of other initiatives
72TAXONOMY STRATEGIES LLC The business of organized information
Challenges when organizing content on enterprise portals
Lack of content standardization and consistency Content messages vary among departments How do users know which message is correct?
Re-usability low to non-existent Costs of content creation, management and delivery may
not change when portal is implemented: Similar subjects, BUT Diverse media Diverse tools Different users
How will personalization be implemented? How will existing site taxonomies be leveraged? Taxonomy creation may surface “holes” in content
73TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
74TAXONOMY STRATEGIES LLC The business of organized information
Methods used to create & maintain metadata
71%
57%
43% 43%
0%
10%
20%
30%
40%
50%
60%
70%
80%
Forms DistributedProduction
Centralizedproduction
Not Automated
Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core
– Guidance information for the deployment of Dublin Core metadata in Corporate Environments
75TAXONOMY STRATEGIES LLC The business of organized information
The Tagging Problem
How are we going to populate metadata elements with complete and consistent values?
What can we expect to get from automatic classifiers?
76TAXONOMY STRATEGIES LLC The business of organized information
Tagging
Province of authors (SMEs) or editors? Taxonomy often highly granular to meet task and re-use
needs. Vocabulary dependent on originating department. The more tags there are (and the more values for each
tag), the more hooks to the content. If there are too many, authors will resist and use “general”
tags (if available) Automatic classification tools exist, and are valuable, but
results are not as good as humans can do. “Semi-automated” is best. Degree of human involvement is a cost/benefit tradeoff.
77TAXONOMY STRATEGIES LLC The business of organized information
Automatic categorization vendors | Analyst viewpoint
Accuracy Levelhighlow
Con
tent
Vol
umes
low
high
78TAXONOMY STRATEGIES LLC The business of organized information
Considerations in automatic classifier performance
Classification Performance is measured by “Inter-cataloger agreement”
Trained librarians agree less than 80% of the time
Errors are subtle differences in judgment, or big goofs
Automatic classification struggles to match human performance
Exception: Entity recognition can exceed human performance
Classifier performance limited by algorithms available, which is limited by development effort
Very wide variance in one vendor’s performance depending on who does the implementation, and how much time they have to do it
1) 80/20 tradeoff where 20% of effort gives 80% of performance.
2) Smart implementation of inexpensive tools will outperform naive implementations of world-class tools.
Accuracy
Development Effort/ Licensing
Expense
Regexps
Trained Librarians
potential performance
gain
79TAXONOMY STRATEGIES LLC The business of organized information
Tagging tool example: Interwoven MetaTagger
Manual form fill-in w/ check boxes, pull-down lists, etc.
Auto keyword & summarization
80TAXONOMY STRATEGIES LLC The business of organized information
Tagging tool example: Interwoven MetaTagger
Auto-categorization
Parse & lookup (recognize names)
Rules & pattern matching
81TAXONOMY STRATEGIES LLC The business of organized information
Metadata tagging workflows
Even ‘purely’ automatic meta-tagging systems need a manual error correction procedure. Should add a QA sampling
mechanism Tagging models:
Author-generated Central librarians Hybrid – central auto-tagging
service, distributed manual review and correction
Compose in Template
Submit to CMS
Analyst Editor
Review content
Problem?
Copywriter
Copy Edit content
Problem?Hard Cop
y
Web site
Y
Y N
N
Approve/Edit metadata
Automatically fill-in metadata
Tagging Tool Sys Admin
Sample of ‘author-generated’ metadata workflow.
82TAXONOMY STRATEGIES LLC The business of organized information
Automatic categorization vendors | Pragmatic viewpoint
Accuracy Levelhighlow
Con
tent
Vol
umes
low
high
83TAXONOMY STRATEGIES LLC The business of organized information
Seven practical rules for taxonomies
1. Incremental, extensible process that identifies and enables users, and engages stakeholders.
2. Quick implementation that provides measurable results as quickly as possible.
3. Not monolithic—has separately maintainable facets.4. Re-uses existing IP as much as possible.5. A means to an end, and not the end in itself .6. Not perfect, but it does the job it is supposed to do—
such as improving search and navigation. 7. Improved over time, and maintained.
84TAXONOMY STRATEGIES LLC The business of organized information
Agenda
3:30 Introductions: Us and you3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Summary, Q&A6:45 Adjourn
85TAXONOMY STRATEGIES LLC The business of organized information
Summary: Categorize with a purpose
What is the problem you are trying to solve? Improve search Browse for content on an enterprise-wide portal Enable business users to syndicate content Otherwise provide the basis for content re-use
How will you control the cost of creating and maintaining the metadata) needed to solve these problems?
CMS with a metadata tagging products Semi-automated classification Taxonomy editing tools Guided navigation tools
Strategies LLCTaxonomy
May 22, 2005 Copyright 2005 Taxonomy Strategies LLC. All rights reserved.
Contact Info
Ron Daniel
925-368-8371
Joseph Busch
415-377-7912