Dublin Core Metadata Cheat Sheet
Brian Kennison
Created: 2013-03-22
Modified: 2013-03-20
This is my attempt to create a reference card for using Dublin Core markup.
Dublin Core Elements and Terms
Elements
Identifier
Title
Creator
Contributor
Publisher
Subject
Description
Coverage
Format
Type
Date
Relation
Source
Rights
Language
Refinements
Abstract
Access rights
Alternative
Audience
Available
Bibliographic citation
Conforms to
Created
Date accepted
Date copyrighted
Date submitted
Education level
Extent
Has format
Has part
Has version
Is format of
Is part of
Is referenced by
Is replaced by
Is required by
Issued
Is version of
License
Mediator
Medium
Modified
Provenance
References
Replaces
Requires
Rights holder
Spatial
Table of contents
Temporal
Valid
Encodings
Box
DCMIType
DDC
IMT
ISO3166
ISO639-2
LCC
LCSH
MESH
Period
Point
RFC1766
RFC3066
TGN
UDC
URI
W3CTDF
Types
Collection
Dataset
Event
Image
Interactive
Resource
Moving Image
Physical Object
Service
Software
Sound
Still Image
Text
Creator
"An entity primarily responsible for making the content of the resource"
In other words - Author, Photographer, Illustrator,
- Potential refinements by creative role
- Rarely justified
Creators can be persons or organizations
Key Point - Reminder: Name variations are a big issue in data quality:
- Ron Daniel
- Ron Daniel, Jr.
- Ron Daniel Jr.
- R.E. Daniel
- Ronald Daniel
- Ronald Ellison Daniel, Jr.
- Daniel, R.
Name fields may contain other information
- <dc:creator>Case, W. R. (NASA Goddard Space Flight Center, Greenbelt, MD, United States)</dc:creator>
Best practice - Validate names against LDAP or other "Authority File"
Refinements
None
Encodings
None
Example - Name mismatches
One of these things is not like the other:
âÄã
- Ron Daniel, Jr. and Carl Lagoze; "Distributed Active Relationships in the Warwick Framework"
- Hojung Cha and Ron Daniel; "Simulated Behavior of Large Scale SCI Rings and Tori"
- Ron Daniel; "High Performance Haptic and Teleoperative Interfaces"
âÄã
Differences may not matter
If they do
- This error cannot be reliably detected automatically
- Authority files and an error-correction procedure are needed
âÄã
Contributor
"An entity responsible for making contributions to the content of the resource."
âÄã
In practice - rarely used.
- Difficult to distinguish from Creator.
- Adds UI Complexity for no real gain
âÄã
Best Practice?
âÄã
Recommendation - Don't use.
Refinements
âÄã
None
Encodings
âÄã
None
Publisher
"An entity responsible for making the resource available".
âÄã
Problems:
- All the name-handling stuff of Creator.
- Hierarchy of publishers (Bureau, Agency, Department, âĦ)
âÄã
Refinements
âÄã
None
Encodings
âÄã
None
Title
"A name given to the resource".
âÄã
Issues:
- Hierarchical Titles
e.g. Conceptual Structures: Information Processing in Mind and Machine (The Systems Programming Series)
- Untitled Works
- Metaphysics
Refinements
Alternative
Encodings
None
Identifier
"An unambiguous reference to the resource within a given context"
âÄã
Best Practice: URL
âÄã
Future Best Practice: URI?
âÄã
Problems
- Metaphysics
- Personalized URLs
- Multiple identifiers for same content
- Non-standard resolution mechanisms for URIs
âÄã
Recommendations - Plan how to introduce long-lived URLs
Refinements
âÄã
Bibliographic Citation
Encodings
âÄã
URI
Date
"A date associated with an event in the life cycle of the resource"
âÄã
Woefully underspecified.
âÄã
Typically the publication or last modification date.
âÄã
Best practice: YYYY-MM-DD
Refinements
âÄã
Created
Valid
Available
Issued
Modified
Date Accepted
Date Copyrighted
Date Submitted
Encodings
âÄã
DCMI Period
W3C DTF (Profile of ISO 8601)
Subject
The topic of the content of the resource.
âÄã
Best practice: Use pre-defined subject schemes, not user-selected keywords.
- Supported Encodings probably not useful for most corporate needs
âÄã
Factor "Subject" into separate facets.
- People, places, organizations, events, objects, services
- Industry sectors
- Content types, audiences, functions
- Topic
âÄã
Some of the facets are already defined in DC (Coverage, Type) or DCTERMS (Audience)
Refinements
âÄã
None
Encodings
âÄã
DDC
LCC
LCSH
MESH
UDC
Coverage
"The extent or scope of the content of the resource".
âÄã
In other words - places and times as topics.
âÄã
Key Point - Locations important in SOME environments, irrelevant in others. Time periods as subjects rarely important in commercial work.
âÄã
Best Practice - ISO 3166-1, 3166-2
Refinements
âÄã
Spatial
Temporal
Encodings
âÄã
Box (for Spatial)
ISO3166 (for Spatial)
Point (for Spatial)
TGN (for Spatial)
W3CTDF (for Temporal)
Description
"An account of the content of the resource".
âÄã
In other words - an abstract or summary
âÄã
Key Point - What's the cost/benefit tradeoff for creating descriptions?
- Quality of auto-generated descriptions is low
- For search results, hit highlighting is probably better
Refinements
âÄã
Abstract
Table of Contents
Encodings
âÄã
None
Type
"The nature or genre of the content of the resource"
âÄã
Best Current Practice: Create a custom list of content types, use that list for the values.
- Try to avoid "image", "audio", and other format names in the list of content types, they can be derived from "Format".
- No broadly-acceptable list yet found.
Refinements
âÄã
None
Encodings
âÄã
DCMI Type
Format
"The physical or digital manifestation of the resource."
âÄã
In other words - the file format
âÄã
Best practice: Internet Media Types
âÄã
Outliers: File sizes, dimensions of physical objects
Refinements
âÄã
Extent
Medium
Encodings
âÄã
IMT
Language
"A language of the intellectual content of the resource".
âÄã
Best Practice: ISO 639, RFC 3066
âÄã
Dialect codes: Advanced practice
Refinements
âÄã
None
Encodings
âÄã
ISO639-2
RFC1766
RFC3066
Relation
"A reference to a related resource"
âÄã
Very weak meaning - not even as strong as "See also".
âÄã
Best practice: Use a refinement element and URLs.
Refinements
âÄã
Is Version Of
Has Version
Is Replaced By
Replaces
Is Required By
Requires
Is Part Of
Has Part
Is Referenced By
References
Is Format Of
Has Format
Conforms To
Encodings
âÄã
URI
Source
"A reference to a resource from which the present resource is derived"
âÄã
Original intent was for derivative works
âÄã
Frequently abused to provide bibliographic information for items extracted from a larger work, such as articles from a Journal
Refinements
âÄã
None
Encodings
âÄã
URI
Rights
"Information about rights held in and over the resource"
âÄã
Could be a copyright statement, or a list of groups with access rights, or âĦ
âÄã
âÄã
Refinements
âÄã
Access Rights
License
Encodings
âÄã
None
Custom business process document types? Ouch!
software, database forms
checklists, templates, forms, logos, branding
ads, brochures, data sheets, technical notes, case studies, price lists
newsletters, bulletins, press releases
research notes, journal articles
policies, procedures, training manuals, standards, best practices
lessons learned, after-action reviews, meeting minutes, FAQs
auditing, compliance, testing, inspections, operations reports
work orders, correspondence
permits, consents, approvals, rejections, certificates
applications, proposals, requests, requirements
agendas, plans, designs, schedules, workflow
analysis, appraisals, assessments, forecasts, predictions
Oil & gas services company document types
The power of taxonomy facets
- 4 independent categories of 10 nodes each have the same discriminatory power as one hierarchy of 10,000 nodes (10 4 )
-
- Easier to maintain
- Can be easier to navigate
Taxonomic metadata example:
Form SS-4. Employer Identification Number (EIN)
Business
Audience
Commerce/Employment taxes
Keyword Topic
Support Delivery of Services/General Government/Taxation Management
Programs & Services
Federal
Jurisdiction
Generic
Industry Impact
Information Submission
Content Type
IRS
Agency
Values
Facet
Facet Values
Agency IRS
Content Type Application [or Information Submission]
Industry Impact Generic
Jurisdiction Federal
BRM Impact Support Delivery of Services/General Government/Taxation Management
Keyword Topic Commerce/Employment taxes
Audience Business
Knowledge workers spend up to 2.5 hours
each day looking for information âĦ
âĦ But find what they are looking for only 40% of the time.
- Kit Sims Taylor
K.S. Taylor. "The brief reign of the knowledge worker," 1998. http://online.bcc.ctc.edu/econ/kst/BriefReign/BRwebversion.htm. Cited by Sue Feldman in her original article.
High cost of not finding information
- "The amount of time wasted in futile searching for vital information is enormous, leading to staggering costs âĦ"
âÄã
- Sue Feldman,
High cost of poor classification
- Poor classification costs a 10,000 user organization $10M each year-about $1,000 per employee.
âÄã
- Jakob Nielsen, useit.com
âÄã
But "better search" itself is a weak ROI
Sue Feldman. "The high cost of not finding information." 13:3 KM World (March 2004) http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&_ID=1725&Publication_ID=108
âÄã
The Jakob Nielsen comment may be apocryphal. It was mentioned in several Delphi reports including Taxonomy and content classification: market milestone report (2002) and Information intelligence: content classification and enterprise taxonomy practice (2004) But the original quote cannot be attributed.
Knowledge workers spend more time re-creating existing content than creating new content
26%
9%
- Kit Sims Taylor
K.S. Taylor. "The brief reign of the knowledge worker," 1998. http://online.bcc.ctc.edu/econ/kst/BriefReign/BRwebversion.htm. Cited by Sue Feldman in her original article.
âÄã
Metadata ROI: Productivity
- Decreased cost to market
-
- Decreased development cost
- Increased R&D productivity
- Reduced time for sales & marketing
- 1-5% decrease in drug development cost
-
- $800M/drug
- 5-10% increase in R&D productivity
-
- 13% of revenue
- $39B in sales ('04)
- 10-20% decrease in time for sales & marketing
-
- 13% of revenue
- Enterprise document management system cost
-
- $10M
âÄã
âÜí $8M to $16M/drug
âÜí $254M to $507M/year
âÜí $254M to $507M/year
PBS Frontline. The Other Drug War: FAQs. (June 2003) http://www.pbs.org/wgbh/pages/frontline/shows/other/etc/faqs.html
Metadata FAQ: Executive mandate is key
- There is no ROI out of the box
- Just someone with a vision
âĦand the budget to make it happen.
âÄã
- What's really needed?
-
- Demos and proofs of value.
- So that a stronger cost benefit argument can be made for continuing the work
âÄã
Metadata FAQ: How do you sell it?
- Don't sell "metadata" or "taxonomy", sell the vision of what you want to be able to do.
- Clearly understand what the problem is and what the opportunities are.
- Do the calculus (costs and benefits)
- Design the taxonomy (in terms of LOE) in relation to the value at hand.
âÄã
Sources for 7 common vocabularies
Names of products/programs & services.
Subset of constituents to whom a piece of content is directed or intended to be used.
Business topics relevant to your mission and goals.
Functions and processes performed to accomplish mission and goals.
Place of operations or constituencies.
Broad market categories such as lines of business, life events, or industry codes.
Structured list of the various types of content being managed or used.
Organizational structure.
Definition
ERP system, Your products and services, etc.
Products and Services
GEM, ERIC Thesaurus, IEEE LOM, etc.
Audience
Federal Register Thesaurus, NAL Agricultural Thesaurus, LCSH, etc.
Topic
FEA Business Reference Model, Enterprise Ontology, AAT Functions, etc.
Function
FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics Div, US Postal Service, etc.
Location
FIPS 66, SIC, NAICS, etc.
Industry
DC Types, AGLS Document Type, AAT Information Forms , Records management policy, etc.
Content Type
FIPS 95-2, U.S. Government Manual, Your organizational structure, etc.
Organization
Potential Sources
Vocabulary
dc:publisher
dc:type
dc:coverage
dc:subject
dcterms:audience
Cheap and Easy Metadata
- Some fields will be constant across a collection.
- In the context of a single collection those kinds of elements add no value, but they add tremendous value when many collections are brought together into one place, and they are cheap to create and validate.
Principles
- Basic facets with identified items - people, places, projects, instruments, missions, organizations, âĦ Note that these are not subjective "subjects", they are objective "objects".
- Clearly identify the Custodians of the facets, and the process for maintain and publishing them.
- Subjective views can be laid on top of the objective facts, but should be in a different namespace so they are clearly distinguishable.
-
- For example, labels like "Anarchist" or "Prime Minister" can be applied to the same person at different times (e.g. Nelson Mandela).
Enterprise Portal challenges when organizing content
- Multiple subject domains across the enterprise
-
- Vocabularies vary
- Granularity varies
- Unstructured information represents about 80%
- Information is stored in complex ways
-
- Multiple physical locations
- Many different formats
- Tagging is time-consuming and requires SME involvement
- Portal doesn't solve content access problem
-
- Knowledge is power syndrome
- Incentives to share knowledge don't exist
- Free flow of information TO the portal might be inhibited
- Content silo mentality changes slowly
-
- What content has changed?
- What exists?
- What has been discontinued?
- Lack of awareness of other initiatives
The complexity of storage of information makes it a significant challenge to integrate all the data stores to act as a single seamless repository
âÄã
Content silos result in poor communication among groups ; lots of extra work because one group doesn't know what the other is doing or has already done
âÄã
Yahoo employs a completely manual approach to tagging. All content is considered by SMEs.
âÄã
Challenges when organizing content on enterprise portals
- Lack of content standardization and consistency
-
- Content messages vary among departments
- How do users know which message is correct?
- Re-usability low to non-existent
- Costs of content creation, management and delivery may not change when portal is implemented:
-
- Similar subjects, BUT
- Diverse media
- Diverse tools
- Different users
- How will personalization be implemented?
- How will existing site taxonomies be leveraged?
- Taxonomy creation may surface "holes" in content
âÄã
Agenda
3:30 Introductions: Us and you
3:45 Background: Metadata & controlled vocabularies
4:00 Dublin Core: Elements, issues, and recommendations
4:30 Dublin Core in the wild: CEN study and remarks
4:45 Enterprise-wide metadata ROI questions
5:00 Break
5:15 ROI (Cont.)
5:30 Business processes
6:15 Tools & technologies
6:30 Q&A
6:45 Adjourn
Methods used to create & maintain metadata
Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core
- Guidance information for the deployment of Dublin Core metadata in Corporate Environments
Paper or web-based forms widely used:
Distributed resource origination metadata tagging
Centralized clean-up and metadata entry.
Automated tools & applications not widely used:
Auto-categorization tools
Vocabulary/taxonomy editing tools
Guided navigation applications
Federated search and repository "wrappers"
âÄã
The Tagging Problem
- How are we going to populate metadata elements with complete and consistent values?
- What can we expect to get from automatic classifiers?
âÄã
âÄã
Tagging
- Province of authors (SMEs) or editors?
- Taxonomy often highly granular to meet task and re-use needs.
- Vocabulary dependent on originating department.
- The more tags there are (and the more values for each tag), the more hooks to the content.
- If there are too many, authors will resist and use "general" tags (if available)
- Automatic classification tools exist, and are valuable, but results are not as good as humans can do.
-
- "Semi-automated" is best.
- Degree of human involvement is a cost/benefit tradeoff.
Automatic categorization vendors | Analyst viewpoint
Accuracy Level
high
low
Content Volumes
low
high
Scalability requires simple creation of granular metadata and taxonomies.
Better content architecture means more accurate categorization, and more precise content delivery.
Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features - UI, versioning, workflow, storage - that provide the basis for building a QA process.
Considerations in automatic classifier performance
- Classification Performance is measured by "Inter-cataloger agreement"
-
- Trained librarians agree less than 80% of the time
- Errors are subtle differences in judgment, or big goofs
- Automatic classification struggles to match human performance
-
- Exception: Entity recognition can exceed human performance
- Classifier performance limited by algorithms available, which is limited by development effort
- Very wide variance in one vendor's performance depending on who does the implementation, and how much time they have to do it
- 80/20 tradeoff where 20% of effort gives 80% of performance.
- Smart implementation of inexpensive tools will outperform naive implementations of world-class tools.
Accuracy
Development Effort/ Licensing Expense
Regexps
Trained Librarians
potential performance gain
Tagging tool example: Interwoven MetaTagger
Manual form fill-in w/ check boxes, pull-down lists, etc.
Auto keyword & summarization
Tagging tool example: Interwoven MetaTagger
Auto-categorization
Parse & lookup (recognize names)
Rules & pattern matching
Metadata tagging workflows
- Even 'purely' automatic meta-tagging systems need a manual error correction procedure.
-
- Should add a QA sampling mechanism
- Tagging models:
-
- Author-generated
- Central librarians
- Hybrid - central auto-tagging service, distributed manual review and correction
Compose in Template
Submit to CMS
Analyst
Editor
Review content
Problem?
Copywriter
Copy Edit content
Problem?
Hard Copy
Web site
Y
Y
N
N
Approve/Edit metadata
Automatically fill-in metadata
Tagging Tool
Sys Admin
Sample of 'author-generated' metadata workflow.
Automatic categorization vendors | Pragmatic viewpoint
Accuracy Level
high
low
Content Volumes
low
high
Scalability requires simple creation of granular metadata and taxonomies.
Better content architecture means more accurate categorization, and more precise content delivery.
Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features - UI, versioning, workflow, storage - that provide the basis for building a QA process.
Seven practical rules for taxonomies
- Incremental, extensible process that identifies and enables users, and engages stakeholders.
- Quick implementation that provides measurable results as quickly as possible.
- Not monolithic-has separately maintainable facets.
- Re-uses existing IP as much as possible.
- A means to an end, and not the end in itself .
- Not perfect, but it does the job it is supposed to do-such as improving search and navigation.
- Improved over time, and maintained.
Agenda
3:30 Introductions: Us and you
3:45 Background: Metadata & controlled vocabularies
4:00 Dublin Core: Elements, issues, and recommendations
4:30 Dublin Core in the wild: CEN study and remarks
4:45 Enterprise-wide metadata ROI questions
5:00 Break
5:15 ROI (Cont.)
5:30 Business processes
6:15 Tools & technologies
6:30 Summary, Q&A
6:45 Adjourn
Summary: Categorize with a purpose
- What is the problem you are trying to solve?
-
- Improve search
- Browse for content on an enterprise-wide portal
- Enable business users to syndicate content
- Otherwise provide the basis for content re-use
- How will you control the cost of creating and maintaining the metadata) needed to solve these problems?
-
- CMS with a metadata tagging products
- Semi-automated classification
- Taxonomy editing tools
- Guided navigation tools
Contact Info
Ron Daniel
925-368-8371
rdaniel@taxonomystrategies.com
âÄã
Joseph Busch
415-377-7912