Dublin Core Metadata

Creator

"An entity primarily responsible for making the content of the resource"

In other words - Author, Photographer, Illustrator,

Potential refinements by creative role
Rarely justified

Creators can be persons or organizations

Key Point - Reminder: Name variations are a big issue in data quality:

Ron Daniel
Ron Daniel, Jr.
Ron Daniel Jr.
R.E. Daniel
Ronald Daniel
Ronald Ellison Daniel, Jr.
Daniel, R.

Name fields may contain other information

<dc:creator>Case, W. R. (NASA Goddard Space Flight Center, Greenbelt, MD, United States)</dc:creator>

Best practice - Validate names against LDAP or other "Authority File"

Refinements

None

Encodings

None

Example - Name mismatches

One of these things is not like the other:

âÄã

Ron Daniel, Jr. and Carl Lagoze; "Distributed Active Relationships in the Warwick Framework"
Hojung Cha and Ron Daniel; "Simulated Behavior of Large Scale SCI Rings and Tori"
Ron Daniel; "High Performance Haptic and Teleoperative Interfaces"

âÄã

Differences may not matter

If they do

This error cannot be reliably detected automatically
Authority files and an error-correction procedure are needed

âÄã

Contributor

"An entity responsible for making contributions to the content of the resource."

âÄã

In practice - rarely used.

Difficult to distinguish from Creator.
Adds UI Complexity for no real gain

âÄã

Best Practice?

âÄã

Recommendation - Don't use.

Refinements

âÄã

None

Encodings

âÄã

None

Publisher

"An entity responsible for making the resource available".

âÄã

Problems:

All the name-handling stuff of Creator.
Hierarchy of publishers (Bureau, Agency, Department, âÄ¦)

âÄã

Refinements

âÄã

None

Encodings

âÄã

None

Title

"A name given to the resource".

âÄã

Issues:

Hierarchical Titles

e.g. Conceptual Structures: Information Processing in Mind and Machine (The Systems Programming Series)

Untitled Works
Metaphysics

Refinements

Alternative

Encodings

None

Identifier

"An unambiguous reference to the resource within a given context"

âÄã

Best Practice: URL

âÄã

Future Best Practice: URI?

âÄã

Problems

Metaphysics
Personalized URLs
Multiple identifiers for same content
Non-standard resolution mechanisms for URIs

âÄã

Recommendations - Plan how to introduce long-lived URLs

Refinements

âÄã

Bibliographic Citation

Encodings

âÄã

URI

Date

"A date associated with an event in the life cycle of the resource"

âÄã

Woefully underspecified.

âÄã

Typically the publication or last modification date.

âÄã

Best practice: YYYY-MM-DD

Refinements

âÄã

Created

Valid

Available

Issued

Modified

Date Accepted

Date Copyrighted

Date Submitted

Encodings

âÄã

DCMI Period

W3C DTF (Profile of ISO 8601)

Subject

The topic of the content of the resource.

âÄã

Best practice: Use pre-defined subject schemes, not user-selected keywords.

Supported Encodings probably not useful for most corporate needs

âÄã

Factor "Subject" into separate facets.

People, places, organizations, events, objects, services
Industry sectors
Content types, audiences, functions
Topic

âÄã

Some of the facets are already defined in DC (Coverage, Type) or DCTERMS (Audience)

Refinements

âÄã

None

Encodings

âÄã

DDC

LCC

LCSH

MESH

UDC

Coverage

"The extent or scope of the content of the resource".

âÄã

In other words - places and times as topics.

âÄã

Key Point - Locations important in SOME environments, irrelevant in others. Time periods as subjects rarely important in commercial work.

âÄã

Best Practice - ISO 3166-1, 3166-2

Refinements

âÄã

Spatial

Temporal

Encodings

âÄã

Box (for Spatial)

ISO3166 (for Spatial)

Point (for Spatial)

TGN (for Spatial)

W3CTDF (for Temporal)

Description

"An account of the content of the resource".

âÄã

In other words - an abstract or summary

âÄã

Key Point - What's the cost/benefit tradeoff for creating descriptions?

Quality of auto-generated descriptions is low
For search results, hit highlighting is probably better

Refinements

âÄã

Abstract

Table of Contents

Encodings

âÄã

None

Type

"The nature or genre of the content of the resource"

âÄã

Best Current Practice: Create a custom list of content types, use that list for the values.

Try to avoid "image", "audio", and other format names in the list of content types, they can be derived from "Format".
No broadly-acceptable list yet found.

Refinements

âÄã

None

Encodings

âÄã

DCMI Type

Format

"The physical or digital manifestation of the resource."

âÄã

In other words - the file format

âÄã

Best practice: Internet Media Types

âÄã

Outliers: File sizes, dimensions of physical objects

Refinements

âÄã

Extent

Medium

Encodings

âÄã

IMT

Language

"A language of the intellectual content of the resource".

âÄã

Best Practice: ISO 639, RFC 3066

âÄã

Dialect codes: Advanced practice

Refinements

âÄã

None

Encodings

âÄã

ISO639-2

RFC1766

RFC3066

Relation

"A reference to a related resource"

âÄã

Very weak meaning - not even as strong as "See also".

âÄã

Best practice: Use a refinement element and URLs.

Refinements

âÄã

Is Version Of

Has Version

Is Replaced By

Replaces

Is Required By

Requires

Is Part Of

Has Part

Is Referenced By

References

Is Format Of

Has Format

Conforms To

Encodings

âÄã

URI

Source

"A reference to a resource from which the present resource is derived"

âÄã

Original intent was for derivative works

âÄã

Frequently abused to provide bibliographic information for items extracted from a larger work, such as articles from a Journal

Refinements

âÄã

None

Encodings

âÄã

URI

Rights

"Information about rights held in and over the resource"

âÄã

Could be a copyright statement, or a list of groups with access rights, or âÄ¦

âÄã

Refinements

âÄã

Access Rights

License

Encodings

âÄã

None

Custom business process document types? Ouch!

software, database forms

checklists, templates, forms, logos, branding

ads, brochures, data sheets, technical notes, case studies, price lists

newsletters, bulletins, press releases

research notes, journal articles

policies, procedures, training manuals, standards, best practices

lessons learned, after-action reviews, meeting minutes, FAQs

auditing, compliance, testing, inspections, operations reports

work orders, correspondence

permits, consents, approvals, rejections, certificates

applications, proposals, requests, requirements

agendas, plans, designs, schedules, workflow

analysis, appraisals, assessments, forecasts, predictions

Oil & gas services company document types

The power of taxonomy facets

4 independent categories of 10 nodes each have the same discriminatory power as one hierarchy of 10,000 nodes (10 ⁴ )
- Easier to maintain
- Can be easier to navigate

Taxonomic metadata example:

Form SS-4. Employer Identification Number (EIN)

Business

Audience

Commerce/Employment taxes

Keyword Topic

Support Delivery of Services/General Government/Taxation Management

Programs & Services

Federal

Jurisdiction

Generic

Industry Impact

Information Submission

Content Type

IRS

Agency

Values

Facet

Facet Values

Agency IRS

Content Type Application [or Information Submission]

Industry Impact Generic

Jurisdiction Federal

BRM Impact Support Delivery of Services/General Government/Taxation Management

Keyword Topic Commerce/Employment taxes

Audience Business

Knowledge workers spend up to 2.5 hours

each day looking for information âÄ¦

âÄ¦ But find what they are looking for only 40% of the time.

- Kit Sims Taylor

K.S. Taylor. "The brief reign of the knowledge worker," 1998. http://online.bcc.ctc.edu/econ/kst/BriefReign/BRwebversion.htm. Cited by Sue Feldman in her original article.

High cost of not finding information

"The amount of time wasted in futile searching for vital information is enormous, leading to staggering costs âÄ¦"

âÄã

- Sue Feldman,

High cost of poor classification

Poor classification costs a 10,000 user organization $10M each year-about $1,000 per employee.

âÄã

- Jakob Nielsen, useit.com

âÄã

But "better search" itself is a weak ROI

Sue Feldman. "The high cost of not finding information." 13:3 KM World (March 2004) http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&_ID=1725&Publication_ID=108

âÄã

The Jakob Nielsen comment may be apocryphal. It was mentioned in several Delphi reports including Taxonomy and content classification: market milestone report (2002) and Information intelligence: content classification and enterprise taxonomy practice (2004) But the original quote cannot be attributed.

Knowledge workers spend more time re-creating existing content than creating new content

26%

9%

- Kit Sims Taylor

K.S. Taylor. "The brief reign of the knowledge worker," 1998. http://online.bcc.ctc.edu/econ/kst/BriefReign/BRwebversion.htm. Cited by Sue Feldman in her original article.

âÄã

Metadata ROI: Productivity

Decreased cost to market
- Decreased development cost
- Increased R&D productivity
- Reduced time for sales & marketing
1-5% decrease in drug development cost
- $800M/drug
5-10% increase in R&D productivity
- 13% of revenue
- $39B in sales ('04)
10-20% decrease in time for sales & marketing
- 13% of revenue

Enterprise document management system cost
- $10M

âÄã

âÜí $8M to $16M/drug

âÜí $254M to $507M/year

PBS Frontline. The Other Drug War: FAQs. (June 2003) http://www.pbs.org/wgbh/pages/frontline/shows/other/etc/faqs.html

Metadata FAQ: Executive mandate is key

There is no ROI out of the box
Just someone with a vision

âÄ¦and the budget to make it happen.

âÄã

What's really needed?
- Demos and proofs of value.
- So that a stronger cost benefit argument can be made for continuing the work

âÄã

Metadata FAQ: How do you sell it?

Don't sell "metadata" or "taxonomy", sell the vision of what you want to be able to do.
Clearly understand what the problem is and what the opportunities are.
Do the calculus (costs and benefits)
Design the taxonomy (in terms of LOE) in relation to the value at hand.

âÄã

Sources for 7 common vocabularies

Names of products/programs & services.

Subset of constituents to whom a piece of content is directed or intended to be used.

Business topics relevant to your mission and goals.

Functions and processes performed to accomplish mission and goals.

Place of operations or constituencies.

Broad market categories such as lines of business, life events, or industry codes.

Structured list of the various types of content being managed or used.

Organizational structure.

Definition

ERP system, Your products and services, etc.

Products and Services

GEM, ERIC Thesaurus, IEEE LOM, etc.

Audience

Federal Register Thesaurus, NAL Agricultural Thesaurus, LCSH, etc.

Topic

FEA Business Reference Model, Enterprise Ontology, AAT Functions, etc.

Function

FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics Div, US Postal Service, etc.

Location

FIPS 66, SIC, NAICS, etc.

Industry

DC Types, AGLS Document Type, AAT Information Forms , Records management policy, etc.

Content Type

FIPS 95-2, U.S. Government Manual, Your organizational structure, etc.

Organization

Potential Sources

Vocabulary

dc:publisher

dc:type

dc:coverage

dc:subject

dcterms:audience

Cheap and Easy Metadata

Some fields will be constant across a collection.
In the context of a single collection those kinds of elements add no value, but they add tremendous value when many collections are brought together into one place, and they are cheap to create and validate.

Principles

Basic facets with identified items - people, places, projects, instruments, missions, organizations, âÄ¦ Note that these are not subjective "subjects", they are objective "objects".
Clearly identify the Custodians of the facets, and the process for maintain and publishing them.
Subjective views can be laid on top of the objective facts, but should be in a different namespace so they are clearly distinguishable.
- For example, labels like "Anarchist" or "Prime Minister" can be applied to the same person at different times (e.g. Nelson Mandela).

Enterprise Portal challenges when organizing content

Multiple subject domains across the enterprise
- Vocabularies vary
- Granularity varies
- Unstructured information represents about 80%
Information is stored in complex ways
- Multiple physical locations
- Many different formats
Tagging is time-consuming and requires SME involvement
Portal doesn't solve content access problem
- Knowledge is power syndrome
- Incentives to share knowledge don't exist
- Free flow of information TO the portal might be inhibited
Content silo mentality changes slowly
- What content has changed?
- What exists?
- What has been discontinued?
- Lack of awareness of other initiatives

The complexity of storage of information makes it a significant challenge to integrate all the data stores to act as a single seamless repository

âÄã

Content silos result in poor communication among groups ; lots of extra work because one group doesn't know what the other is doing or has already done

âÄã

Yahoo employs a completely manual approach to tagging. All content is considered by SMEs.

âÄã

Challenges when organizing content on enterprise portals

Lack of content standardization and consistency
- Content messages vary among departments
- How do users know which message is correct?
Re-usability low to non-existent
Costs of content creation, management and delivery may not change when portal is implemented:
- Similar subjects, BUT
- Diverse media
- Diverse tools
- Different users
How will personalization be implemented?
How will existing site taxonomies be leveraged?
Taxonomy creation may surface "holes" in content

âÄã

Agenda

3:30 Introductions: Us and you

3:45 Background: Metadata & controlled vocabularies

4:00 Dublin Core: Elements, issues, and recommendations

4:30 Dublin Core in the wild: CEN study and remarks

4:45 Enterprise-wide metadata ROI questions

5:00 Break

5:15 ROI (Cont.)

5:30 Business processes

6:15 Tools & technologies

6:30 Q&A

6:45 Adjourn

Methods used to create & maintain metadata

Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core

- Guidance information for the deployment of Dublin Core metadata in Corporate Environments

Paper or web-based forms widely used:

Distributed resource origination metadata tagging

Centralized clean-up and metadata entry.

Automated tools & applications not widely used:

Auto-categorization tools

Vocabulary/taxonomy editing tools

Guided navigation applications

Federated search and repository "wrappers"

âÄã

The Tagging Problem

How are we going to populate metadata elements with complete and consistent values?
What can we expect to get from automatic classifiers?

âÄã

Tagging

Province of authors (SMEs) or editors?
Taxonomy often highly granular to meet task and re-use needs.
Vocabulary dependent on originating department.
The more tags there are (and the more values for each tag), the more hooks to the content.
If there are too many, authors will resist and use "general" tags (if available)
Automatic classification tools exist, and are valuable, but results are not as good as humans can do.
- "Semi-automated" is best.
- Degree of human involvement is a cost/benefit tradeoff.

Automatic categorization vendors | Analyst viewpoint

Accuracy Level

high

low

Content Volumes

low

high

Scalability requires simple creation of granular metadata and taxonomies.

Better content architecture means more accurate categorization, and more precise content delivery.

Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features - UI, versioning, workflow, storage - that provide the basis for building a QA process.

Considerations in automatic classifier performance

Classification Performance is measured by "Inter-cataloger agreement"
- Trained librarians agree less than 80% of the time
- Errors are subtle differences in judgment, or big goofs
Automatic classification struggles to match human performance
- Exception: Entity recognition can exceed human performance
Classifier performance limited by algorithms available, which is limited by development effort
Very wide variance in one vendor's performance depending on who does the implementation, and how much time they have to do it

80/20 tradeoff where 20% of effort gives 80% of performance.
Smart implementation of inexpensive tools will outperform naive implementations of world-class tools.

Accuracy

Development Effort/ Licensing Expense

Regexps

Trained Librarians

potential performance gain

Tagging tool example: Interwoven MetaTagger

Manual form fill-in w/ check boxes, pull-down lists, etc.

Auto keyword & summarization

Tagging tool example: Interwoven MetaTagger

Auto-categorization

Parse & lookup (recognize names)

Rules & pattern matching

Metadata tagging workflows

Even 'purely' automatic meta-tagging systems need a manual error correction procedure.
- Should add a QA sampling mechanism
Tagging models:
- Author-generated
- Central librarians
- Hybrid - central auto-tagging service, distributed manual review and correction

Compose in Template

Submit to CMS

Analyst

Editor

Review content

Problem?

Copywriter

Copy Edit content

Problem?

Hard Copy

Web site

Y

N

Approve/Edit metadata

Automatically fill-in metadata

Tagging Tool

Sys Admin

Sample of 'author-generated' metadata workflow.

Automatic categorization vendors | Pragmatic viewpoint

Accuracy Level

high

low

Content Volumes

low

high

Scalability requires simple creation of granular metadata and taxonomies.

Better content architecture means more accurate categorization, and more precise content delivery.

Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features - UI, versioning, workflow, storage - that provide the basis for building a QA process.

Seven practical rules for taxonomies

Incremental, extensible process that identifies and enables users, and engages stakeholders.
Quick implementation that provides measurable results as quickly as possible.
Not monolithic-has separately maintainable facets.
Re-uses existing IP as much as possible.
A means to an end, and not the end in itself .
Not perfect, but it does the job it is supposed to do-such as improving search and navigation.
Improved over time, and maintained.

Agenda

3:30 Introductions: Us and you

3:45 Background: Metadata & controlled vocabularies

4:00 Dublin Core: Elements, issues, and recommendations

4:30 Dublin Core in the wild: CEN study and remarks

4:45 Enterprise-wide metadata ROI questions

5:00 Break

5:15 ROI (Cont.)

5:30 Business processes

6:15 Tools & technologies

6:30 Summary, Q&A

6:45 Adjourn

Summary: Categorize with a purpose

What is the problem you are trying to solve?
- Improve search
- Browse for content on an enterprise-wide portal
- Enable business users to syndicate content
- Otherwise provide the basis for content re-use
How will you control the cost of creating and maintaining the metadata) needed to solve these problems?
- CMS with a metadata tagging products
- Semi-automated classification
- Taxonomy editing tools
- Guided navigation tools

Contact Info

Ron Daniel

925-368-8371

rdaniel@taxonomystrategies.com

âÄã

Joseph Busch

415-377-7912

jbusch@taxonomystrategies.com

Dublin Core Metadata Cheat Sheet

Dublin Core Elements and Terms

Elements

Refinements

Encodings

Types