Semantic Web: Paul Ford responds to Clay Shirky

Usually a fan of Clay Shirky, I heaved a heavy sigh at yet more strawman arguments directed against the Semantic Web efforts. I was pleased to see Paul Ford took the time to respond at length, giving concrete examples as well as demonstrating the techniques on his own site.

The critics of the Semantic Web, methinks, simply lacking patience. If everyone only thought one, three, or five years into the future we’d never solve the very hard problems. The Semantic Web as a practical reality might be 10 or 20 years off, but that’s not really so long. And if the W3C wasn’t doing this work, we’d all be sitting around complaining, “Someone has to think about the future of the web, coming up with the strategic plan for web technology and use. Why isn’t the W3C doing this?!”

Metadata Glossary

In an attempt to summarize the relationship among various metadata-related terms and how they relate to building Internet systems I created a metadata glossary. Addressed, for example, are metadata, taxonomies, indexing, CMS, Semantic Web, and XML. I then ordered and tied the terms together with a bit of narrative to explain the relationships among the terms, which helped keep it shorter than an essay but, hopefully, more clear than a glossary only.

Introduction to Metadata

Our understanding of the world is facilitated by our ability to associate things, to compare and contrast, to categorize, and to form abstract relationships. To shape information in ways that allow others to better understand, we deliberately describe the information around us to shape it, creating new forms of knowledge. When communicating with computers, we can do this using metadata.

Metadata is simply a piece of information that describes other information. For example, let’s look at some text, a headline from nytimes.com:

Bush Continues to Push Congress for Resolution on Iraq


By THE ASSOCIATED PRESS 12:30 PM ET


President Bush today kept up pressure on Congress to approve action against Iraq amid new criticism from Democrats.


  • Video: Bush Speaks on Iraq Issues
  • C.I.A. and F.B.I. Defend Counterterrorism

The data in this case is the headline and summary:


Bush Continues to Push Congress for Resolution on Iraq


President Bush today kept up pressure on Congress to approve action against Iraq amid new criticism from Democrats.

The metadata is the surrounding information that helps us understand the context or to categorize the data:

Published by: THE ASSOCIATED PRESS


Publish time: 12:30 PM ET


Related information:

  • Video: Bush Speaks on Iraq Issues
  • C.I.A. and F.B.I. Defend Counterterrorism

There may also be other metadata that isn’t displayed but which helps the system display or organize the data:

Desk: National


Information Type: News


Format: Column

To allow readers to search or browse their news, the New York Times might collect one taxonomy of terms – a form of metadata – and display all these terms together. For example, the Desk taxonomy looks like this:

International


National


Politics


Business


Technology


Science


Health


Sports


New York Region


Education


Weather


Obituaries

This collection is called a metadata schema, meaning a systematic combination of elements.

Metadata can describe other things as well, such as people or places.

<--


There are several types of schemes that can be used when organizing metadata:


[ insert chart ]


adapted from “Levels of Control” from and “An Ontology Spectrum” from Deborah McGuiness


–>

Essentially, the benefits of these metadata schema are:


  • improved browsing and searching by making it easy for the users of a system to find information
  • improved communication among people by creating a common vocabulary
  • simpler maintenance by reducing chaotic use of language

Here’s some basic definitions to help tell the different kinds of schema apart:


  • Synonym Ring: A grouping of similar words or phrases. Synonyms might be used in a search engine by locating relevant information when someone searches on a related term.


  • Glossary: a collection of terms and definitions within a particular domain. A glossary could be used to simply help people agree and understand a common terminology.


  • Taxonomy: An arrangement and naming of metadata, usually hierarchical. A taxonomy might be a list of category names.


  • Faceted Taxonomy: A taxonomy with attributes and attribute values. If News is a term than an attribute could be Country and an attribute value of Country could be France.


  • Thesaurus: A taxonomy that also includes terms that are associated and terms that are related. The term Newspaper is associated with the term Journal and related to the term Town Crier.

  • The above are often referred to as “controlled vocabularies”. If we try to go beyond formal vocabularies and formalize our knowledge of a subject this is known as “knowledge representation”.


  • Ontology: the specification of one’s conceptualization of a knowledge domain. Ontologies resemble faceted taxonomies but use richer semantic relationships among terms and attributes, as well as strict rules about how to specify terms and relationships.

It might help to define some related terms:


Controlled Vocabularies – a defined set of preferred terms. Types of controlled vocabularies include Synonym Rings,


Authority Files, Taxonomies, Faceted Taxonomies, and Thesauri. Ontologies are not usually considered a form of controlled vocabulary but rather a form of knowledge representation.

Attribute – an aspect of an object, such as the publisher name. Attributes are alternately called “facets” when applied to taxonomies, “slots” when applied to ontologies, or “fields” when applied to databases.

Attribute Value – a value assigned to an attribute. For example the attribute “Publisher Name” can have a value of “New York Times”.

{show examples of all these}

A note on metatags: metadata and metatags are related, but are different things. Metatags are found within markup code (like HTML pages) to identify certain attributes of that information. Metadata goes *into* metatags, but metadata has many other uses as well.

A Question

How will we create and access information in 10 years? In 20 years?

How do you wish we would?

Progress is so constant we rarely pause to acknowledge it. We have split the atom, put astronauts on the moon, and replaced unhealthy hearts in the living with healthy hearts from the dead. Computer technology leaves a record of faster, smaller, easier to use technology: We’ve come from mainframe computers the size of large rooms to microprocessors embedded in credit cards. From assembly language to Java. From command line interfaces to mouse-and-keyboard driven multimedia interfaces. From computers dialing each other at 1200 bits per second to constant communication at millions of bits per second.

We have so much potential.

With so much progress, it strikes me as odd that we devote so little time to planning for it. We have an understanding of what we need to do now when designing products (“make it usable”, “make it beautiful”, “increase brand equity”…). How helpful it would be to hold similar common understandings of what we should all have 10 years from now. This could then guide all our efforts towards our goals, rather than design products as a series of guesses about what should be next year.

How will we create and access information in 10 years? How do you wish we would?