Inside CDL

OAI Harvesting Infrastructure

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) specifies a method for digital repositories (also called "data providers") to expose metadata about their objects for harvesting by aggregators (also called "service providers"). Metadata is exposed via "sets," or collections of metadata that data providers decide to make available for harvesting. Service providers harvest sets from data providers of interest, and provide search services for the resulting collections of metadata (for a good example of a service provider, see OAIster). Data providers also decide which metadata formats to expose for harvesting, beyond the one required data format of simple Dublin Core.

Creating CDL's OAI Harvesting Infrastructure

As part of CDL's Metasearch Infrastructure Project, which is supporting the development of a number of search portals, metadata was test harvested from a variety of repositories. Experimentation with a Prototype Harvest Search service exposed a number of problems and issues that need to be addressed. Many of these issues, as well as some suggested strategies for dealing with them as well as a proposed infrastructure for metadata harvesting, is outlined in the paper Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers". In response to the issues outlined in that paper, we are drafting Specifications for Metadata Processing Tools that the recently formed Harvesting Core Group (see below) is charged with implementing.

As a beginning step toward normalizing dates and as a test case for a suite of metadata normalization and transformation tools, we created a prototype date normalization tool. We have also collected date test cases by noting variances in date encodings. This work led to the coding and release of a Date Normalization Utility which anyone is free to download and use (without support).

Internally, we are exposing our harvested metadata to other applications such as our metasearch software via SRU.

Harvesting Core Group

Charged with creating an OAI harvesting infrastructure for CDL, the Harvesting Core Group membership is:

  • Lynne Cameron
  • Robin Chandler
  • Heather Christenson
  • John Kunze
  • Bill Landis, Team Lead
  • Jane Lee
  • David Loy
  • Mike McKenna, Technical Lead
  • Roy Tennant

Resources

See also the "Bitter Harvest" paper above for pointers to additional resources.

  • Crawling & Harvesting Glossary: Provides definitions for specialized terminology and key terms appropriate to web crawling and OAI-PMH metadata harvesting activities.
  • Digital Library Federation OAI Best Practices Working Group: An initiative to develop a set of best practices for OAI data and service providers, as well as to foster communication and the sharing of tools. Roy Tennant represents CDL on this group.
  • The Open Archives Initiative: This web site points to a lot of tools and documents that can help repositories become OAI-PMH-compliant and help service providers (harvesters) to harvest metadata from complian t repositories.