Inside CDL

Digital Library Building Blocks

The California Digital Library provides software, best practices, and other tools to facilitate digital library operations.

Curation Micro-Services

Developed with CDL programs and partners (e.g., LoC, UMich), curation micro-services offer an unbundled alternative to all-in-one repositories that can be expensive to support and modify (cf DSpace, Fedora, LOCKSS). Using native operating system file and web services, we define minimal conventions to turn a file system into an "object system" and provide low barrier tools for full lifecycle enrichment (identity, fixity, replication, annotation, etc.) of objects. For more background see curation services.

Open specifications and tools. We welcome feedback on these works in progress.

  • Noid (Nice Opaque Identifiers): Noid provides minting, binding, and resolving services in support of preservation-ready identifiers. Persistent identifiers may be obtained by a committed provider with help from these kinds of identity services. Software: download.
  • Dflat: Simple File-Based Object Storage: An object residence, or "digital flat". Common amenities, such as versions, metadata, annotations, administrivia, and the occupant itself (as intended by the depositor), if present, are always found under reserved names. We will likely have "Dflats" at the ends of Pairtree paths.
  • Pairtrees for Collection Storage: A filesystem convention for holding a collection of digital object directories. The directory path ending at an object is formed by taking the identifier and making a new sub-directory for each next pair of characters. Conversely, one can recover every object and its identifier simply by "walking" the Pairtree. Software: download.
  • Content Access Node (CAN): A CAN holds a repository instance, which is a set of collections (Pairtrees) plus policy configuration files to govern such things as fixity, replication, indexing, and annotation, depending on the purpose of the repository.
  • UC3 Storage Service Specification: A robust, flexible, and easily deployed environment in which to manage the secure and persistent storage of encoded files that represent digital content. It defines interfaces for a hierarchy of concepts corresponding to Service, Node, Object, Version, and File.
  • CLOP: A Class-Based System for Managing Object Properties: Very preliminary thinking about policy declarations to be attached to files, versions, objects, and entire repositories.
  • Directory Typing with Namaste Tags: Namaste (NAMe AS TExt) tags are primitive directory-level metadata exposed directly via filenames. As such, they greet visitors who request a directory listing with a glimpse of what the directory holds. Alpha software: download.
  • Reverse Directory Deltas (ReDD): ReDD is a way to represent differences between two sets of files, which permits great cost reduction when storing multiple versions. To optimize access to recent versions, a chain of ReDD "reverse deltas" stretches backward in time. We will likely use ReDD for Dflat version directories.
  • Checkm: a checksum-based manifest format: Checkm is a general-purpose text-based manifest format designed to support tools that verify the bit-level integrity of file groups for such things as content fixity, replication, import, and export.
  • JHOVE2 Architecture for Format-Aware Characterization: A next-generation framework and application for format-aware characterization, building on the succcess of the original JHOVE system. JHOVE2 generalizes the process of characterization to include signature-based identification, validation, feature extraction, and policy-based assessment.
  • BagIt File Package Format: A "bag" is a hierarchical file package format suitable for the exchange of generalized archival content via the network or hard-disk. It has just enough structure to safely enclose its payload but does not require the receiver to have any deep knowledge of its internal semantics. Software: download.
  • N2T: Name-to-Thing Resolver: N2T is a centralized, scheme-agnostic identifier resolver to protect URL stability for organizations with web server hostnames that might change.

Best Practices and Standards

  • Archival Resource Key (ARK): a naming scheme for preservation-ready identifiers. [HTML]
  • WARC File Format (ISO 28500:2009): co-authored by CDL preservation staff, this international standard specifies a structure for storing and exchanging resources harvested from the web and elsewhere. [HTML]
  • CDL guidelines for digital objects, version 2.0: September 2007 [HTML] [PDF]
    • CDL guidelines for digital images, version 2.0: April 2008 [HTML] [PDF]
    • CDL Text Encoding Initiative (TEI) encoding guidelines: [HTML]
  • OAC best practice guidelines for Encoded Archival Description (EAD), version 2.0: [HTML] [PDF]
  • Minimal level OAC MARC records for CDL, Version 1.1: [HTML]
  • Standards for minimal level MARC bibliographic records for University of California Libraries: [DOC]
  • Standards for UC Union catalog input records: [RTF]

Submission Agreements

  • CDL/UC libraries digital assets agreement: [PDF]
  • CDL/UC libraries digital assets submission inventory: [RTF]

Software and Services

  • UC-eLinks OpenURL resolution: The CDL allows UC campus libraries to customize and localize the SFX OpenURL resolution service, UC-eLinks. For detailed operational information about campus instances of UC-eLinks, go to the UC-eLinks Campus Liaisons page.
  • CDL Access and Preservation Repositories: Provides information about the CDL's digital object repositories.
  • eXtensible Text Framework (XTF): Flexible indexing and query tool that supports searching across collections of heterogeneous data and present results in a highly configurable manner.
  • 7train: An XSLT 2.0-based tool for generating METS files from standardized XML inputs (e.g., CONTENTdm Standard XML exports, OAI records, etc.).
  • Date Normalization Utility: Java code that outputs machine-readable date strings to enrich collections that weren't originally encoded with machine-readable dates.
  • Markup data dictionary: Encoding strategy for the data dictionary used for processing of all U.S. census studies.

Guidelines

References

Contact the CDL