Henry M. Gladney, Edward A. Fox, Zahid Ahmed, Ron Ashany, Nicholas J. Belkin, and Maria Zemankova
 IBM Almaden Research Center, San Jose, California 95120-6099, firstname.lastname@example.org,
 Virginia Polytechnic Institute and State University, Blacksburg, Virginia 24601-0106, email@example.com,
San Diego Supercomputer Center, Univ. of Calif., La Jolla, California 92093-9784, firstname.lastname@example.org,
 National Science Foundation, Arlington, Virginia 22230, email@example.com,
 Rutgers University, New Brunswick, New Jersey, firstname.lastname@example.org,
 Mitre Corporation, McLean, Virginia 22102, email@example.com
At the IEEE CAIA'94 Workshop on Intelligent Access to On-Line Digital Libraries we began discussing requirements and architecture for digital library systems. This paper provides a first summaryof the results of our deliberations, analysis, and synthesis.
We consider the context, definitions and characteristics of digital libraries and then propose using an architecture for such distributed computing services built on the concepts of resource managers and application enablers. Our taxonomy for digital libraries calls for a base of file systems and database managers, a storage subsystem for library items (implemented as resource managers), and a higher layer of document managers (implemented as application enablers). Examples of the latter include Mosaic or a folder manager.
Many classes of modules are needed to build these systems. For a particular situation, it is essential to identify the requirements. As a guide, we outline some of the requirements relating to the document storage services and to catalogs that help with access. We conclude with discussions of document markup, links, interchange, and a reminder to build upon the lessons learned with previous libraries and with other distributed information systems, as we develop the first generation of digital libraries.
Keywords: Application enablers, architecture, digital libraries, distributed resource managers, document managers, requirements, storage subsystem, taxonomy.
A digital library (DL), or electronic library, can be the focus of many productive applications. It is no longer only the relatively obscure concern of a few people in computer science and library disciplines but rather a popular research topic for many groups.
Commercial, academic, and public interest are fueled by U.S. Government interest led by Vice President Gore, under the National Information Infrastructure label, and the national press, under the Information Superhighway slogan. Between November 1993 and February 1994, at least four topical conferences were announced for this area, which had seen no similar calls for papers before that.
The earliest of these activities was a one-day, constrained-size workshop addendum to the annual CAIA conference held in San Antonio, Texas on March 1. Its participants agreed it worthwhile to document its deliberations, notwithstanding their tentative nature, as a starting point for similar discussions in other 1994 conferences. In addition to plenary sessions, the workshop group mounted the following subgroups:
1. DL Models, Frameworks, and System Requirements.
2. Library Sciences and Automation
3. Information Retrieval, Organization, Navigation-- Tools and Paradigms
4. DL Specific Nomenclature, System Integration and Architecture Issues.
5. Interfaces to DLs--Information Delivery and Presentation Issues
6. Role of Knowledge Representation Systems in DL Interactions
We report opinions shared in the first subgroup, drawing on elements of the plenary session. We include refinements generated later as we prepared this report. We focus on what we mean by digital libraries, a system taxonomy for distributed data services, and system requirements in that order, trying to provide an aid for future discussions.
What is a Digital Library?
There are many buzz-words for related activities, including, but not limited to: multi-media database [Wo87], information mining, information warehouse, information retrieval, on-line information repositories, electronic library, imaging database, world-wide web (WWW) [Ni92, Ha94, pp.495-512], and wide area information services (WAIS) [Ha94, pp.476-493]. How many distinct activities does this list represent? What requirements differ from topic to topic? What distinctions are essential, if any? What distinctions are more matters of marketplace focus than technical? Clearly there are too many topics in the list, with too much overlap of related activities, and researchers rediscovering what is already known. Precision is needed; hence we define:
A DIGITAL LIBRARY is an assemblage of digital computing, storage, and communications machinery together with the content and software needed to reproduce, emulate, and extend the services provided by conventional libraries based on paper and other material means of collecting, cataloging, finding, and disseminating information. A full service digital library must accomplish all essential services of traditional libraries and also exploit the well-known advantages of digital storage, searching, and communication.
We note a few circumstances and characteristics for which we expect DLs to emulate conventional libraries holding books, pictures, and other material objects:
* users are usually elsewhere than the information they want, and often wish to correlate things from several sources;
* whoever wants to use a library must show permission to do so;
* different patrons are permitted different actions and to see different parts of each collection;
* to find specific information, each user must understand the catalog structure;
* the catalog may describe items not actually held as part of the collection at hand;
* the catalog and the collected items are used differently and not necessarily housed in the same place;
* documents are cataloged with text descriptors and also with conventional properties, such as author names;
* documents contain cross references to other documents;
* document identifiers are different from document names; a document may have several names, one for each context, e.g., "Tales of Hoffmann" in English, "Les contes d'Hoffmann" in French, and "Hoffmanns Erzaehlungen" in German;
* translations of a document may express essentially the same information, e.g., versions of classic literature in different languages;
* each stored item is valuable, often with part of its residual value owned by its authors or authors' assignees;
* part of the value provided by a library is the provenance information it holds for each item;
* items are put into libraries because, while each is thought valuable for future reference, the specific individuals who will read it and the times when this will occur are not known.
We anticipate that a "complete" library service will contain many components from which each installation selects a subset and each user draws on an even smaller set.We need a distributed computing infrastructure and a framework for such components. Part of such a framework is provided by the concepts of resource manager and application enabler, which are well known to architects of distributed computing services. (See the DCE/DME deliberations [Ku91].)
Since these concepts seem to be unfamiliar to at least part of the digital library community we summarize them below. The concept of a resource manager will be seen to embrace notions from object-oriented computing and from client-server computing. Given basic operating system and communication services, we believe that all distributed computing services could be built as a set of application programs, application enablers, and resource managers, with only the resource managers directly invoking the primitive operating system and communication services.
A protected resource is a typically large data collection together with programs which define its semantics entirely if they are used as the only access path to the data held. Each such program set is a resource manager. Services such as authentication, filesystems, network directory services, database management systems, and digital library components can all be constructed as resource managers.
We propose a network of mutually supportive resource managers, each providing a relatively specialized service. Each resource manager (see Figure 1) distributes itself for remote applications and accesses any needed sibling as a client. Whether a sibling service is local or remote is solely a matter of network optimization.
Each service instance encapsulates its own data within a procedural cocoon -- a form of object-oriented programming which is not necessarily bound to any particular programming language. Thus, a resource manager is a service which combines state and processes and is accessible to multiple, concurrent clients (as in Figure 1). To qualify and be used as a resource manager in the sense we need, the program set and the data it manages (the protected object) should satisfy the following criteria:
* There typically will be many instances of each kind of protected resource, with its associated resource manager defining the resource class, e.g., Network File Systems (NFS), DB2 databases, X.500 directories, X-windows services.
* The resource manager programs provide the only access path to the protected data, and therefore define and implement its semantics. (Practical systems always permit someone to bypass this proper access path, e.g., for data backup and recovery; alternative paths need to be protected by physical and administrative means if the data are to be safe.)
* Typically, the protected data itself are highly structured, possibly consisting of well-defined objects. Typically each protected resource consists of many such entities, called "items."
* The resource manager provides distributed access, by having client and server portions. The protocol between client and server portions is private to the resource manager.
* To the extent consistent with maintaining good performance and with practical aspects of software production and distribution, each resource manager avoids reproducing services it can get from other resource managers. For instance, a library catalog manager would exploit a database manager, invoking it just as any other database manager client would.
* A resource manager is often an access control enforcement function (AEF) between a request initiator and a target, in the sense called for in international standards [Is88].
* As well as access control, a quality resource manager provides various data integrity protections, such as those called the ACID (Atomicity, Consistency, Integrity, Durability) properties [Gr93, p.6].
Resource managers are generic services. Yet, they can be invoked in turn by other generic services such as editors, filters, formatters, and other generic software which constitute a class collectively called @b(application enablers).
The purpose of application enablers is to make application programming easy and quick, or, optimally, avoidable entirely. Just as resource managers can be modularized by having each exploit other resource managers, application enablers can be cascaded. Figure 2 suggests how applications, application enablers, and resource managers can be layered to exploit open communications and to hide irrelevant operating system and machine differences.
Parts of what makes the modularization implicit in this model feasible today are the dramatic improvements taking place in computing performance and costs. In addition, the transport layer interface protocol boundary depicted in Figure 2 makes it possible for the lower communication layers to choose efficient paths independently of how each resource manager calls communications internally (see Figure 1). For example, in one extant implementation [Gl93], the transport layer detects when the client and server happen to be in the same machine, and uses local operating system services for inter-process communications; for a library application, the performance is close to what would be achieved by combining the client and server into a single program.
Digital Library Taxonomy
Document storage and access software can be realized in two layers above a base of file systems and database managers (see Figure 3). The lower one is a storage sub- system which stores and retrieves items to and from each library collection, updates and searches library catalog records, and limits who can manipulate which data -- giving only services which are identical for all types of documents. Instances of the higher layer, which we call document managers, help applications or end users with their special kinds of documents and varied forms of presentation and manipulation.
The distinction between the document storage subsystem layer, which would be implemented as a resource manager, and document managers, implemented as application enablers, deserves careful articulation. One reason for the distinction is that the storage subsystem layer often is difficult for most users to change or substitute, but document managers often are made as accessible as any individual user cares to have them.
The storage subsystem limits its services to those not dependent on the meaning or representation of items. Usually, items it delivers to requesting applications are faithful copies of items other applications stored. Sometimes, however, partial document retrieval is wanted, and transformations which improve presentation without adding information are valuable. The storage subsystem manages data placement and replication, implements custodial responsibilities for data security, and hides irrelevant network and other environmental dependencies as possible. Its application programming interface has three parts: a query interface, identifying items of interest to a browser, allowing whatever inquiries do not violate item owners' confidentiality desires; a retrieval interface delivering items with timing and buffering consistent with the data at hand and with the user's response and cost objectives; and update interfaces for the library catalog and collection enforcing articulated policies for library data integrity and quality. Since searches for information may depend on databases that are not part of what the librarian has chosen to include in the formal library catalog, query services in the storage subsystem also should support joins with external data.
To provide enough flexibility for all possible applications, the document storage subsystem interface is likely to have many primitive operators, making it somewhat difficult to program for ad hoc applications. This can be overcome with document managers which implement broadly interesting information models, such as hypertext and document-in-folder models. For example, we see Mosaic [Ha94, p.510] as a document manager. The storage subsystem attempts comprehensive coverage of functional requirements in its domain; good document managers would offer less flexibility and fewer options, but would be much easier to explain and understand.
Document managers give services that vary among access incidents because different document types need different presentation / manipulation and users have different objectives and preferences. In the complex of software for library services, document managers are the only implementation (with limited exceptions already noted) of services for document editing, transformation, combination, and presentation, as well as complex information search dependent on content. In this architecture, document managers are workstation programs, readily accessible for users' selection and change.
In a practical system, each document manager embodies a document model -- the set of concepts that create the digital analog of some collection of papers or other physical objects, or some information network for a particular application, such as hypertext [Ha92], or some flow of documents. In contrast, the document storage subsystem layer avoids modeling. Typical document managers interpret scanned data to create catalog entries automatically, manage interrelationships among documents, facilitate the most common search methods, and help move information among workers:
* A folder manager might scan electronic memoranda, letters, contracts, and financial records; such a manager would extract names, addresses and dates to cross-index information received [Ma87] and associate each document with a folder. It might further model and facilitate the information flow of library administration, such as accessions management.
* The entities of a second document manager might be movies; it would communicate with its users in terms of movies, reels, and frames and with the storage subsystem using channels.
* A third document manager might feature a CAD system and be applied to maintenance records of university buildings; it would generate and display building plans with a graphic editor and maintenance contracts with a customized text editor.
* A fourth document manager might model what is found in a university library -- books and pamphlets with individually viewable pages, folders of papers, manuscripts, video tapes, etc.
Generic document managers for applications like geographic data systems, and enterprise-specific ones administering conventions and document quality standards, may evolve over time. While a good document manager would support most library services in its domain, we see the storage subsystem interface being exposed to allow applications to bypass their document managers. Applications and document managers execute in users' machines. The document storage subsystem provides retention and catalog services and manages inter-machine communications, hiding them to the extent possible. Implementation follows a client-server approach.
We feel that the suite of software that creates DLs will include at least the following module classes. Here we say "module classes" because each tabulated item in the list may be represented by several implementations to create a different look and feel or to provide different data transformations or for different hardware and operating systems.
* authentication/authorization server;
* authoring: editor, integrator;
* billing subsystem;
* browser, navigator;
* data analyzer;
* document analyzer;
* format converter;
* link engine;
* multimedia presenter;
* naming service;
* organizer, clusterer;
* presenter or renderer;
* query optimizer;
* recognizer of patterns/structure;
* script interpreter;
* search engine;
* source selector (fuser from sources);
* storage subsystem.
Any social unit (school, business, department, family, individual, ...) might create and manage its own library, and most individuals will want access to many libraries. All libraries should do certain things similarly -- adhere to certain standards -- so people do not need to learn new methods for each library and so information can be exchanged.
At the general level found in requests for proposals, in the trade literature, and in business publications, there is broad consensus on what services the digital library should have in 5-10 years. For a few of the generic components, such as storage subsystems and document markup languages and interpretation, detailed requirements analyses exist; typically they include hundreds of well-justified requirements. For most of the other generic components suggested earlier, similarly comprehensive requirements analyses are not available in the generally accessible literature.
To prioritize the requirements for academic and cultural DL services we must consider a range of objectives which will differ among different institutions. For some the premier objective will be improved accessibility to rare and valuable materials for scholars. For others, as in the TULIP project mounted by Elsevier in partnership with computer science groups at several universities, it will be easier information search by electronic publication of professional journals. For still others, it might be exploitation of interactive formats for instructional materials (IBM/Case Western Reserve University project and the Brown University Intermedia project), or broad public access to one-of-a kind material (Library of Congress American Memory project), or preservation of fragile materials (Cornell University project [Ke93, We93]).
Requirements for Document Storage Services
An analysis done by IBM Research [Gl90] identified several hundred specific requirements -- too many to tabulate here. However, several broadly applicable elements emerged, and are summarized below because they typify what needs to be worked out for each library component class identified above:
* accessibility from all workstation platforms (in distributed fashion);
* application independence;
* catalog service from all kinds of operating system platforms;
* joining libraries to other databases;
* automatic capture and indexing;
* document managers;
* large and small items;
* low entry point, with growth to giant collections, maintaining performance;
* low installation and administration overhead;
* open subsystem (import/export to workstation application programs);
* standard interfaces and protocols;
* customer-defined data formats;
* support for all kinds of item storage;
* tools for "amateur" application programmers.
Requirements for Catalogs
The radically new possibility for DLs is storage and dissemination of collected items. In contrast, digital catalogs have been in practical use for some time, and there is a considerable body of experience, both good and bad, or at least questionable [Ba94], and some standards in this area. Library cataloging is known to be difficult:
"The preparation of a catalog may seem a light task, to the inexperienced, ant to those who are unacquainted with the requirements of the learned world, respecting such works. In truth, however, there is no species of literary labor so arduous and perplexing. The peculiarities of titles are, like the idiosyncrasies of authors, innumerable." [Je53 ]
We did not consider this topic, but recommend renewed attention to it, either by resurrecting prior requirements analyses and re-examining them for current pertinence, or by constructing afresh something similar to what is available for the storage subsystem [Gl90].
Document Markup, Links, and Interchange Conventions
This topic is critical for documents produced specifically for the digital environment. This has been realized for some years, so that the topic has already received intensive examination, including standards activities and proposed industry conventions. We refer the reader to treatments of the Dexter model for hypertext [Gr94, Ha94a], of SGML and HyTime for standard document markup language [Go90a], and to the trade literature for arguments about the merits of Microsoft OLE (Object Linking and Embedding) and Apple OpenDoc [Pi94]. There is considerable overlap among these tools, which are mostly promulgated for personal computers and office applications, between them and World Wide Web and Mosaic markup being popularized in the Internet, and probably between all these and further document markup languages that we have overlooked. In addition, there are at least two incompatible standards for document interchange: ANSI Z39.50 [Ly91] and ISO DFR [Is91], with unresolved relationships with the linking conventions.
We feel that the DL community should avoid further competing activities. In the workshop, we did not consider the extent to which DL progress depends on the emergence of a limited number of document markup conventions or how the DL community should participate, if at all. We note in passing that object-oriented technology might be capable of hiding markup differences from end users, and suggest that this possibility be investigated.
The concept "library" has been refined over several centuries. It would be injudicious to depart from what people expect merely because a digital service is replacing a material one. Except where explicit reasons suggest an improvement that is easily explained to ordinary users (e.g., in query services), library services should implement a familiar model.
Many potential advantages of a digital library over a paper library are similar to those of any digital database over its paper counterpart: faster addition to the collection with better quality control, improved search functionality and faster access to information found, and more freedom and reduced bureaucracy for individual users. Achieving these advantage depends not only on efforts traditionally undertaken by computer scientists, but also on the highest quality engineering for human usability.
This paper is an abbreviated version of a report available from IBM Almaden Research Center, currently being prepared for the proceedings of the IEEE CAIA '94 Workshop on Intelligent Access to On-Line Digital Libraries, which was held in San Antonio, Texas on March 1, 1994, in cooperation with the IEEE Computer Society. Funding for this work comes in part from NSF grant IRI-9116991. The opinions, reflections, and ideas presented in this paper represent only the co-authors' individual (and collective) thoughts, and do not by any means denote the views of their respective organizations.
[Ba94] N. Baker, Annals of Scholarship: Discards, New Yorker, 64-86, (April 4, 1994).
[Ba89] D. Ballantine, Issues Related to the Preservation of Machine Readable Records, Presentation to the Annual Conference of the Assn. of Canadian Archivists, (1989).
[Be93] J. Browning, Libraries without Walls for Books without Pages: What is the Role of Libraries in the Information Economy?, Wired, premiere issue, (1993).
[Gl90] H.M. Gladney and P.E. Mantey, Integrated Records Management - A Statement of Requirements on the Library Subsystem, IBM Research Report RJ 7425, (April 1990)
[Gl93] H.M. Gladney, A Storage Subsystem for Image and Records Management, IBM Systems Journal 32(3), 512-540, (1993).
[Go90a] C.F. Goldfarb and S.R. Newcomb, Hypermedia/Time-based Document Structuring Language (HyTime), ANSI Project X3.749-D, X3V1.8M/SD-7.
[Gr93] Jim Gray and Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufman Publishers, San Mateo, California, (1993).
[Gr94] K. Groenbaek and R.H. Trigg, Design Issues for a Dexter-Based Hypermedia System, Comm. ACM 37(2), 41-49, (1994).
[Ha92] B.J. Haan, P. Kahn, V.A. Riley, J.H. Coombs, and N.K. Meyrowitz, IRIS Hypermedia Services, Comm. ACM 35(1), 36-51, (Jan. 1992).
[Ha94] Harley Hahn and Rick Stout, The Internet Complete Reference, Osborne McGraw-Hill, Berkeley, California, (1994). Mosaic was written by Marc Andreessen of the National Center for SuperComputer Applications (NCSA) at the University of Illinois at Urbana.
[Ha94a] F.G. Halasz and M. Schwartz, The Dexter Hypertext Reference Model, Comm. ACM 37(2), 30-39, (1994). Extended version in Proceedings of the Hypertext Workshop, NIST Special Publication 500-178, 95-133, (March 1990).
[Is88] International Organization for Standardization, Open Systems Interconnection, Reference Model, Part 2: Security Architecture, ISO 7498-2, Geneva, Switzerland, (1988).
[Is91] International Standards Organization (ISO), Information Technology - Text and Office Systems - Document Filing and Retrieval Draft International Standard, ISO/IEC JTC 1/SC 18 10166-1, (June 28, 1991). (This draft standard has been ratified.)
[Je53] Charles Coffin Jewett, the Librarian of the Smithsonian Institution, Smithsonian Report on the Construction of Catalogues of Libraries, (1853).
[Ke93] A.R. Kenney and L.K. Personius, A TestBed for Advancing the Role of Digital Technologies for Library Preservation and Access, Final report by Cornell University to the Commission on Preservation and Access, Cornell University, (October 1993).
[Ku91] R. Kumar, OSF's Distributed Computing Environment, IBM AIXpert, 22-29, (Fall 1991).
[Ly91] C.A. Lynch, The Z39.50 Information Retrieval Protocol: An Overview and Status Report, Computer Communication Review 21(1), 58-70, (1991).
[Ma87] T.W. Malone, K.R. Grant, F.A. Turbak, S.A. Brobst, and M.D. Cohen, Intelligent Information Sharing Systems, Comm. ACM 30(5), 390-402, (1987).
[Ni92] G. Nickerson, WorldWideWeb: Hypertext from CERN. Computers in Libraries 12(11), 75-77, (1992).
[Pi94] K. Piersol, A Close-Up of OpenDoc, Byte 19(4), 183-188, (March 1994).
[We93] K. Webster, Cornell Project Saves Documents, Books-and Makes Them Accessible,Adv. Imaging,42-46, (Sept. 1993).
[Wo87] D. Woelk and W. Kim, Multimedia Information Management in an Object-Oriented Database System, Proc. 13th VLDB Conference, 319-329, Brighton (1987).