The mission of research libraries is to acquire information, organize it, make it available and preserve it. This has been their significant, distinctive and successful role with print and other artifactual materials for the past several hundred years . An implicit mission of computing science has been to make the benefits of computing technology of use to society at large. Missions, needs and capabilities now come together so that information users can have added assistance in performing research and in assuring the continuity of scholarship, today and in the future. It will take conscious, planned efforts by both librarianship and computing to make this happen.
This article sets out what must be done for a digital research library to be successful in meeting user needs. The primary requirement for a digital research library (DRL) is that from the start it be committed to organizing, storing and providing electronic information for periods of time longer than human lives. The expression "digital research library" is here used in preference over "digital library," a term preempted and given currency by Vice President Albert Gore, and "virtual library," a companion term brought forth by the National Science Foundation . These terms have so far been used narrowly to define a quantity of data bases available for use at a given time. A library however is not simply a network full of data bases nor a building full of books; it is an organization. A DRL is a set of electronic information organized for the long term.
Many libraries are now trying to provide the increasing volume of scholarly electronic information to their clienteles. Current information needs are being provided in electronic form with varying success in public, college and research libraries around the country and the world. Research libraries have only begun to take on the provision, organization and preservation of information with the same long-term commitment they have made for print materials . It is an expensive, uncharted and difficult task.
But until the long-term commitments are undertaken, many currently proposed solutions will have only temporary effects. For example, discussion of cataloging network resources will remain tentative, for until resources being cataloged have a permanent network presence (whether at fixed or virtual locations), the cataloging that points to them must also have an ephemeral quality. (Cataloging for some transitory electronic materials will always be necessary.) Similarly, the expensive products of recent valuable digitizing demonstration projects, from microfilm to digital form and vice versa, will be at risk after only a few years if tools and commitments are not in place for the preservation of what has been achieved .
Most important, the ability of the scholarly community to give serious weight to electronic information depends upon their trust in such information being dependably available, with authenticity and integrity maintained. Looked-for changes in scholarly publishing to help alleviate the serials crisis, for example, are usually thought to be bound up with the prestige of electronic journals in the academic tenure process. The ability of the academy to count on long-term, secure existence of electronic scholarly work will be an important determinant of the success of academic electronic publishing. Libraries and universities have a stake in helping electronic publishing to succeed, and therefore have an interest in establishing secure digital research libraries.
Users' needs will continue to be what they long have been. Users will want information reliably locatable, so that when they go there (whether personally or on the net) they can expect to find what they're looking for. Users will want information easily accessible: the cataloging must be clear and accurate, and the information must be promptly retrievable. In the electronic environment the need for access tools will be more evident, and users will expect appropriate and standard software to be readily available. Users will expect information to be available that was placed in the library's care a long time ago; and they will expect that the integrity of the information they get from the library will be assured.
Implementation of a Digital Research Library will require several specific tasks, broadly familiar to a computing audience, and three kinds of new commitments. In what follows the tasks are given more space, yet as technical problems they probably are the easiest to solve; they will only cost money. The institutional commitments described in the final section will be much more difficult to achieve.
All the issues are described here in cursory form. Each could be developed in great detail, but at the moment the outline and overall program are most important. Early implementations will test many of these assumptions and will add more requirements to the list. Work needs to begin.
A Digital Research Library will be manifest to users as collections of information existing in various places (not always evident) and accessible through the use of widely available tools. The locus of information may be called the electronic storage repository; the access tools will be described below.
Over time, we will learn how collection development plays out in an access environment as well as in an ownership environment. It is sometimes loosely proposed (seldom by librarians) that libraries need not acquire electronic information, for it will be available somewhere on the network. Such proposals ignore the obvious truth that some institution must still, in the end, take responsibility for the information. That has always been a definition of the library responsibility.
There will be many electronic storage repositories, responding both to requirements of redundancy and to the individual needs of institutions. In contrast to print collections, it is unlikely that there will be a high degree of content duplication across many electronic repositories, since for most purposes existence in a single place allows world-wide access. Aside from their actual contents, however, repositories that are part of a DRL will have many common characteristics. Some of these are described here; in some cases, open questions are noted that need to be explored in early implementations.
Megadocument contents. Even an initial repository will comprise many gigabytes of information, growing quickly to millions of electronic documents. The medium itself (disk storage) is cheap and the possible resources are plentiful.
Sources and potential participants. It is easy to cite numbers of electronic scholarly resources that now exist. A few are noted here only as examples:
* Johns Hopkins Medical Library medical image data base and its e-Journal of Medical Imaging;
* Texts maintained by the Center for Electronic Texts in the Humanities at Rutgers/Princeton (e.g. those of the Women Writers Project);
* Texts at the Georgetown electronic text center, such as those of C.S. Peirce, Hegel and Feuerbach, under varying licensing arrangements;
* Survey research data from the Interuniversity Consortium for Political and Social Research (ICPSR);
* Aviador, the Columbia University Libraries architecture image resource;
* Commercial publications, either profit or non-profit (from a university press? publications of a scholarly society, such as IEEE? a partnership with a commercial press, as in the TULIP project with Elsevier?); a repository could be a commercial alternative to local storage or no storage;
* Los Alamos National Laboratories Physics Preprint Data Base;
* National Archives and Record Administration materials;
* E-journals now established on the network, especially if peer reviewed (e.g. Psycoloquy, Bryn Mawr Classical Review, Early Modern Literary Studies, Journal of Fluids Engineering, Modal Analysis, OCLC Journal of Online Critical Trials (with attendant copyright issues), Scientist, Solstice ;
* Early network activity as examples of ephemera, e.g. selected alternate (alt.x) newsgroups, information located at temporary ftp sites, samples of early advertisements, etc.;
* Listserv and newsgroup electronic archives;
* Commercial information bases which will not be made widely available, e.g. Biosis or Chadwyck-Healey's English Poetry, where it can be recognized that long-term preservation is necessary even though access might be licensed or otherwise constrained.
All these are only examples. None, of course, should automatically be selected; collection development policies should be adapted and followed. The continuing substantial costs of providing electronic information will require that electronic collection decisions be made even as carefully and parsimoniously as for print.
Backup mechanisms. Backup/restore procedures must be in place and must be automated and economical, for libraries are never likely to have expensive labor available in quantity. Backups must be multi-generational, using remote storage, with regular disaster simulations and tests.
Staged Access. "Staging" refers to the prioritized use of different mechanical methods of storing data as it waits to be recalled. All data does not need to be immediately available on the most expensive and fastest storage media. Alternatives for providing immediate online access to the enormous potential volume of scholarly information need to be provided. What can be off line, and how can it be retrieved? Present alternatives include magnetic disks, optical disks and jukeboxes, optical disks on shelves, magnetic tapes on site, tapes in remote storage, and automated data warehouses of magnetic tapes.
Data structure standards. In a repository, does information simply exist as is (as first created) or is complementary information associated with it? Widely differing examples include SGML (Standard Generalized Markup Language) headers, ICPSR codebooks, picture captions, hypertext links and early software versions for use with data files. There is an increasing need to link bit-mapped page images to ASCII text versions of the page contents. If there is an association, is it through use of header portions of a file or through supplemental files? How are they indicated and connected ?
Refreshing mechanisms. Refreshing is agreed to be necessary for long-term preservation across advances in computing technology, media and software . There will be organizational and bureaucratic issues in addition to the simply technical. If information is copied from magnetic to optical disk, copyright issues must be recognized. Automation will be necessary to reduce labor costs. Other issues include workflow and record-keeping, migration techniques, and standards and techniques that will apply independently of technology. It may be possible to link refreshment to backup techniques for expedience and economy.
Authentication and integrity. Intellectual preservation goes beyond preservation of the medium and the technology to assure the protection of the intellectual structure of information as it was recorded by its author . To meet user expectations DRL's must implement authentication and integrity techniques that combine mathematical security with ease of use, public trustworthiness and privacy protection. For example, bit patterns of texts, sound and images may be preserved through cryptographic hashing and encoding methods such as the digital time-stamping technique . Standards and conventions for use and citation will be necessary.
Redundancy. It will be important to establish standards for the number of repository locations necessary to assure long-term existence of specific electronic information and access to it. One location won't do for a particular major electronic document or set; will two, or three? How many? Major institutions may separately or consortially establish repositories. It is not yet clear how much redundancy of their components will be desirable among them.
Aside from assuring longevity, other issues come to bear on decisions to provide multiple permanent copies of electronic information. Informed decisions will be made about the dynamic interplay between costs of network bandwidth, response time and costs of storage. It seems likely that many library consortia will be formed on the basis of joint contracts with information vendors. Geographic location, nationalism and regionalism will likely play a role (at least intercontinentally, and probably intracontinentally).
Usage and retrieval mechanisms. The full panoply of present access tools must be supported by a Digital Research Library (e.g. online catalogs and OPACs, FTP, gopher, World Wide Web and its multiple clients) with provision for the new access tools that are likely to appear regularly. The "granularity" of documents needs to be addressed: how may one retrieve only part of a document when the full document may be of substantial size (e.g. the full text of Moby-Dick or of a legal code; or a presentation of many images from which one is desired). Must documents be pre-coded (or pre-marked) to allow such granular access, or can access-time mechanisms be made available ?
Techniques for document update and consequent archiving and labeling need to be developed, as well as flags indicating obsolescence or supersession (or conversely indicating status as an authorized version), e.g. for ANSI standards, monthly statistical reports or draft versions. A form of SGML may be appropriate in some cases, for example the format proposed by the TEI (Text Encoding Initiative) .
Cataloging. Providing access to voluminous information is an intellectual problem that historically has been solved in the print environment by abstracting and indexing services and by library cataloging, with attendant rules and procedures to insure consistency and accuracy. These tools, adapted to suit new needs, will work for electronic information as well . They should be linked to the new retrieval mechanisms so that users can smoothly navigate from location of information to retrieval of it without having to shift their mode of use. Early mechanisms will probably link catalog records to documents using tools such as the WWW, the Uniform Resource Indicator (and Locator) or URI/URL, and the recently proposed MARC 856 field . SGML may offer other possibilities for linking of certain documents through its document description techniques. In any case, there eventually will need to be consensus both for the representation of physical electronic locations in bibliographic records and for representation of virtual locations.
If the DRL's catalog system works well, users will be able to search for information, locate bibliographic records for desiderata, and use those records directly to draw the desired information to their workstation . Where an authentication technique is used (see above), means for including and testing the certification must be provided. Standards for such cataloging and remote access still need to be developed, particularly for providing catalog access to non-owned materials. The present review of AACR2R Chapter 9 is to be applauded, as is the recent OCLC study on the cataloging of non-book materials .
Remote Access. A DRL should from the outset be intended for access from multiple remote locations. Internet-wide access should generally be possible. In early pilot implementations it may initially be advisable for a few libraries to plan and development catalog and access mechanisms that integrate the individual libraries' collections with that of the DRL. Procedures for dissemination of such catalog records will be needed; it will be not only a technical matter but a policy matter for libraries associated with the DRL to provide non-local access to their local patrons. Presumably the bibliographic utilities, such as OCLC and RLIN, will play their accustomed role.
Fees and freedom. In practice these are often linked issues. Standards and techniques will be necessary to solve a knot of interconnected problems surrounding access and ownership, including
* Privacy preservation for users, while also protecting
* Copyright protection for intellectual property holders, while also protecting
* Fair use mechanisms, and also providing
* Fee-charging techniques, including billing, where relevant.
Much of what has been described so far is merely technical, and the outlines of solutions are becoming clear even if the details remain to be worked out (set aside here are the non-trivial matters of cost). More difficult will be the social compacts, that is, the agreements on standards, intellectual property and access modes.
Most difficult of all to achieve, if electronic preservation and access are to be accomplished on any significant scale, will be the long term commitments to these goals by institutions . Nothing makes clearer that a library is an organization, rather than a building or a collection, than the requirement for institutional commitment for electronic information to have more than a fleeting existence.
The organization of libraries is already changing as electronic information increasingly becomes part of their charge. Most research libraries now have substantial systems departments. Some libraries locate the responsibility for electronic information distinctly from that for print. Other libraries see the forms as inseparable and include electronic responsibilities along with artifactual responsibilities in assignments for collection development, cataloging and public service.
What is new will be the permanent assignment of staff responsibility for the long term maintenance of electronic information within a library. There is no obvious artifactual parallel for this responsibility: circulation, stack maintenance, preservation and physical plant departments now share it for print. Nor are there present parallels in academic computing centers, where staffs typically focus on technological advance and availability, leaving data to the users. The electronic preservation responsibility will be focused as it will require technical expertise likely to be located in a single functional area.
It is by no means clear that this functional area will be what we used to call the library's systems department. As libraries move more into the electronic environment the historic tripartite division of libraries into public services, technical services and collection development will continue but in more fluid arrangements. People who combine bibliographic understanding, problem-solving abilities and process orientation have often been found in technical services as well as elsewhere in libraries. Similar librarians will take on the demanding new technical, collection and service responsibilities for long-term support of digital collections. At the same time, it is becoming clear that the traditional computing community is fertile with ideas, analysis and skills that will be important to electronic library goals .
The permanent existence of a digital research library will require assured continuity in operational funding. Almost any other library activity can survive a funding hiatus of a year or more. Acquisitions, building maintenance, and preservation can be suspended, or an entire staff can be dispersed and a library shut down for several years, and the artifactual collections will more or less survive. But digital collections, like the online catalog, require continual maintenance if they are to survive more than a very brief interruption of power, environmental control, backup, technological advance and related technical care.
Online catalog maintenance costs have reached a rough steady state, and the capital costs for new OPACs are decreasing relative to the capabilities provided. The catalog size will continue to increase, but catalog records are small relative to the information to which they refer. DRL's, however, as a proportion of the library's supply of information, will grow for the foreseeable future, and the quantity of information requiring care will become considerable (and much larger than the catalog). Unit costs of storage are likely to continue falling for some time, which may make the financial burden manageable. (Staffing costs are not expected to increase, simply because overall staff growth in most libraries is likely to be restrained; reassignments, however, are likely.)
Long term funding will be required to assure long term care. Libraries and their parent institutions will need to develop new fiscal tools and use familiar fiscal tools for new purposes. Public institutions, usually constrained to annual funding, will have particular difficulties; existing procedures for capital or plant funding may provide precedents. One familiar technique is the endowment. It has been difficult to obtain private funding for endowments of concepts and services rather than books and mortar, but it is possible. Institutions might also build endowments out of operating funds over periods of time.
Some revenue streams associated with Digital Research Libraries may be practical. Consortial arrangements may allow for lease or purchase of shares in a DRL. Shorter-term access might be provided to other institutions on a usage basis. Access could be sold to certain classes of users, e.g. businesses, non-local clienteles, or specific information projects. New relations with publishers, presently difficult to perceive through the mists rising from intellectual property, might result in fee income for storage of electronically published materials during the copyright lifetime in which publishers collect usage fees. With commitment and imagination long term fiscal tools will be found.
All these are instrumental means of accomplishing the greatest requirement, that of conscious, planned institutional commitment to preserve that part of human culture which will flower in electronic form. While museums have preserved artifacts (often beautiful) that embody information, libraries preserved information that has been embedded in artifacts (only occasionally of aesthetic interest in themselves). The advent of electronic information will accentuate the difference between these roles as libraries take the responsibility for the preservation of information in non-artifactual forms.
For the past century most research libraries have been associated with universities, and this connection seems likely to continue in the immediate future . Whatever the governance structure, an institution wishing to benefit from electronic information will have to make a conscious commitment to providing resources. Michael Buckland, of the University of California at Berkeley, has distinguished between a library's role and its mission. Where the role of a library is to facilitate access to information, its mission is to support the mission of its parent institution . Thus if a university wishes to continue relying on mission support from its library, it will have to make commitments to support the library's role. In the electronic environment, this means new longstanding financial commitments which the library and university together must identify and establish.
The commitment will have to be clearly and publicly made if scholars and other libraries are to have confidence that a given DRL is indeed likely to exist for the long term. Guidelines or standards will be desirable that define what is meant by a long term commitment, and that define which electronic repositories of data can qualify to be termed part of a digital research library. Just as donors of books, manuscripts and archives look for demonstration of long term care and commitment, so too will scholars and publishers as electronic information is created and requires a home.
Establishing a Digital Research Library continues the research library role. For a university to do so should be considered as natural as acquiring the next book or cataloging the next journal. Not to do so will be an abdication of that responsibility. The skills and understandings of both the library and computing communities will be essential to carry out this goal of preserving the human record in the electronic environment.
The tasks call not so much on new knowledge nor on new techniques, but upon informed commitment; that is, upon will. For computing experts seeking a goal worthy of their skills, here is their challenge. For librarians wondering what is to come of their profession in the electronic age, here is their challenge. For institutions intending to continue their mission of expanding the permanent acquisition of human knowledge, here is their responsibility.
The author wishes to thank Marianne Gaunt and Robert T. Warwick, both of the Rutgers University Libraries, and Czeslaw Jan Grycz of the University of California, for attentive readings of earlier drafts. Preliminary forms of this material were presented at an Institute of the Association for Library Collections and Technical Services (ALA)(October, 1993) and at a Task Force meeting of the Coalition for Networked Information (November, 1993).
1. Artifactual materials include books, journals, manuscripts, recordings and other information resources which are inseparably linked to the objects that are their medium, and therefore exist in space and require specific physical handling to use. In contrast with such materials, where to preserve the artifact is to preserve the information contained in it, electronic information is easily transferred from one medium to another with no loss.
2. Thomas J. DeLoughry, "Government Provides $24-Million for 'Virtual Libraries' Projects," Chronicle of Higher Education, October 5, 1994, A26).
3. The Research Libraries Group, at the beginning of 1995, established a Digital Collection Project Task Force to carry out its Board of Directors' mandate to investigate these issues. The Library of Congress "Digital Library" project makes brief reference to some of these issues in its Strategic Directions Toward a Digital Library: A Working Paper... (LC: September 13, 1994). The "Digital Library Federation" of about 15 major libraries, including Harvard, Columbia, Stanford, Michigan, Tennessee and Penn State, was announced formally on May 1, 1995 (information is available from the Commission on Preservation and Access, Washington, DC). The Kellogg Foundation has funded a "Digital Libraries Project" at Harvard under Brian Kahin, but early announcements deal with copyright and general information access without mentioning issues of longevity; "Kellogg Gives Harvard $650K," Library Journal (May 1, 1995), p. 14.
4. Paul Conway, "Digitizing Preservation," Library Journal (February 1, 1994), p. 42-45. See also Donald J. Waters, Electronic Technologies and Preservation (Washington, DC: Commission on Preservation and Access, 1992); and the 1994 pamphlet from the Commission, The Digital Preservation Consortium: Mission and Goals.
5. Others will be found listed in 1994 Directory of Electronic Journals and Newsletters, ed. Ann Okerson (Washington, DC: Association of Research Libraries, 1994); also at <URL:gopher://arl.cni.org:7/11/scomm/edir>.
6. A thoughtful beginning of a formal document architecture for a digital library is contained in Anne R. Kenney and Lynne K. Personius, A Testbed for Advancing (Washington, DC: Commission on Preservation and Access, Oct., 1993), Appendix II, "Document Architecture Description," 75-81.
7. "In this new world, preservation means copying, not physical preservation." Michael Lesk, Preservation of New Technology: A Report of the Technology Assessment Advisory Committee to the Commission on Preservation and Access (Washington, DC: Commission on Preservation and Access, 1992), 13.
8. Clifford A. Lynch, "The Integrity of Digital Information: Mechanics and Definitional Issues," Journal of the American Society for Information Science 45 (1994), 737-744. See also Peter S. Graham, "Preserving the Intellectual Record and the Electronic Environment," Scholarly Communication and the Electronic Environment: Issues for Research Libraries, ed. Robert Sidney Martin (Chicago: ALA, 1993), 71-101.
9. Stuart Haber and W. Scott Stornetta, "How to Time-stamp a Digital Document," Journal of Cryptology 3 (1991) 99-111; also, under the same title, as DIMACS Technical Report 90-80 ([Morristown,] New Jersey: December, 1990). See also D. Bayer, S. Haber and W.S. Stornetta, "Improving the Efficiency and Reliability of Digital Time-stamping," Sequences II: Methods in Communication, Security, and Computer Science, ed. R. M. Capocelli et al (New York: Springer-Verlag, 1993), 329-334. A useful brief account is in Barry Cipra, "Electronic Time-Stamping: The Notary Public Goes Digital", Science 261 (July 9, 1993), 162-163. For an account of digital time-stamping in the library context, see Peter S. Graham, Intellectual Preservation (Washington, DC: Commission on Preservation and Access, March, 1994).
10. Clifford Lynch, A Framework for Identifying, Locating, and Describing Networked Information Resources (March 24, 1993; electronic "Draft for discussion at March-April 1993 IETF Meeting"), n.p., section "Referencing Parts of Objects" (my citation in this form exemplifies the problem).
11. L. Burnard, What is SGML and How Does it Help? TEI document TEI ED W25, October 1991, available from TEI fileserver (firstname.lastname@example.org; send the line "get TEI-L filelist"); International Organization for Standards, ISO 8879: Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML), ISO, 1986; Eric van Herwijnen, Practical SGML (Kluwer, 1991); C. M. Sperberg-McQueen and Lou Burnard, eds. Guidelines for Electronic Text Encoding and Interchange (TEI P3). (2 vols.) Chicago, Oxford: Text Encoding Initiative, 1994 (available by anonymous ftp from <URL:ftp://ftp-tei.uic.edu/pub/tei>.
12. Lynch, in Framework proposes "that the emphasis be on describing content...rather than access mechanisms" ([[section]]"Cataloging Networked Information Resources").
13. Tim Berners-Lee. July 14, 1993. Uniform Resource Locators <URL:ftp://ds.internic.net/internet-drafts/draft-ietf-uri-url-01.txt(or ...-01.ps)>. There is a good deal of more recent work in this area being done by IETF groups (for current status, see URL:http: //www.ietf.cnri.reston.va.us/1id-abstracts.html. See also MARBI Proposal 93-4 (Nov. 20, 1992), p. 5 ff, for comments on the possible relations between the URL and the proposed MARC (Machine-Readable Cataloging) field 856 (Electronic Location and Access); and MARBI Proposal 94-3 (Dec. 6, 1993), which specifically proposed adding a subfield $u to field 856 to accommodate a URL; these proposals have been adopted by the library community. .
14. For a further description of this potential for integration see Peter S. Graham, "The Mid-Decade Catalog," in ALCTS Newsletter (January, 1994), pp. A-D.
15. Martin Dillon et al, Assessing Information on the Internet (Dublin, Ohio: OCLC, 1993). AACR2R is the second edition, revised, of the Anglo-American Cataloging Rules.
16. References to the need for long-term commitment are beginning to appear. Paul Conway, and Jim Barker at Case Western Reserve's Library, have called attention to it (Conway, "Digitizing Preservation," p. 44). A rare example in the computing community is in John A. Kunze, Functional Requirements for Internet Resource Locators (IETF URI Working Group Internet-Draft, 27 July 1994), [[section]]4, "Resource Access and Availability" <URL:ftp://ds.internic.net/internet-drafts/draft-ietf-uri-rl-fun-req-01.txt>.
17. See, for example, Jerome Saltzer, "Technology, Networks, and the Library of the Year 2000", In Future Tendencies in Computer Science, Control, and Applied Mathematics, Lecture Notes in Computer Science 653, edited by A. Bensoussan and J.-P. Verjus, Springer-Verlag, New York, 1992, pages 51-67, and available at <URL:http://ltt-www.lcs.mit.edu/ltt-www/Papers/inria.html>. See also the works mentioned of Berners-Lee and the IETF groups working on the URI (the group working on the Uniform Resource Characteristics (URC) <URL: ftp://ietf.cnri.reston.va.us/internet-drafts/draft-ietf-uri-urc-req-01.txt>, however, would benefit from more exposure to cataloging principles).
18. The national libraries are the great exceptions, such as those of Britain, Russia, France and the United States. Exceptions in this country include the handful of independent research libraries such as those at the Folger, the Huntington Institution and the American Antiquarian Society, and some of the great civic institutions such as the Boston and New York Public Libraries. For the possibility of the link between research libraries and universities being lost, see the 1991 Malkin Lecture of Terry Belanger, The Future of Rare Book Libraries (Charlottesville: Book Arts Press, in preparation; text available from Dec. 16, 1991 archive of ExLibris, a listserv at rutvm1.rutgers.edu, message from: email@example.com, subject: Malkin Lecture).
19. Michael Buckland, "Putting It Together: The Principles of Information Access," presentation at the ALCTS Institute, The Electronic Library: Administrative Issues for Organization and Access (San Antonio: October 29, 1994).