The Harvard Self-Enriching Library Facilities (SELF) Project

Gary King[1], H. T. Kung[2], Barbara Grosz[2], Sidney Verba[3],
Dale Flecker[4], and Brian Kahin[5]

[1] Department of Government,

[2] Division of Applied Sciences, {kung, grosz}

[3] Department of Government and Harvard University Library,

[4] Harvard University Library,

[5] John F. Kennedy School of Government, Harvard University, Cambridge, Massachusetts 02138


This paper proposes the development of a library system with an in-depth catalog of social science materials linked initially to a digital collection of survey and other research datasets. It is proposed that digital libraries can accept information from, as well as dispense information to users, and that such libraries can be improved and enriched through use. The digital library is composed of the user client software, a metadata or catalog server, a data store server, and a contribution acceptance server. Primary emphasis is given to research and development of the user client software, which must support information retrieval (both traditional and hypertext), tools to capture the extensive knowledge, analysis, and information routinely created by users during the research process, and the ability to contribute such information back to the digital library. The system will be designed to be extensible to various types of digital content beyond survey datasets, and to other types of access networks, including cable television access. In addition to issues of user client design, research areas include the ``bibliographic'' description of datasets, security/authentication in an inter-institutional context, scalable client/server architectures, information certification, and intellectual-property management.

Keywords: User contribution, client software, information retrieval, certification, work environments, user interface

1. Introduction

We have developed a research agenda based on the design and implementation of a ``Self-Enriching Library Facilities'' (SELF) technology. By developing and exploiting the on-line, interactive features of a digital library, SELF systems allow users to contribute to the archive, adding useful, organized information as a natural part of the work they already do. In particular, users will easily be able to (1) enter evaluative comments, (2) add links between related works, and (3) contribute their own data, manipulations, analytical results, and software to the library.

Because the system will capture and incorporate users' knowledge and tools, the quality of the library will be continually enriched and its capabilities will be progressively enhanced. Instead of serving as passive repositories, the SELF technology will convert libraries into information sources that grow as users employ them. Self-enrichment, the capacity to incorporate users' contributions, distinguishes the SELF approach from many other digital library designs. Instead of devoting resources to expensive one-time digitization of printed materials, the SELF project will invest primarily in tools to capture the extensive knowledge, analyses, and information routinely created by users. Tools will be provided to allow users to contribute easily. In addition to improving digital libraries automatically, SELF tools will improve the practice and quality of scientific research: they will enhance tracking of research by an individual researcher (playing a role similar to that of a natural scientist's laboratory notebook) and make it possible to replicate the research of others.

The proposed research project will develop facilities that subject the fundamental features of this self-enriching library vision to intensive empirical evaluation. These features include:

  1. A work environment that facilitates users' contributions to the database
  2. A user interface that supports contribution submission by users
  3. A client/server architecture that is extensible to new hardware platforms and to new types of data, including users' contributions
  4. Mechanisms for certifying information and protecting users' intellectual property by authenticating users' identities and limiting access rights
The primary users of the testbed will be social scientists, using data archives from the Inter-university Consortium for Political and Social Research (ICPSR, the largest collection of social science data in the world), and the Harvard On-Line Library Catalog, HOLLIS, of Harvard University (the largest university library in the world). Social scientists from all over the country are expected to participate in the design and evaluation of the tools and systems to be developed.

An important goal of the project is to create tools and procedures and to gain an understanding of user behavior that extends to all types of digital library use, not only to research by scholars in the social sciences. The system will be extensible to multiple network configurations including the Internet and cable TV, to other data archives including image and video servers, and to other types of users including grade-school pupils and casual browsers.

2. Project Rationale

The profound change in media that the transition from paper to digital libraries represent will undoubtedly be accompanied by equally profound changes in the intellectual organization and day to day use of our libraries. During discussions of possible pilot digital library applications at Harvard we decided to focus on one important aspect of this change. Using traditional libraries is a one-way process whereby users retrieve materials from the library. Indeed, usage actually degrades (through normal wear-and-tear) rather than enhances paper-based collections. Through SELF technology, digital libraries offer the possibility of blurring the distinction between users and librarians since both will be able to contribute to the collection.

The collection which we propose using in our initial digital library implementation is the datasets distributed by ICPSR (the Inter-university Consortium for Political and Social Research), a membership collaborative which collects and distributes social science data of numerous types. Harvard makes very heavy use of ICPSR data, with researchers spread across many schools and academic departments. Several aspects of the present use of ICPSR datasets which make the SELF model attractive include:

Discussions with empirical data researchers revealed a great willingness to make available through automated contribution to the digital library various products of their research, especially if some of the burdens of making this information available were resolved by the SELF project. Specifically, the SELF library model will enable researchers to make at least three types of contributions: The latter two types of contribution are applicable to library materials beyond datasets, and address two shortcomings of traditional library catalogs:

3. Architectural Overview of the Proposed SELF Library

The SELF library is based on the client/server system architecture, and has four basic components: the user client software, a metadata or catalog server, a data store server, and a contribution acceptance server. Figure 1 outlines the architecture.

The most difficult and sophisticated component is the client. Client functions include:

We believe that the user interface is one of the most critical elements in the design of the digital library, and research into the effective design of the client software is a major element in the SELF project.

The metadata server contains traditional library catalog data for objects in digital library, enhanced with commentary, links to related data and publications, extended information about datasets including such things as data element list, etc. The server will support two retrieval protocols: Z39.50 search/retrieval, and HTTP for following links from the catalog records to other relevant data.

The data store servers will be tailored to the nature of the digital objects they contain. Different access protocols, different retrieval functions, different security/billing facilities will be needed depending on the materials stored. While the SELF project will initially provide only statistical research datasets, the architecture allows for a variety of data store servers, and we assume that during the project other types of digital materials relevant to users of the SELF library (such as electronic text, video, audio, etc.) will become available and will be incorporated into the system.

The contribution acceptance server will automate the process of incorporating user contributions into the SELF library. Contributions will include additions to existing catalog records, links between catalog records, and new materials for the data store along with appropriate metadata. One of the major issues to be resolved about the contribution process is the degree to which validation of contribution can be automated, and the degree to which a manual ``editorial'' process is needed.

4. Major Research Issues to Be Addressed

The proposed research project will subject the fundamental features of this self-enriching library vision to intensive testing. These features include:

5. The Testbed and Its Three-Layer Model

The proposed SELF project will develop a testbed to support the research work. Figure 2 depicts a three-layer model of the testbed:

The primary users of the testbed under this proposal will be social scientists, using the ICPSR data archives, the Harvard University library catalog, and the infrastructure support shown in Figure 2. Social scientists include political scientists, economists, anthropologists, sociologists, geographers, and scholars from many other areas using data from the ICPSR. The SELF library catalog will draw materials from many sources at Harvard (materials relevant to social science research in all formats, not just datasets, will be included in the catalog), and, given the depth of Harvard's collections, will be a significant resource for social scientists well beyond Harvard. Because the catalog will be accessible over the Internet, our beta test users will come from all over the U.S. These social scientists will participate in and evaluate, in depth, the tools and systems to be researched and developed under this project. The shaded boxes in the figure indicate these core activities of the project.

Results of the proposed research under this project will be most useful if they are applicable to other fields and different types of users. We shall insure that the system is extensible at each of the three layers. At the infrastructure layer, we will allow (in addition to the Internet) cable networks, based on the Ethernet and TCP/IP protocols, to provide access to households and business offices at speeds ranging from 0.5 to 10 Megabits/sec. At the data layer, we will ensure that the SELF system can incorporate other datasets beyond ICPSR and the Harvard library catalog. We also hope to provide facilities for including datasets based on audio as well as static and full motion video. At the usage layer, via cable access, we will experiment with introducing the SELF system to people outside of the research community. These users will most likely be consumers rather than producers of contributions submitted to the data archive, and will in general use communication-limited sublanguages such as remote TV control protocols. The unshaded boxes in Figure 2 indicate all these extended activities to ensure the testbed's architectural generality.

6. Testbed Development Plan

The testbed will be developed according to the following schedule:

7. Concluding Remarks

The SELF project is a cooperative effort involving Harvard researchers with a number of institutions having complementary expertise in areas relevant to digital libraries, including user interface design (Lotus Notes, NCSA), metadata and data servers (OCLC, CIESIN), digital information collections (CIESIN, ICPSR), and networking infrastructure (Continental Cablevision). The project proposes the development of a library system with an in depth catalog of social science materials linked to a digital collection initially composed of social science datasets from the ICPSR. This library would be of relevance to a wide range of researchers at Harvard and elsewhere, and could expect to receive a significant level of use from students and scholars in many disciplines.

The project is based on the proposition that, unlike traditional libraries, digital libraries can accept information from, as well as dispense information to, users, and that such libraries can be improved and enriched through use. In particular, as applied to social science datasets, such a library can reduce effort for users by capturing the results of earlier work, can help address the difficulties of research replicability, and can provide enhanced catalog services through expert user input. The enhanced catalog features would be applicable to libraries of any sort.

Technically, the project proposes an architecture that is extensible to digital collections of any sort, and in fact one project participant (CIESIN) will apply the same architecture to a collection of significantly different materials, ensuring that our developments will be generalizable. The design combines two modes of information retrieval for catalog or metadata: word or phrase searching, and hypertext linkage. Primary emphasis is placed on the design of a user's working environment that both simplifies system use and provides for the facile capture of research by-products which can be subsequently shared with other users. Lastly, the project intends to address the problems of security and intellectual property rights control in a multi-institutional environment.

In summary, we have designed the SELF project to be immediately useful to the vast numbers of social and political researchers in these numerous related academic fields. Our testbed will therefore be a real-world application that will help real-world users. If our application is successful, SELF technology will also fundamentally restructure the practice and improve the quality of scientific research throughout the social sciences and other areas that rely on secondary data analyses. The proposed SELF technology is generic in the sense that it can benefit a variety of mainstream digital library technologies of the future beyond this single application.

Intellectual property rights, security, certification Work environment User interface Clients and servers The Internet (Contacting Author: H. T. Kung; Phone: 617-496-6211; Email: Division of Applied Sciences, Harvard University, 29 Oxford Street, Cambridge, MA 02138 Alan Chapman Bell-Northern Research, P.O.Box 3511, Station C, Ottawa, Ontario K1Y 4H7 Cable networks ICPSR datasets Harvard library catalog CIESIN datasets USGS datasets Image, graphic, video archives Cable Scholarly research in social & political sciences, contributing to the data archive Read-only users browsing the data archive (e.g., USGS and CIESIN users, cable users) Cable Figure 2: Three-layer model of the SELF testbed. Shaded boxes indicate the core activities of the project. Unshaded boxes indicate extensions of the testbed to ensure architectural generality.