The Harvard Self-Enriching Library Facilities (SELF) Project

Gary King[1], H. T. Kung[2], Barbara Grosz[2], Sidney Verba[3],
Dale Flecker[4], and Brian Kahin[5]

[1] Department of Government, gk@isr.harvard.edu

[2] Division of Applied Sciences, {kung, grosz}@das.harvard.edu

[3] Department of Government and Harvard University Library, sverba@hdc.mhs.harvard.edu

[4] Harvard University Library, dale@harvarda.harvard.edu

[5] John F. Kennedy School of Government, kahin@hulaw1.harvard.edu Harvard University, Cambridge, Massachusetts 02138

Abstract

This paper proposes the development of a library system with an in-depth catalog of social science materials linked initially to a digital collection of survey and other research datasets. It is proposed that digital libraries can accept information from, as well as dispense information to users, and that such libraries can be improved and enriched through use. The digital library is composed of the user client software, a metadata or catalog server, a data store server, and a contribution acceptance server. Primary emphasis is given to research and development of the user client software, which must support information retrieval (both traditional and hypertext), tools to capture the extensive knowledge, analysis, and information routinely created by users during the research process, and the ability to contribute such information back to the digital library. The system will be designed to be extensible to various types of digital content beyond survey datasets, and to other types of access networks, including cable television access. In addition to issues of user client design, research areas include the ``bibliographic'' description of datasets, security/authentication in an inter-institutional context, scalable client/server architectures, information certification, and intellectual-property management.

Keywords: User contribution, client software, information retrieval, certification, work environments, user interface

1. Introduction

We have developed a research agenda based on the design and implementation of a ``Self-Enriching Library Facilities'' (SELF) technology. By developing and exploiting the on-line, interactive features of a digital library, SELF systems allow users to contribute to the archive, adding useful, organized information as a natural part of the work they already do. In particular, users will easily be able to (1) enter evaluative comments, (2) add links between related works, and (3) contribute their own data, manipulations, analytical results, and software to the library.

Because the system will capture and incorporate users' knowledge and tools, the quality of the library will be continually enriched and its capabilities will be progressively enhanced. Instead of serving as passive repositories, the SELF technology will convert libraries into information sources that grow as users employ them. Self-enrichment, the capacity to incorporate users' contributions, distinguishes the SELF approach from many other digital library designs. Instead of devoting resources to expensive one-time digitization of printed materials, the SELF project will invest primarily in tools to capture the extensive knowledge, analyses, and information routinely created by users. Tools will be provided to allow users to contribute easily. In addition to improving digital libraries automatically, SELF tools will improve the practice and quality of scientific research: they will enhance tracking of research by an individual researcher (playing a role similar to that of a natural scientist's laboratory notebook) and make it possible to replicate the research of others.

The proposed research project will develop facilities that subject the fundamental features of this self-enriching library vision to intensive empirical evaluation. These features include:

A work environment that facilitates users' contributions to the database
A user interface that supports contribution submission by users
A client/server architecture that is extensible to new hardware platforms and to new types of data, including users' contributions
Mechanisms for certifying information and protecting users' intellectual property by authenticating users' identities and limiting access rights

The primary users of the testbed will be social scientists, using data archives from the Inter-university Consortium for Political and Social Research (ICPSR, the largest collection of social science data in the world), and the Harvard On-Line Library Catalog, HOLLIS, of Harvard University (the largest university library in the world). Social scientists from all over the country are expected to participate in the design and evaluation of the tools and systems to be developed.

An important goal of the project is to create tools and procedures and to gain an understanding of user behavior that extends to all types of digital library use, not only to research by scholars in the social sciences. The system will be extensible to multiple network configurations including the Internet and cable TV, to other data archives including image and video servers, and to other types of users including grade-school pupils and casual browsers.

2. Project Rationale

The profound change in media that the transition from paper to digital libraries represent will undoubtedly be accompanied by equally profound changes in the intellectual organization and day to day use of our libraries. During discussions of possible pilot digital library applications at Harvard we decided to focus on one important aspect of this change. Using traditional libraries is a one-way process whereby users retrieve materials from the library. Indeed, usage actually degrades (through normal wear-and-tear) rather than enhances paper-based collections. Through SELF technology, digital libraries offer the possibility of blurring the distinction between users and librarians since both will be able to contribute to the collection.

The collection which we propose using in our initial digital library implementation is the datasets distributed by ICPSR (the Inter-university Consortium for Political and Social Research), a membership collaborative which collects and distributes social science data of numerous types. Harvard makes very heavy use of ICPSR data, with researchers spread across many schools and academic departments. Several aspects of the present use of ICPSR datasets which make the SELF model attractive include:

significant effort is frequently required to make use of these datasets, effort in part duplicated by each subsequent user;
researchers frequently spend considerable effort creating derivative datasets, new derived variables, etc., while using the datasets, efforts again frequently duplicated by other users; and
it has been shown that it is frequently difficult to replicate research results reported in the literature, in part because the complex manipulations used to derive results are hard to document and recreate.

Discussions with empirical data researchers revealed a great willingness to make available through automated contribution to the digital library various products of their research, especially if some of the burdens of making this information available were resolved by the SELF project. Specifically, the SELF library model will enable researchers to make at least three types of contributions:

Derivative datasets and software. This addresses the issues of duplicate effort, and provides ways to document the research process which yielded a given research result.
Evaluative commentary, providing commentary by experts about given objects in the digital library (such as cells, variables, observations, datasets, or software).
Linkages between objects in the digital library, allowing ``webs'' or networks of links between various objects in a collection.

The latter two types of contribution are applicable to library materials beyond datasets, and address two shortcomings of traditional library catalogs:

Current library catalogs provide users with little information to help them decide which of the items in a collection retrieved in a search is most likely to be of use. This seeming neutrality of catalogs, where each holding is implicitly attributed equal importance or potential usefulness, has many justifications but is beginning to be questioned in the library field. After all, we have no reason but tradition to prefer this equal weighting scheme to the infinite variety of other possibilities. For example, some institutions are experimenting with including book reviews in their catalogs. Weighting schemes may be particularly important for very large libraries, such as Harvard, where the probability of any volume being used may be exceedingly low (about half of all books in the largest Harvard library have never been checked out).
Budgets constrain how much effort libraries can invest to enhance the retrievability of catalog data. Permitting users to build ``links'' between related materials in a catalog allows the use of popular ``hyperlinking'' techniques in catalogs without requiring additional library resources to add such data during the cataloging process. By ``deputizing'' users as expert librarians, we can significantly increase the level of effort invested in catalog building and enhancement without corresponding increases in library staff and budget.

3. Architectural Overview of the Proposed SELF Library

The SELF library is based on the client/server system architecture, and has four basic components: the user client software, a metadata or catalog server, a data store server, and a contribution acceptance server. Figure 1 outlines the architecture.

The most difficult and sophisticated component is the client. Client functions include:

Support for traditional word and phrase catalog searching, as implemented in the Z39.50 information retrieval protocol. In the SELF model collection use begins with a catalog search.
Support for following hypertext links between data records in the metadata store, as implemented today in such systems as the WorldWideWeb. Links are used in the SELF system after a relevant catalog record is found for retrieving evaluative commentary about an object, metadata or catalog data about other related objects in the collection (including contributed derivative datasets and software), and extended information about datasets (description of a survey, data element lists, etc.).
Facilities which make it easy during the research process for users to accumulate information about related items in the collection and evaluations of objects in the collection, and to contribute this information to the SELF library at an appropriate time.
Facilities for retrieving digital objects from the collection.

We believe that the user interface is one of the most critical elements in the design of the digital library, and research into the effective design of the client software is a major element in the SELF project.

The metadata server contains traditional library catalog data for objects in digital library, enhanced with commentary, links to related data and publications, extended information about datasets including such things as data element list, etc. The server will support two retrieval protocols: Z39.50 search/retrieval, and HTTP for following links from the catalog records to other relevant data.

The data store servers will be tailored to the nature of the digital objects they contain. Different access protocols, different retrieval functions, different security/billing facilities will be needed depending on the materials stored. While the SELF project will initially provide only statistical research datasets, the architecture allows for a variety of data store servers, and we assume that during the project other types of digital materials relevant to users of the SELF library (such as electronic text, video, audio, etc.) will become available and will be incorporated into the system.

The contribution acceptance server will automate the process of incorporating user contributions into the SELF library. Contributions will include additions to existing catalog records, links between catalog records, and new materials for the data store along with appropriate metadata. One of the major issues to be resolved about the contribution process is the degree to which validation of contribution can be automated, and the degree to which a manual ``editorial'' process is needed.

4. Major Research Issues to Be Addressed

The proposed research project will subject the fundamental features of this self-enriching library vision to intensive testing. These features include:

Interactive user interface. Two complementary forms of information-finding tools will be integrated into a single coherent interface. Users will be able to use the precision of traditional keyword and phrase searching, and after locating one relevant item in the catalog pursue related materials through hypertext linking. Hypertext techniques will also be used to provide access to enriched catalog data, such as evaluative commentary and descriptions of data set formatting and content. In addition, the same interface will provide facilities for the user to accumulate information throughout the research process. This information can be easily contributed subsequently, enriching both the catalog and the collection. Finally, the interface will support communication in limited sublanguages such as TV remote-control protocol for cable.
The ``bibliography'' of datasets. Libraries have largely standardized on an international basis the rules and data formats for describing printed materials. Similar consensus does not yet exist for describing datasets, and different conventions are growing up in different communities. One aspect of the SELF project is to analyze these various conventions for use in digital library systems.
Context-rich work environment. Context-tracking software will be used to support and encourage users to make contributions to the database. Authentication, security, and access-control mechanisms that understand context will be devised.
Security/authentication. The use of the ICPSR data, combined with the desire to make the SELF library accessible to users beyond Harvard, raises interesting security issues. ICPSR data is freely available to researchers in any ICPSR member institution. Providing access to non-Harvard users therefore requires a system to validate institutional affiliations over the network. Our project incorporates research into the construction of an inter-institutional authentication system.
Scalable client/server architecture. The architecture will be extensible in the future to servers that incorporate various kinds of users' contributions and to new types of published data, and it will leverage existing searching capabilities. The additions may be in various data types, including text, numeric tables, graphics, and video.
Information certification. Users' contributions can be properly evaluated, filtered, classified, or edited to enhance their value to other users.
Intellectual-property management. While the ICPSR provides one model for ownership/access for digital materials, any robust digital library system must be able to encompass materials under a range of different intellectual property-rights protocols. The present lack of guidance in intellectual property management has held back the ICPSR (and many other archives) in pursuing advanced services based on digital library concepts. We will work with ICPSR to solve these problems in their case, while attempting to provide more generally applicable solutions. In particular, the new legal challenge of incorporating users' contributions in a digital library will be studied. The proposed research is expected to result in frameworks for handling financing and intellectual-property management issues.

5. The Testbed and Its Three-Layer Model

The proposed SELF project will develop a testbed to support the research work. Figure 2 depicts a three-layer model of the testbed:

Usage Layer: A set of applications for the testbed. Each application is associated with a specific set of users and tools that the testbed will support.
Data Layer: A collection of datasets that the testbed possesses.
Infrastructure Layer: Includes all the hardware and software facilities that are applicable to a large number of data stores and applications. These facilities include networking configurations, clients, servers, user interface, work environments, as well as mechanisms for assuring security, authentication and information certification, the protection of intellectual property rights, etc.

The primary users of the testbed under this proposal will be social scientists, using the ICPSR data archives, the Harvard University library catalog, and the infrastructure support shown in Figure 2. Social scientists include political scientists, economists, anthropologists, sociologists, geographers, and scholars from many other areas using data from the ICPSR. The SELF library catalog will draw materials from many sources at Harvard (materials relevant to social science research in all formats, not just datasets, will be included in the catalog), and, given the depth of Harvard's collections, will be a significant resource for social scientists well beyond Harvard. Because the catalog will be accessible over the Internet, our beta test users will come from all over the U.S. These social scientists will participate in and evaluate, in depth, the tools and systems to be researched and developed under this project. The shaded boxes in the figure indicate these core activities of the project.

Results of the proposed research under this project will be most useful if they are applicable to other fields and different types of users. We shall insure that the system is extensible at each of the three layers. At the infrastructure layer, we will allow (in addition to the Internet) cable networks, based on the Ethernet and TCP/IP protocols, to provide access to households and business offices at speeds ranging from 0.5 to 10 Megabits/sec. At the data layer, we will ensure that the SELF system can incorporate other datasets beyond ICPSR and the Harvard library catalog. We also hope to provide facilities for including datasets based on audio as well as static and full motion video. At the usage layer, via cable access, we will experiment with introducing the SELF system to people outside of the research community. These users will most likely be consumers rather than producers of contributions submitted to the data archive, and will in general use communication-limited sublanguages such as remote TV control protocols. The unshaded boxes in Figure 2 indicate all these extended activities to ensure the testbed's architectural generality.

6. Testbed Development Plan

The testbed will be developed according to the following schedule:

Year 1: (1) Metadata server with URL (Uniform Resource Locators) linkage capability, (2) ICPSR data server, and (3) Microsoft Windows-based clients using both MOSAIC and Notes.
Year 2: (1) Opening of the testbed to selected researchers around the world via the Internet, and high-speed access (i.e., Ethernet speed) to local users in the Greater Boston area through cable services, and (2) implementation of users' contributions.
Year 3: (1) Integration of the client with advanced work environment and user interface facilities to facilitate users' contributions, (2) second-generation servers (e.g., servers documents like magazine articles, video, and copyrighted materials) and (3) servers implementing results of research on security and intellectual property.
Year 4: Evaluation of system and user experience to identify effectiveness, limitations, areas for improvement, and possible extensions for the SELF technology.

7. Concluding Remarks

The SELF project is a cooperative effort involving Harvard researchers with a number of institutions having complementary expertise in areas relevant to digital libraries, including user interface design (Lotus Notes, NCSA), metadata and data servers (OCLC, CIESIN), digital information collections (CIESIN, ICPSR), and networking infrastructure (Continental Cablevision). The project proposes the development of a library system with an in depth catalog of social science materials linked to a digital collection initially composed of social science datasets from the ICPSR. This library would be of relevance to a wide range of researchers at Harvard and elsewhere, and could expect to receive a significant level of use from students and scholars in many disciplines.

The project is based on the proposition that, unlike traditional libraries, digital libraries can accept information from, as well as dispense information to, users, and that such libraries can be improved and enriched through use. In particular, as applied to social science datasets, such a library can reduce effort for users by capturing the results of earlier work, can help address the difficulties of research replicability, and can provide enhanced catalog services through expert user input. The enhanced catalog features would be applicable to libraries of any sort.

Technically, the project proposes an architecture that is extensible to digital collections of any sort, and in fact one project participant (CIESIN) will apply the same architecture to a collection of significantly different materials, ensuring that our developments will be generalizable. The design combines two modes of information retrieval for catalog or metadata: word or phrase searching, and hypertext linkage. Primary emphasis is placed on the design of a user's working environment that both simplifies system use and provides for the facile capture of research by-products which can be subsequently shared with other users. Lastly, the project intends to address the problems of security and intellectual property rights control in a multi-institutional environment.

In summary, we have designed the SELF project to be immediately useful to the vast numbers of social and political researchers in these numerous related academic fields. Our testbed will therefore be a real-world application that will help real-world users. If our application is successful, SELF technology will also fundamentally restructure the practice and improve the quality of scientific research throughout the social sciences and other areas that rely on secondary data analyses. The proposed SELF technology is generic in the sense that it can benefit a variety of mainstream digital library technologies of the future beyond this single application.

Intellectual property rights, security, certification Work environment User interface Clients and servers The Internet (Contacting Author: H. T. Kung; Phone: 617-496-6211; Email: Kung@das.harvard.edu) Division of Applied Sciences, Harvard University, 29 Oxford Street, Cambridge, MA 02138 Alan Chapman Bell-Northern Research, P.O.Box 3511, Station C, Ottawa, Ontario K1Y 4H7 Cable networks ICPSR datasets Harvard library catalog CIESIN datasets USGS datasets Image, graphic, video archives Cable Scholarly research in social & political sciences, contributing to the data archive Read-only users browsing the data archive (e.g., USGS and CIESIN users, cable users) Cable Figure 2: Three-layer model of the SELF testbed. Shaded boxes indicate the core activities of the project. Unshaded boxes indicate extensions of the testbed to ensure architectural generality.