Four lessons learned from managing World Wide Web digital libraries

Robert Pettengill
Schlumberger Austin Research P. O. Box 200015
Austin, Texas 78720-0015
Tel: 1-512-331-3728
E-mail: rcp@austin.sar.slb.com
Guillermo Arango
Schlumberger Austin Research P. O. Box 200015
Austin, Texas 78720-0015
Tel: 1-512-331-3735
E-mail: arango@austin.sar.slb.com

ABSTRACT

Corporations can become aggressive users of digital libraries and electronic distribution because they own the copyright on large collections of documents. There are many common issues in operating and maintaining digital document libraries and software libraries. Many of our software engineering tools and methods are very useful in the digital library domain. In this paper, we describe the application of some software engineering tools to common problems in the development and evolution of libraries of project documentation based on the World Wide Web. We discuss four lessons learned: 1) keep separate maintenance and publication areas, 2) document your process for repeatability, 3) automate your process for reliability and efficiency, 4) use explicit version control mechanisms. We outline methods and describe supporting tools used in the Unix environment. We also introduce a simple approach to implementing versioned hypertext libraries on the World Wide Web.

KEYWORDS: digital library, change management, version control, library methods, maintenance tools, hypertext

INTRODUCTION

Schlumberger engineering projects and sites are distributed worldwide. Network accessible libraries (e.g., networked information servers) can greatly reduce the cost of distributing and accessing project documents. This is one of many options in our evolving process for electronic document management. We use electronic mail, databases and network-based information servers for storing and accessing documentation. We have found that the World Wide Web(WWW) [4] is an effective and flexible platform for implementing digital libraries of project information. After a year and a half of operation, the WWW document server at our site handles over 60,000 requests a month.

Documentation projects must deal with the entire life-cycle of document including authoring, editing, publication, distribution, storage and access, and evolution. The process emphasizes common, sustainable solutions that do not rely on the latest hot product or fad and compatibility with printed and electronic media.WWW addresses only some aspects of the documentation life-cycle. From the point of view of process and tools, much more is needed to maintain a high-quality, networked, digital library.

QUALITY ISSUES IN ON-LINE LIBRARY MANAGEMENT

Users of public document archives accessible on the Internet see many different problems. For example:

Difficulty in finding information due to poor organization and or lack of search tools.
Lack of consistency in the presentation of similar information.
Information that is out of date.
Obvious errors in grammar and spelling.
Bad cross references or links to other documents.
Over-organization too many links to empty or useless information.
Frequent reorganization that keeps users guessing where to find previously located references.
Documents not available in formats suitable for both on-line reference and printing.

We have faced similar difficulties when managing large software libraries. There are simple lessons from software engineering that we can apply to networked digital libraries.

LESSONS LEARNED FROM SOFTWARE ENGINEERING

Figure 1 outlines a simplified process for maintaining a software library. An analogous process for maintaining documents in a digital library is shown in Figure 2. It was natural for us to take advantage of our experience with software methods and tools to manage document libraries. The methods, that we use to manage defects in software libraries, can be used to manage quality issues (listed above) in digital document libraries.

Figure 1. Simplified development and maintenance process for a software library.

Figure 2. Simplified development and maintenance process for a digital document library.

We find the most valuable recommendations to be:

Keep separate maintenance and publication areas
Document your process for repeatability
Automate your process for reliability and efficiency
Use explicit version control mechanisms

Keep separate maintenance and publication areas

One of the simplest and most valuable lessons is the value of using a separate document maintenance area. If the documents are edited directly in the publication area, readers are exposed to every editing error and inconsistency of the document collection. The consequences of this may be small in personally published documents. However, the errors and inconsistencies that this exposes in corporate and project documents are unacceptable.

Document your process for repeatability

Maintaining a high-quality digital library of project documents requires that many individual tasks be completed successfully (Figure 2). Documenting the process is the first step in ensuring repeatable quality. We have found that the following questions must be addressed and their answers tracked as part of the digital library development and maintenance process:

What is the original source for a document?
What are the formats used in the original source?
In what formats will the document be available on the server?
How are the documents transformed from their source formats to their server formats?
Where should the document be located on the server?
What indexes and search engines should include this document?
What documents should be updated to contain links to this document?
How long is the information in the document expected to remain useful?
What events or schedules will trigger updates of documents on the server?
Who will be responsible for the maintenance of the documents on the server?
What quality review process is needed before a new or updated document can be released to the server?
What property rights or security review is necessary to determine applicable access restrictions?

Our experience as providers and consumers of digital documents has shown that quality of service suffers if answers to these questions are not available, on demand, to maintainers of a digital library.

Automate your process for reliability and efficiency

Automating the document maintenance process minimizes the chances for error and reduces the cost of repetitive operations. Checklists, templates, and scripted procedures are all means to this end. One of the most useful software engineering tools for this purpose is the computer program called make [8].

Make facilitates declaring explicitly process steps and dependencies between process steps. It also automates the application of only those steps needed to achieve a desired result. The commands necessary to achieve canonical results are standardized ( e.g., `make test' to run appropriate spelling or reference checkers, `make install' to install a tested document in the server). The dependency information and procedural recipes for the make targets (e.g., test, install) are recorded in a file located in the working document directory called the makefile. Once a process has been described to make, it can easily be executed by authors without special technical training.

Make best supports processes which can be executed by a single individual and computer. When the process is distributed over many individuals and machines, as in the case where many authors and editors have to approve a change, more sophisticated system building or process enactment tools can be used.

Not all authors and publishers are familiar with software engineering tools such as make. As a result we have built tools which simplify the process further. For example, an author who wishes to add a collection of related documents to our on-line project library, follows the following process:

Identify the location (in this case, the URL) where the material should be placed in the library.If it does not already exist, create a working directory that collects the files to be published. This directory can be used for future document maintenance.In the working directory, run publish-edm with the library location as the target. Publish-edm is a script that creates a makefile of instructions for make and any required table-of-contents files (e.g., Gopher links or WWW index.html files). Publish-edm categorizes the files by type in the makefile. Appropriate transformations and tests based on the document type are automatically included in the makefile. Publish-edm also creates a `Read Me' file that documents the original publisher and the location of the working files.Next publish-edm invokes an editor and a WWW browser on the table-of-contents file so that the user can inspect the working local web and customize the automatically generated files. The table-of contents file is generated with all of the elements required to conform to our process' content guidelines. The author may then test the results via a combination of visual inspection in the browser and automatic tests (e.g., spelling check) run by `make test'.Install the verified files in the server by `make install'.Notify by e-mail any other author/publishers who may wish to view or reference the new documents in the server.Subsequent document changes are installed using the sequence: `make test', visual inspection with the WWW browser, and `make install'.

Custom scripts like the ones just described can be developed by any competent Unix programmer.

Use explicit version control mechanisms

It is often necessary for several people to maintain a collection of documents concurrently. In this case, it is necessary to coordinate access to the working files or to reconcile independently made changes. Software engineers call such coordination tools Version Control Systems (VCS) [1]. They allow any past version of the controlled files to be retrieved and the source of all changes to the program source text to be identified and compared. We have successfully used different systems for this purpose, including the widely available RCS [12]. The use of VCS to control documents published in their current configuration on the WWW is becoming more common [11]. Note that there are some types of documents which by their nature are immutable ( e.g. an electronic mail message once it has been sent) and version control is unnecessary in these cases.

THE NEED FOR VERSION-CONTROLLED HYPERTEXT

Digital libraries on the World Wide Web provide access only to their current organization and contents. This is a significant limitation. Sometimes it is more important to have access to yesterdays' paper than to the latest edition.

We must be able to manage the history of documents over time, not just offer to our readers the latest versions. For example, in a digital library that captures the history of an engineering project, there are documents that keep a record of design decisions made [2]. The documents recording decisions usually reference other technical reports in the library (e.g., evaluation of products, results of experiments, requirement analysis documents) as justification for the decision. An example is the selection of a hardware component to be used in a system. At some later time, the referenced report may be revised, to include a new alternative which is superior to the originally chosen part. Designers participating in follow-on projects must be able to follow references to both the historical and current documents in order to gain understanding required for product maintenance and redesign decisions. The need for versioned access to hypertext document webs has been recognized by other researchers as well [9].

To enable this type of access, we may treat each version of a document as a separate document. However, the complexity of this approach quickly becomes overwhelming when documents are organized hypertext as they are in WWW digital libraries. If our digital library is to contain hypertext documents, a way must be found to access multiple versions of those documents as configurations of versioned documentsour links (i.e., the web) must be versioned.

The draft HTTP 1.0 specification [5] contains two entity header fields, Version and Derived-from, intended to support the development of versioned documents. These are to be used when HTTP is used bidirectionally to store documents in the server as well as retrieve them from the server. However, these constructs do not solve the problem of versioned links. Version information must be encoded in the document cross reference itself (and not just in the client/server communications protocol) to support versioning of the web. Another approach is needed.

A simple solution for versioned hypertext on the WWW

We have implemented a simple approach that will add this capability to documents on any WWW HTTP server that supports the CCI, the common client interface [10]. To implement this scheme we need:

A syntactic convention for identifying the version component of a document cross reference ( e.g., WWW URI, Universal Resource Identifier [3], or URL, Universal Resource Locator).
A CGI client that retrieves the versioned documents from a version control system.

For our server we chose to add a `comma version' suffix to the URL: "http://server/vcs.cgi/path/name,version".

The comma is allowed in the URL syntax [6] but infrequently used. We have implemented interfaces to both a proprietary VCS and the widely available RCS. RCS is a good choice for the VCS in this application because it supports symbolic version names. Symbolic version names are the foundation for designating versioned configurations of many hypertext files.

Note that this inverts the use of a VCS described earlier. Rather than keeping the VCS archive in the working maintenance area, the VCS is directly available to the server. The CGI interface to the VCS in the URL above is `vcs.cgi'. An algorithm for the CCI implementation of version controlled WWW HTTP server involves the following steps:

Get the versioned document path from the CCI environment variable PATH_INFO.Parse the document version information from the path.Normalize the path to the form required by the VCS.Using the server's table of MIME [7] type (i.e., document type) information (or file attributes from the VCS repository) determine the MIME content type of the document. Send the content-type back through the server on standard output as specified in the CCI.Fetch the versioned document from the VCS using the normalized path and version. Direct the document output to standard output as specified in the CCI.Exit.

Provisions for error recovery should be included. Browsing the VCS archive by constructing directories of files or versions is easily implemented by adding the required handling of wild card characters.

With VCS support, versioned documents may be retrieved by number (e.g., paper.html,1.2), named version (e.g., paper.html,DL95) or by date (e.g., paper.html,1995-3-12). Versioning does not require any changes to the WWW HTTP protocol. The syntactic convention need hold only on the versioned server. On the other hand, when references are made without version information (e.g., paper.html) the current document is retrieved with no apparent difference from any other WWW server. Relative URLs, which make local cross references, work in the same way as on an unmodified server.

Other behaviors can be useful. For example: when the user follows a cross reference to a document without a specified version, from a past version of the referring document (the REFERRER in the CGI specification), the document version current on the date of the referring document version could be returned. This approach would allow users to view historical slices of the library without requiring that all cross references be versioned.

World Wide Web support for Universal Resource Names, (URN another kind of URI, e.g. an ISBN) will simplify support for versioned web configurations by providing stable names which are independent of the current location of the document.

CONCLUSIONS

We have been able to successfully apply tools and methods from software engineering to our maintenance of digital libraries of project documentation. Others can likewise benefit from our experiences. We have also found that it is straightforward to support versioned webs of hypertext documents in digital libraries implemented with today's World Wide Web technology and standard software engineering tools available in the Unix environment. We expect to improve our digital libraries in the future by making better use of traditional library experience in subject classification, controlled vocabularies, and long-term archive management.

Because we have sung the virtues of applying software engineering methods and tools to digital libraries, we also have an obligation to point out that there are many other issues where such transfer does not seem to provide any leverage and experience from other disciplines must be brought in.

For example, one of the issues that we identified in Section 2 was the difficulty finding material in poorly organized and indexed archives. This is a serious usability problem. The organizational schemes used in corporate digital libraries are mostly ad hoc, usually influenced by business organization and project breakdowns. Business organizations and projects are notoriously fluid. Our digital libraries have much to gain from library science in the application of controlled subject classifications.

Another example is the long-term viability of electronic records. The distributed nature of our digital libraries and the variety of storage solutions that networks enable have practical advantages. However, the physical protection of the documents that compose the corporate knowledge base is not one of them. Too often, knowledge captured in electronic documents is put at risk or lost because of media reuse or obsolescence. Managed processes must be in place so that near-term electronic accessibility is not gained at the expense of longer-term loss of our corporate knowledge.

REFERENCES

Alderson, A., `Configuration management', Software Engineer's Reference Book, John A. McDermid ed., Butterworth-Heinemann Ltd, 1991.
Arango, G., Schoen, E., and Pettengill, R., `A Process for Consolidating and Reusing Design Knowledge', 15th International Conference on Software Engineering, pp. 231-242, IEEE Computer Society Press, May 1993.
Berners-Lee, T., `Universal Resource Identifiers in WWW', Internet RFC 1630, Internet Engineering Task Force, June 1994.
Berners-Lee, T., Cailliau, R., Groff, J-F., Pollermann, B., `World Wide Web: The Information Universe', Electronic Networking: Research, Applications and Policy, vol. 2 No. 1, pp. 52-58, Meckler Publishing, Spring 1992.
Berners-Lee, T., Fielding, R., and Nielsen, H. F., `Hypertext Transfer ProtocolHTTP/1.0', Internet-Draft, Internet Engineering Task Force, March 1995.
Berners-Lee, T., Masinter, L., and McCahill, M., `Universal Resource Locators (URL)', Internet RFC 1738, available from http:/www.cis.ohio-state/htbin/rfc/rfc1738.html, Network Working Group, Internet Engineering Task Force, December 1994.
Borenstein, N. and Freed, N., `MIME (Multipurpose Internet Mail extensions) Part One', Internet RFC 1521, Internet Engineering Task Force, September 1993.
Feldman, S. I., `Make A Program for Maintaining Computer Programs', SoftwarePractice & Experience, vol. 9 no. 3, pp. 255-265, March 1979.
Halasz, F. G., `Reflections on Notecards: Seven Issues for the Next Generation of Hypermedia Systems', Communications of the ACM, vol. 31 no. 7, July 1988.
McCool, R., `The CGI Specification', available from http://hoohoo.ncsa.uiuc.edu/cgi/interface.html, National Center for Supercomputing Applications, 1994.
Pitkow, J.E. and Jones, R.K., `Towards an Intelligent Publishing Environment', Graphics, Visualization, & Usability Center, College of Computing, Georgia Institute of Technology, Atlanta, GA30332-0280, 1995.
Tichy, W. F., `RCSA System for Version Control', SoftwarePractice & Experience, vol. 15 no. 7, pp. 637-654, July 1985.