An Architecture and Operation Model for a Spatial Digital Library

Charles Kacmar[1], Susan Hruska[1], Chris Lacher[1], Dean Jue[2], Christie Koontz[2], Myke Gluck[3], and Stuart Weibel[4]

[1] Department of Computer Science, Florida State University, 203 Love, Mail Stop 4019, Tallahassee, Florida 32306-4019, {kacmar, hruska, lacher}@cs.fsu.edu

[2] Florida Resources and Environmental Analysis Center, Mail Stop 4015, Florida State University, Tallahassee, Florida 32306-4015, {djue, ckoontz}@opus.freac.fsu.edu

[3] School of Library and Information Studies, Mail Stop 2048, Florida State University,

Tallahassee, Florida 32306-2048, mgluck@lis.fsu.edu

[4] >Office of Research, Online Computer Library Center, Inc. Dublin, Ohio 43017-0702, weibel@oclc.org

Abstract

The dependency on and importance of spatial data is well documented but the reality is that a majority of spatial data is inaccessible, even to the most experienced user. The reasons for this situation include the lack of a general and national locator service; facilities to retrieve, convert, relate, and access spatial data; and a diversity of standards for cataloging and representing spatial files. A national spatial digital library would greatly improve this situation. This paper presents a model for a distributed, hierarchical architecture to support a spatial digital library. The goals of this work are to clarify and resolve the problems of access by creating a national spatial metadata locator service that supports the collection and distribution of metadata to geographically distributed nodes. A unique aspect of this approach concerns the distribution network, which is built upon traditional institutions, particularly libraries, at the state and local levels.

keywords: Spatial, metadata, locator service, libraries, distribution.

1. Introduction

Researchers and users in almost every discipline depend upon spatial data to support their research and job activities. The range of spatial data use is extremely diverse, from "coarse" data about geographic characteristics of the earth, to "fine" data concerning the placement of fire hydrants within a city. This wide variation accounts for the fact that almost 80% of all data have some spatial characteristics [2]. It also accounts for some of the difficulties in collecting, representing, and relating spatial data.

The dependency on and importance of spatial data is well documented, but the reality is that a majority of spatial data is inaccessible, even to the most experienced user because they: 1) do not have access to the necessary computing facilities; 2) do not know where the spatial data files are stored; 3) cannot use the services of a geographic information system (GIS) to view or manipulate the data; or, 4) do not know which spatial files are relevant to their work. Two (of many) reasons for this situation are ineffective locator services which have retarded widespread use and have inhibited knowledge about spatial data that is available; and, the flux of current standards for cataloging, representation, and markup of spatial files, as well as the metadata (data about data), which have been major barriers to locatability and use.

A national spatial digital library would greatly improve this situation. The current state of the field is that many spatial data files exist and are available on the Internet but only a few select groups of researchers are aware of these files and know how to retrieve them. The Federal Geographic Data Committee (FGDC) is one such group and has sponsored a project [7, 18] to provide Federal spatial data over the Internet. The project has provided many interesting results, but it has not accounted for the majority of spatial data because more than half of all spatial data is collected and maintained at the state and local levels.

This paper presents an overview of a distributed, hierarchical architecture to support a spatial digital library. Section 2 reviews the current state of the field and previous research in the area of spatial digital data. Section 3 presents the architecture and operation of the various components supporting the library. Section 4 identifies the current participants in this project and describes each of their roles. Section 5 provides a summary.

2. State of the field

2.1. Characteristics

A spatial document is composed of one or more layers of graphical features, with accompanying attribute/values usually stored in relational data files. Each layer is stored in a separate file with the collection of related feature variables defining a particular phenomena. For example, one layer may provide all features related to transportation services for a downtown area while another layer provides zoning information. The various layers and attribute/value pairs are related, dynamically at the time of access (e.g., through an SQL-generated query), to produce the desired graphical display for each data layer. In fact, the resulting graphical map of features may be generated from data scattered across several layers. This results in an extremely problematic situation for analysis because GIS do not provide mechanisms for easily generating queries on data of this type.

Attribute/values not only serve to drive the spatial display but also are the basis for search. For example, a demographer can determine if the transportation layer of the USGS Digital Line Graph (DLG) file for a county contains 4-lane bridges or railroad crossings by obtaining the appropriate layer of the spatial document and then utilizing the viewing services of the GIS to filter (generate the appropriate SQL query) the presentation to affect the display. Spatial variables are recorded as attribute/values in the document and are accessible using a GIS. In some cases, data is unavailable until it is processed by a GIS, and for this reason, access to spatial data is inhibited using traditional means. Complex metadata records and data dictionaries/codebooks are necessary to effectively identify and select documents and elements for viewing.

Feature variable selection also is used to derive a view of the spatial document. This occurs, however, only if the GIS supports the selection of all variables which are represented in the layers of the document. If a particular feature/abstraction is not supported or the data for the feature is captured at a level other than what is being viewed, the entity is unavailable for view construction and cannot be used to support search or selection of the spatial element. Moreover, few GISs support search across multiple layers at the feature element level; the ability to selectively identify relevant components of a spatial document remains the responsibility of the user. At a document level, the tools used to locate and retrieve spatial documents such as WAIS, ARCHIE, and GOPHER [16] are inadequate [17]. Other researchers have also documented limitations of these tools [13].

The underlying problem concerning the above discussion is centered in the composition and content of the metadata supporting the spatial document, its layers, and the features within each layer. At a minimum, the metadata must provide named geographic features at all levels of resolution. This problem exists because of disagreement over standards, ambiguity in naming conventions, changing of names (e.g., Leningrad to St. Petersberg), different levels of accuracy of data recording, and ineffective locator and access facilities. Current locator and access tools are designed to be most effective on textual documents. In many cases, the textual documents are static or subject to changes which impact only a single or small collection of documents. In contrast, spatial data documents can be extremely dynamic. For example, a cataclysmic event such as the volcanic eruption of Mt. St. Helens had wide reaching effects on all spatial documents associated with that region.

2.2. Standards and standards-setting organizations

The library community, including the Library of Congress, American Library Association and others, have developed eight exchange formats and transport mechanisms for the cataloging of library materials. These are known collectively as the MARC (Machine Readable Cataloging) records [1, 5, 10, 11] These formats include templates for cartographic and machine readable items. The US MARC standard is used in the US while variants of this standard are used in other countries, such as UKMARK for the United Kingdom. The templates support both fixed and variable fields in order to describe library holdings. Fields identify author, title, producer, copyright holder, and so forth as well as support repeating fields and subfields. MARC records are the pervasive standard for exchange of library materials worldwide.

The second standard is the spatial data standard [7] (actually only a proposed standard at this time although the intent is to make it a FIPS standard). A draft content standard for spatial metadata was published in the Federal Register in 1992 and serves a similar purpose as MARC but supports a slightly different set of fields.

The problem with the current state of standards is the lack of agreement across all levels of spatial data. At one level of abstraction, local communities cannot agree on the definition and content of the base map the basic collection of layers of data which define the primary entities to be represented. Other levels of disagreement occur on the variables and attribute/values needed to support representation. Discrepancies and differences in codebooks are a classic illustration of the problem. One codebook may provide a definition for "Pipeline", another may ignore it completely, a third may provide 4 different definitions under two different categorizations.

2.3. Locator and access facilities

There are many spatial digital data sets available for purchase or downloading over the Internet. However, the access rates and usage of such data sets are very low [18] and can be attributed to a lack of established locator and retrieval services for spatial metadata, files, and their surrogates.

Various data locator and distribution mechanisms and services have been created and used in other settings for non-spatial documents. However, few of these services have been applied to the needs of the spatial community. One effort began in September, 1993, sponsored by the FGDC, and involves the use of WAIS to access spatial metadata. The pilot system uses a centralized locator service for spatial documents called the National Geospatial Data Clearinghouse (NGDC). Approximately 180 sites were involved in the initial test, accessing a small number of FGDC data files. The proposed FGDC spatial metadata standard was used as the basis of the metadata structure.

Figure 1. From upper left to lower right. Spatial data set repositories provide storage for the library. Data sets are placed into repositories by producers. Users access the library to locate and retrieve data sets through interface mechanisms. Supporting these mechanisms are expert network tracking and end-user data collection. Connection to the metadata nodes and repositories is through standardized network protocols and tools. Catalogers manage the metadata by creating and editing metadata records. A distributed DBMS supports metadata storage, access, and query activities. Components of the architecture, such as the expert network and data collection elements, may exist elsewhere and may not necessarily be embedded within a single package as implied in the figure.

While the NGDC pilot project has provided some valuable insights into developing a locator service, it does not address some critical issues that must be considered for a national spatial data library. First, some state legislatures require state data to be provided to citizens through the state itself. This mandate necessitates the creation of multiple spatial data repositories and various locator services to support the digital library. Coordination among these agencies and institutions must exist and be of a sufficient level to ensure distribution of metadata throughout the entire library. Second, the majority of spatial data sets are produced by non-federal agencies and in various jurisdictional areas. This indicates a need for catalog and locator services within these agencies. Often, for political or legal reasons, the responsibility for maintaining spatial metadata and files must-remain with the data producer. All of these issues affect user access to the spatial digital library.

3. Architecture

The architecture which we have proposed for a spatial digital library is distributed, layered, and supports multiple interface and access methodologies (see Figure 1).

Interaction between the user, metadata access facilities, and data set repositories is supported by data collection, tracking, network protocol, and data management elements. The architecture not only supports spatial digital library activities, but also provides a framework for experimenting with different component designs and implementations to determine how access paradigms relate to user experience levels and task needs and from this suggest to users appropriate paradigms of access. Thus, a component of the research and development of this spatial digital library system will consist of an evaluation of access mechanisms.

3.1. Operation

The production and distribution of spatial data begins with the production of the document entity. At the present time, spatial data producers are usually responsible for creating and maintaining the data and metadata. This production process must be augmented with a metadata cataloging facility to allow producers to specify the contents and accuracies of a spatial data document.

Metadata and changes to metadata will propagate throughout the network relative to nodes that are maintaining or have interest in the changed spatial documents. The goal is to satisfy requests as quickly as possible through geographic distribution of metadata and to allow changes in storage locations as necessary to expand the library. A primary goal of the hierarchical distribution structure is to place metadata nodes close to producers and users of spatial data. A secondary goal is to allow privately held spatial documents to be registered with other metadata nodes to enhance awareness of those documents.

3.1.1. Information retrieval

The issues concerning collection and management of spatial metadata necessitate the use of three different information retrieval access methods for the spatial digital library. Library cataloging provides for the construction, maintenance, and access to documents using surrogate records. This access technique is based on standards such as Anglo-American Cataloging Rules (AACR), Library of Congress Subject Headings for classification (LCSH), and Machine Readable Cataloging (MARC). Surrogates generally represent the most aggregated level of content, for example, a book must be defined in a single record. Retrieval is based on boolean procedures to find information in each item's surrogate record and to use AND, OR and NOT operations to further refine the retrieval. Call numbers, subject headings, and keywords from note fields also support retrieval.

Indexing and abstracting will be supported by automated methods. The result of this processing is a collection of surrogate items for the document. Surrogates are highly structured and if created manually would require a very labor intensive effort.

Fulltext retrieval allows each word of a document to be searched permitting creation of indexing and thesaurus terms through automatic examination of the document. Fulltext retrieval increases a user's freedom to explore documents that surrogates, such as abstracts or catalog records, cannot provide. However, performance depends upon the choice of vocabulary of the author and user since searches are still based on boolean procedures. If a user searches on a synonym of a term that does not appear in the fulltext document, the document will not be retrieved. Thus, precision in query formulation is critical for effective retrieval, as it is with boolean searching of catalog and index/abstract surrogates.

3.1.2. Hypermedia

Hypermedia will be used as an access paradigm over the metadata network by creating hypertext documents which organize metadata in a hierarchical and non-linear manner. More importantly, hypermedia will be the primary access paradigm for relationships among spatial:spatial as well as spatial:non-spatial documents. Researchers navigating a metadata hypertext may select a document for retrieval through a simple mouse click.

Hypermedia research also will benefit the spatial library by reducing instances of disorientation [6, 14, 15]. Whereas many computing environments are characterized by transactions of short duration, spatial data systems often involve long duration transactions, in some cases several minutes. The researcher must contend not only with the complexity of problems, but also the loss of control and concentration during long duration tasks. One way that hypermedia can assist in this task is through the capturing and display of navigational paths through the metadata network. This presentation will aid the spatial data user in recalling link paths and by doing so should reduce some aspects of disorientation.

Another expected benefit and research focus of hypermedia will be to enhance usability and reduce disorientation over large document collections. The lack of document spaces large enough for doing this type of research and understanding the issues of navigating large document networks is well stated in [12]. A digital library of the magnitude anticipated for use by the spatial data community will provide an appropriate platform. The focus of this work will concern the adequacy of the hypermedia access paradigms to present metadata in a non- linear manner as the size of the metadatabase increases.

3.1.2.1. Defining relationships among spatial documents

The problem of identifying relationships among documents is perhaps the most critical issue for enhancing understanding and increasing the use of spatial documents. By applying hypermedia technology to the spatial data arena, it will be possible to assist users in identifying relationships among documents and hopefully to better understand the use of a particular document in a task.

The integration of machine learning and hypermedia will provide for the development, maintenance, and display of interdocument relationships. From a user's perspective, this integration will improve his or her ability to discover new information while minimizing the effort required to identify and retrieve those documents.

Similar to the work by Chang [4], links will specify both "type" and "strength" of the relationship. For example, two spatial documents may be related because they concern a particular county in a state (type) but these documents may be so different that the strength of the relationship is small (e.g., 0.01). Expert network technology, developed originally to use connectionist-type machine learning in refining static rule bases, will be adapted to attack this problem. Specifically, expert network methods will be used to dynamically update the strengths of the connections (links) based on data gathered as the user navigates the document collection. Expert networks will allow dynamic refinement of a rule-base governing the connections being made and accessed. The logical chain of reasoning represented in the resulting rule-base captures the researcher's preferences in a usable form by which abstraction and application to other documents can be made by the system.

Operation of the access pattern learning facilities is based on expert network technology and rules derived from user preferences [9]. Expert network technology used on spatial documents will provide for the application of connectionist-based learning algorithms to tailor a set of rules which are used to create visuals (e.g., structure chart, map, outline) which depict metadata relationships. These visuals will reflect the relationships between spatial data documents based on the access patterns of an individual spatial data user. This data will consist of and be collected by statistical usage patterns and path analysis during use.

A second approach to automatic relationship identification and construction will be to apply results from Handley and Weibel's work [8] to support the automatic creation of interdocument relationships. Previous findings included a taxonomy of electronic information, an analysis of descriptive data elements present in sample files, recommendations for extensions to existing cataloging standards, and a document relationship discovery system. Three important components appropriate for the creation and management of automatically-created relationships were defined as: 1) automated analysis of electronic files (text, software, data); 2) automated creation of surrogate database records; and 3) search and retrieval capabilities.

The system based on this work extracts identified data elements and creates a structured surrogate record (a catalog record similar to the library MARC record, but incorporating additional fields and links to other data sets and metadata as appropriate). Statistical classification techniques [3] are applied to aid in the correct identification of descriptive elements.

4. Participants

This project is supported by participants from several academic, industrial, and governmental agencies. The major industrial partner is OCLC, the world's largest not-for-profit membership organization providing bibliographic and full-text services to libraries and educational institutions. Government agencies in Florida and Ohio consist of the state libraries, research map library of Ohio, Growth Management Data Network Coordinating Council, Ohio Geographically Referenced Information Program, and various agencies who produce spatial data to support the activities of these organizations. Academic participants include the Departments of Computer Science and Library and Information Studies, Florida Resources and Environmental Analysis Center, and researchers from the Departments of Meteorology, Economics, and the Supercomputer/Computations Research Institute. See Figure 2 for an overview of the participants.

OCLC. OCLC will be the hub of the spatial data locator service. Metadata records will be transmitted from producers, librarians, and catalogers to OCLC and from there may be distributed to other metadata nodes. Researchers and users of spatial data will access the metadata collection through front-ends. The front- ends will interact with OCLC's client/server database engine, Newton, via the Internet using the industry standard Z39.50 protocol supporting search and retrieval of bibliographic and metadata entries. Newton is a distributed DBMS. The search engine may be used as the basis of a fully functional information retrieval system and uses generalized data definitions so that many types of data may be accessed.

State Libraries. The state libraries of Florida and Ohio are connected to the Internet and will provide cataloging support for state agencies and access to the locator service for their patrons. Patrons will be allowed to query the locator service for spatial data documents relevant to their tasks. Identified data sets can be retrieved for viewing or printing (using a GIS) in the library.

State libraries also will serve as regional and local access and distribution nodes for spatial data. The objective of this partitioning is to assess the adequacy of access and retrieval mechanisms relative to the needs of a spatial data community, especially library patrons who are not experienced in retrieving spatial data. By partitioning and distributing the metadata, patrons will have immediate access to spatial data sets particular to their locale or state as well as access to tools and interfaces that have been customized by librarians to meet local needs.

GMDNCC, OGRIP. The Growth Management Data Network Coordinating Council (GMDNCC) of Florida and Ohio Geographically Referenced Information Program (OGRIP) of Ohio will have similar roles in each of their respective states. These participants will serve as the primary state-level contact and coordinate local and regional (state) locator services and responsibilities and support the creation of metadata with state agencies.

Leon County Public Library, Wilderness Coast Library Coalition. These local libraries will serve the needs of local patrons in urban and rural counties, respectively, within the state of Florida. They will assess the adequacy of the access and retrieval mechanisms relative to the needs of a wide range of users, especially library patrons who are not experienced in retrieving spatial data.

Others. The state library of Ohio's map library, FREAC, and other academic departments will serve as users of the digital library. These participants will assist in identifying relationships among spatial data and evaluate mechanisms which attempt to identify these relationships automatically. Second, they will form the initial set of researchers and users to participate in usability, task, and domain analysis of the proposed library. Results from these studies will drive the development of some of the front-end and access tools. SCRI. The Supercomputer/Computations Research Institute on the Florida State University campus will be an evaluator of the access tools. SCRI is currently supporting spatial data research in the areas of thunderstorm effects on environmental and economic conditions in Brazil as well as effects of ozone depletion on climate in Africa.

Figure 2. Participants. OCLC is the primary access and distribution point for metadata. The state libraries of Florida and Ohio provide patrons with access to OCLC and local metadata stores. Agencies in the states supply information to cataloging librarians so spatial data sets can be registered with the system. GMDNCC, OGRIP, SCRI, and other users on the Internet will access the digital library through local metadata nodes or OCLC. Local and public libraries, such as the Leon County [Florida] Public Library and Wilderness Coast [Florida] Library Coalition, will access the digital library through the state's network or through the Internet. Local libraries that are not Internet accessible will be limited in that direct searching of the metadatabase at OCLC will not be possible. For this reason, OCLC must broadcast new metadata or changes to metadata to state library nodes so that state libraries can further broadcast the data to local libraries.

5. Summary

Spatial data is vital to many researchers, businesses, and governmental agencies. Although a proliferation of spatial data exists and is available on the Internet, the lack of a general and national locator service, and facilities to retrieve, convert, relate, and access spatial data, hinder the use and awareness of spatial data that could be used to solve problems.

The goals of this work are to clarify and resolve many of these locator and access problems. The approach consists of the creation of a national spatial metadata locator service that is distributed and organized around a common collection and distribution point. Metadata is created at the point of production and in distributed metadata production locations, sent to a central location, and from there it is dispersed to other geographically distributed nodes. The distribution is based on necessity, capability, and relevance to the researchers and users served by the node. The distribution network is built upon traditional institutions, particularly libraries, at the state, and local levels. Libraries serve as spatial document cataloging agents to provide metadata input to the system and as end-user access points to the digital library. Public and rural libraries and coalitions are served by the distribution network through interlibrary digital connections which generally exist among these institutions. This architecture uses libraries as repositories for metadata while the spatial data files reside in repositories provided by the data producers. Businesses can participate as users of the metadata or to become nodes which meet and solve specific needs of a particular spatial data community.

References

[1] American National Standards Institute. 1986. ANSI Standard for Information Sciences Bibliographic Information Interchange. New York, NY.

[2] Antenucci, J. 1989. Technical updates of geographic information", National Association of Counties Conference - Workshop. Cincinnati, OH. (July 16).

[3] Breiman, L. 1984. Classification and regression trees. Wadsworth International Group, Belmont, CA.

[4] Chang, D. 1993. HieNet: A user-centered approach for automatic link generation. Hypertext '93 Proceedings, (Seattle, WA), pp. 145-158.

[5] Crawford, W. 1989. MARC for Library Use. 2nd edition. Boston,MA: G.K. Hall & Co.

[6] Edwards, D. and Hardman, L. 1989. 'Lost in hyperspace': Cognitive mapping and navigation in a hypertext environment. In Hypertext: Theory into Practice, R. McAleese (Ed.), Ablex Publishing Corp., Norwood, NJ, 105-125.

[7] Federal Geographic Data Committee (FGDC). 1993. Content Standards for Spatial Metadata. USGS: Reston, VA.

[8] Handley, J. and Weibel, S. 1990. Automated document architecture processing and tagging. Electronic Publishing (EPODD), 183-192.

[9] Kuncicky, D., Hruska, S., and Lacher R. 1991. Hybrid systems: The equivalence of rule-based expert system and artificial neural network inference, International Journal for Expert Systems, 4 (3), 281-297.

[10] Library of Congress Cataloging Distribution Service. 1992. USMARC Format for Bibliographic Data Including Guidelines for Content Designation. Washington, DC.

[11] Library of Congress Cataloging Distribution Service. 1993. USMARC Format for Authority Data. Washington, DC.

[12] Malcom, K., Poltrock, S., and Schuler, D. 1991. Industrial strength hypermedia: Requirements for a large engineering enterprise. Proceedings of the Hypertext '91 Conference, (San Antonio, TX, December), pp. 13-24.

[13] Marchionini, G. and Barlow, D. 1994. Extending retrieval strategies to networked environments: Old ways, new ways, and a critical look at WAIS. Brief Communication summary of final report to NASA.

[14] Nielsen, J. 1990a. The art of navigating through hypertext. Commun. ACM, 33 (3), 296-310.

[15] Nielsen, J. 1990b. Hypertext and Hypermedia. Academic Press, New York, NY.

[16] Obraczka, K., Danzig, P., and Li, S. 1993. Internet resource discovery services. IEEE Computer, 26 (9 September), 8-22.

[17] USGS. 1990. Federal Interagency Coordinating Committee on Digital Cartography. A summary of GIS Use in the Federal government. Reston, VA.

[18] Tosta, N. 1994. Personal communication. Federal Geographic Data Committee.