Viewing the U.S. Government Budget as a Digital Library

R. L. Grossman[1], A. Sundaram[1], H. Ramamoorthy[1], M. Wu[1], S. Hogan[2], J. Shuler[2], and O. Wolfson[1]

[1] Laboratory for Advanced Computing, University of Illinois at Chicago, Dept. of Mathematics, Statistics and Computer Science, 851 S. Morgan Street, 322 SEO mail code 249, Chicago, IL 60607-7045, nisp@math.uic.edu

[2] Main Library, University of Illinois at Chicago, 851 S. Morgan Street, 1-280 LIB, mail code 234, Chicago, IL 60607-7045

Abstract

We developed a prototype of a digital library designed to browse, query, mine and visualize large amounts of scientific, numerical and statistical data. The system currently provides access to the U.S. Government Budget for FY93, 94 and 95. Our point of view is to regard the data as collections of objects distributed over a wide area network. We manage the objects using a high performance, low overhead object manager we have developed called ptool. Ptool interfaces to a hierarchical storage system including tape to provide the potential of accessing terabyte size data sets. The system caches, migrates and replicates collections of objects over a wide area network to achieve higher performance. We have also developed specialized tools to query, analyze, mine, and visualize the data. Additional economics and statistics data should be available through the system soon.

Keywords: Digital library, object manager, scientific & statistical database, visualization.

1. Introduction

We describe a prototype we have developed of a digital library designed to browse, query, mine and visualize large amounts of scientific, numerical and statistical data. The prototype exploits a hierarchical storage system including tape to provide the potential of accessing terabyte size data sets. We view the data as collections of objects distributed over a wide area network; we use low overhead, high performance persistent object stores to access the data; we cache, migrate, and replicate collections of objects over a wide area network to achieve higher performance; and we query, analyze, mine, and visualize the data with a suite of modular software tools. Further details are in [7].

Our prototype provides distributed access to the U.S. Government Budget for FY 93 and FY 94. FY 95 data will be available shortly. The budget for each fiscal year contains approximately 5000 tables and 50,000 line items, as well as a modest amount of accompanying text. The challenge was to provide distributed access and analysis tools for tabular data of this type. A typical query retrieves all line items containing the keyword "research'' in which fiscal year 94 outlays were over $100 million.

The prototype uses a standard object oriented data model. This model provides the data with enough structure for queries such as the one just described. To manage and query the data, we used two software tools developed by us for related projects: ptool, a low overhead high performance persistent object manager and qtool, a companion tool which implements a subset of the ODMG-93 emerging standard for OQL (Object Query Language) queries. To provide wide area access to the data, we used the Forms Package in NCSA's Mosaic to send OQL queries to a server, which then returned the requested data. A variant of the prototype offers other analysis tools, such as spreadsheets, for those clients who can access the server as a X-client.

We have used the same technology to develop digital libraries for high energy physics data [2] and [3]. We have also used this technology to implement data intensive algorithms in high performance computing [5].

2. Related Work

Our prototype is designed to handle digital libraries which contain numerical, statistical or scientific data. Some of the important differences between digital libraries which contain textural or multi-media data and those which contain numerical, statistical or scientific data are:

Data Model.

Textural and multi-media digital libraries are usually "document based.'' By browsing, navigating, or searching, one identifies the document of interest and then browses or retrieves the document as appropriate. Whether the document is a compound document, multi- media document or hypermedia document does not fundamentally change this. Of course, some documents have a complex or hierarchical structure and contain a variety of data types. In contrast numerical, statistical or scientific data are usually organized into attributes, which may themselves be further divided into additional attributes. The typical access exploits the attributes to return the data of interest, which often requires a statistical or numerical computation, as in "return all lines items in

which there is a research related expenditure greater than $100 million dollars.'' The objects returned are usually not from just one data set, but more often from several.

Searching.

Searching textural and multi-media digital libraries is usually by key word, tag, or through some type of full text retrieval. On the other hand, searching numerical digital libraries is often done by applying a statistical or numerical filter to the data. For example, "return all line items from the FY 94 budget which are more than 10% different than the estimates from the FY 93 budget.''

Use.

Information from a textural or multi-media digital library is usually read or viewed, while information from a numerical digital library is usually used as the basis for further numerical analysis. For example, after all line items which involve more than $100 million of research are retrieved, the data is usually further analyzed with a variety of statistical or visualization tools.

A variety of technologies have been used to build digital libraries. Many are document based and use the native file system to manage the data. Others use a database to manage the data. Our prototype in contrast uses a low overhead, high performance persistent object store to manage the data and World Wide Web (W3) applications to provide wide area access to the data. Since the data in our digital library was historical, most of the additional functionality provided by a database was not needed and a persistent object manager sufficed.

3. Design

The design of our system was based on just a few basic principles.

Objects and collections.

Our system is based upon objects and collections of objects. For the Federal Budget, we choose the fundamental objects to be the budget tables. The budget tables have an internal structure so that one can query by row or column. In contrast, with a conventional document based system, it would be very difficult to query by row or column.

Object Manager.

We used a high performance, low overhead persistent object manager we developed called ptool to manage these objects. Ptool interfaces to an IEEE compliant hierarchical storage system in order to provide transparent access to data on secondary and tertiary storage. Access to the data was with a variant of a subset of OQL (object query language). The variant supported some table-specific operations we found useful.

Integrated Analysis Tools.

Rather than design stand alone applications, we designed a number of small tools which accepted an input collection of objects and produced an output collection by selecting some objects and computing derived objects.

Wide Area Access.

We provided wide area access to the data by using Mosaic Forms to send OQL queries to a WWW server which accessed the required data. The query could specify whether the objects themselves should be returned, so that they could be further analyzed using local tools, or simply a file containing the object's attributes.

4. Implementation

As already mentioned, the prototype was implemented using a low overhead, high performance object manager we developed called ptool [4] and [6] and a companion tool called qtool we developed which supports a subset of OQL.

A major part of the implementation was to migrate the legacy data into a usable form. The U.S. Government Budget, as published by the U.S. Government Printing Office, is available as a print document, and in electronic form.. The electronic form contains the data in a proprietary mark up language used by the Government Printing Office called Microcomp. We reversed engineered the Microcomp data, translated it into a set of files containing the data, and matched files describing the logical format of the data, and then populated an object store with this data using ptool.

For this prototype, we used qtool to select tables, rows, columns, and fields from the data. Qtool supports a variant of OQL. We also wrote some specialized functions for the statistical analysis of rows and tables.

We developed a WWW server to provide distributed access to the data, which could return either the requested objects themselves or ASCII files containing the attributes of the data in html format.

We also developed X-based client-server variants of the system which used the commercial spreadsheet Wingz for the analysis of retrieved and selected data. We put together a simple user interface for the X-based version using Tcl/Tk to integrate the various tools.

The architecture is illustrated in Figure 3. Figure 1 contains a typical query. The objects returned by the query, viewed as a spreadsheet, are displayed in Figure 2.

select all

from * in FY93

where * = "outlays by function"

Figure 1. The query uses the software tool qtool to scan all tables in the collection of tables FY93 and locates all rows containing the string outlays by function.

Variants of the query allow just selected attributes of the row to be returned, and either the entire table or just the selected rows in the table to be returned. This particular query retrieved 38 out of approximately 4000 tables in the collection FY93.

4. Conclusion and Future Directions

Our prototype demonstrates the feasibility of building scalable digital libraries for numerical, statistical or scientific data using wide area object stores. To make effective use of digital libraries of this type, further work is required in a number of areas: especially in developing more efficient methods for migrating unstructured legacy data into object stores; in visualizing large amounts of numerical data; and in providing better techniques for mining such data.

References

[1] "Mass Storage System Reference Model, Version 4'' edited by Sam Coleman and Steve Miller, IEEE.

[2] C. T. Day, S. Loken, J. F. MacFarlane, E. May, D. Lifka, E. Lusk, L. E.Price, D. Baden, R. Grossman, X. Qin, L. Cormell, P. Leibold, D. Liu, U. Nixdorf, B. Scipioni, T. Song,

"Database Computing in HEP --- Progress Report,'' Proceedings of the International Conference on Computing in High Energy Physics '92, C. Verkerk and W. Wojcik, editors, CERN-Service d'Information Scientifique, 1992, ISSN 0007-8328, pp. 557-560.

[3] R. L. Grossman, X. Qin, and D. Valsamis, and D. Lifka, E. May, and D. Malon, and L. Price, "The Architecture of a Multi-level Object Store and its Application to the Analysis of High Energy Physics Data,'' Laboratory for Advanced Computing Technical Report, Number LAC 94- R8, University of Illinois at Chicago. December, 1993.

[4] R. L. Grossman, D. Lifka, and X. Qin,

"An object manager utilizing hierarchical storage,'' Twelfth IEEE Symposium on Mass Storage Systems, IEEE Press, Los Alamites, 1993, pp. 209--214.

[5] R. L. Grossman, D. Valsamis and X. Qin, "Persistent stores and Hybrid Systems,'' Proceedings of the 32nd IEEE Conference on Decision and Control, IEEE Press, 1993, pp. 2298-2302.

[6] R. L. Grossman and X. Qin, "Ptool: a low overhead, scalable object manager,'' Proceedings of SIGMOD 94, to appear.

[7] R. L. Grossman, X. Qin, A. Sundaram, M. Wu, and W. Xu, "Software Tools for Working with Large Amounts of Complex Tabular Data: An Application to the U.S. Government Budget, Laboratory for Advanced Computing Technical Report, Number LAC 94-R11, University of Illinois at Chicago, December, 1993