Management of the National HPCC Software Exchange - a Virtual Distributed Digital Library
The work described in this paper was sponsored by NASA under
Grant No. NAG 5-2736
Computer Science Department
University of Tennessee
Knoxville, TN 37996-1301
Computer Science Department
University of Tennessee
Knoxville, TN 37996-1301
Center for Research on Parallel Computation
Houston, TX 77005
Mathematical Sciences Section
Oak Ridge National Laboratory
Oak Ridge, TN 37831-6367
The National HPCC Software Exchange (NHSE) is a distributed collection
of software, documents, and data for the high performance
computing community. Our experiences with the design and initial
implementation of the NHSE are relevant to a number of general digital
library issues, including publication, quality control,
authentication and integrity, and information retrieval.
This paper describes an authenticated submission process that is
coupled with a multilevel review process.
Browsing and searching tools for aiding
information retrieval are also described.
KEYWORDS: electronic publication, information retrieval,
high performance computing, quality control
The National HPCC Software Exchange
homepage is is an Internet-accessible
resource that provides access to software and other information related
to High Performance Computing and Communications (HPCC).
The NHSE facilitates the development of discipline-oriented software
and document repositories. Furthermore, it promotes contributions
to and use of such repositories
by members of the high performance computing community,
via a common World Wide Web interface.
The NHSE is also a valuable resource for technology transfer
and educational purposes.
The effectiveness of the NHSE depends on discipline-oriented groups
having ownership of independently maintained repositories.
The information and software residing in these repositories
is best maintained and kept up-to-date by its developers, rather
than by centralized administration. Developers may wish
to provide specialized access methods or services, a remote execution
capability for example.
Central administration is used instead to handle interoperation
and to meet common needs.
Although the different disciplines maintain their own software
repositories, users should not need to access each of these repositories
separately. Rather, the NHSE provides a uniform interface to a virtual
HPCC software repository built on top of a distributed set of
discipline-oriented repositories, as shown in Figure 1.
The interface assists the user in locating
and retrieving relevant resources.
Figure 1: Virtual Repository Architecture.
In order for the NHSE to provide an information retrieval interface
to the distributed collection of materials, it must have the raw
material available from which to build indexes and other searching
and browsing aids. Various techniques for collecting and indexing
descriptive material are used in the NHSE, including manual
construction of catalog records, collection and indexing of
unstructured text, and computer-assisted construction of a hypertext
Users of the NHSE need to have confidence that the software
they obtain is high quality and well-tested. If the software
is experimental or untested, they should be made aware of this.
The NHSE has developed a review process that allows authors to
submit software for consideration at different levels of review
classification, with the rigor of the review process increasing
with increasing levels.
A contributor to the NHSE makes a contribution available by placing
it on a file server accessible via the FTP or HTTP file access protocols
and informing the NHSE of its existence.
The NHSE can then provide a pointer in the form of a URL, along
with a description of the contribution.
For review, version control, and tracking of software contributions
it is important to ensure
fixity of publication -- i.e., that the software has not
been changed since the time of submission unless the NHSE has
been informed of the change.
Because of copyright, liability, and other legal issues,
it is also important that someone not be able to masquerade
as someone else or to make unauthorized changes to someone else's
contributions. For these reasons, the NHSE has developed authenticity
and integrity checking mechanisms for software submissions
based on file fingerprints and a public-key cryptosystem.
SOFTWARE SUBMISSION AND REVIEW
Contributors submit software to the NHSE by filling out an HTML form
using a forms-capable WWW browser such as Mosaic or
This form explains the submission and review process, including
the authentication procedures, and gives an example of a completed
submission form. The form asks the user to fill in values for
several attributes, some required and some optional.
These attributes form a subset of those specified
in the Reuse library Interoperability Group (RIG) Basic
Interoperability Data Model (BIDM) [RIG-BIDM]. The remaining BIDM
fields are generated by the NHSE librarian or from default values.
The RIG has been chartered by the IEEE to develop standards
for reuse library interoperation. Use of the BIDM standard
by the NHSE will facilitate interoperation with other reuse
libraries adopting this standard, including a number of
existing government and industry reuse libraries (e.g., ASSET,
CARDS, DSRS, ELSA).
Some contributors may have large collections that are already
indexed using a different data model. The NHSE will provide assistance
to such contributors in converting their indexing information to
the form required for submission to the NHSE and in submitting
such collections en masse.
Currently three levels of software are recognized in the NHSE,
described as follows:
To receive the partially reviewed rating, software submitted to the NHSE
should conform to the following guidelines:
- Unreviewed. The submission has not been reviewed by the NHSE for
conformance with software guidelines. This classification is for unreviewed
software available on an ``as is'' basis.
- Partially reviewed. The submission has undergone a partial NHSE
review to verify conformance with the scope, completeness,
documentation, and construction guidelines. These particular
guidelines are those that can be verified through a visual inspection of
- Reviewed. The submission has undergone a complete NHSE review
to verify conformance with all the software guidelines. This
classification requires peer-review testing of the submitted software.
This level may be further refined into additional levels
in the future.
To be accorded the reviewed status, the software must first have been
accorded the partially reviewed status. This precondition ensures that
reviewers will be able to access all the information needed to carry out the
review over the National Information Infrastructure.
- Scope. Software submitted to the NHSE should provide a new
capability in numerical or high-performance computation or in support
of those disciplines.
- Completeness. Submissions must include all routines and drivers
necessary for users to run the software.
Source code for widely
available software used by the submission, blas and lapack for
example, need not be included as part of the submission.
- Documentation. The software contains complete and understandable
documentation on its use.
- Construction. Submissions must adhere to good mathematical
software programming practice and, where feasible, to language
standards. Software should be constructed in a modular fashion to
facilitate reusability. The use of language checking tools, such as
pfort or ftnchek, is recommended.
Software submitted for full review is reviewed according to the
After software has been submitted for full review,
it is assigned to an area editor, who recruits two to six reviewers to
peer review the software according the above criteria.
To qualify for full review,
an author must provide sample data and the output from or a
description of results from each sample. Each reviewer is asked to read
the software documentation and try the software on some of the data sets
provided by the author. In addition, it is recommended that a reviewer test the
software on inputs not provided by the author.
If source is available, the reviewer
examines the source to ensure that the methods and programming
methodology are of acceptable quality. Each reviewer prepares all
comments in electronic form and returns these, along with a recommendation to
the editor in charge of the review.
After the peer reviews are returned, the editor makes the final decision
as to whether to accept the software and informs the author of the decision.
If the software is accepted, the area
editor prepares a review abstract for use by the NHSE.
- Documentation. The software contains complete, understandable,
and correct documentation on its use.
- Correctness. The software is relatively bug-free and works as
advertised on all provided data sets and on data sets constructed by the
reviewer according to the documentation.
- Soundness. The methods employed by the software are sound for
solving the problem it is designed for, as described in the
- Usability. The software has an understandable user interface and is
easy to use at the level of a typical NHSE client.
- Efficiency. The software runs fast enough, in that slow speed does
not make it an ineffective tool.
Once the software has been reviewed, one of two things happens.
If it is not accepted,
the author will be so informed and anonymous copies of the reviews
will be provided.
The author may then choose to address the reviewers' comments and
resubmit the revised software.
If the software is accepted, the author will be shown a review abstract
summarizing the reviewer comments. This abstract will be available to anyone
who accesses the software through the NHSE. An author who finds the abstract
unacceptable may withdraw the software and resubmit it for review
at a later date.
After a contributor fills out the NHSE software submission form
and submits it, a program is invoked at an NHSE server that checks
the form for any obvious errors, such as omission of required
attributes, incorrectly formed email addresses, or unretrievable URLs.
If no errors are found, a plain-text version of the catalog record is
returned to the client program, along with instructions to save the
plain text version to a file and to carry out one of the following
Before using Method 1, the author must have PGP installed locally
and must have obtained a PGP key pair. The author's public key must have been
certified by the NHSE librarian. An author may obtain this certification either
in person, via a trusted third party who signs the author's key, or by a method
similar to 2 above: print out the key fingerprint, sign it, have it notarized,
and surface mail it to the NHSE librarian.
- PGP Authentication. [Zimmerman-PGP]
The author uses his public, NHSE-certified PGP key to sign the
catalog record and then mails it back to a designated address.
The mail server at that address verifies the PGP signature
and processes the submission only if the signature is valid.
The author prints out the plain text form, signs it, has
the signature notarized, and sends the document back via
surface mail. When the form is received, the NHSE librarian PGP-signs the
electronic version of the form (using a special proxy key reserved for
this purpose) on behalf of the author.
We considered other authentication methods, such as email addresses
and userid/password based accounts, but rejected such methods
as providing insufficient security.
Identification, Cataloging, and Integrity
Once an author's software submission has been authenticated,
it is processed before being placed in the NHSE on-line software
catalog. This processing involves retrieval of the files specified
by the author as making up the contribution, fingerprinting these files,
assigning the contribution a unique identifier, and additional
cataloging by the NHSE librarian.
If the software has been submitted for partial review, the
NHSE librarian also inspects the submission for adherence to the
NHSE software guidelines.
After the files making up a contribution have been retrieved,
each file is fingerprinted using the MD5 secure hash
function [Rivest-MD5]. The (URL,MD5) pairs for the
files are then placed in another file which is itself fingerprinted.
This top-level fingerprint is used to construct a unique identifier
for the submission, which we call a LIFN, or Location Independent
File Name. The submission can subsequently be retrieved
from the NHSE software catalog by specifying its LIFN.
The LIFN concept is part of a more general naming structure
that is being developed to provide for transparent mirroring
of files and to address other scalability and reliability
problems that will result from the expected growth of the NHSE
As part of the processing, the NHSE librarian categorizes the
software submission into one of four main categories: application
libraries and programs, data analysis and visualization tools,
numerical libraries and routines, and parallel processing tools.
Software falling under parallel processing tools is categorized
further into one of eight subcategories. The NHSE librarian
also assigns keywords drawn from the HPCC thesaurus (currently
under development) and, for mathematical software, from the
GAMS classification scheme [BoHK91].
The NHSE provides a form, called the LIFN verification form,
that allows a user to verify the integrity of a
A contributor may use this form to check whether any
of the files have changed since their submission.
To use the form, the user or contributor enters the LIFN to be verified
and presses the Verify button. This action causes a program
to be invoked on an NHSE server that carries out the following steps:
- retrieves the fingerprint file that was constructed when
the LIFN was assigned and that contains the URLs and the stored
fingerprints for the files making up the submission,
- retrieves the files using the designated URLs,
- computes the MD5 fingerprint for each of the retrieved files
and compares it with the stored fingerprint that was previously
computed for the same URL,
- flags any file that has been changed since the LIFN was assigned
and gives the user the option of retrieving the original file
as archived by the NHSE.
Updating a Previous Submission
A contributor may update or withdraw a previous submission by using
the NHSE software submission change
This form asks the contributor to enter the LIFN for the previous
submission. A contributor who does not know the LIFN can
search for the submission in the NHSE software catalog in order to
determine it. After entering the LIFN, the contributor presses a button
that causes the catalog record for the LIFN to be retrieved and
displayed in a second form. The contributor may then specify
any files that have been changed or added, describe changes made to
the files, and/or update cataloging information.
After filling out the change form and submitting it,
the contributor authenticates the change request using one of the
two authentication procedures described in the previous section.
Note, however, that if the submission was initially authenticated
using PGP, the NHSE will be extremely cautious about accepting
updates authenticated using the notarization method.
INFORMATION RETRIEVAL AIDS
Depending on the size, rate of change, and nature of the underlying
software or document database, the NHSE uses different techniques
for assisting the user in searching and browsing the information.
Small or fairly stable collections permit labor-intensive
indexing and abstracting, with resulting benefits of
improved recall and precision for searches. Large or rapidly
changing collections require the use of less precise automatic
The current NHSE software
is fairly small, with fewer than 300 entries. Thus, it has been
possible to manually abstract and index this collection.
The cataloging process has been carried out jointly by the software
authors and the NHSE librarian, with the software authors providing the
title and abstract fields, and the NHSE librarian categorizing
each entry and assigning thesaurus keywords. The NHSE software
catalog is available in the following formats:
A number of sites involved with the NHSE maintain collections
of technical reports on numerical or high performance computing.
These collections are frequently already indexed and abstracted,
although they may use different indexing formats. One such collection
is maintained at the University of Tennessee Computer Science
Department (UTKCS). UTKCS is joining the Computer Science
Technical Report (CSTR) project, and other
NHSE sites will be encouraged to do likewise.
The CSTR project is developing standards and technologies
for digital document
The Dienst server software available from Cornell University facilitates
searching for and retrieving documents from a repository and linking
together different repositories so that all may be searched from
any site. Dienst also provides utilities that assist sites
with installing the document database and converting from other
- An HTML version that can be browsed by category.
- A searchable version that allows the user to search separately
by different attributes or to do a free-text search on the catalog
A link to an on-line copy of the HPCC thesaurus is provided so that
users can select controlled vocabulary terms for searching.
The current interface requires users to cut and paste thesaurus
terms into the search form. We plan to develop a hypertext version of
the thesaurus that will statically link thesaurus terms to scope and
definition notes and to related terms (also broader terms and
narrower terms), as well as dynamically link thesaurus terms to indexed
- A PostScript version that can be downloaded and printed.
In addition to the software catalog, the NHSE has a distributed
hypertext structure that contains a variety of information
on high performance computing. Most of this information is
in the form of HTML pages, but there are also links to documents
in other formats, such as plain text and PostScript.
Links are provided to various HPCC programs and activities,
to descriptions of Grand Challenge applications, and to other
software repositories. Because the collection of information
has grown very large, a search interface has been provided.
This search interface currently uses the Harvest system
[Bowman-Harvest] to collect information from remote sites,
index that information using WAIS, and process queries from users.
The Harvest system worked satisfactorily at first, but the
underlying database has now grown so large and diverse that
1) the gathering takes on the order of several days to a few
weeks, and the search interface becomes out-of-date in the meantime,
2) extremely large result sets are returned by many searches.
Work is underway both by the Harvest development group and
by NHSE researchers at Argonne National Laboratory
to address these scalability problems.
Hypertext roadmaps are being developed at Syracuse University
to provide guided tours to HPCC software and
The roadmap consists of encyclopedia-style articles written
by experts in the field, with links to relevant software and
technologies. Because construction of such a guide is labor-intensive
and because the resulting structure is static, the roadmap can
encompass only a portion of the available information.
However, we hope to use semantic indexing techniques such
as LSI [Deerwester-LSI] to simplify the work by automatically
inferring relationships between the roadmap and new material.
Digital libraries have been identified as a National Challenge
by the Information Infrastructure Technical Application (IITA)
component of the HPCC program.
A joint initiative by NSF, ARPA, and NASA has funded six four-year
research projects to develop new technologies for digital
The goals of this initiative are to advance the techniques
for collecting, storing, and organizing information in digital
forms, and for searching and retrieving the information over
communications networks. Each project is centered at a university
and is focused on a particular area. For example, Carnegie Mellon
University is developing an on-line digital video library system,
while the University of California at Berkeley is developing
a digital library focused on environmental information.
The NHSE is an example of a digital library that is focusing
on a particular type of resource, software, in a particular
subject area, high performance computing.
Software-specific issues include the following:
An ARPA-funded project led by the Corporation for National Research
Initiatives (CNRI) is developing the network infrastructure for
a distributed digital library system
CNRI is addressing only the network-based aspects of the
infrastructure, and not the content-based aspects, which
are expected to be addressed by specialized communities.
The infrastructure defines the basic components of a distributed
digital library system, including digital objects,
repositories, naming authorities, and properties
records. Repositories provide access to digital objects
and also provide value-added services such as organizing,
cataloging, searching, and evaluating.
Each repository is responsible for providing meta-information
about its own collection of digital objects in the
form of a set of properties records.
- Software evolves (e.g., through bug fixes and enhancements),
while documents (with the exception of software documentation)
tend to remain fixed after publication.
- Authentication and integrity checking are more important
for software than for documents because more is at stake.
Small changes to code can cause large changes in results.
Bugs or viruses that are not easily detectable by visual inspection
may be introduced by malicious third parties.
- The relationships between software components are more
structured and more complicated than relationships between documents.
The NHSE is an example of a repository, but it is a virtual
repository that provides access to resources maintained by
a distributed collection of autonomously maintained physical repositories.
The physical repositories store and provide access to files,
and the NHSE provides the value-added services.
For example, software available through the NHSE is maintained on local
file servers by its contributors, but is searchable and retrievable
through the NHSE software catalog.
The NHSE software catalog is an example of a set of properties records.
CNRI's infrastructure includes a system for assigning globally
unique names to digital objects and for using these names to
retrieve objects. An object's name is called its handle.
A distributed system of handle servers maps handles to the repositories
that contain the objects. The NHSE developers have designed
a similar system for name assignment and resolution
[ssr95]. The use of LIFNs described earlier in this paper
is a forerunner of the full deployment of our naming system.
We are currently investigating the possibility of merging
our naming system with CNRI's system.
Software reuse is the process of creating software systems
from existing software rather than building software systems
from scratch [Krueger-survey].
Software reuse may be more broadly defined as the use
of engineering knowledge or artifacts from existing systems
to build new ones [Frakes-success].
Reuse of software components ranges from black-box reuse
of domain-specific components to white-box reuse through modification
and adaptation of existing components [Prieto-Diaz-status].
A problem domain is defined as a class of problems
considered to be significant and related by members of a
particular applications community [Arango-domain-analysis].
Systematic reuse requires domain engineering, which consists
of the following two phases:
Parallel high performance
computing will realize its full potential only if it is
accepted and adopted in the real world of industrial applications.
Cost-effective parallel computing will require
widespread reuse of parallel software and related artifacts.
The NHSE will support software reuse by providing access to
the following resources:
- domain analysis, the process of discovering and
recording the commonalities and variabilities of the systems
in a domain, and
- domain implementation, the use of the information
uncovered in domain analysis to create reusable components and
- Tools for developing
portable, scalable applications software.
Because writing parallel programs requires substantial effort and
investment, it is unacceptable to rewrite for every current
and future parallel architecture. Code that is portable and scalable,
however, will run with high efficiency on almost all current and
anticipated future parallel architectures.
Appropriate tools include data parallel compilers for high level
machine independent parallel programming languages and message
passing communication subsystems that provide a machine independent
Many such parallel processing tools are already included in the NHSE
software catalog, and a hypertext roadmap is being constructed that
will guide the unfamiliar user in locating and using the appropriate
- High-quality reusable mathematical software components.
A large number of scientific and engineering applications rely heavily
on linear algebra and other mathematical software routines.
Scalable, portable, efficient,
and easy-to-use libraries of such routines provide the fundamental
building blocks for applications.
A number of such libraries can be found under the numerical
category in the NHSE software catalog, and more are under development.
- Domain engineering information.
The NHSE will provide access to resources for carrying out domain
analysis, including information about and access to domain analysis tools.
The NHSE will also store and provide access to artifacts produced
by domain analysis.
For example, information is currently available about
a problem classification scheme for parallel applications
that has been developed by researchers at Syracuse University.
A number of software repositories have been established over the
last decade that provide access to reusable software components.
These include the Netlib and GAMS mathematical software repositories,
as well as government sponsored reuse libraries such as Ada-IC,
CARDS, DSRS, ELSA, and STARS. Information about all these repositories
in available from the NHSE.
Netlib began operation in 1985 to fill a need for cost-effective,
timely distribution of high-quality mathematical software to
the research community [Dongarra-netlib].
Netlib is accessible through an email interface or from a World
Wide Web browser such as Mosaic or Netscape (11).
The number of Netlib servers has grown from the original
two, at Oak Ridge National Laboratory (initially at Argonne
National Laboratory) and Bell Laboratories, to servers in Norway,
the United Kingdom, Germany, Australia, Japan, and Taiwan.
A mirroring mechanism keeps the repository contents at the different
sites consistent on a daily basis [Grosse-mirroring].
The Guide to Available Mathematical Software (GAMS) project of the National
Institute of Standards and Technology (NIST) studies techniques to provide
scientists and engineers with improved access to reusable computer software
components available to them for use in mathematical modeling and statistical
analysis. One of the products of this work is the GAMS system,
an on-line cross-index and virtual repository of mathematical software
performs the function of an interrepository and interpackage cross-index,
collecting and maintaining data about software available from external
repositories and presenting it as a homogeneous whole.
GAMS currently contains information on more than 9800 problem-solving software
modules from about 85 packages in four physically distributed software
repositories (three maintained at NIST and Netlib).
The NHSE is similar to GAMS in that both are virtual repositories,
but the NHSE encompasses a much larger number of physical repositories than
GAMS. GAMS indexes the contents of a handful of repositories, while
the NHSE provides access to software residing at hundreds of sites.
Netlib and GAMS both specialize in the fairly narrow domain of
mathematical software. Although the NHSE collection includes
mathematical software written for high performance machines,
the coverage of the NHSE is much broader, ranging from data visualization
and parallel processing tools to software for individual application
areas. The NHSE uses the GAMS classification scheme to classifying
mathematical software, as does Netlib. A new classification
will need to be devised for the general area of high performance
computing, however, and portions of it will need to be refined
by sub-communities and specialists in different areas.
The Reuse Library Interoperability Group (RIG) was founded in
1991 for the purpose of developing standards for interoperability
between software reuse libraries. The RIG has developed and
approved the Basic Interoperability Data Model (BIDM) as a minimum
standard data model for interoperability, and the BIDM has
been submitted for balloting as an IEEE standard [RIG-BIDM].
The NHSE is working with the RIG to develop and promote standard
data models for software repositories.
There is close correspondence between BIDM concepts and the digital
library framework proposed by CNRI [Kahn-CSTR].
Ideally the software reuse library community and other
digital library communities
should work together to promote interoperability between
all types of digital libraries.
We have described a digital library of software and related artifacts
that is being developed for the HPCC community.
Rather than being a single central repository, this library provides
a uniform interface to a distributed collection of autonomously maintained
Although some of our concerns are specific to software repositories,
much of our work will be applicable to management of other
types of digital data.
In particular, we have constructed a mechanism that allows individuals
to contribute material from a World Wide Web browser.
We have implemented an authentication mechanism that uses public
key cryptography and file signatures to prevent impersonation and
unauthorized changes to contributed material.
The review process we have set up is similar to the peer review process
used by refereed journals, and our experiences in applying this
process to software will help determine whether the peer review
concept generalizes to non-document and electronically available
resources. The solutions we devise for providing searchable access
to a large quantity of diverse information available from geographically
dispersed sources will be applicable as well
to other distributed digital library systems.
Standard reuse library Basic Data Interoperability Model (BIDM).
Technical Report RPS-0001, Reuse Library Interoperability Group,
Domain analysis methods.
In W. Schafer, R. Prieto-Diaz, and M. Matsumoto, editors,
Software Reusability, chapter 2, pages 17--49. Ellis Horwood, 1992.
R. F. Boisvert.
The architecture of an intelligent virtual mathematical software
Math. & Comp. in Simul., 36:269--279, 1994.
R. F. Boisvert, S. E. Howe, and D. K. Kahaner.
The Guide to Available Mathematical Software problem
Comm. Stat. -- Simul. Comp., 20(4):811--842, 1991.
C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz.
Harvest: A scalable, customizable discovery and access system.
Technical Report CU-CS-732-94, Department of Computer Science,
University of Colorado - Boulder, Aug. 1994.
S. Browne, J. Dongarra, S. Green, K. Moore, T. Pepin, T. Rowan, and R. Wade.
Location-independent naming for virtual distributed software
In ACM-SIGSOFT 1995 Symposium on Software Reusability, Seattle,
Washington, Apr. 1995.
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshamn.
Indexing by latent semantic analysis.
Journal of the American Society for Information Science,
41(6):391--407, Sept. 1990.
J. J. Dongarra and E. Grosse.
Distribution of mathematical software via electronic mail.
Commun. ACM, 30(5):403--407, May 1987.
W. B. Frakes and S. Isoda.
Success factors of systematic reuse.
IEEE Software, pages 15--19, Sept. 1994.
ACM Trans. Math. Softw., 21(1), Mar. 1995.
R. Kahn and R. Wilensky.
Accessing digital library services and objects: A frame of reference,
draft 4.4 for discussion purposes.
Available on-line at
http://www.cnri.reston.va.us/home/cstr/arch.html, Feb. 1995.
C. W. Krueger.
ACM Computing Surveys, 24(2):131--183, June 1992.
Status report: Software reusability.
IEEE Software, pages 61--66, May 1993.
The MD5 message-digest algorithm.
Internet Request for Comments, 1321, Apr. 1992.
PGP user's guide.
PGP Version 2.6.2, Oct. 1994.