Defining and Using Structure in Digital Documents

Richard Furuta

Hypermedia Research Laboratory, Department of Computer Science, Texas A&M University, College Station, Texas 77843-3112, furuta@cs.tamu.edu

Abstract

Understanding structure is a critical step in the process of developing the design of a digital library. Understanding the structures required for a particular digital library requires an understanding of the scope of objects to be stored in the library, of the classes of clients to be served, and of the needs of each of the client groups. The preexisting work in the area of structured documents with its emphasis on logical structuring illustrates a successful case of separating the concerns of the different client classes in the structural design. The specific tree-based, context-free grammar-constrained structures that predominate in the structured document world are not likely to be sufficiently general to handle the wide range of objects in the digital library--collections that include not only text but also graphics, audio, video, computations, and process. Powerful metaphors will have to be developed for these other objects and interrelationships defined. The definition process can be focused by consideration of key structural characteristics.

Keywords: Structure, documents, process, specification, constraint, consistency, reusability.

This material is based, in part, upon work supported by the National Science Foundation under Grant No. IRI-9496187 and in part on a grant from the Texas Higher Education Coordinating Board, Advanced Research Program.

1. The importance of structure in documents in the digital library

The collection of a library represents the individual efforts of thousands of authors, working together and separately across thousands of years and using a tremendous range of composition tools to capture their thoughts. Despite the diversity of authors and of the document preparation tools they used, there is a striking consistency in the representation of their work. The standard cataloging rules [24] focus on the media upon which the work is presented (e.g., book, cartographic, music, etc.). Within a media type, both cataloger and library patron expect to find a standardized organization of the document's content, for example books have titles that are represented on an initial title page, and then a body that is subdivided into chapters, etc. The ability to assume such a standardized organization helps the reader to become oriented within the document's information space quickly; able to focus on finding information rather than puzzling over its organization.

Over the past few years similar conventions have arisen for structuring the electronic form of paper-based documents. Such conventions are usually centered either around the logical relationships of components in the document or upon their physical presentation relationships. Essentially, the logical representation reflects the standardized organization noted above (chapters, sections, subsections, paragraphs, etc.). The physical representation focuses on characteristics of the display medium--pages, lines, characters, margins, indentation, fonts, etc.

From the point of view of the author of a traditional document, an important feature of computer systems based on logical-structure-oriented editing of paper documents is that the author is freed to concentrate on the content of the document and not forced also to consider its presentation [36]. The advantages of structure can be carried further, especially in technically-oriented documents. Grammatically-constrained definitions of the interrelationships among logical document components have been standardized in ISO's SGML [28] and ODA [27] international standards. Such grammatically constrained representations provide the means for guaranteeing that the document's structure corresponds, for example, to a corporate style, aids in the maintenance of the document, and provides the leverage needed to be able to develop applications that can automatically reuse the document's contents, for example transformation between publication styles or conversion of form to permit inclusion of portions within an electronic database. Such representations also provide the basis for considering issues of document interchange--for example converting the document's specification from one markup language to another [30].

Providing analogous structuring mechanisms in the digital library is required if it is important to maintain a consistency among the individual components of the collection. However, the digital library presents situations and opportunities not encountered in the static cases that have preceded it. For example, it may be desirable to structure not only the component parts of a document but also to structure the interrelationships among those parts. In addition a wide range of media will be represented in the digital library, and consequently structuring techniques will need to be developed and extended to represent this wider variety of objects.

Because of the wider variety of object types, it is sensible to reexamine the characteristics of the structures to be used in the digital library. The traditional document discussed above primarily rely on what is known as the logical structure of the information space--an encoding of the standard organization of technical papers and books. While the content relationships frequently reflect the logical structural relations, as they correspond to the accepted presentation of information, one may also find inconsistencies. Additionally, logical relationships may not be the most natural way to define and access other kinds of media (for example graphical or aural).

The characteristics of interactive access to the collection are also an important consideration. Indeed, it may be desirable to separately describe different browsing interfaces for use by different categories of patrons of the digital library. The structuring of process in addition to the structuring of content is a unique requirement imposed by the interactive nature of the media in the digital library.

1. Section 1's title

Section 1 text.

* list element 1

* list element 2

More section 1 text.

1.1 Subsection 1.1's title

Subsection 1.1's text

Figure 1. A small sample document fragment (from [13]).

2. Logical structure

In earlier reports [11,12], I have discussed the characteristics of logical structures in paper-based document markup languages, the interactions between those structural characteristics and the characteristics of interactive implementation, and have developed a taxonomy for those structures [13]. Coombs, Renear, and DeRose [7] have also presented a high-level defense of the benefits of logical structure based representations. Goldfarb [22,23,26] and Reid [34,35,36,37] are generally credited for origination and popularization of this approach, which also goes under the name generic markup. The documents described in this manner are generally known as structured documents [2,14].

An important characteristic of systems based on generic markup is an emphasis on a separation of concerns. Generally two roles are involved in creation of a document: specification of its content and specification of its appearance. The task of content specification is the job of an author while the task of appearance specification is the job of a style designer. In many cases a single person may occupy both roles, but frequently leverage can be gained by separating the tasks as the skills needed are different. Indeed, the Scribe system provides separate languages, specialized for use by author and by style designer.

The specification of documents using generic coding is often perceived initially by authors as more difficult because it seems "less direct," as it seems to contradict the concept of "direct manipulation" [40]. In general, the more complex the relationships among the components of the document and the more strongly constrained those relationships, the more challenging the task of the system's designer becomes to produce a straightforward-seeming authoring user interface [9,10]. On the other hand, a less-complex, relatively unconstrained representation makes it difficult to develop applications that can reuse the document instance specifications effectively and makes it increasingly possible that a document can violate an externally-defined style that is supposed to be followed.

A variety of logical-object-based representations can be, and have been, defined that fall along the axis between complex, constrained structures and simple, unconstrained structures. Figure 1 shows a small document fragment and figure 2 shows three representations that have been used to represent that fragment in markup. In figure 2(a), the appearance of structure in the formatted document is not directly reflected in the markup. The document is specified as a sequence of content chunks, and a particular transformation is defined for each chunk that defines its physical appearance. This is the kind of representation found in systems such as Microsoft Word.

Figure 2(b) shows a small adaptation of the previous structure in that the itemized list is now represented in nested form. Such representations are found for example in LaTeX. In figure 2(c) the representation of hierarchy has been carried out to a complete degree, representing, for example, sections and subsections as hierarchical objects. Such representations can distinguish, for example, between a block of text that completes subsection 1.1 and a block, located at the same place, that completes section 1. When structures like this are defined grammatically, it reflects the characteristics that can be achieved when specification languages such as SGML are used.

A number of lessons of relevance to the digital library can be learned from an examination of the previous uses of logical structure in document specification. These include the competing benefits and drawbacks of more complex versus less complex structures; of strongly constrained versus unconstrained specifications; and of externally-defined (e.g., grammatical) versus ad-hoc style specifications. An additional consideration is the application of the "separation of concerns" principle, again with potential benefits and drawbacks. Strongly separating the roles of reader, author, and style designer, requires corresponding specialization of skills--for example the author may find himself writing in what initially seems to be an unfamiliar and constrained environment--and its success depends highly on the degree to which it makes sense to partition the interested parties into such categories. On the other hand, with the selection of the right structuring metaphor (e.g., logical structuring is the metaphor being discussed in this section) and with the provision of an appropriate specification mechanism, benefits can result that justify the extra expenditure of effort, for example, consistency, reusability, and verifiability.

Figure 2. Document representation structures corresponding to Figure 1 (From [13]).

3. Characterizing structure

In this section we will discuss some the dimensions along which one might characterize the structure of objects in a digital library, or indeed the structure of the library itself. The dimensions that will be discussed further are listed in figure 3. Certainly a diverse set of objects will be associated with a diverse set of structures. In this section's discussion, we focus on the characteristics of an individual structure and on the structural interconnections that associate this object with others in the universe of objects.

* Structuring metaphor

* Homogeneous or heterogeneous data structures

* Granularity of structure

* Structural constraints

* Dynamic or static definitions

Figure 3. Key characteristics of digital library structures.

3.1. Structuring metaphor

We have already examined the structuring metaphor of "logical structure" in the context of printed documents. We mentioned earlier in this paper that another common structure for markup intended to describe printed documents is that representing the document's appearance, often called its physical structure or its "layout structure." A graphical representation might more naturally be structured based on the spatial relationship of objects to each other. Structures describing process may focus on the temporal relationships as may multimedia-related specifications (see for example the temporal description mechanisms described by Buchannen and Zellweger [5,6]). Meta-structures also are found, for example hierarchical or flat directory structures intended to help organize an information space, presentation structures intended to join different views or perspectives of a data space, and search and indexing structures intended to help locate information.

3.2. Homogeneous or heterogeneous data structures

Orthogonal to the question of structuring metaphor is what data structure or structures are used to represent the structure in an information space. Common choices include trees, directed acyclic graphs, directed graphs, and undirected graphs. Continuously varying objects such as those found in motion video and audio require reexamination of the use of such structures, which are most easily oriented to the relationships among "discrete" components. When process is included in the description, structures may be based on automata, which associate execution semantics with a graphical representation. For example, in hypertext finite automata and Petri nets [18,19,20,41,42,43] have been used for these purposes.

Heterogeneous data structures may be used to describe different elements of an information space. Even if data structures are homogeneous, multiple instances may be defined to reflect, for example, separate documents. When multiple structures are defined over a set of contents, the general question is whether they are interrelated in any way. Such issues will be discussed in conjunction with the granularity of the structure.

One note in passing; multiple structures also include the case when the different structures are defined over the same content elements. Perhaps a particular work is used in different contexts or perhaps the structures represent different views of the same information space.

3.3. Granularity of structure

When we discuss granularity of structure we are focusing also on what the constituents of the structure are and how they relate to one another. We wish to determine if there is an atomic element from the perspective of a particular structure, and if there is if it has internal structure of its own. As an example of an atomic element with internal structure, consider the case of a meta-structure such as a directory, as discussed above. The constituents of this structure, indivisible from the perspective of the directory structure, are themselves complex, structured objects. From a graph perspective, a graph may be defined whose components may themselves be graphs. Once again, the atomic element of the higher-level graph in turn possesses structure.

Discussion of granularity also requires consideration of whether the elements are continuous or discrete. A particularly interesting example of a continuous structure is that of Pad [32,33], which defines a continuously scalable display surface, which from the standpoint of spatial representations is also continuous. It is possible that other structuring metaphors for the same space will result in discrete elements. As just one example, using a quadtree style representation of an image provides a collection of "snapshots" of the data space at different resolutions [38,39].

3.4. Structural constraints

Two interrelated questions are whether structural relationships are constrained and if so whether the constraints are externally-defined. The inclusion relationships in the example presented in section 2's figure 2(c) can be defined and consequently constrained by an appropriate grammar requiring, for example, that a subsection can be defined only after the body of the enclosing section is completed. On the other hand, such relationships can also be unconstrained in a system that permits arbitrary nesting of environments. In such a case, the choice to nest the list within the section rather than vice versa is a matter of convention rather than a syntactic requirement. In addition, constraints when present can be encoded into an implementation rather than specified by external means, for example the grammatical means just discussed.

The issue of structural constraint is tightly bound into the question of interactivity of the author's user interface. The more strongly constrained the structure, the greater the need for the author's user interface to give assistance to help the author understand what environment an object is contained within and what constructs are permissible within that context.

The issue of constraints is also tied into the ability of a system designer to build a computer system that can automatically verify properties of the specification, that can automatically locate, extract, and reuse elements of the instances, and that can perform operations on the structure in addition to the contents.

Two examples of structural operations that we have examined in the past are the searching for hypertext components based on their structural characteristics, perhaps in conjunction with their contextual components [21], and the transformation of document instance structures to match changes in the grammatical specification of their relationships [1,3,4,8,17,29]. The general question of conversion among different representations (again in the realm of hypertext in this particular example) also benefits from strong structure [15,16].

3.5. Dynamic or static definitions

A final issue in structure definition is when (or indeed if) the definition can change. It may be an advantage to be able to change the structure while readers are actively traversing the corresponding information space, but accommodating such changes raises a host of questions concerning issues such as maintaining a consistent state, preventing readers from becoming trapped in newly-created dead ends, incremental revalidation, and versioning.

4. Discussion and conclusion

Support for authoring in the digital library will involve the ability to model and incorporate new media types; to define, analyze, and enforce the object element's structure and the interrelationships with other object elements; to transform between related representations and between different versions of the same representation; to permit authoring to occur at a higher level of abstraction; to provide sufficient definition that automatic processes can reason about the structure and can reuse components specified in it; and to model and specify the interfaces presented to author and reader, specializing those interfaces for distinct classes of individuals as desired.

The experience from structured documents suggest that higher-level constructs for use by authors will permit writing to focus on what is to be communicated rather than the fine details of how it is connected together with other related fragments. Such constructs will also provide the basis for the automatic conversion of existing documents into a format relevant for inclusion in the digital library [15,16]. An open research question is the development of structure definition mechanisms that permit the verification of the dynamic properties of the specification, perhaps to allow authors to identify "bugs" in their documents. Existing standards such as HyTime [31] may not provide the best framework for addressing this issue, as interrelationships among the components are not strongly represented in the underlying SGML-based structure definitions (i.e., there is no HyTime meta-specification mechanism that structures and constrains the permissible link relationships in a way corresponding to the way in which the SGML Data Type Definition structures and constrains the interrelationships among document objects).

A concern is the ability to incorporate the existing corpus of knowledge now found in the traditional library into the digital library while at the same time developing the basis needed to extend the types of media represented to make best use of the interactive characteristics of the digital medium. Our ability to address all of these questions is tightly bound up with our ability to understand, describe, and categorize the relevant structures.

References

[1] Extase Akpotsui and Vincent Quint. "Type transformation in structured editing systems." In C. Vanoirbeek and G. Coray, editors, EP92: Proceedings of Electronic Publishing, 1992, pages 27-41. Cambridge University Press, April 1992.

[2] Jacques André, Richard Furuta, and Vincent Quint, editors. Structured Documents. Cambridge University Press, 1989.

[3] Dennis S. Arnon. "Scrimshaw: A language for document queries and transformations." In Hüser et al. [25], pages 385-396.

[4] Fred Cole and Heather Brown. "Editing structured documents--problems and solutions." Electronic Publishing: Origination, Dissemination, and Design, 5(4):209-216, December 1992.

[5] Mariano P. Consens and Alberto O. Mendelzon. "Expressing structural hypertext queries in GraphLog." In Hypertext '93 Proceedings, pages 269-292. ACM, New York, November 1989.

[6] Mariano P. Consens and Alberto O. Mendelzon. "Expressing structural hypertext queries in GraphLog." In Multimedia '93 Proceedings pages 269-292. ACM, New York, November 1989.

[7] James H. Coombs, Allen H. Renear, and Steven J. DeRose. "Markup systems and the future of scholarly text processing." Communications of the ACM, 30(11):933-947, November 1987.

[8] An Feng and Toshiro Wakayama. "SIMON: A grammar-based transformation system for structured documents." In Hüser et al. [25], pages 361-372..

[9] Richard Furuta. "An Integrated, but not Exact-Representation, Editor/Formatter." In J. C. van Vliet, editor, Text Processing and Document Manipulation, pages 246-259. Cambridge University Press, April 1986. Proceedings of the international conference, University of Nottingham, 14-16 April 1986.

[10] Richard Furuta. An Integrated, but not Exact-Representation, Editor/Formatter. Ph.D. dissertation, University of Washington, Department of Computer Science, Seattle, WA, 1986. Also available as Technical Report No. 86-09-08, Department of Computer Science, University of Washington (August 1986).

[11] Richard Furuta. "Structured document models and representations." In J. J. H. Miller, editor, Issues in Generalized Text Processing: Lecture notes of a Short Course held in association with PROTEXT IV the Fourth International Conference on Text Processing Systems, pages 1-14. Boole Press, 1987.

[12] Richard Furuta. "Complexity in structured documents: User interface issues." In J. J. H. Miller, editor, PROTEXT IV: Proceedings of the Fourth International Conference on Text Processing Systems, pages 7-22. Boole Press, 1987.

[13] Richard Furuta. "An object-based taxonomy for abstract structure in document models." The Computer Journal, 32(6):494-504, 1989.

[14] Richard Furuta. "Important papers in the history of document preparation systems: Basic sources." Electronic Publishing: Origination, Dissemination, and Design, 5(1):19-44, March 1992.

[15] Richard Furuta, Catherine Plaisant, and Ben Shneiderman. "A spectrum of automatic hypertext constructions." Hypermedia, 1(2):179-195, 1989.

[16] Richard Furuta, Catherine Plaisant, and Ben Shneiderman. "Automatically transforming linear documents into hypertext." Electronic Publishing: Origination, Dissemination, and Design, 2(4):211-229, December 1990.

[17] Richard Furuta and P. David Stotts. "Specifying structured document transformations." In J. C. van Vliet, editor, Document Manipulation and Typography, pages 109-120. Cambridge University Press, April 1988. Proceedings of the International Conference on Electronic Publishing, Document Manipulation, and Typography, Nice (France), April 20-22, 1988.

[18] Richard Furuta and P. David Stotts. "Separating hypertext content from structure in Trellis." In Proceedings of Hypertext 2, June 1989. University of York, June 29th and 30th, 1989.

[19] Richard Furuta and P. David Stotts. "Object structures in paper documents and hypertexts." BIGRE, (63-64):147-151, May 1989.

[20] Richard Furuta and P. David Stotts. "Interpreted collaboration protocols and their use in groupware prototyping." In Proceedings of Computer Supported Cooperative Work '94. Association for Computing Machinery, October 1994. To appear.

[21] Leonard Gallagher, Richard Furuta, and P. David Stotts. "Increasing the power of hypertext search with relational queries." Hypermedia, 2(1):1-14, 1990.

[22] C. F. Goldfarb. "Use of an integrated text processing system in commercial textbook production." Abstracts of the Presented Papers, International Conference on Research and Trends in Document Preparation Systems, Lausanne, Switzerland, pages 121-122, February 1981.

[23] C. F. Goldfarb. "A generalized approach to document markup." Proceedings of the ACM SIGPLAN SIGOA Symposium on Text Manipulation, SIGPLAN Notices, 16(6):68-73, June 1981. The proceedings of the conference containing this paper are also available as SIGOA Newsletter 2(1&2), Spring/Summer 1981.

[24] Michael Gorman and Paul W. Winkler, editors. Anglo-American Cataloguing Rules, 2nd Edition. American Library Association, 1988.

[25] Christoph Hüser, Wiebke Möhr, and Vincent Quint, editors. Proceedings of the Fifth International Conference on Electronic Publishing, Document Manipulation and Typography. Wiley, April 1994. Also available as Electronic Publishing: Origination, Dissemination and Design, Volume 6, Number 4, December 1993.

[26] IBM Corporation. Document Composition Facility Generalized Markup Language: Starter set reference, April 1980. Order number SH20-9187-0.

[27] International Standard Organisation. Text and Office Systems--Office Document Architecture (ODA) and Interchange Format, 1989. International Standard 8613.

[28] ISO. Text and Office Systems--Standard Generalized Markup Language, October 1986. Document Number: ISO 8879-1986(E).

[29] Eila Kuikka and Martti Penttonen. "Transformation of structured documents with the use of grammar." In Hüser et al. [25], pages 373-383.

[30] Sandra A. Mamrak, Michael J. Kaelbling, Charles K. Nicholas, and Michael Share. "Chameleon: A system for solving the data-translation problem." IEEE Transactions on Software Engineering, 15(9):1090-1108, September 1989.

[31] Steven R. Newcomb, Neill A. Kipp, and Victoria T. Newcomb. "The `HyTime' hypermedia/time-based document structuring language." Communications of the ACM, 34(11):67-83, November 1991.

[32] Ken Perlin. "Pad: An alternative approach to the computer interface." In Coordination Theory and Collaboration Technology Workshop, pages 103-118, 1993.

[33] Ken Perlin and David Fox. "Pad: An alternative approach to the computer interface." In Proceedings of SIGGRAPH 93, pages 57-64, 1993.

[34] B. K. Reid. "The Scribe document specification language and itscompiler." Abstracts of the Presented Papers, International Conference on Research and Trends in Document Preparation Systems, Lausanne, Switzerland, pages 59-62, February 1981.

[35] Brian K. Reid. "A high-level approach to computer document formatting." Conference Record of the Seventh Annual ACM Symposium on Principles of Programming Languages, January 1980.

[36] Brian K. Reid. Scribe: A Document Specification Language and its Compiler. Ph.D. dissertation, Carnegie-Mellon University Computer Science Department, Pittsburgh, PA, October 1980. Also issued as Technical Report CMU-CS-81-100.

[37] Brian K. Reid and Janet H. Walker. Scribe Introductory User's Manual, Third Edition, Preliminary Draft. UniLogic, Ltd., Pittsburgh, May 1980. Previous editions were issued by the Carnegie-Mellon University Computer Science Department, Pittsburgh, PA.

[38] Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison Wesley, 1990.

[39] Hanan Samet. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison Wesley, 1990.

[40] Ben Shneiderman. "Direct manipulation: A step beyond programming languages." Computer, 16(8):57-69, August 1983.

[41] P. David Stotts and Richard Furuta. "Petri-net-based hypertext: Document structure with browsing semantics." ACM Transactions on Information Systems, 7(1):3-29, January 1989.

[42] P. David Stotts and Richard Furuta. "Hierarchy, composition, scripting languages, and translators for structured hypertext." In A. Rizk, N. Streitz, and J. André, editors, Hypertext: Concepts, Systems, and Applications, pages 180-193. Cambridge University Press, November 1990. Proceedings of the European Conference on Hypertext.

[43] P. David Stotts, Richard Furuta, and J. Cyrano Ruiz. "Hyperdocuments as automata: Trace-based browsing property verification." In D. Lucarella, J. Nanard, M. Nanard, and P. Paolini, editors, Proceedings of the ACM Conference on Hypertext (ECHT '92), pages 272-281. ACM Press, 1992.