KEYWORDS: Searcher's thesaurus, graphical user interface, hypertext, information retrieval.
Each entry in the INSPEC Thesaurus has a structure typical of most subject thesauri. Relationships used in the INSPEC Thesaurus are shown in the sample entry shown in figure 1.
Figure 1 shows the INSPEC Thesaurus entry for shock waves. Each two-letter code, respectively, stands for Use For, Narrower Term(s), Broader Term(s), Top Term(s), Related Term(s), and Date Input. Every term designated as a NT, BT, TT, or RT in an entry has, in turn, its own entry, which indicates the UF, NT, BT, TT, and RT relationships for that term. (In this paper, term is synonymous with descriptor, meaning preferred or approved INSPEC indexing term. Nonpreferred terms are referred to as access terms.)
An additional relation, PT, not shown here, indicates a Prior Term. When used with the DI field, the PT field helps to demark the beginning and end of a term's lifetime in the thesaurus.
Figure 1: Sample INSPEC Thesaurus entry.
The INSPEC Thesaurus consists of approximately 8,000 descriptors and another 8,000 or so access terms which lead to the descriptors. All of the terms in figure 1 except for sonic boom are INSPEC Thesaurus descriptors; sonic boom is an access term, as is any other term falling under the UF tracing in an entry. The user who looks up the term sonic boom, for example, is directed to use shock waves instead.
The INSPEC Thesaurus is particularly well-suited to this project not only because of its subject scope but also because of its overall structure. In several important respects it is a "well-behaved" thesaurus and is therefore amenable to algorithmically-controlled access and display. The INSPEC Thesaurus consists of several hundred subject hierarchies, ranging in length from two terms up to several hundred terms, with a maximum hierarchical depth of about six levels. Subject hierarchies are built from recursively tracing all the NT (narrower term) relationships from each TT (top term) in the thesaurus. Orphaned terms do occur in the INSPEC Thesaurus, but what few there are are still connected to neighboring hierarchies by RT (related term) tracings. Its moderate size, varied though consistently applied structure, rich interconnectivity, and avoidance of inordinately deep hierarchies makes the INSPEC Thesaurus well suited for the kinds of subject access research we report here and have planned for the future.
The time has now come, thanks to Graphical User Interfaces (GUIs), hypertext, and better understanding of user requirements during the search process, to provide a Searcher's Thesaurus at the outset of a search. Marcia Bates , Wilf Lancaster , Jean Aitchison , Pauline Atherton Cochrane , and Susan Jones , among other authors, have expressed a need for such a thesaurus because the user needs help immediately upon contact with an information system. Only in this way can the system clearly show a willingness to help with synonyms, variations in phrasing compound concepts, broader and narrower terms, and other means of expanding or limiting queries to bibliographic databases.
Susan Jones, et al. , describe experiments in interactive thesaurus navigation with intelligence rules. They review the various attempts at weighing relationships between terms, processing co-occurrence of terms to present a concordance to the user, hypertextual thesaurus files, user navigation techniques all in an attempt to provide heuristics for increasing or decreasing recall and precision via a thesaurus. They end their discussion of related research by saying that "the thesaurus component was not considered separately so it is impossible to tell how much it contributed to the overall success" of the search. In our work, because we consider this component of the IR system so important, we are studying thesaurus use separately and will redesign the interface and thesaurus as needed after user evaluation tests.
To quote Jones et al. again: "A thesaurus can be viewed as a bridge (emphasis in original) between queries phrased in natural language and an abstract classification structure which constitutes a map of a particular domain....we can view the thesaurus mainly as a source of natural language terms for query enhancement in a more general context..." (page 59) These two statements seem to us to imply the need for a man-made thesaurus and a machine-made term relationship list (sometimes called an automatic thesaurus) to be conceived of as a single file for purposes of creating a Searcher's Thesaurus. This is our intended line of research. This paper represents the bare beginnings of our attempt to enhance IR system use at the outset and throughout the searching and retrieval processes by providing easy access and manipulation of information in a "thesaurus" file.
The thesaurus browser described in this paper is an important component in the redesign of a user interface for a digital library collection. Its novel features provide access to a thesaurus descriptor's total hierarchy and the "cloud" of related terms surrounding it. The display represents an ordered, hypertextual concept space in which the user can move about at will, selecting search terms for immediate use or for use during subsequent searches without leaving the thesaurus or going into a separate search "mode."
In figure 2, the thesaurus display has two sections. On the left, below where you enter your initial subject search, it displays the current thesaurus term in boldface with the related terms (RTs) floating in space around it. The image conveys the related terms as having no hierarchical relationship to the current term, but merely near it in as equidistant a way as possible. RTs which appear closer to the current term are not in any way more "closely related" to it than those which appear farther away; in the parlance of the crowd, they merely got there first.
The current term appears on the right in boldface as well, but in a list of hierarchical relationships between it and other terms. In figure 2, the narrower terms (NTs) to shock waves are detonation waves and plasma shock waves, the broader term (BT) is acoustic waves, and the top term (TT) is waves. In cases where a term has more than one BT or TT, the interface can display a polyhierarchy as well (see figure 7).
The right-hand section of the display allows you to discern relationships among terms other than the current term and its immediate broader and narrower terms. For example, acoustic waves, the BT for shock waves, has the BT elastic waves. Elastic waves has the NTs acoustic waves, Love waves, magnetoelastic waves, etc.
The left-hand section of the display only shows terms listed in the thesaurus as immediately related to the current term.
The thesaurus display as a whole reveals other interesting "polyrelations" between terms not discernible from the printed form. In figure 2 again, seismic waves is shown as a RT to shock waves, although the hierarchic display on the right also reveals that seismic waves is a sibling term (has the same BT, elastic waves) with acoustic waves, the BT of shock waves. Depending on your gender preference, you might then call seismic waves an "Aunt" or "Uncle" term of shock waves.
Since RTs are symmetrical relationships, shock waves is shown as an RT to Mach number. An interesting thing about this particular display is that all the RTs for Mach number were seen in figure 2 as RTs for shock waves. At the current stage of development of the interface you have to flip back and forth between displays a number of times to discover this fact.
Continuing with the example display shown in figure 3, the hierarchical display shows Mach number as a NT for fluid mechanics (in other words, fluid mechanics is the BT for Mach number), and mechanics is the TT for the whole hierarchy. Mach number itself, however, has no NTs.
From the hierarchical display in figure 3 you can tell that Mach number has no NTs in two ways.
First, there are no other terms shown as indented immediately below it; intermolecular mechanics, though it is immediately below Mach number, is actually indented relative to mechanics, the top term of the hierarchy, and therefore an NT of mechanics, not Mach number.
The second and more useful way of knowing that Mach number has no NTs is the absence of either a "+" (plus) or "-" (minus) sign to the immediate left of it. These signs both mean the same thing: that the term they precede has NTs. The difference is that when the sign is a "+" the NT hierarchy beneath the term is collapsed, but when the sign is a "-" the NT hierarchy beneath the term is expanded. Both figures 2 and 3 show NT hierarchies in states of expansion and collapse: in figure 2, the NT hierarchies under elastic waves, acoustic waves, and shock waves are expanded, with the rest collapsed; in figure 3, the NT hierarchy under fluid mechanics is expanded, and all the rest are collapsed.
The thesaurus interface software automatically expands or collapses NT hierarchies to show the BTs and NTs surrounding the current term. Thus, even when the current term occurs in a long hierarchy, the software can display it as part of a fairly short and thus readily comprehensible display. The fully expanded hierarchy under mechanics in figure 3, for example, is one of the longest in the INSPEC Thesaurus. Yet, because only the immediately broader parts of the hierarchy are expanded around the current term Mach number, its relation to the overall body of knowledge to which engineers ascribe the term "mechanics" is clearly illustrated.
You can see directly how the thesaurus interface automatically expands and collapses hierarchies by clicking on different terms in the same hierarchy. Clicking on intermolecular mechanics as it appears in figure 3 changes the hierarchy to the one shown in figure 4 (because the thesaurus now has intermolecular mechanics as its current term, it changes the RTs displayed as well). Note that it collapsed the NT hierarchy under fluid mechanics (where Mach number appears) and expanded the NT hierarchy under intermolecular mechanics. The sole NT of intermolecular mechanics is intermolecular forces, which has, as indicated by the "+" sign before it, one or more NTs itself.
You can expand and collapse NT hierarchies yourself by respectively clicking on "+" or "-" signs in the hierarchical display, rather than the terms they occur next to. You can thus see other hierarchical relationships in the thesaurus without changing the current term.
The change in the mechanics hierarchy shown between figures 3 and 4 was a minor one because in each case the immediate hierarchies surrounding the current terms were small. If, however, in figure 3 you were to click on fluid dynamics the change in the mechanics hierarchy would appear drastic and disorienting, as shown in figure 5. The causes of this sudden increase in the complexity of the display have to do with the structure of the thesaurus itself and with how the current version of the interface software displays it.
The structure of the thesaurus places fluid dynamics into a polyhierarchy, which complicates not only the conceptual space surrounding it but also the way in which the interface software must display it. Fluid dynamics has two BTs: dynamics and fluid mechanics, meaning that the entire NT hierarchy under fluid dynamics occurs twice. Since the displayed NT hierarchy for fluid dynamics is identical in each case, we can collapse the redundant NT hierarchies by clicking on the "+" signs next to all but one of the redundant occurrences of the polyhierarchic term. Figure 6 illustrates how collapsing one of the redundant NT hierarchies under fluid dynamics simplifies the hierarchical display somewhat, though it is still long enough to require a scrollbar. Automatic collapsing of redundant NT hierarchies could be added easily enough to the thesaurus interface software.
Polyhierarchy comes in two distinct flavors in the INSPEC thesaurus. Fluid dynamics occurs twice under the TT mechanics, but surface waves (fluid) occurs twice under the TT mechanics (due to it being an NT of fluid dynamics) and also under the TT surface phenomena (which is not in the mechanics hierarchy). When a term has more than one TT, the interface software puts the TTs in a pulldown box at the top of the hierarchical display, allowing you to choose the TT for which you would like to see the hierarchy. Figure 7 illustrates the term surface waves (fluid) in the mechanics hierarchy, with the pulldown being used to select the other TT, surface phenomena. Figure 8 illustrates the term surface waves (fluid) in the surface phenomena hierarchy.
Terms in the thesaurus can appear in more than one hierarchy because they often fit into more than one conceptual scheme.
Short of extending the entry vocabulary of the thesaurus, a useful tool in these circumstances would be a lexical venue into the controlled as well as entry vocabulary of the thesaurus. KeyWord-Out-of-Context (KWOC, also known as just "keyword") lists and KeyWord-In-Context (KWIC) lists can be useful for this purpose.
Figures 9 and 10 illustrate the use of Keyword and Keyword in Context lists to help find thesaurus terms containing the word stem "computer." The "Keywords" list continually tries to match the current word that the user is typing with a word in the keyword database compiled from all INSPEC Thesaurus descriptors and access vocabulary terms. As such, it can also act as a spell-checker. You can also use it to transfer a word from the Keywords list to the search entry area by double-clicking (or dragging-and-dropping). Once you have typed "zeu" for example, you can transfer the whole word "zeugmatography" into the search form directly.
Returning to the example illustrated in figures 9 and 10, the Keywords in Context list activates after you type a complete word. In figure 9, the Keywords in Context list has returned all INSPEC Thesaurus descriptors and access terms containing the word stem "computer." The list scrolls in case there are more items in it than can be shown in the list window (both the Keywords list and Keywords in Context list are in resizable windows).
In figure 10, computer industry has been selected from the Keywords in Context list, and the Thesaurus has responded by displaying that term in its context. In this case computer industry is an access term, the preferred term being DP industry, as indicated in the note area below where you would normally enter the term. With the Keywords in Context list in use, you can scroll through it and click on any number of terms displayed therein, displaying the thesaurus entry for each in turn.
You can use the Keywords and the Keywords in Context list separately or together, depending on your situation. If you know of a term in the thesaurus but aren't sure about how one of its words is spelled, you can use the Keywords list to check your spelling. If you want to pick out a thesaurus descriptor or access term from a word contained in it (as in figures 9 and 10), you can use the Keywords in Context list with or without the Keywords list.
We feel that our current solution to keyword access, with free-floating, resizable, modelessly accessible Keyword and Keyword in Context lists, offers the best of both querying and browsing without requiring that each have its own mode of operation. You can summon or dismiss each list independently at any point during the typing of your search term, and using them does not complicate the search form (you access the lists through a floating control palette, not shown in figures 9 and 10). An added bonus to this generic application of Keyword and Keyword in Context lists is that they can also be used, with the same programming code and same interface elements, for other kinds of searching such as title, author, or full text. The list software merely has to know to switch keyword lists when you decide to use a different search method. And once you learn how to use Keyword and Keyword in Context lists in one kind of search, you know how to use them in any other kind of search as well, because the lists look and work exactly the same in each case.
In earlier designs for the interface, the Keywords and Keywords in Context lists were a fixed part of the thesaurus interface, with fixed places on the screen and several modes of display. This produced several problems that made the use of the interface unsatisfactory.
First, it was difficult, from a design standpoint, to decide where best to place the respective list displays. Given how we wanted the hierarchical and related term displays to work, there was no good place to put them to begin with, and we even considered at one point doing without the lists altogether. The only solution, keeping the lists as part of the thesaurus display form, would have been to elongate it in a manner that would only allow it to be used on a 1024x768 or higher resolution screen, and that was not in keeping with our purpose of providing an interface that would work on public terminals with screen resolutions of only 640x480, as well as on notebook computers.
Second, as useful as Keywords and Keywords in Context lists may be, you don't always want to use them. This is why almost all OPACs and other information retrieval systems maintain the distinction in their interfaces between browsing and querying: you should be able to browse when you want to, but when you know an item is there, or know the correct subject heading for your search, you just don't want to be bothered with scrolling through lists of choices . Even when, as with our system, you can still type your request when the Keywords and/or the Keywords in Contexts lists are displayed, they are still a distraction when you don't want to use them.
Third, realities of server loads and network traffic mean that during periods of peak use, or when using a keyword server located some distance from your client machine, the performance of both the Keywords and Keywords in Context lists can be quite sluggish. This is tolerable if you really need them to help you spell a word or check for the existence of a heading or phrase before submitting a query to an even more heavily loaded bibliographic database server. But putting up with a sluggish keyword list that encumbers your typing when you don't need it is intolerable.
To combat this phenomenon without limiting the number of displayed terms or the means to navigate the thesaurus, we have implemented a generic tool called the "Hold File," into which you can place thesaurus terms (as well as free text descriptors, author names, and other bibliographic information objects) for later use. The metaphor is that of an index card file into which you can place search items that you might want to try later.
Figure 11 illustrates use of the Hold File. In this example, the descriptor "research initiatives" is of interest, but not immediately. So as not to lose track of it while continuing a search for something else, you can use the mouse to drag it into the Hold File (represented by a card file icon) on the main search form (where queries and short record lists are displayed). Dragging a subject descriptor to the Hold File stores a copy of the descriptor there, and does not remove it from the thesaurus display.
Like the Keyword and Keyword in Context lists, the Hold File is a generic interface tool. It can hold title and full-text keywords and author names as well as thesaurus descriptors, and you use it the same way for each. The Hold File automatically keeps each type of data in a separate list, but also allows you to move items between lists, as when you want to use a thesaurus descriptor in a title or full-text search.
Development of both the thesaurus interface and the off-line processing software will continue for the length of the Digital Library project. There are several problems in particular that need further work . The RT placement algorithm, for instance, does not yet consistently distribute terms of arbitrary lengths in a way which appears spatially balanced.
Preliminary usability studies have been done to record user problems and acceptance of display features and tools (e.g., Keyword lists, Keyword in Context lists, and ways of transferring information objects between interface forms, such as drag and drop). Findings from these studies have and will continue to help us improve the interface. Among other preliminary findings, users seem to prefer looking at the "cloud" of RTs rather than the subject hierarchy, even though to us the hierarchy seems to contain more interesting and useful information. Some users even question what the hierarchy is, and try to navigate using only the RT cloud ("cloud" is the word users most often used when asked to describe the RT display). The "Done" button also seems to need a different label.
The INSPEC Thesaurus with this interface is only a portion of our plan to create a searcher's thesaurus. We also intend to display an automatic thesaurus made from term co-occurrences in abstracts and full text documents in the Digital Library Initiative testbed. We think linkages between these two types of thesauri will provide more lead-in vocabulary for the user and direct access to portions of the full-text documents. Navigation through the document retrieval space will be hypertextual as well.
We also plan to apply this interface to other science and engineering thesauri besides INSPEC and to thesauri in other disciplines. This will allow us to evaluate its robustness and general applicability to features of other thesauri.
We also wish to thank the DLI user study group (Ann Bishop, Laura Neumann, and Emily Ignacio) for providing us with preliminary user feedback, and Leigh Star for being our preliminary test subject.