The Faceted Acess to Cultural hErirtage Terminology (FACET) project was a collaborative project investigating the potential of semantic expansion in retrieval. It aimed to take advantage of facet structure in both the interface and retrieval mechanism. More information and a web demonstrator can be found on the project webpages. An online journal paper with links to the web demonstrator discusses various issues in providing KOS-based services. This research strand is continuing in the STAR project.
Dates: April 2000 – March 2003 Funding source: EPSRC £121,130 Principal Investigator: Douglas Tudhope, Co-Investigator: Daniel Cunliffe, Research Associates: Ceri Binding, Dorothee Blocks
FACET was a 3 year EPSRC funded collaborative project investigating the retrieval potential of faceted thesauri. The original project finished in 2003. The EPSRC’s assessment of our final report rated the project as 'Tending to Outstanding’. Two aspects (Communication of Research Outputs and Cost Effectiveness) were awarded the top grade of 'Outstanding’.
Today we are seeing major efforts to digitise collections for the WWW. This involves opening up databases, previously the domain of the professional, to a new range of users. There is a critical need for tools that assist users to formulate and refine queries, and navigate through the information space. The recent growth of cultural heritage applications has served as a major impetus in promoting access to multimedia collections more generally and has coincided with an interest in applying traditional cataloguing techniques to the WWW. The move by museums to unlock their collections databases to the public, has also foregrounded the issues of access points and indexing practice.
The thesaurus is one of the most commonly used controlled vocabulary indexing tools – the aim of the FACET project is to investigate the retrieval potential of thesauri. FACET investigates the closer integration of the thesaurus into the interface and search techniques that do not require the user to exactly match how an item has been indexed.
FACET collaborates with the J. Paul Getty Trust in exploring the retrieval potential of its vocabularies, in particular the Art and Architecture Thesaurus (AAT), and with the National Museum of Science and Industry (NMSI) in its attempts to promote wider access to its collections database. The aim is to complement NMSI’s development of major areas of 'rich content’. Railway/Locomotive History has been selected as one area particularly appropriate for the project due to its AAT coverage and synergy with ongoing work at the National Railway Museum on extending the AAT with railway terms. The MDA and CHIN act as advisors to the project.
Thesauri and classifications are types of controlled indexing vocabulary, in which index terms are restricted to a controlled set of terms. A large number of systems exist, covering a variety of subject domains, for example the MEdical Subject Headings, the Art and Architecture Thesaurus and the Dewey Decimal Classification. These controlled vocabularies have long been part of standard cataloguing practice in libraries and museums and are now being applied to digital hypertexts via thematic keywords in metadata resource descriptors. Metadata sets for the WWW, such as Dublin Core and the Resource Description Framework (RDF) typically include the more complex notion of the Subject of a resource in addition to elements for Title, Creator, Date, etc. It is recommended that, where possible, the Subject element be taken from a relevant controlled vocabulary. This semantic index approach offers the potential for searcher and indexer to speak the same language, and for a user to be guided to fruitful terms when searching a particular collection for a particular purpose. Links between concepts in the subject domain can be expressed by the semantic relationships in a thesaurus (or classification). The three main thesaurus relationships are Equivalence (equivalent terms), Hierarchical (broader/narrower terms), and Associative (more loosely Related Terms). Specialisations of the three main relationships offer possibilities for semantic web applications.
Facet analysis is a key technique in thesaurus construction; concepts are decomposed into elemental classes, or facets, which form homogenous mutually exclusive groups. The faceted approach to subject analysis began in 1933 with Ranganathan’s Colon Classification (Personality, Matter, Energy, Space and Time) and was subsequently elaborated by the British Classification Research Group. Faceted thesauri or classification systems include MESH, BLISS, PRECIS and the main thesaurus used in the project, the Art and Architecture Thesaurus (AAT).
The AAT is a large, evolving thesaurus (nearly 120,000 terms), organised into 7 facets (and 33 hierarchies as subdivisions) according to semantic role: Associated concepts, Physical attributes, Styles and periods, Agents, Activities, Materials, Objects and optional facets for time and place.
Faceted thesauri are similar in structure to faceted classifications but explicitly represent equivalence, hierarchical and associative links between concepts. A thesaurus can be used as a search thesaurus for refining or expanding a free text query (either interactively or automatically). Alternatively a thesaurus can be used both in searching and indexing with controlled vocabulary indexed datasets – and this latter use is the immediate application of our current work (although we also see the techniques as useful with free text searching). In retrieval, thesaurus relationships are conventionally used to expand synonyms and sometimes narrower query terms but the FACET system also performs more general semantic term expansion (to broader and to related concepts). Reasoning over the semantic relationships in the thesaurus permits imprecise matching between query and index terms. This allows the ranking of matching items in a result list or a 'More like this’ option for similar but not necessarily identically indexed items.
Faceted systems are based on a primary division of terminology into fundamental, high-level categories, or facets. A knowledge system can be considered as enumerative, when all possible simple and compound terms are explicitly listed in their hierarchical position, or as synthetic. Faceted systems are normally synthetic; they do not attempt to include the vast number of possible multi-concept headings or descriptors in a domain, but combine terms from a limited number of fundamental facets, as needed when indexing or querying. This flexibility allows highly specific, nuanced metadata descriptions (or annotations). Matching such compound descriptors poses significant challenges when searching and the full potential for retrieval has remained untapped.
The overarching objective of the research was to develop and evaluate retrieval tools based on a matching function incorporating thesaurus semantic closeness measures.
Further objectives were to:
The research is directly relevant to cultural heritage organisations and the users of their digital collections, also to collection management vendors and commercial image providers. Thesauri are one of the most common Knowledge Organisation Systems and frequently underpin higher level schemas and ontologies. Initiatives to update international thesaurus standards are currently underway and various groups are working on XML/RDF representations for thesauri. Thesauri and faceted approaches have been applied to website architecture and hierarchical browsing interfaces to web databases.
The final FACET system comprises a tiered component-based architecture (Figure 1), accessing a SQL Server relational database. Queries with associated results are stored persistently using XML format data.
Figure 1 – underlying system architecture
This architecture has enabled the reuse of key underlying components in the development of two main client interfaces – the first a compiled standalone VB 'fat’ client, the second a browser based ASP web application. Intrinsic to both systems is the C++ semantic expansion engine operating over the in-memory directed graph structure populated from the relational tables representing the thesaurus.
Multi-concept subject headings (metadata descriptors), built by synthesising single concept vocabulary elements, allow highly specific descriptions. This poses challenges for retrieval systems. Indexer and searcher may be operating at different levels of specificity, and at different times both may make different choices from a set of possible term options. It may be impractical to drill down deep hierarchies or browse several dimensions, trying combinations to match exactly all descriptors that might be considered relevant, taking into account both indexing exhaustivity (number of terms) and specificity (level of detail). In distributed search systems, there may not even be a coupling with a single database that permits easy feedback of postings when browsing the thesaurus. Toni Petersen, then Director of the Getty Art and Architecture Thesaurus Project, outlined key unsolved issues for system designers (in a discussion of the National Art Library database at the Victoria and Albert Museum):
“The major problem lies in developing a system whereby individual parts of subject headings containing multiple AAT terms are broken apart, individually exploded hierarchically, and then reintegrated to answer a query with relevance” (Petersen T. 1994. The National Art Library and the AAT. Art and Architecture Thesaurus Bulletin, 22, 6-8.)
The FACET matching function addresses this challenge by generalising queries via semantic expansion of concepts. Results are ranked based on measures of semantic closeness. Semantic closeness is based on the minimum number of (weighted) semantic relationships that must be traversed in order to connect any two distinct thesaurus concepts. This can range from traversing one relationship to more complex chains that combine traversals.
For example, a descriptor brocade, oak, Victorian, Carver Chairs can be considered a partial match for a query on brocading, mahogany, Edwardian, armchairs, although no terms match exactly. Relevance judgements will depend on context. The point is to provide a semantic expansion capability as an option when exact matches are not available.
Figure 2: FACET Query Editor (stand alone)
The direct manipulation faceted query editor facilitates the construction of multi-concept queries (Figure 2). The left pane combines a number of navigable views of the thesaurus: a mapping facility to controlled terms, a hierarchical browser, a 'semantic browser’ (see below). Once controlled thesaurus terms are selected (by any of the methods), they can be dragged (or added via context menu) to the query on the right, where they are automatically associated with the appropriate facet. Figure 2 shows the Query Expansion view in the right pane, a colour coded visualisation of terms affected by current expansion setting. Functionality includes term navigation history and bookmarking, display of scope notes and related terms. Colour coded icons indicate facet membership (and also presence of related terms). At any point, the user may double-click a term to browse the thesaurus and explore local context in order to discover if a term corresponds to the user’s information need (there may be homonyms).
Semantic browsing presents semantic expansion as a simple navigation option. It is one of the left pane browsing options in the Query Editor and offers an innovative alternative to cumbersome sequential hierarchical navigation of a complex thesaurus structure, with the chance of missing a Related Term link or failing to explore a key line of hierarchical descent in a complex tree. With semantic browsing, the hierarchical display is replaced by a linear list (and indication of relative semantic closeness if desired). This is not only useful when dealing with Related Terms. In some situations, semantic expansion may be an easier browsing option than investigating which sub-hierarchies are fruitful to explore in large thesauri. In fact, a user can continue to browse via semantic expansion by double-clicking terms in this view.
A thesaurus typically employs a restricted set of core semantic relationships between concepts, following well established standards (ISO 2788, ISO 5964). The three main thesaurus relationships are Equivalence (synonyms and equivalent terms), Hierarchical (broader/narrower terms) and Associative (more loosely related terms). This tradition of confining relationships to a core set assists interoperability. It can also serve to facilitate automated reasoning over a small well-defined set of relationships.
In retrieval, thesaurus relationships are conventionally used to expand synonyms and sometimes narrower hierarchical concepts but the FACET system also performs more general semantic term expansion (to broader and to related concepts). Reasoning over the semantic relationships in the thesaurus permits imprecise matching between query and index terms. This allows the ranking of matching items in a result list or a 'More like this’ option for similar but not necessarily identically indexed items. Results are ranked based on measures of semantic closeness. Semantic closeness is based on the minimum number of transitive relationships that must be traversed in order to connect any two distinct thesaurus concepts.
A web based demonstrator was developed illustrating many of the techniques used in the standalone FACET system. This ran until March 2017 when the underlying server was decommissioned. However the online JoDI paper gives a flavour of the application.
Query expansion via conceptual distance in thesaurus indexed collections (author version)
Tudhope, D., Binding, C., Blocks, D. & Cunliffe, D. 2006. Journal of Documentation. 62, 4, p. 509-533 25 p.
A reference model for user-system interaction in thesaurus-based searching (author version)
Blocks, D., Cunliffe, D. & Tudhope, D. 2006. Journal of the American Society for Information Science and Technology. 57, 12, p. 1655-1665 11 p.
KOS at your Service: Programmatic Access to Knowledge Organisation Systems (open access)
Binding, C. & Tudhope, D. 2004 In : Journal of Digital Information. 4, 4