Integrating Heterogeneous Search Engines

Integrating Heterogeneous Search Engines
Position Paper for the
W3C Distributed Indexing/Searching Workshop

Gary Adams and W. A. Woods, Sun Microsystems Laboratories, Chelmsford, Mass.
contact: Gary.Adams@East.Sun.Com, William.Woods@East.Sun.Com

Introduction

Integrating heterogeneous search engines will require protocols for communicating with search engines about their capabilities and for reporting information in result lists about scoring method used and about what constitutes a hit. The growing diversity of search methods poses interesting challenges to integration that can be addressed if there are sufficiently expressive protocols. For example, the Conceptual Indexing System being developed at Sun Microsystems Laboratories, is a concept matching engine that reports a penalty-based score for dynamically identified text passages. In this dynamic passage retrieval system, scores are assigned to regions of text determined at query time, based on groupings of query terms or conceptually related terms. This differs from document retrieval, which generates scores for entire documents, and from static passage retrieval, which identifies rankable passages at indexing time. Integrating this system with a traditional system requires a way to identify dynamic passages and a way to know that smaller penalty scores are better.

Negotiating about Engine Capabilities and Reporting Results

A multi-engine search system may want to interrogate a search engine to determine its capabilities or to negotiate with the engine about what information it wants. For example it may want to determine if a given engine supports a proximity operator, and for those that do not, pass the results through a postprocessing filter. A system that integrates heterogeneous results may want to ask a search engine to report the following kinds of information for each returned hit, if available:

what score was assigned by the engine
what scoring method was used
what query terms were matched
what were the corresponding hit terms (which may be related, but different)
what were the term frequencies, if known
what were their positions, if known
what is the size of the document
what is the size of the collection (number of documents)
what are the document frequencies of the terms in the collection, if known
what are the word frequencies of the terms, if known
what are the positions (ranges) of hits within the document

One could use SOIF notation to make such requests. For example, the following might be used to specify desired capabilities, and a similar format could be used to report available capabilities:

@CAPABILITIES-REQUEST {labboot:9112
POSITIONS{1}:	Y
SCORES{1}:	Y
WORD-FREQUENCIES{1}:	Y
SCORE-TYPE{33}:	TWIDF,IDF,PROB,WORD-COUNT,PENALTY}

Returning a result list as a collection of SOIF objects would give a way to encode collateral information about results. For example, the following could be a passage retrieval result:

@DPASSAGE { http://www.sunlabs.com/
SCORE{3}:	.01
SCORE-TYPE{7}:	PENALTY
PASSAGE-REGION{11}:	01736,01895
HIGHLIGHT-REGIONS{23}:	01754,01799 01804,01815}

References

Darren R. Hardy, Michael F. Schwartz, and Duane Wessels, Harvest User's Manual, U. Colorado, January 31, 1996.
EARN Staff, Request For Comments 1580, "Guide to Network Resource Tools", EARN Association, March 1994.
J. Foster, ed., Request For Comments 1689, "A Status Report on Networked Information Retrieval: Tools and Groups", University of Newcastle, August 1994.
Conceptual Indexing Fiscal 1995 Project Portfolio Report, Sun Microsystems Laboratories, November 1995.
Sun Microsystems Laboratories Knowledge Technology Group -- Conceptual Indexing Project home page, Sun Microsystems Laboratories, February, 1996.

Call for Participation

This page is part of the DISW 96 workshop.
Last modified: Tue Jul 9 17:19:02 EST 1996.