Integrating Heterogeneous Search Engines
Position Paper for the
W3C
Distributed Indexing/Searching
Workshop
Gary Adams and
W. A. Woods,
Sun Microsystems Laboratories,
Chelmsford, Mass.
contact:
Gary.Adams@East.Sun.Com,
William.Woods@East.Sun.Com
Introduction
Integrating heterogeneous search engines will require protocols for
communicating with search engines about their capabilities and for
reporting information in result lists about scoring method used and
about what constitutes a hit.
The growing diversity of search methods poses interesting challenges
to integration that can be addressed if there are sufficiently
expressive protocols.
For example, the
Conceptual Indexing System being developed at
Sun Microsystems Laboratories,
is a concept matching engine
that reports a penalty-based score
for dynamically identified text passages.
In this dynamic passage retrieval system, scores are assigned
to regions of text determined at query time, based on groupings of
query terms or conceptually related terms. This differs from
document retrieval, which generates scores for entire
documents, and from static passage retrieval, which
identifies rankable passages at indexing time. Integrating this
system with a traditional system requires a way to identify dynamic
passages and a way to know that smaller penalty scores are better.
Negotiating about Engine Capabilities and Reporting Results
A multi-engine search system may want to interrogate a search engine
to determine its capabilities or to negotiate with the engine about
what information it wants. For example it may want to determine if a
given engine supports a proximity operator, and for those that do not,
pass the results through a postprocessing filter. A system that
integrates heterogeneous results may want to ask a search engine to
report the following kinds of information for each returned hit, if
available:
- what score was assigned by the engine
- what scoring method was used
- what query terms were matched
- what were the corresponding hit terms (which may be related, but different)
- what were the term frequencies, if known
- what were their positions, if known
- what is the size of the document
- what is the size of the collection (number of documents)
- what are the document frequencies of the terms in the collection, if known
- what are the word frequencies of the terms, if known
- what are the positions (ranges) of hits within the document
One could use SOIF
notation to make such requests. For example, the following might be
used to specify desired capabilities, and a similar format could be
used to report available capabilities:
@CAPABILITIES-REQUEST {labboot:9112
POSITIONS{1}: Y
SCORES{1}: Y
WORD-FREQUENCIES{1}: Y
SCORE-TYPE{33}: TWIDF,IDF,PROB,WORD-COUNT,PENALTY}
Returning a result list as a collection of SOIF objects would give a
way to encode collateral information about results. For example, the
following could be a passage retrieval result:
@DPASSAGE { http://www.sunlabs.com/
SCORE{3}: .01
SCORE-TYPE{7}: PENALTY
PASSAGE-REGION{11}: 01736,01895
HIGHLIGHT-REGIONS{23}: 01754,01799 01804,01815}
References
- Darren R. Hardy, Michael F. Schwartz, and Duane Wessels,
Harvest User's Manual,
U. Colorado,
January 31, 1996.
- EARN Staff,
Request For Comments 1580,
"Guide to Network Resource Tools",
EARN Association,
March 1994.
- J. Foster, ed.,
Request For Comments 1689,
"A Status Report on Networked Information Retrieval:
Tools and Groups",
University of Newcastle,
August 1994.
-
Conceptual Indexing Fiscal 1995 Project Portfolio Report, Sun Microsystems Laboratories, November 1995.
-
Sun Microsystems Laboratories
Knowledge Technology Group
-- Conceptual Indexing Project home page,
Sun Microsystems Laboratories,
February, 1996.
Call for Participation
This page is part of the DISW 96 workshop.
Last modified: Tue Jul 9 17:19:02 EST 1996.