The Query Routing and Engine ID Breakout Session at DISW '96 was
attended by a good cross-section of the online search industry. The
discussion centered mostly around Query Routing, and then hit Engine
Identification and at the end touched on Result Merging Strategies.
Among the successes of the workshop was the creation of an informal
working group which will explore various implementations of query
routing.
There were two sets of problems addressed, which were
divided into the broad categories:
Where to Search and
How to Search.
There were also a few topics, namely Query Refinement,
that did not easily fit into our simple categorization, and weren't
discussed in the meeting.
Easy tools to create centroids
Ron Daniel <rdanial@lanl.gov> will be doing
some work with lq-text in the near future.
Medium (end of 1996) and Long Term Goals
Prototype Implementation Erik Selberg
of UW has offered (threatened? :) to co-ordinate an
implementation of query routing on the MetaCrawler web search service
(http://www.metacrawler.com). The idea is that the RFC and suggested
modifications can be implemented and we can see what problems arise
therein. The first step in this is to identify other parties
interested.
Apart from the technological issues of defining and implementing an
efficient query routing standard, it is of paramount importance to consider
the issues facing existing search engine systems (e.g. Verify, Fulcrum) which
have not been designed with centroids in mind. That is, what tools
and techniques will be necessary in order to allow existing systems to
generate effective standard centroid-information?
Some answers can be drawn from existing work with centroid-passing software,
such as Bunyip's Digger software, which implements the Whois++ protocol.
This is currently being used for meshes of white pages (people) data, and
there are other projects underway for using this with corpora
of other template-based data (e.g., metadata).
Potential problems arise when engines use stemming. Some agreement
on the format of Centroids when they use stemming is going to be needed.
One question which needs to be explored are the performance
dimensions of the Centroids, i.e. How big are they? How much CPU time on the
server's end do they take to compute?
There are significant research efforts already underway to explore the
scaling issues associated with the use of centroids in query routing. One
example is the NSF/CNIDR/MCNC supported University of California systemwide
Whois++
testbed (http://www.ucdavis.edu/whoisplus/). Another example is the GlOSS project at Stanford
(http://gloss.stanford.edu).
Related Standards / Scope / References
RFC 1913
(WHOIS++ syntax), RFC 1914 (WHOIS++
interact) based
These aren't general enough for non-text databases (e.g. image
databases), so we should be open for more expansion.
Headers
Need to expand 1913/14 for databases (defined headers are
for white pages and aren't general enough)
Stemming could be tricky
Comments
Data
"Enough for clients to rank services"
<word> <coll freq> (coll freq = # of docs in
coll that contain word)
Transport
Patrik said RFC 1913/4 had stuff for this, so we'll use that
Alternatives to Centroid Model
We touched briefly on alternatives, but didn't go into any
details. Below are two ideas which got over a sentence's worth of
discussion. Mike - this is pretty much all I intent to write for
these; they're here more for completeness sake than anything
else. Feel free to scrap 'em. -E
How does a Client Interact with appropriate Servers?
Related Standards / Scope / References
FIND folks Query Routing model
Resource Discovery?
KQML - Knowledge Query
and Manipulation Language, out of UMBC, which is used with agents
in the AI arena.
Required Work
We spent a lot of time discussing what the problem we were trying to address
actually was. In the end, we determined that a Client needs to obtain
information necessary to communicate with a Server, formulate queries, and
interpret results. This breaks the problem in two parts: Definition of a Data Structure which contains this
information, and Transport of this Data Structure from
the Engine / Engine Provider to the Client.