We have also applied these natural language processing methods to the problems of information retrieval and organization. We are now focussing our attention on the Web as a textual resource of vast breadth and depth.
The Lycos project at Carnegie Mellon is in the early stage, we have a Web explorer in operation, and our indexer will come on-line later this month. We will use the SCOUT indexer which has an HTTP gateway (a set Sample database of the Tipster corpus from Wall Street Journal is available intermittently from the Experimental SCOUT server).
Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URLs. It uses a random search, keeps its record of URLs visited in a Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu DBM doesn't have arbitrary limits!). It searches HTTP, FTP, and GOPHER sites, ignoreing TELNET, MAILTO, and WAIS. Lycos uses a data reduction scheme to reduce the stored information about each document:
Lycos searches the entire Web, not just HTTP servers. So far, the breakdown of file types found is:
142132 http 102910 ftp 84143 gopher 4314 news 1396 telnet 379 mailto 244 wais 13 rlogin
Citation counting (number of "parents" by URL): this is the first 20 URLs sorted by number of documents that reference that URL. What I did not do was to count only references from different sites (I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come from the Genome Database server itself).
1703 http://gdbwww.gdb.org/omim/ 1578 http://cossack.cosmic.uga.edu/keywords.html 692 ftp://ftp.network.com/IPSEC/rfcindex4.html 421 ftp://ftp.network.com/IPSEC/rfcindex3.html 322 ftp://ftp.network.com/IPSEC/rfcauthor.html 319 ftp://ftp.network.com/IPSEC/rfcindex5.html 234 ftp://ftp.network.com/IPSEC/rfcindex2.html 202 ftp://ftp.network.com/IPSEC/rfcindex1.html 177 http://info.cern.ch/hypertext/WWW/TheProject.html 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html 133 http://www.cs.columbia.edu/~radev/ 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html 108 http://www.mcs.anl.gov/home/gropp/ 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html 85 http://cui_www.unige.ch/w3catalog
The Lycos philosophy is to keep a finite model of the web that enables subsequent searches to proceed more rapidly. The idea is to prune the ``tree'' of documents and to represent the clipped ends with a summary of the documents found under that node. The 100 most important words lists from several documents can be combined to produce a list of the 100 most important words in the set of documents.
Alternative fixed representations of documents or document sets include the vector models such as Dumais at BellCore [Deerwester et al. 90, Dumais et al. 94] and Gallant & Caid at Hecht-Neilson Corporation [Gallant et al. 94]. The number 100 was chosen arbitarily, so we will need to investigate to find whether than number is too high or too low.
One perhaps unique aspect of Lycos is that it uses the text in the parent document to describe the unexplored links (currently, the highlighted text from each anchor is associated with the URL for that anchor). In this way, Lycos can retrieve documents that it has not even searched.
Once the exploration and retrieval functions are in place, we will turn to experiments in ``best-first-search,'' to be able to process on-demand queries by searching the webv in a knowledge-directed way.
The WebAnts project, also in the early stage, is progressing with cooperation in mind. Like Lycos, this project makes a clear preference for explorer-based schemes over those that require cooperation from each information server [Flater & Yesha 93, Koster 94a, Koster 94b]. There are several motivating issues behind this project. First, information discovery on the web (including gopher-space and ftp-space) is now (and will remain) too large a task. Given that there are currently several thousand web and gopher servers and an order of magnitude more ftp servers, and that these numbers are growing, the scale is too great for the use of a single explorer agent to be effective.
Second, it is undesirable to rely on a single site for such services, as this would create a bottleneck. Creating a single explorer site does nothing to solve the fan-in, fan-out nature of the information discovery problem, but rather exacerbates it, making that one site a bottleneck for all users.
Finally, it is undesirable for multiple explorers to examine the same sites [Eichmann et al. 94, Koster 94c]. This situation would be worse than relying on a single information server, since it is certain that net traffic would be higher and it is uncertain that better results would be achieved. This is, unfortunately, almost what we have now. To search for information on a general topic, a user cannot rely on a single search engine, since none of them explore everything.
To address these issues, the WebAnts project is developing a cooperative explorer agent; that is, an explorer (or ant) which will share results with other ants and not duplicate their efforts. Another system that employs multiple agents is the WebCrawler [Pinkerton 94], although it is unclear as to how distributed these agents are. The WebAnts model could be used for either searching or indexing, although at the moment, we are concentrating on searching, until we have settled on a design for the index.
For searching, we expect this cooperation to be very effective, in that the different ants may be directed based on others' results. If one ant finds a document that satisfies the search criteria, it can share the references from that document with any other ants that are not currently exploring hits of their own. And as each ant explores a document, it lets other ants know, so that they do not have to examine the same document.
For indexing, this cooperation allows each indexer to conserve resources, and also to distribute the indexing load between different explorers. Each index server would provide all the information gathered by one of the ants during exploration. When querying, a user could restrict the query to the local ant or allow it to propagate to the entire colony. While this does not solve the fan-in, fan-out problem at the macro level, it does serve to reduce the bottleneck effect. This is a simpler version of the Übernet used by ALIBI [Flater & Yesha 93].
One issue that the WebAnts project does not yet entirely address, but which any indexer must eventually deal with, is that the size of an index will grow proportionally to the size of the web. The storage, retrieval, and distribution of information on the scale will no doubt prove a compelling challenge in coming years.
We categorize Web agents into two broad classes: those that ``pull'' information out, such as the WebCrawler and Lycos, and those that ``push'' information back out, of which the best example is fictional. In Cold as Ice, [Sheffield 92] describes ``faxes,'' which can be defined as expert systems that can oversee experiments, answer questions, and replace many of the functions of secretaries and assistants.
It is this second kind of agent that we hope to build in the future, building on our success with TinyMUD agents such as Julia [Mauldin 94]. Instead of looking for answers, a proxy agent might read newsgroups looking for questions relevant to its owner's areas of interest, might send email to people working on a problem that the its owner has several publications that might be relevant, and might sort, prioritize and even answer email.