Posted to robots@nexor.co.uk on June 6, 1994
The Lycos project at Carnegie Mellon is in the early stage, we have a Web explorer in operation, and our indexer will come on-line later this month. We will use the SCOUT indexer which has an HTTP gateway (a set Sample database of the Tipster corpus from Wall Street Journal is available intermittently from the Experimental SCOUT server
Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URLs. It uses a random search, keeps its record of URLs visited in a Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu DBM doesn't have arbitrary limits!). It searches HTTP, FTP, and GOPHER sites, ignoreing TELNET, MAILTO, and WAIS. Lycos uses a data reduction scheme to reduce the stored information about each document:
Lycos keeps a word frequency count as it runs...it has read over 25 million words. A list of the most frequent words found after searching 6.3 million words is available off the Lycos home page.
So far, Lycos has run for less than a month
142132 http 102910 ftp 84143 gopher 4314 news 1396 telnet 379 mailto 244 wais 13 rlogin
Citation counting (number of "parents" by URL): this is the first 50 URLs sorted by number of documents that reference that URL. What I did not do was to count only references from different sites (I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come from the Genome Database server itself.
1703 http://gdbwww.gdb.org/omim/ 1578 http://cossack.cosmic.uga.edu/keywords.html 692 ftp://ftp.network.com/IPSEC/rfcindex4.html 421 ftp://ftp.network.com/IPSEC/rfcindex3.html 322 ftp://ftp.network.com/IPSEC/rfcauthor.html 319 ftp://ftp.network.com/IPSEC/rfcindex5.html 234 ftp://ftp.network.com/IPSEC/rfcindex2.html 202 ftp://ftp.network.com/IPSEC/rfcindex1.html 177 http://info.cern.ch/hypertext/WWW/TheProject.html 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html 133 http://www.cs.columbia.edu/~radev/ 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html 108 http://www.mcs.anl.gov/home/gropp/ 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html 85 http://cui_www.unige.ch/w3catalog 84 http://wings.buffalo.edu/world 82 http://sass577.endo.sandia.gov/SEACAS/CUBIT/Developers/ 80 http://cui_www.unige.ch/OSG/MultimediaInfo/mmsurvey/ 79 http://www.nta.no/telektronikk/4.93.dir/ 76 http://asp.esam.nwu.edu/chris/dce_prodlist.html 76 http://hypatia.gsfc.nasa.gov/NASA_homepage.html 76 http://info.cern.ch/hypertext/DataSources/WWW/Servers.html 75 http://www.ncsa.uiuc.edu/demoweb/demo.html 75 http://www.rtd.com/people/rawn/ 74 ftp://ftp.network.com/IPSEC/rfcindex0.html 74 http://tns-www.lcs.mit.edu/cgi-bin/value-added/sports/register.sos.texas.gov/texreg/ 73 http://rs560.cl.msu.edu/weather/getmegif.html 71 http://rs560.cl.msu.edu/weather/interactive.html 70 http://rs560.cl.msu.edu/weather/textindex.html 70 http://rs560.cl.msu.edu/~henrich/ 70 http://www.seas.upenn.edu/~mengwong/ 68 http://info.cern.ch/hypertext/DataSources/WWW/Geographical.html 68 http://rs560.cl.msu.edu/weather/uscmp.gif 66 http://rs560.cl.msu.edu/weather/uscmp.mpg 66 http://www.cso.uiuc.edu/~kline/cvk.html 65 ftp://cs.nott.ac.uk/pub/sat-images/ 65 http://rs560.cl.msu.edu/weather/goes7ir.mpg 65 http://rs560.cl.msu.edu/weather/worldir.mpg 65 http://www.hmc.edu/~irilyth/diplomacy/ 64 gopher://burrow.cl.msu.edu/00/news/weather/lan 64 gopher://ssec.wisc.edu 64 http://rs560.cl.msu.edu/weather/6panel.mpg 64 http://rs560.cl.msu.edu/weather/d2.jpg 64 http://rs560.cl.msu.edu/weather/gmsvis.mpg 63 http://cui_www.unige.ch/meta-index.html 63 http://rd13doc.cern.ch/public/doc/Rd13StatusReport.html
Alternative fixed representations of documents or document sets include the vector models such as Dumais at BellCore and Gallant & Caid at Hecht-Neilson Corp. The number 100 was chosen arbitarily, so we will need to investigate to find whether than number is too high or too low.
I also subscribe to the dream of a single format and indexing scheme that each server runs on its own data, but given the current state of the community I believe it is premature to settle on a single format. Various information retrieval schemes depend on wildly different kinds of data. We should try out more ideas and evaluate them carefully and only then should we try to settle on a single format.
I will make lists, statistics, reports, and the index server accessible off the Lycos home page as they become available.
--Michael L. Mauldin Carnegie Mellon University Center for Machine Translation 5000 Forbes Avenue Pittsburgh, PA 15213-3890 fuzzy@cmu.eduBack to the Lycos Home Page.