You can also read these pages as multiple short documents in the Lycos Home Page.
216,008 documents fetched totaling 1,306,495,064 bytes 1,277,779 unexplored URLs with descriptions 458,129,769 bytes of Lycos summaries 206,431,394 bytes of inverted index
You'll also notice that the score now indicates how many terms matched (if there was more than 1 term in your query), plus there are bonuses for adjacency of search terms.
December 18, 1994 New hardware:
Since my Pentiums haven't arrived yet, I have donated my own workstation to the cause as the Lycos3 server.
I have made arrangements to borrow another Sparc (Lycos4), but
it probably needs a new kernel to increase the maximum number of
processes before it can be a Lycos server (I tried Friday, and it
died by Saturday).
Four additional machines are on order (two P90s and two Sparc 5 clones). I have also obtained a beta copy of Netsite from Netscape Communications to evaluate its speed in comparison to NCSA httpd. I won't be able to unpack and install Netsite until Monday or Tuesday.
Lastly, if anyone knows about serve software to gracefully distribute HTTP load among several servers, please send email to fuzzy@cmu.edu.
December 18, 1994 Network interruptions:
Note that my network connection may be flaky during Dec. 26 and 27th.
Here's the announcement from facilities:
Planned Network Upgrades: ------------------------- In conjunction with the scheduled Cyert Hall Fire System test and the university holidays, the Data Communications department will be upgrading various core network components on Monday, December 26th and Tuesday, December 27th. During these two days, starting at the beginning of Monday December 26th (midnight) various network components will be worked on, replaced or upgraded. In some cases, the network service will be down but start working for short time periods while tests are performed and in other cases long term outages will exist.
December 13, 1994 New catalog:
The large catalog now contains 1,056,523 documents found by Lycos
between Nov 21 and Dec 11th (including 148,667 documents actually
retrieved. This also represents the first catalog incorporating the
new Lycos URL
Deletion function (no more pointers to mtv.com).
148,667 documents fetched totaling 887,792,616 bytes 907,856 unexplored URLs with descriptions 319,514,940 bytes of Lycos summaries 175,426,295 bytes of inverted index
December 7, 1994 New catalog:
The small catalog is now a subset of the big catalog, containing
documents that were retrieved by Lycos between Nov 21 and Dec 5th.
It contains
113,794 documents fetched totaling 675,696,928 bytes 46,621 unexplored URLs for images or postscript files 160,415 documents all together
December 7, 1994 New load limits:
To cope with the load, we've been forced to limit access to the larger
catalog when the load average exceeds 10.0, and to reject new queries
entirely when the load exceeds 15.0. We are adding new hardware and
new servers soon. The Pentiums are scheduled to arrive Tuesday --
don't worry, Lycos/Pursuit only does one floating point divide per hit. :-)
December 6, 1994 New catalog:
The big catalog now contains only 840,327 documents, but they
were all collected between Nov 21 and Dec 5, so you should see
fewer bad links.
113,794 documents fetched totaling 675,696,928 bytes 714,764 unexplored URLs with one or more descriptions 251,648,743 bytes of Lycos summaries 184,618,784 bytes of inverted index
December 2, 1994 Now running NCSA HTTPD:
In an effort to reduce the system load and improve system
response time, we are trying the NCSA HTTPD 1.3 server on
the Fuzine server and
the Lycos1 server.
The Lycos2 server
is still running CERN HTTPD.
November 28, 1994 reorganization:
Given that Lycos is handling up to 30,000 requests a day, I have made a
smaller catalog (481k recent URLs) the default, and made the big catalog
(1.3 million URLs) the test database. For most people, the smaller
catalog may be better, since it contains only URLs found in the last
two months, and has fewer bad links in it.
More hardware is on the way...stay tuned to this channel.
It really is a fast indexer...it only seems slow because you're sharing with 34,999 other people...
November 25, 1994 New catalog:
The main catalog (June-Nov) is up to 1,284,907 URLs.
This 1.3meg catalog will be available tomorrow, and includes:
175,887 documents fetched totaling 1,081,826,971 bytes 1,109,020 unexplored URLs with one or more descriptions 368,328,875 bytes of Lycos summaries 264,708,701 bytes of inverted index
November 8, 1994 Update:
The main catalog (June-Nov) is up to 999,461 URLs.
This 999k catalog includes
131,173 documents fetched totaling 831,633,976 bytes 868,288 unexplored URLs with one or more descriptions 276,290,984 bytes of Lycos summaries 200,665,350}i bytes of inverted indexThe other good news is that we now have a second big disk, so both Lycos and Lycos2 servers have their own copies of the catalog. So the searches should run faster (for awhile).
November 2, 1994 Update:
Okay, I give up. You win. You can run more searches in a day
than I can find extra computers to run them.
Lycos ran on one computer for 4 months, on two computers for 2 months, and now you've overloaded the third computer in less than a week.
Okay, seriously. We're getting an additional disk (to improve the inverted file access times), and we've moved data around to reduce NFS file accesses needed to run searches on the big DB.
October 30, 1994 Update:
Because of the heavy demand for Lycos, I am now using 3 computers to
provide HTTP service (note that CGI scripts have been re-enabled on Fuzine):
Other changes to reduce the load include raising the default match threshhold from 0.20 to 0.40, reducing the default number of hits from 50 to 20, and the commissioning from a graphic artist of an even scarier spider picture for the logo.
October 27, 1994 Update:
The main catalog (June-Oct) is up to 862,858 URLs.
This 862k catalog includes
109,462 documents fetched totaling 699,070,847 bytes 753,396 unexplored URLs with one or more descriptions 235,741,898 bytes of Lycos summaries 171,769,116 bytes of inverted index
October 26, 1994 Update:
To see the Lycos usage, check out these documents:
October 10, 1994 Update:
There is now a
Forms-based Lycos
search page that allows you to set the min-score,
max-hits, and terse mode.
October 9, 1994 Update:
You can now request that Lycos explore a specific URL by using the
Lycos URL
Registry.
October 5, 1994 Update:
The test catalog (June-Oct) is up to 701,466 URLs, including all URLs from the
production catalog (June-Sep).
This 701k catalog includes
84,239 documents fetched totaling 531,276,671 bytes 617,227 unexplored URLs with one or more descriptions 180,980,745 bytes of Lycos summaries 110,741,009 bytes of inverted index
September 20, 1994 Update:
I've merged all the Lycos search results and removed duplicate
URLs (by name, not content), so the main
Lycos search
now covers 547,675 unique URLs.
September 4, 1994 Update:
Lycos/Pursuit is now available for courageous beta testers.
The Lycos
beta test source is a compressed tar file.
Documentation is included, but is still minimal. The faint of heart may wish to wait a few days for better documentation. Users desiring new features should check the Lycos To Do List to see if that feature is already on the list.
August 26, 1994 Update:
Carnegie Mellon has dedicated a Sparcstation,
lycos.cs.cmu.edu,
to running Lycos searches. This machine was used for the ARPA Tipster
phase I program, and has now been reassigned. Please note the
new Lycos search
engine URL and update your hotlists and web pages accordingly.
August 15, 1994 Update:
I've merged the June and August catalogs...so there may well be some
duplicates in the test version of Lycos. The test catalog is 634,066
documents, 152.9 megabytes. I will be modifying the PURSUIT
engine to weed out duplicates by default.
August 14, 1994 Update:
Current catalog is up to 173,000 documents, and 49 megabytes.
August 11, 1994 Update:
Lycos is searching the web again, and it's current catalog is
available here.
So far, starting from scratch on August 7, has found 4,784 HTTP
servers, 18,687 documents (totalling 56 megabytes of text), and the
names of 115,000 more documents. The new catalog is up to 37 meg.
The next experiment is to add the ability to do best-first search starting with the finite document set to find specific topics. Until then, you can cast around using the search function.
Lycos's web crawler is written in PERL, with a C program that uses CERN's libwww library to fetch documents. Lycos will not fetch TELNET, MAILTO, NEWS, FILE, or WAIS type files (that leaves mostly HTTP, GOPHER and FTP files). It also ignores files that start with "/dev/tty" or end with with these extensions: AU, AVI, BIN, DAT, DVI, EXE, FLI, GIF, GZ, HDF, HQX, JPEG, LHA, MAC, MPEG, PS, TAR, TGA, TIFF, UU, UUE, WAV, Z or ZIP.
Lycos's search engine, PURSUIT, is a C program that uses a disk-based inverted file retrieval system and a simple sum of weights to score documents. One unique feature is that PURSUIT scores words by how far into the document they appear. Thus hits in the title or first paragraph are scored higher. As soon as the bugs are squashed, the search engine will be made available to all for non-commercial use (send mail to fuzzy@cmu.edu if you would like to be a beta tester).
We might upgrade the search engine's language at some future point to implement more standard boolean operators. We will definitely add the spelling correction and phonetic and semantic match capabilities from the SCOUT project.
For each document fetched, Lycos keeps the title, headings, subheadings, and links, plus the 100 highest weighted words (using Tf*IDf weghting) plus the first 20 lines. Lycos uses a random search to prevent bunching up accesses to any one server.
Lycos now complies with the standard for robot exclusion to keep unwanted robots off WWW servers, and sets the USER-AGENT field to "Lycos".
Posts to NetNews are archived here:
SIGNIDR 94 materials
The Robots mailing list is for discussion of issues related to automated Web searching programs.
Some of the hardware used by Lycos was originally purchased with funds provided by ARPA for the Tipster phase I program, and Michael Mauldin is partially supported by ARPA's CS-TR project.