The documentation will be coming along slowly, this is the first manual for the Beta version. Check for more in lycos-beta.html.
00README This file Makefile For make Notes This is my list of notes to myself on what to do next, what limitations there are in the code, etc. buildindex A csh script that builds the inverted index from a list of files config.h Defaults for paths and compile time settings copyright.h Statement of copyright hash.c Implements an expanding hash table hash.h Header for hash.c lycos1.perl Lycos web crawler, in perl. names2doc.perl Perl script to convert the name entries from Lycos into SGML style abstracts. pbuild.c Given a word index file, build a quadgram file pcompact.c Given a sorted word occurence file, produce the inverted index (.wrd) file. pinvert.c Reads the files to be indexed, creates a list of files, a document index, and a list of word occurences for sorting. pursuit-beta A CGI script that shows how to use pursuit from httpd. pursuit.c The retrieval function. pursuit.hdr Sample HTML header for search output pursuit.help Sample HTML help included when no query is given pursuit.trl Sample HTML trailer for search output
pdb.idx Maps document id numbers to offsets in the file list (pdb.lst) pdb.lst Has one entry for every document indexed. Entries are of the form: Field Bytes Description ----- ----- ------- stx 1 Start of text marker (0x02) scoutid 4 Id number of this document docbyte 4 Fseek offset from beginning of file nextbyte 4 Fseek offset of next document txttp 4 Up to 4 character file type txtnm var Null terminated file name etx 1 End of text marker (newline) Since a file may contain multiple documents, each entry contains both the file name and byte offset of the document, and the nextbyte offset points to the character immediately after the document. pdb.wrd Is a list of words followed by long words indicating document ids and word numbers in those files. Thus abacus\0 0x8000002 0x0000001 0x0000005 0x8000006 0x0000002 0xc000007 Means that the word "abacus" occurs in document 2 at words 1 and 5, and in document 6 at words 2 and 7. Field Length Description ----- ------ ----------- word var Null terminated ascii string docid 4 0x80000000 | scout id of document containing scoutidthe word wordno 4 word number of occurence, if last entry in the list, orred with 0xc0000000 pdb.wtb Is a quadgram table that takes the first 4 characters of a word and hashes them into a single bucket, with a byte offset in pdb.wrd to the first word that starts with those 4 letters. Only a-z and 0-9 are significant.
Then create a subdirectory called .robots, and type "perl lycos11.perl".
Watch the beta test page for information about Lycos11.perl.
The industrious user may wish to modify pursuit to use this file, either with an in-memory hash table or binary file search to locate the word occurence lists in pdb.wrd without using the wtb file.