Lycos Beta Test: Readme file


README

Welcome to Lycos (tm) and Pursuit. This is the first publically available version, Lycos 0.9beta, so please send comments and constructive criticism to Michael L. Mauldin (fuzzy@cmu.edu).

1.0 Introduction

As shipped, Lycos/Pursuit is intended to provide a WWW search capability for the WWW. We will describe a sample installation for a site that wishes to index a single directory.

The documentation will be coming along slowly, this is the first manual for the Beta version. Check for more in lycos-beta.html.

2.0 Files

	00README	This file

	Makefile	For make

	Notes		This is my list of notes to myself on
			what to do next, what limitations there are
			in the code, etc.

	buildindex	A csh script that builds the inverted index from
			a list of files

	config.h	Defaults for paths and compile time settings

	copyright.h	Statement of copyright

	hash.c		Implements an expanding hash table

	hash.h		Header for hash.c

	lycos1.perl	Lycos web crawler, in perl.

	names2doc.perl	Perl script to convert the name entries from
			Lycos into SGML style abstracts.

	pbuild.c	Given a word index file, build a quadgram file

	pcompact.c	Given a sorted word occurence file, produce
			the inverted index (.wrd) file.

	pinvert.c	Reads the files to be indexed, creates a list
			of files, a document index, and a list of word
			occurences for sorting.

	pursuit-beta	A CGI script that shows how to use pursuit
			from httpd.

	pursuit.c	The retrieval function.

	pursuit.hdr	Sample HTML header for search output
	pursuit.help	Sample HTML help included when no query is given
	pursuit.trl	Sample HTML trailer for search output

3.0 Installation

for better descriptions, stay tuned to the Beta test page http://fuzine.vperson.com/mlm/lycos-beta.html.
  1. First create a directory pursuit-beta (or pick your own name).
  2. Uncompress and untar the pursuit-beta.tar.Z file in this directory.
  3. Edit config.h, pursuit.hdr, and pursuit.trl to conform to your local file structure and httpd location.
  4. Type make install
  5. Copy buildindex, pursuit.hdr, pursuit.trl, and pursuit.help into the directory you are indexing. Make sure buildindex is executable.
  6. Build the inverted index. Either type:
    1. Type "buildindex *.html" to build the inverted index or
    2. Type "find . -name '*.html' -print | buildindex"
  7. Copy pursuit-beta into your httpd script directory, and modify the file locations. The default is to build the index in the same directory with names pdb.*
  8. Try opening http://yourcgipath/pursuit-beta from a WWW client.

4.0 File formats

Lycos/Pursuit uses 4 files to store the inverted index. These are
	pdb.idx		Maps document id numbers to offsets in the
			file list (pdb.lst)

	pdb.lst		Has one entry for every document indexed.
			Entries are of the form:

			Field	Bytes	Description
			-----	-----	-------
			stx	  1	Start of text marker (0x02)
 			scoutid   4	Id number of this document
			docbyte	  4	Fseek offset from beginning of file
			nextbyte  4	Fseek offset of next document
			txttp	  4	Up to 4 character file type
			txtnm	 var	Null terminated file name
			etx	  1	End of text marker (newline)

			Since a file may contain multiple documents,
			each entry contains both the file name and
			byte offset of the document, and the nextbyte
			offset points to the character immediately
			after the document.

	pdb.wrd		Is a list of words followed by long words
			indicating document ids and word numbers in
			those files.  Thus

				abacus\0
				0x8000002
				0x0000001
				0x0000005
				0x8000006
				0x0000002
				0xc000007

			Means that the word "abacus" occurs in document
			2 at words 1 and 5, and in document 6 at words
			2 and 7.

			Field	Length	Description
			-----	------	-----------
			word	 var	Null terminated ascii string
			docid	  4	0x80000000 | scout id of document
					containing scoutidthe word
			wordno	  4	word number of occurence, if
					last entry in the list, orred with
					0xc0000000


	pdb.wtb		Is a quadgram table that takes the first 4
			characters of a word and hashes them into a
			single bucket, with a byte offset in pdb.wrd
			to the first word that starts with those 4
			letters.  Only a-z and 0-9 are significant.

5.0 LYCOS EXPLORER

To run the web crawler, compile scoutget.c (you must have libwww compiled for that to work). See C code to fetch URLs (scoutget).

Then create a subdirectory called .robots, and type "perl lycos11.perl".

Watch the beta test page for information about Lycos11.perl.

6.0 Miscellany

The quadgram table pdb.wtb is a performance enhancer that works well for huge database (more than 50 megabyets), but is excessive for small or medium databases. The pcompact program builds a word index file (pdb.wdx) that contains a list of words and hex offsets into the pdb.wrd file.

The industrious user may wish to modify pursuit to use this file, either with an in-memory hash table or binary file search to locate the word occurence lists in pdb.wrd without using the wtb file.

 Back to Lycos Beta Test page


Last updated 02-Sep-94