Measuring the Web with Lycos
Michael L. Mauldin,
Lycos Inc., Pittsburgh, PA, USA
fuzzy@lycos.com or http://fuzine.vperson.com/mlm/
fuzzy@lycos.com or http://fuzine.vperson.com/mlm/
- Keywords:
- Web Size, Information Discovery and Retrieval
What is the Web?
The first question to answer in measuring the size of the web is
to determine what counts as being ``on'' the web. We define the
web as any document in either
- FTP space,
- Gopher space, or
- HTTP space.
By design, Lycos does not index ephemeral or changing data or
infinite virtual spaces. Therefore, the following are not considered part
the web:
- WAIS databases
- USENET news
- TELNET services
- Email
Further, we do not consider the output of CGI scripts as part of
our count. To codify this constraint, we ignore URLs containing
either a question mark (?) or an equals-sign (=), as these characters
are of primary usage for CGI scripts. We also eliminate some URLs
such as the human genome database or the Blackbox glossary, which
encodes user state information in the URL
(without this limitation, a robot could count a single page with
state information an infinite number of times).
The figure shows our taxonomy of CyberSpace, showing that Lycos'
view of the Web is a strict subset of the Internet, but is larger
than simply the space provided by HTTP servers.
Sampling the space
Lycos samples the web continuously, and the search results are merged
with the catalog weekly. To estimate the size of the web, we take
a week's worth of new searches, assume they are an independent
random sample of the web as a whole, and multiply the old document size
by the ratio of the size of the new sample set to the size of the
intersection of the two sets.
As of April 4th, 1995, the ``old'' document set contained
2,687,146 known URLs. The new search consisted of 680,011 URLs,
of which 408,839 were also in the old set. That gives a ratio
of 1.508, multiplied by the old size gives 4.051 million URLs.
As of October 17, 1995, the "old" set contained 9,019,100 known URLs.
The new search found 1,634,356 URLs, of which 1,285,164 were
already known, giving a ratio of 1.272, and multiplying gives a
new estimate of 11.469 million URLs.
Other Measures
How many servers?
Between Nov 21, 1994 and April 4, 1995, Lycos successfully downloaded at
least one file from 23,550 unique HTTP servers.
As of October 17th, 1995, the Lycos catalog contained 80,985 unique HTTP servers.
How big is the average document?
During that same time period, the average text file size
downloaded was 7,920 characters.
As of October 17, 1995, the average file size was 7,874 bytes.
So how big is it?
Multiplying gives an estimate of 90 billion bytes (84.1 gigabytes)
for the size of the web.
Sources of Error
The biggest problem with this number is that the search is almost
certainly not a truly random sample. Lycos uses a biased weighting
scheme to download ``popular'' documents first, so the new search
will tend to overlap the old more than a truly random sample.
Since the size of the intersection is therefore inflated, and
since it's in the denominator, the estimate of 11.469 million is a lower
bound.
Acknowledgements
Lycos is a registered trademark of Carnegie Mellon University.
This document is Copyrighted 1995 by Lycos, Inc.
Originally presented at the
Third International World-Wide Web Conference, April 11, 1995.
Last updated 29-Oct-95