Measuring the Web with Lycos
Michael L. Mauldin, Carnegie Mellon University, Pittsburgh, PA, USA
fuzzy@cmu.edu or http://fuzine.vperson.com/mlm/
fuzzy@cmu.edu or http://fuzine.vperson.com/mlm/
- Keywords:
- Web Size, Information Discovery and Retrieval
What is the Web?
The first question to answer in measuring the size of the web is
to determine what counts as being ``on'' the web. We define the
web as any document in either
- FTP space,
- Gopher space, or
- HTTP space.
By design, Lycos does not index ephemeral or changing data or
infinite virtual spaces. Therefore, the following are not considered part
the web:
- WAIS databases
- USENET news
- TELNET services
- Email
Further, we do not consider the output of CGI scripts as part of
our count. To codify this constraint, we ignore URLs containing
either a question mark (?) or an equals-sign (=), as these characters
are of primary usage for CGI scripts. We also eliminate some URLs
such as the human genome database or the Blackbox glossary, which
encodes user state information in the URL
(without this limitation, a robot could count a single page with
state information an infinite number of times).
The figure shows our taxonomy of CyberSpace, showing that Lycos'
view of the Web is a strict subset of the Internet, but is larger
than simply the space provided by HTTP servers.
Sampling the space
Lycos samples the web continuously, and the search results are merged
with the catalog weekly. To estimate the size of the web, we take
a week's worth of new searches, assume they are an independent
random sample of the web as a whole, and multiply the old document size
by the ratio of the size of the new sample set to the size of the
intersection of the two sets.
As of April 4th, 1995, the ``old'' document set contained
2,687,146 known URLs. The new search consisted of 680,011 URLs,
of which 408,839 were also in the old set. That gives a ratio
of 1.508, multiplied by the old size gives 4.051 million URLs.
Other Measures
How many servers?
Between Nov 21, 1994 and April 4, 1995, Lycos successfully downloaded at
least one file from 23,550 unique HTTP servers.
What is the distribution of file types?
Cataloged Downloaded
ftp 486906 (16.5%) 31931 ( 6.2%)
gopher 736091 (24.9%) 106340 (20.6%)
http 1722152 (58.2%) 377505 (73.2%)
mailto 276 NA
news 218 NA
rlogin 157 NA
telnet 11401 ( 0.4%) NA
wais 784 NA
total 2957985 515776
How big is the average document?
During that same time period, the average text file size
downloaded was 7,920 characters.
So how big is it?
Multiplying gives an estimate of 32 billion bytes (29.9 gigabytes)
for the size of the web.
Sources of Error
The biggest problem with this number is that the search is almost
certainly not a truly random sample. Lycos uses a biased weighting
scheme to download ``popular'' documents first, so the new search
will tend to overlap the old more than a truly random sample.
Since the size of the intersection is therefore inflated, and
since it's in the denominator, the estimate of 4.051 million is a lower
bound.
Acknowledgements
Lycos is generously supported by funds from Carnegie Mellon University.
Some of the hardware is re-used from the Tipster Data Extraction
Project funded by ARPA. Dr. Mauldin is also funded by a research grant
from the Corporation for National Research Initiatives as part of
ARPA's Computer Science Technical Report project.
Lycos is a registered trademark of Carnegie Mellon University.
Presented at the
Third International World-Wide Web Conference, April 11, 1995.
Last updated 7-Apr-95