The Net is Unfilterable

by Michael Sims jellicle@inch.com

A recent study reported in the prestigious journal Science came to the conclusion that there are roughly 320 million separate web pages - 320 million unique HTML files - available on the World Wide Web. They estimate this comes to some 15 billion words of information available over the web. For comparison, the Library of Congress, which has been collecting books since 1800, possesses 17 million books and 95 million other works.

But the web is different than the Library of Congress. Web pages are easier to catalog - feed them through a computer database indexer, and you can quickly create a full-text searchable index of all the pages you can suck through your system. But on the other hand, it's harder to find pages to suck, because they're not confined to your library storage archives.

AltaVista is one of the fastest search engines, perhaps the fastest. Created as a showcase for Digital Equipment Corporation's high-end line of superfast computers, the search engine processes a mammoth amount of data daily to create an up-to-date, searchable index of web documents.

To do this, it takes money, and hardware, and bandwidth. AltaVista is connected to the rest of the net by links totalling 25 Gigabytes/second of bandwidth. For comparison, this is enough bandwidth to support 87 million phone lines. If downloading that new game demo or the latest version of Netscape took two hours with your 28.8 kilobytes/second modem, it would take .008 seconds for the AltaVista search engine to download that same file.[3]

AltaVista runs on a tremendous amount of hardware, as well. Currently 16 of Digital's top-of-the-line AlphaServer 8400 5/440's form the backbone of the search engine, with several other computers for various specialized tasks. The total value of the hardware that runs AltaVista is something over $15,000,000.[4]

Yet with all this hardware, all this bandwidth, and no human interaction in the indexing of webpages (adding a new page takes a tiny fraction of a second), AltaVista still can't index the entire web. According to the Science study, they have about 28% of the web in their database. Three quarters of all web pages are NOT in the AltaVista database.

Let's compare this achievement to the claims of censorware makers. All of them claim to have scanned more or less the entire web for naughty words, homosexual references, and anything else they don't want you to see. Not only that, which would be remarkable, but they claim that each page is reviewed by a human before being added to the blacklist.

CyberPatrol, for instance, employs about a dozen people to review sites for their list. Other censorware makers seem to have even fewer, relying (though they deny it) on a computerized search to determine which pages are fit or unfit for human perusal.

But how effective can their blocking be? AltaVista covers a quarter of the web. The most complete (and newest) of the search engines studied is HotBot, covering some 34% of the web. Yet somehow, running some home-grown software on a few Pentiums connected to the net by a T-1 line at best - a T-1 connection is .0062% of the speed of AltaVista's connection to the net - these companies assert that they index the ENTIRE web. No censorware company has more than a tiny fraction of Altavista's bandwidth. No censorware company has more than a tiny fraction of the hardware, or technical expertise, of the people at Digital Equipment Corporation. And supposedly these pages are viewed by a human, taking a minute or more instead of a split second?

Hah.

Not only are the criteria for adding pages to the blacklists arbitrary and vague, the companies simply cannot, it is mathematically impossible, even *view* any significant portion of the web. A T-1 line sucking data full-time 365 days a year does not take in enough data to index the entire web, at its current size and rate of change, yet the claim is still made, with a straight face.

It doesn't take more than basic math to determine that their claims cannot even approach the truth.

[1] Reported at: http://www.news.com/News/Item/0,4,20728,00.html , among other places.

[2] http://www.internetworld.com/print/1998/03/23/infrastructure/19980323-huge.html

[3] Assuming the originating computer can handle outputting the file at that rate, of course.

[4] Retail. Estimated from retail prices of Digital's servers in standard configurations. Considering the extreme amount of disk space and memory used in the search engine, this estimate of total hardware value may be substantially low.


Michael Sims is a charter member of The Censorware Project.