The Volokh Conspiracy

One Trillion Unique URLs:

A terafic milestone, according to Google's report,

[O]ur systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

How do we find all those pages? We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.

So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We're not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what's a useful page, and there is no exact answer.

We don't index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers....

Thanks to Dan Friedman for the pointer.

Grobstein (mail) (www):
A terafic milestone

Well played.
7.28.2008 1:21pm
DNL (mail):
1 trillion URLs... but their index is only about 25% of that, max. What gives? (Here's my guess -- they aren't indexing all those blog posts.)
7.28.2008 1:43pm
Sean O'Hara (mail) (www):
I wonder how many of those unique URLs lead to misconfigured home servers? I know there are certain letter combinations that you can put into Google and turn up millions of webcams that aren't properly secured, and all those count as distinctive URLs.
7.28.2008 1:48pm
Houston Lawyer:
The Google search goes quite deep. If I google myself, I find documents filed by clents on EDGAR (the SEC's website) where my name was included as a designated contact person. That's fairly useless information for me, but could be useful to someone.
7.28.2008 2:26pm
Displaced Midwesterner:
So I'm assuming this means that there are approximately 800 billion unique URLs for porn now. Not bad.
7.28.2008 7:07pm
Dave N (mail):
Displaced Midwesterner: 900 million unique URLs if you add gambling and phishing sites to the porn ones.
7.28.2008 8:13pm