What about a Google cache on my desk?

Yesterday I said that within a decade disk space should be cheap enough to put the entire visible web on your desk for under $1000. I think that’s actually a pretty conservative estimate, since it assumes a 100 KB average page size, up to an order of magnitude higher than some estimates.

Here’s another back-of-the envelope: let’s say we wanted the equivalent of Google’s webcache on your desktop (that is, all the HTML but no images). Another way to calculate it starts with the fact that the 2003 update to Berkeley’s How Much Info? study estimated that in 2002 the web was only 167 Terabytes total, with only 30 TB as HTML (69 TB when you include images). Assuming 75% compression, that’s just around 8 TB. That same year a 2002 OCLC study calculated that the total number of web pages was only increasing by about 5% per year (with the number of sites actually shrinking, but the number of pages per site growing). That rate had been decreasing ever since the explosion in the mid ’90s, but let’s assume growth became a steady 5% and will stay at that rate for the next few years. (There are a lot of assumptions going on here, but the nice thing about these kinds of curves is that even if my numbers are off by a factor of two somewhere, so long as disk keeps increasing at the same rate that crossover point only changes by one year.)

Now we’ve got two trends, and just need to find the intersection point for the price we want:

Year Price of 1 TB disk Size of public web
(compressed HTML only,
assumes 5% growth/year)
Cost to store
2002 8 TB
2003 8.5 TB
2004 8.8 TB
2005 $500 9.25 TB $4,625
2006 $250 9.7 TB $2,425
2007 $125 10.2 TB $1,275
2008 $62.50 10.7 TB $670
2009 $31.25 11.25 TB $350
2010 $15.50 11.8 TB $185

So given a few assumptions, we’ll be able to cache all the raw text on the public web for under $1000 (disk cost) within 3 years!