Spectrum Online—Tomorrows Technology Today
Font Size: A A A

« More memory devices masquerade as jewelry | Main | Industry Group Backs New Wireless HDTV Scheme »

When is a terabyte not a terabyte?


Consider the following exchange from the generally excellent public radio show, On The Media.

BROOKE GLADSTONE: Mathematician Martin Wattenberg observed in Wired that the sum total of all the words you'll hear in your lifetime amount to less than a terabyte of text. So then how much is a petabyte?

CHRIS ANDERSON: A petabyte is, mathematically it is, you know, 1,000 terabytes, but we have a hard time understanding that scale. We usually use the sort of, you know, the Library of Congress, as an example. The Library of Congress is sort of, you know, on the - you know, on a couple of terabytes scale; a petabyte’s a thousand of those.


We've never seen petabyte scale data aggregations before. There’s never been anything like that because we're still relatively early in, you know, the digital age. But Google has just hit that state. Google processes about a petabyte of information every 72 minutes, and a year from now it'll process a petabyte every half an hour, and so on.

There's a category mistake here to the tune of at least two, maybe three or four, orders of magnitude; it's an awful lot like mixing little-c calories and big-c Calories in the same sentence.

Consider a simple document containing the words "hello world" - 11 bytes of information, in a fairly straightforward way of counting them. The same words took up 19 456 bytes when put into Microsoft Word document. Take a image-file snapshot of it, and it might blow up to a 76 468-byte file, as it did when I used the Grab utility a few minutes ago.

When we measure the Library of Congress, we tend to make a successive estimates of the number of books, pages, words, and, finally, bytes, taking as a rough measure one byte per character. The makers of these estimates generally look for the smallest possible number, and I suppose we should be grateful they don't subject the resulting number to some ZIP-like compression.

When we consider Google's petabytes, we're, presumably, looking at the number of bytes its spiders crawl through, or the bytes on its millions of hard disks, throwing together video, audio, jpegs, and PDFs in with the text. Indeed, a lot of simple text gets counted in terms of the HTML pages it resides on. For comparison purposes, "hello world" is a cool 2500 bytes when you let Microsoft Word turn it into a .htm file.

Then there's the question of information, and Information. The Library of Congress is filled mainly with books. That is, it contains words that have been carefully thought out, then written, then vetted by a publisher, then pushed out into the world at great expense, with the expectation that they will be useful and interesting to thousands, often millions, of people, across several decades.

Google's cache, on the other hand, is filled with MySpace diary entries, LOLcat images, videoclips of Jon Stewart, and millions of copies of Abba songs. Compress those terabytes down to their truly useful and interesting elements, and you have, well, not much more than the 11 bytes of "hello world."

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

TrackBack

TrackBack URL for this entry:
http://blogs.spectrum.ieee.org/cgi-bin/mt/mt-t.fcgi/4954

About

This post was last updated July 23, 2008 7:19 PM.

Previous post: More memory devices masquerade as jewelry.

Next post: Industry Group Backs New Wireless HDTV Scheme.

Go back to the main index page or visit the archives.

Tag Cloud