Who's got more data? The largest retailer in the world or the largest library in the world?
Walmart tends to over 500 terabytes of data (see here, here, etc.) while the Library of Congress, largest according to the Guinness Book of World Records, has a petty 20 terabytes, cowered by comparison.
To hear it from data warehouse vendors, data mining academics, data savvy politicians, or data fixated citizens, Walmart versus the LOC is like New World versus Old World, the future versus the past, fast versus slow, wired versus tired.
The more things change, the more they stay the same. The flood of data has not washed away these two age-old truisms.
- It's not the Quantity, it's the Quality.
Walmart knows to excruciating detail who bought what when: "we capture data on every item, for every customer, for every store, every day". In the Library of Congress reside a Gutenberg Bible, books in 470 languages, world newspapers from the past 300 years, millions of maps and sheet music, among other treasures.
So who's got more data now?
Corollary: When More is Less
With riches of data has profused "data-rich" graphics, which takes Tufte's data-ink ratio to ridiculous extremes. The result is more clutter, less clarity.
It's an example of pages of ink to illustrate one number. But let the imagination turn this into a data-rich graphic, and assume that the Googlers are ordered from most to least tenured, and the colors are associated with distinct job functions. So the blue one is employee #1, an engineer; the red one is employee #2, a business person; etc.
A lot of data but how much data really?
Some time ago, a more interesting take of this concept was used by the New York Times to illustrate casualties in Iraq.
Good statistics is always about data reduction. In this age of abundant data, it is technically challenging to infuse graphics with as much of it as possible. Resist the temptation to overdo it.