Monday, August 18, 2008

The Multiplication of little Bits

The explosion of information data is no more a hypothesis but a fact now. Analysts keep predicting an exponential evolution of data growth for the coming years.

Each and every one of us surely heard such comment more than once. This is no news and does not really bring anything useful to the debate. As corollary, hard drives and other memory cards capacities also increase at a surprising path. This is still no news but this is rather cool for our mobile devices, digital cameras, GPS systems, MP3 players and other computers.

But where are all these data coming from ?
Of course, anyone who brags to know a bit about hi-tech will immediately accuse voice and video applications to be the main source of data explosion. Other people with a bit more experience will talk about automated sensor devices measuring and recording data about almost everything from environment (temperature, pollution level…) or car traffic (traffic jams) to security (alarms). But what most people forget about but which remains the nightmare of quite a few IT managers is email. Emails are the perfect example of useless data replication in frightening proportions. Here comes a couple of examples.

Let’s start with something simple. You just decided to invite 10 of your friends by mail for a barbecue. You send them a nice and small text email of only 1 kilobyte (1kb). You just hit the send button and – Miracle – your email has been decupled. Ten more copies have been created, each of them surfing on the world wild web network and hunting for their recipient mailbox. Now when each of your friends will reply, they will most likely include your original message in their email along with their response. If we suppose that such a reply will make 2kb in size, then there will be 31 created emails for a total of 51kb of data. How did we manage to create 51 more data ?
- 1 original email still stored in your “Sent Items” folder (1kb)
- 10 emails stored in the recipient mailboxes (10x 1kb)
- 10 received answers (10x 2kb)
- 10 original answers stored in the Sent items of your friends (10x 2kb)

Out of these 51kb, the only interesting information are your invitation (1kb) and the answers of your friends excluding the part of their mail which includes your original invitation, making it only 1kb of meaningful data per answer. If we sum it that way, we have 11kb of meaningful information and 40kb of uselessly duplicated data.

And sometimes it can goes even further and have more impressive consequences. Here is an example directly issued of my past experience. As member of a workgroup of 10 people, I received a document of one megabyte from the group leader. Each workgroup member had to fill a part of the document and send it back. Can you estimate the amount of data that such process flow will generate ? Well here is the result according to the previous method of calculation.
- 1x 1MB for the original document stored on the hard drive of the sender.
- 10x 1MB for the original document stored on the 9 recipient inboxes and in the sender sent items.
- 9x 1MB for the 9 local copies including the recipient updates.
- 18x 1MB for all 9 replies stored 9 times in the workgroup leader inbox and once per workgroup member sent items folder.
- 1x 1MB for the final document gathering all updates and stored on a shared folder.

We already have 39MB of data for only one megabyte of useful data and mostly because most members wanted to keep all of their email for as long as possible (ie, forever). And was it finished ? Not even. All members of this workgroup were equipped with laptops and each of these laptops had an automated backup system on it, copying all data onto network NAS. As a result, the total data volume is doubled to 78MB; seventy eight megabytes of data for only one megabyte of final useful information.

Is this a bad thing ?
Well it is certainly a bad habit to desire to keep all of his emails at all costs instead of concentrating on the final deliverable (when applicable). Therefore, it is probably more a problem of people than of technology and people are hard to change. As a side note, tt means also that IT managers need to plan for storage space and that CEOs must give them the budget for that.