Operational Excellence with Technology

Bunches and Bunches of Data

Every year I have the privilege of attending Defrag,, a conference unlike any other I attend during the year. Where other technology conferences are about what's new, what's hot, but some specific topic, Defrag often appears as a random walk through technologies that will be. But every year I come away with one or two insights that prove to be predictors of future trends. It is an amazing, confusing experience.

Last year, there were a couple of presentations on "big data." One I especially remember by Jeff Jonas at IBM started me thinking about time and location tagged information. He talked about some databases with hundreds of billions of rows, which, compared to most business databases that are thousands of rows, is big.

This year, there were more presentations on big data, but the size has gotten even bigger. One presenter's company, GNIP, is processing 30 billion data elements a month. Previously big data was considered to be at least 1 terabyte (that's 1 million megabytes), now it is considered to be 10 terabytes. Once the supply disruption from the Thailand floods settles out, you can buy 10 terabytes for about $750-1,000. Every business will have the capacity to have big data. DOD and the NSA are working with petabyte sized data sets (that's 1,000,000,000,000,000 characters of data!). Google processes about 24 PB of data per day!

More and more big data is unstructured, so the tools for analyzing this data will evolve away from the tools used to analyze traditional databases. Fortunately a lot of big data software is being developed open source, making it accessible to any size business with the will (and some manpower) to make use of it.

One thing that big data is teaching us is that data (e.g. a temperature), but data in context is worth a whole lot more (e.g. the temperature in Fargo North Dakota at 7:45 am on January 15th compared to that location, time and the previous 20 years). For unstructured data, it becomes more complex. For instance if someone tweeted something about one of your products, where were they, what were they doing, was there some event going on, what is their social network look like (not only on Twitter but perhaps also on Facebook). All that context makes that one, 140 character, tweet more valuable to you.

And data exhibits the same effect as a network. The bigger your data set, the more insight you can get. The common network example is a fax machine. If you are the only one with one, it's not very useful. When another person gets one, it's more valuable. But when everyone you know has one, you all can communicate with each other. Called Metcalf's law, it says that the value of your network is proportional to the square of the number of members. Same with data, one data point is interesting, a million lets you see trends.

And data collection is increasing. Utilities used to read your meter once a month. It was expensive to send someone out to read that meter. Then they got meters that could be read remotely, via telephone, or now a cellular data connection. So now some utilities are reading your meter ever 15 minutes, or more often. They know the demand curve of every house by time of data. With that data they can better plan their generation needs. The question for you is, "what can I measure that might help me run my business better, or design a better product?"

You may be thinking to yourself "I won't be doing any big data work in my lifetime." I think you'll be surprised. Just a few months after hearing about big data at Defrag I find myself working on a project to integrate Goodreads data with a library. Goodreads currently has more than 6,600,000 members and 230,000,000 books added to member profiles and has 20 billion data points, mostly in the form of reviews.

Amazon has over 50 large public data sets that you can download, or use with their cloud services, The U.S. Government is making more and more of its data available (just under 400,000 data sets) at You might think whether your business, with the addition of one or more of these public domain data sets, could create some new product, or add value to an existing product.

There are more than nine billion internet connected devices… and growing. Two billion of those are devices creating data sent to some server. If you products aren't generating data now, they soon will be. That data adds up, fast.

Big data is in your future, it's just a matter of when. The good news is you can dip you feet in the water and start experimenting with big data analysis tools without making a full investment. Amazon's Web Services has an on-demand, cloud based option called Elastic MapReduce, you can read more about it at

Blog Tags: