Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Beware the Pirates of Big Data

Andy Harrison, Director, Analytic Consulting, FICO | May 29, 2015
More data is not always better and the blind application of statistical analysis and data mining can lead to spurious correlations.

This vendor-written piece has been edited by Executive Networks Media to eliminate product promotion, but readers should note it will likely favour the submitter's approach.

How can you make best use of the explosion in the availability, complexity, variety and volume of data that occurred in the last decade, which is popularly known as Big Data?  Think!  Don't be led, blindly, by the data, Big or otherwise.

If you're familiar with Big Data, you will be familiar with the four Vs of Volume, Velocity, Variety and Variability - five if you add Veracity.  The word Big focuses us on the V of volume and talk quickly turns to exabytes and zettabytes (1021 bytes or 1,000 exabytes).

You'll have been told that all those zettabytes offer you great opportunities to gain profitable insights.  But do they?

There is an implicit assumption that more data means more information.  There is a danger that if you blindly apply statistical and data mining tools to massive data sets you will be doing little more than numerology.  Given enough data you can prove anything!

More than that, if you forget the difference between correlation and causality, and you fail to sense-check your conclusions, you can find yourself drawing ridiculous conclusions.

Consider pirates.  If you look at the number of pirates in the world since the early 19th century and plot this against the global average temperature you can see that there is a statistically significant inverse relationship.  If you take this at face value, you could conclude that global warming is due to the decline in the number of pirates in the world.

Pirates of big data

The data suggests we could solve global warming if we encouraged piracy.

You can be confident that the decline in pirate numbers is not the cause of global warming.  This chart demonstrates that even a strong correlation can be nonsensical and if you don't sense-check your conclusions you may look foolish.

As we move to an era where we are processing bigger and bigger data sets and where analysis, by necessity, is becoming more and more automated, can you be confident that you are not finding pirates in your data?  Validating against historic data and measuring statistical significance won't save you.  So what can you do?

Look to the principles of Operational Research.  Don't start with data and see what it tells you.  Start with a question or a hypothesis and see if the data support or refute it.

Would you start with the question, is the rise in global temperature related to the number of pirates?  I hope not!

More data is not always better and the blind application of statistical analysis and data mining can lead to spurious correlations.  Remember the principles of Operational Research: start with a hypothesis and use the data to support or refute it.  But, above all, beware the pirates of Big Data!

P.S. You may wish to inspect the x-axis a little more closely.

 

Sign up for Computerworld eNewsletters.