Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Top 10 data mining mistakes

SAS | Aug. 13, 2010
Avoid common pitfalls on the path to data mining success. Any one mistake that is left undetected, can leave decision makers/analysts worse off than if they hadn’t collected data at all.

Arguably more art than science, data mining can provide useful patterns to fuel strategic decision-making.

So what are the some data mining pitfalls to avoid? This complimentary book excerpt from the Handbook of Statistical Analysis and Data Mining Applications shares the top 10 mistakes of data mining. Most are basic, though few are subtle. But any one that is left undetected, can leave analysts worse off than if they hadn’t collected data at all. Briefly, numbering like a computer scientist (with an overflow problem), here are the mistakes.

ZERO: Lack proper data.

Some projects shouldn’t proceed until enough critical data is gathered to make them worthwhile.

ONE: Focus on training.

Obsession with getting the most out of training cases focuses the model too much on the peculiarities of that data to the detriment of inducing general lessons that will apply to similar, but unseen, data. Try resampling, with multiple modeling experiments and different samples of the data, to illuminate the distribution of results.

TWO: Rely on one technique.

For many reasons, most researchers and practitioners focus too narrowly on one type of modelling technique. At the very least, be sure to compare any new and promising method against a stodgy conventional one. But, for best results, you need a whole toolkit.

THREE: Ask the wrong question.

It’s important first to have the right project goal or ask the right question of the data. It’s also essential to have an appropriate model goal.

FOUR: Listen (only) to the data.

Don’t tune out received wisdom while letting the data speak. No modelling technology alone can correct for flaws in the data.

FIVE: Accept leaks from the future.

Data warehouses are built to hold the best information known to date; they are not naturally able to pull out what was known during the timeframe that you wish to study. So when storing data for future mining, it’s important to date-stamp records and to archive the full collection at regular intervals.

SIX: Discount pesky cases.

Outliers and leverage points can greatly affect summary results and cloud general trends. Don’t dismiss them; they could be the result.

SEVEN: Extrapolate.

We tend to learn too much from our first few experiences with a technique or problem.

EIGHT: Answer every inquiry.

If only a model answered “Don’t know!” for situations in which its training has no standing!

NINE: Sample casually.

The interesting cases for many data mining problems are rare and the analytic challenge is akin to finding needles in a haystack.

TEN: Believe the best model.

Don’t read too much into models; it may do more harm than good.

No Longer Taxing

An organisation that has managed to sidestep data mining pitfalls and gleaned the benefits is the Australian Taxation Office (ATO). The ATO is Australia’s main revenue collection agency and one of the biggest arms of the Commonwealth Government of Australia.

Data mining and analytical software is assisting the ATO to develop models to understand its clients better and to customise services to them for both customer relationship and compliance purposes. The ATO currently has in excess of 30 analytical models in development, including models that analyse information for clues to incorrect tax liability reporting, including tax evasion or avoidance.

For example, using both its own data and data sourced from third parties, it is possible to spot apparent mismatches between, say, an individual person’s declared income and a propensity for expensive motor cars or other less common lifestyle spending. SAS calls this getting at one version of the truth. In times of prosperity, self-employment tends to proliferate and, with it, the expansion of cash economies which lend themselves to concealment. In the case of a company, for example, modelling a company’s returns with its declared staffing levels and generic data relating to the business the company is in can reveal apparent under-reporting of revenues and over-reporting of outlays.

Highlighting the ATO’s return on its investment in analytics and other initiatives, the agency’s preliminary results for the 2005-06 tax year revealed that AUD $51.8 million was restrained, confiscated or recovered from the proceeds of crime and 100 successful convictions were secured resulting in many millions of dollars in fines and prison sentences of up to five years. 

 

Sign up for Computerworld eNewsletters.