September 1, 1998

THE LAND MINES OF DATA MINING

by Andy Oram
American Reporter Correspondent

CAMBRIDGE, MASS.—We all guard closely our privacy, and some go to great lengths to withhold information about themselves from the computerized databases of large corporations. But does that effort keep other people from making and acting on judgments about us? In truth, large companies and governments are making decisions that affect us all the time.

Companies prefer to find out as much information about each individual as they can. It allows them to market their products to you (or in rather alarming military terms, target you) directly with ads or promotional offerings that they hope will interest you.

But while holding back data can give them less fodder—and perhaps keep you free of their mailing lists—they will still make judgments about who you are and what you want. They will simply use statistical sampling, and to some extent you will have to live with the consequences.

The growing influence of statistical sampling, or “data mining” as it is called in the computer trade press, has been critiqued by sociologist Oscar H. Gandy. While the computer consultants enthusiastically push the efficiency of data mining, Gandy points out that it subjects the entire population to an involuntary computerized dissecting lab.

Technically, data mining is messy to carry out but conceptually simple. A business combines many different databases—such as sales patterns, service calls, etc.—and perhaps throws in a number of databases bought from specialized marketing firms. It then uses traditional database operations on a massive, automated scale to find patterns in the data.

A traditional operation, for instance, is to create a table “joining” two relations: the age of the purchaser and the product purchased. For instance, you can retrieve all sales of cottage cheese to people under the age of 21. In data mining, the program performs the join monotonously over and over again, checking each age group against each product.

From sales patterns come judgments about how to make and market products. Perhaps an age group that doesn’t buy much cottage cheese can be drawn to the dairy shelf by a certain kind of packaging. Decisions about which new products to design can also be based on what the company learns about the sales of old products.

Since databases can contain millions of entries, and a join requires checking each of the millions against each of the millions in another database (that is, the operation scales geometrically), mining requires enormous processing power. That’s why it hasn’t become common until the past few years, when computers became fast enough to handle it.

Of course, marketing departments have always made judgments about consumers, and for a long time they’ve used surveys to compensate for limitations in their guesswork. But data mining introduces Taylorism into what has always been seen before as a subjective, creative process. It gives the marketing group a new confidence in the scientific validity of their choices. And their choices affect you whether or not you’re in their databases.

Suppose the data miners have decided that African-American males have little interest in classical music. If you happen to be a black man with a passion for Puccini, you may rarely hear about new recordings. So long as data mining affects just direct marketing, such stereotyping may be fairly harmless. But how are we to know how far the practice will extend? How about to job and training opportunities?

Other commentators have complained about companies drawing broad conclusions based on limited ranges of facts. In addition, people change, and the preferences they showed at earlier times can follow them around after their tastes change.

I am not completely opposed to the use of statistics. For instance, I tend to the view that statistical adjustments to the U.S. census will make results more accurate. The Republicans who oppose the use of statistical methods have never done so on scientific grounds. The roadblocks they’ve put in the way of the plan are based on explicitly political concerns over losing votes as the census of low-income neighborhoods comes closer to their true populations.

But we must always acknowledge the risks of relying on statistics. Furthermore, sampling in a census is a much better understood practice than the new field of data mining. Any statistician can tell you that relationships may turn up that lack external validity.

Thus, one infamous “fact” that a store turned up through data mining a few years ago was a correlation between sales of beer and sales of diapers. While this amusing result was widely reported, nobody could figure out why people seemed to buy beer and diapers at the same time. What coherent marketing strategy can emerge from such factlets?

One profound problem is that databases contain lots of incorrect information, whose effects scale up geometrically when databases are combined. Many databases look compatible but are different in subtle ways that render data mining invalid. For instance, both may list a date for a transaction. But if clerks entered the sale date in one database while other clerks enter the shipping date in another, inaccuracies are introduced.

The fact is that data mining itself is in the diaper stage. Trade journals like ComputerWorld warn regularly against trusting the results of current techniques—while reasserting that it will soon become a critical part of business planning. It looks like data mining is one toy that managers will not relinquish until it blows up underneath them.


Editor, O’Reilly Media
Author’s home page
Other articles in chronological order
Index to other articles