Data Mining Taps the Trends

gcahlink@govexec.com

F

rom predicting how many Marines will leave the Corps to rooting out fraudulent health-care bills for the Health Care Financing Administration, data mining is becoming a popular technological tool for agency managers trying to make sense and better use of mounds of government data.

Data mining is accomplished with commercial-off-the-shelf software applications that use sophisticated statistical analysis and advanced modeling to turn volumes of data into usable information. By uncovering subtle trends or patterns in data, the application can enable users to draw conclusions or make predictions. "Data mining is getting the 'aha' out of data," says Adam Crafton, a data-mining manager for IBM's public-sector operations in Bethesda, Md.

Commercial retailers first developed data-mining software nearly 30 years ago to track consumer-buying habits. By combining those early models with advances in artificial intelligence over the past decade, data mining firms have moved beyond helping retailers to assisting many businesses. Data mining's uses now range from predicting production breakdowns for large manufacturers to helping financial institutions uncover patterns of money laundering.

Data mining first made inroads in federal agencies as a method for sorting through reams of records to find fraud, waste and abuse. However, as agencies have increased their focus on results-oriented management, data mining is now proving useful in measuring performance and finding ways to improve it. "The need for data mining is growing with the need for agencies to do business better," says Fiona McKenna, public-sector marketing manager for SPSS, a Chicago data-mining vendor. With agencies increasingly doing business online, data mining will play a role in helping agencies monitor which portions of their Web sites customers most frequently visit and spot any patterns in their online activities, McKenna says.

Many agencies already are using data mining to make better management decisions and improve services. The Justice Department has used it to find crime patterns so it can focus its money and resources on the most pressing issues. The Veterans Affairs Department has used it to predict demographic changes among its 3.6 million patients so it can prepare more accurate budgets. The Internal Revenue Service uses the technology at its customer service center to track calls in order to pinpoint the most common customer needs. And the Federal Aviation Administration uses it to scour plane crash data to find common causes so future failures can be avoided.

Making Predictions
"Data mining is more than slicing and dicing data; it's doing predictive work as well," says Peter Caron, a data-mining product manager for SPSS. The Marine Corps is using data mining to predict which types of officers and enlisted members will stay in the Corps and which will bail out. "We want to use historical data to predict our loss rates in specific areas," says Maj. Joe Van Steenbergen, a manpower manager for the Marine Corps.

By mining a data warehouse containing career and biographical information for every officer and enlisted member in the past decade, the Marine Corps expects to find answers to questions such as, "Are married or single Marines more or less likely to leave the service?" and, "Are members in high-skill career fields like information technology more likely to leave than members with jobs that require less training?"

By answering those questions and many others, Van Steenbergen says the Marine Corps can create a profile of those likely to stay in various positions and use it to make better management decisions when recruiting, assigning and promoting personnel.

Improving Service
The Centers for Disease Control and Prevention's National Immunization Program in Atlanta is installing data-mining software that allows better tracking of reactions to vaccines. The program has a huge database of adverse reactions to vaccines reported by physicians, clinics and hospitals, patients, and pharmaceutical companies across the nation. Federal researchers and statisticians monitor the data regularly to find problems caused by a single vaccine or vaccine combinations.

"The difficulty is you have to know what to watch for. We are installing data-mining software because we want a system that will monitor incoming reports on adverse reactions and alert us to associations that have not been discovered," says David Walker, a public health analyst for the program. For example, data-mining software may process the incoming reports and reveal a cluster of children who became sick after taking a measles vaccine shortly after receiving one for hepatitis. Once the associations are found, researchers can study the problem and decide whether to recommend pulling the vaccines off the market or changing immunization schedules.

Fighting Fraud
Ferreting out fraud still is the most common use of data mining among federal agencies-particularly fraud in health care, says Kristin Nauta, director of the public-sector technology center for SAS, a data-mining company in Cary, N.C. By searching through medical claims, agencies such as HCFA can compare costs for the same medical services and find health-care providers who are overcharging. In other cases, data mining has allowed HCFA to compare treatments for various medical conditions and determine whether patients are receiving inadequate or excessive care.

Betty Jackson, director of the enterprise database group at HCFA, says data mining has helped the agency recover millions of dollars in fraud cases. HCFA is building a data warehouse for hundreds of thousands of Medicare records, which the agency can then mine for fraud.

Preparation Key
Like HCFA, many agencies still are moving their electronic records off bulky mainframe computers-on which data searches can take weeks-to data warehouses built on network servers that can complete searches in minutes. "Data mining has become popular within in the government in the last few years but it has been lagging a bit because there has to be a concerted effort to get data organized so that it can be processed," says Nauta.

Agency information need not be stored in data warehouses that can cost millions to build (a commercial data-mining application costs about 1 percent of the price of a typical warehouse), but it must be sorted or summarized to make data mining effective. Vendors say 70 percent to 80 percent of the time invested in data mining goes toward preparing the information before it is mined.

An agency manager seeking to pinpoint unauthorized transactions on government-issued purchases cards, for example, would gain little by running a data-mining application through a database of all transactions. However, if the data were presorted by transactions made on weekends-when business travel is less common-then a data-mining application could sort through and pinpoint patterns of unusually high spending or unauthorized purchases.

"Data mining is growing faster than databases," says Henry Morris, an industry analyst for the International Data Corp., a vendor market research company in Framingham, Mass. Worldwide sales of data-mining applications were $343 million in 1999 and are expected to reach $1.4 billion in 2004-an annual growth rate of 32 percent, Morris says.

"The use of data mining will continue to grow because we are collecting more and more data," says Alexander Linden, a senior analyst with the international market research company Gartner Group.

Linden says the top data-mining products available are Enterprise Miner from SAS, Clementine from SPSS and IBM's Intelligent Miner. Linden says the SAS product offers more flexibility and functionality than the IBM and SPSS models. However, the application is difficult to use, requiring mathematicians to operate it, he adds.

Linden calls the IBM and SPSS products good tools. Clementine is flexible and easy to use, he says, but does not have the ability to sort through the tens of millions of records kept by larger federal agencies. Intelligent Miner has the capability to sort through large volumes of data and offers good graphics but lacks the ease of use and flexibility of the other models.

NEXT STORY: Purchasing IT as a Utility