Knowledge Discovery in Databases is the process of searching for hidden knowledge in the massive amounts of data that we are technically capable of generating and storing. Data, in its raw form, is simply a collection of elements, from which little knowledge can be gleaned. With the development of data discovery techniques the value of the data is significantly improved.
A variety of methods are available to assist in extracting patterns that when interpreted provide valuable, possibly previously unknown, insight into the stored data. This information can be predictive or descriptive in nature. Data mining, the pattern extraction phase of KDD, can take on many forms, the choice dependent on the desired results. KDD is a multi-step process that facilitates the conversion of data to useful information.
Our increased ability to gain information from stored data raises the ethical dilemma of how the information should be treated and safeguarded.
The desire and need for information has led to the development of systems and equipment that can generate and collect massive amounts of data. Many fields, especially those involved in decision making, are participants in the information acquisition game. Examples include: finance, banking, retail sales, manufacturing, monitoring and diagnosis, health care, marketing and science data acquisition. Advances in storage capacity and digital data gathering equipment such as scanners, has made it possible to generate massive datasets, sometimes called data warehouses, that measure in terabytes. For example, NASA's Earth Observing System is expected to return data at rates of several gigabytes per hour by the end of the century.(1) Modern scanning equipment record millions of transactions from common daily activities such as supermarket or department store checkout-register sales. The explosion in the number of resources available on the World Wide Web is another challenge for indexing and searching through a continually changing and growing "database."
Our ability to wade through the data and turn it into meaningful information is hampered by the size and complexity of the stored information base. In fact, the shear size of the data makes human analysis untenable in many instances, negating the effort spent in collecting the data.. There are several viable options currently being used to assist in weeding out usable information. The information retrieval process using these various tools is referred to as Knowledge Discovery in Databases (KDD).
"The basic task of KDD is to extract knowledge (or information) from lower level data (databases)."(2) There are several formal definitions of KDD, all agree that the intent is to harvest information by recognizing patterns in raw data. Let us examine definition proposed by Fayyad, Piatetsky-Shapiro and Smyth, "Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."(3) The goal is to distinguish from unprocessed data, something that may not be obvious but is valuable or enlightening in its discovery. Extraction of knowledge from raw data is accomplished by applying Data Mining methods. KDD has a much broader scope, of which data mining is one step in a multidimensional process.
Steps in the KDD process are depicted in the following diagram. It is important to note that KDD is not accomplished without human interaction. The selection of a data set and subset requires an understanding of the domain from which the data is to be extracted. For example, a database may contain customer address that would not be pertinent to discovering patterns in the selection of food items at a grocery store. Deleting non-related data elements from the dataset reduces the search space during the data mining phase of KDD. If the dataset can be analyzed using a sampling of the data, the sample size and composition are determined during this stage.
Databases are notoriously "noisy" or contain inaccurate or missing data. During the preprocessing stage the data is cleaned. This involves the removal of "outliers" if appropriate; deciding strategies for handling missing data fields; accounting for time sequence information, and applicable normalization of data.(4)
In the transformation phase attempts to limit or reduce the number of data elements that are evaluated while maintaining the validity of the data. During this stage data is organized, converted from one type to another (i.e. changing nominal to numeric) and new or "derived" attributes are defined.
At this point the data is subjected to one or several data mining methods such as classification, regression, or clustering. The data mining component of KDD often involves repeated iterative application of particular data mining methods. "For example, to develop an accurate, symbolic classification model that predicts whether magazine subscribers will renew their subscriptions, a circulation manager might need to first use clustering to segment the subscriber database, then apply rule induction to automatically create a classification for each desired cluster."(5) Various data mining methods will be discussed in more detail in following sections.
The final step is the interpretation and documentation of the results from the previous steps. Actions at this stage could consist of returning to a previous step in the KDD process to further refine the acquired knowledge, or translating the knowledge into a form understandable to the user. A commonly used interpretive technique is visualization of the extracted patterns. The results should be critically reviewed and conflicts with previously believed or extracted knowledge resolved.
Understanding and committing to all phases of the data mining process is crucial to its success.
A few of the many model functions being incorporated in KDD include:
Classification: mapping or classifying data into one of several predefined classes.(6) For example, a bank may establish classes based on debt to income ratio. The classification algorithm determines within which of the two classes an applicant falls and generates a loan decision based on the result.
Regression: "a learning function which maps a data item to a real-valued prediction variable."(7) Comparing a particular instance of an electric bill to a predetermined norm for that same time period and observing deviations from that norm is an example of regression analysis.
Clustering: "maps a data item into one of several categorical classes (or clusters) in which the classes must be determined from the data, unlike classification in which the classes are predefined. Clusters are defined by finding natural groupings of data items based on similarity metrics or probability density models."(8) An example of this technique would be grouping patients based on symptoms exhibited. The clusters need not be mutually exclusive.
Summarization: generating a concise description of the data. Routine examples of these techniques include the mean and standard deviation of specific data elements within the dataset.
Dependency modeling: developing a model that shows a how variables are interrelated. An example would be a model showing that electrical usage is highly correlated with the ambient temperature.
There are no established guidelines to assist in choosing the correct algorithm to apply to a dataset. Typically, the more complex models may fit the data better but may also be more difficult to understand and to fit reliably.(9) Successful applications often use simpler models due to the their ease of translation. Each technique tends to lend itself to a particular type problem. Understanding the domain will assist in determining what kind of information is needed from the discovery process thereby narrowing the field of choices. Results can be broken into two general categories; prediction and description. Prediction, as the name infers, attempts to forecast the possible future values of data elements. Prediction is being applied extensively in the area of finance in an attempt to forecast movement in the stock market. Description seeks to discover interpretable patterns in the data. Fraud detection is an application that uses description to identify characteristics of potential fraudulent transactions.
Classification, clustering, summarization and dependency modeling are descriptive models, while regression is predictive.
Several Knowledge Discovery Applications have been successfully implemented. "SKICAT, a system which automatically detects and classifies sky objects image data resulting from a major astronomical sky survey. SKICAT can outperform astronomers in accurately classifying faint sky objects."(10) KDD is being used to flag suspicious activities on two frontiers: Falcon alerts banks of possible fraudulent credit card transactions and the FAIS system being employed by the Financial Crimes Enforcement Network detects financial transactions that may indicate money laundering.(11) Market Basket Analysis (MBA) has incorporated discovery driven data mining techniques to gain insights about customer behavior. Other applications are being used in the Molecular Biology, Global Climate Change Modeling and other concentrations where the volume of data exceeds our ability to decipher its meaning.
Although not unique to Knowledge Discovery, sensitive information is being collected and stored in these huge data warehouses. Concerns have been raised about what information should be protected from KDD-type access. The ethical and moral issues of invasion of privacy are intrinsically connected to pattern recognition. Safeguards are being discussed to prevent misuses of the technology.
Knowledge Discovery in Databases is answering a need to make use of the mountains of data that is accumulating daily. KDD enlists the power of computers to assist in the recognizing patterns in data, a task that is exceeds human ability as the size of data warehouses increase. New methods of analysis and pattern extraction are being developed and adapted to KDD. Which method is used depends on the domain and results expected. The accuracy of the recorded data must not be overlooked during the KDD process. Domain specific knowledge assists with the subjective analysis of KDD results. Much attention has been given to the data mining phase of KDD but earlier steps, such as data cleaning, play a significant role in the validity of the results.
The potential benefits of discovery driven data mining techniques in extracting valuable information from large complex databases are unlimited. Successful applications are surfacing in industries and areas were data retrieval is outpacing man's ability to effectively analyze its content. Users must be aware of the potential moral conflicts to using sensitive information.
(1) Way, J.; and Smith, E.A. "The evolution of Synthetic Radar Systems and Their Progression to the EOS SAR." IEEE Trans. Geoscience and Remote Sensing. Vol 29. No. 6. 1991. Pp962-985.
(2) Fayyad, U.; Simoudis, E.; "Knowledge Discovery and Data Mining Tutorial MA1" from Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95) July 27, 1995 www-aig.jpl.nasa.gov/public/kdd95/tutorials/IJCAI95-tutorial.html
(3) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data Mining to Knowledge Discovery: An overview" in Advances in Knowledge discovery and Data Mining. Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press. Cambridge, Mass.. 1996 pp. 1-36
(4) Fayyad, U. "Data Mining and Knowledge Discovery: Making Sense Out of Data" in IEEE Expert October 1996 pp. 20-25
(5) Simoudis, E. "Reality Check for Data Mining" in IEEE Expert October 1996 pp. 26-33
(6) Hand, D. J. 1981 Discrimination and Classification. Chichester, U.K.: John Wiley and Sons
(7) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data Mining to Knowledge Discovery: An overview" in Advances in Knowledge discovery and Data Mining. Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press. Cambridge, Mass.. 1996 pp. 1-36
(8) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P.; "The KKD Process for Extracting Useful Knowledge from Volumes of Data" in Communications of the ACM, November 1996/Vol 39, No.11 pp.27-34
(9) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data Mining to Knowledge Discovery: An overview" in Advances in Knowledge discovery and Data Mining. Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press. Cambridge, Mass.. 1996 pp. 1-36
(10) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data Mining to Knowledge Discovery: An overview" in Advances in Knowledge discovery and Data Mining. Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press. Cambridge, Mass.. 1996 pp. 1-36
(11) Simoudis, E. "Reality Check for Data Mining" in IEEE Expert October 1996 pp. 26-33