Data Mining for Multiple Antibiotic Resistance

This document gives an overview of the research project "SEI: Data Mining for Multiple Antibiotic Resistance," funded by the National Science Foundation under grant number 0612170. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF. For more information or for any comments or questions, please contact the project's principal investigator, Chris Jermaine.

Multiple Antibiotic Resistance. Controlling antimicrobial or antibiotic resistance in nosocomial (hospital acquired) infections due to common micro-organisms (or "bugs") is one of the most difficult and significant problems facing epidemiologists and other health scientists today. The cost of increasing resistance is significant. As the incidence of common micro-organisms that are resistant to a great variety of of antibiotics increases, it becomes possible for hospital patients to become very (even fatally) ill from simple infections that have no applicable antibiotic treatment. Approximately 2 million of the 36 million Americans hospitalized each year will acquire a nosocomial infection, resulting in more than 90,000 deaths, many of them directly related to drug-resistant bugs. Estimates of the financial burden associated with resistance among nosocomial pathogens range from 4.5 billion dollars to 30 billion dollars annually.

One potential avenue for reducing the huge human and financial cost of antimicrobial resistance in nosocomial infection arises from the association between antimicrobial usage and antimicrobial resistance. Overuse of drugs has long been recognized as a key factor in antimicrobial resistance among hospital organisms. The problem is that it is not feasible to simply stop using antimicrobials in a hospital. Instead, antimicrobials should be used responsibly so as to reduce the risk of widespread resistance. Unfortunately, no good statistical models exist that describe exactly how antimicrobial use relates to antimicrobial resistance, and so it is very difficult to institute usage guidelines that can reduce the chance for widespread resistance. The final goal of this project is to use data mining and machine learning methods to figure out how antimicrobial use can cause antimicrobial resistance over time. As opposed to classical, statistical methods that test whether human-postulated hypotheses or patterns are significant, data mining and machine learning methods actually look through the data with the goal of identifying important patterns that human beings may be unaware of. 

A Two-Pronged Approach. To develop useful models for how antimicrobial resistance increases in response to usage, there are two main research trusts for the project. On one hand, we need to develop new data mining methods that can detect the sort of patterns that will be useful in our models. For example, given a large number of infected hospital patients and the microbiology data that describes the extent to which those infections resist the applicable drug treatments, we want to be able to extract at a very fine-grained level how the resistance changes over time. On the other hand, we need to actually apply those mining methods to real-life data in an attempt to find the sort of patterns that will allow us to construct our models.

Results So Far. We have some interesting results both on the methodological side, where the goal is to develop new and relevant data mining methods, and on the application side. What follows are a few of the papers that we have published as part of the project.

Evolution of Mixture Models. Classic mixture models assume that the prevalence of the various mixture components is fixed and does not vary over time. This presents problems for applications where the goal is to learn how complex data distributions evolve. In our motivating application for this work, the goal is to track the prevalence of various subclasses of bacteria over time, in order to determine whether the more virulent classes (those with broad antimicrobial resistance) increase or decrease in prevalence. We have developed models and Bayesian learning algorithms for inferring the temporal trends of the components in a mixture model as a function of time. We have applied our methods to the problem of tracking changes in the rates of antibiotic resistance in Escherichia coli and Staphylococcus aureus. The results show that our methods can derive meaningful temporal antibiotic resistance patterns. This work will be published as:

Xiuyao Song, Christopher M. Jermaine, Sanjay Ranka, John Gums: A Bayesian Mixture Model with Linear Regression Mixing Proportions. To appear (full presentation), Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008).

Conditional Anomaly Detection. The first paper describes a new method for outlier or anomaly detection, that is specifically tailored to our application. In outlier detection, the goal is to find those data points that are substantially different from all others in a data set. During analysis of antimicrobial resistance, this is useful because we can find patients whose treatment/infection is quite different from the norm. 

The unique aspect of our work is that we propose a method for conditional outlier detection, where certain data attributes are considered to not be directly indicative of outlier status. We call those environmental attributes. Environmental attributes (like a patient's age or his or her gender) cannot be ignored because they may affect the other data attributes -- younger patients may have different treatment regimes.  But our application may dictate that finding a very old patient is not interesting in and of itself. Thus, we perform outlier detection where the "important" or result attributes (such as the particular characteristics of a patient's infection) are conditioned on the environmental attributes. If the observed result attribute values are strange or are not in-keeping with the environmental attribute values, then an outlier has been observed. This work has been published as:

Xiuyao Song, Mingxi Wu, Christopher M. Jermaine, Sanjay Ranka: Conditional Anomaly Detection. IEEE Trans. Knowl. Data Eng. 19(5): 631-645 (2007).

Multi-Dimensional Change Detection. A paper that recently appeared in KDD 2007 describes a new statistical test for distributional change in multi-dimensional data. In checking for trends in antimicrobial resistance, the ability to rigorously compare two data sets to see if they are substantially different is a key goal. The ability to perform such a comparison allows one to group all of a hospital's infection data for one time period into one data set, to group all of the hospital's infection data for a later time period into a second data set, and to then compare the data sets (in a statistically rigorous fashion) to see if they are different. A "yes" result means that the resistance has evolved over time. 

In our paper, we describe a new statistical test called the density test that builds a kernel density estimator F for the baseline data, and then checks to see if F evaluated at the new data has a high or low value. If F is relatively high, then it means that the new data set is not much different from the data that was used to construct F, and the test returns "no". If F is relatively low, then it means that the new data set is quite different from the data used to construct F, and the test returns "yes". This work was published as:

Xiuyao Song, Mingxi Wu, Christopher M. Jermaine, Sanjay Ranka: Statistical Change Detection for Multi-Dimensional Data. Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007).

Spatial Variation in Resistance Trends. One problem that we have looked at in depth is identifying local areas in spatial data where the trends in the data are significantly different (in a statistical sense) from the norm. Our motivating application is the following. We have a large number of hospitals, and want to find subsets of those hospitals where the trends in antimicrobial resistance are significantly different from the trends in the rest of the hospitals, since these subsets may be candidates for further examination. "Significantly different" is defined in terms of a likelihood ratio test that takes into account the temporal or time-evolving nature of the data. From a computer science point-of-view, the difficulty is that the number of subsets -- defined as those sets of hospitals that fall in contiguous, regularly-shaped sub-regions of the US -- can be huge, even if the overall number of hospitals in the data set is only in the hundreds or thousands. Thus, it is challenging to find all of the interesting subsets in a scalable fashion. In addition to a paper that has recently been submitted, we have published two abstracts detailing the results of applying our methods to real data (abstract 1, abstract 2). We conclude that there is significant heterogeneity in antimicrobial resistance trends across the country. This is important, because it suggests that resistance is not a monolithic phenomenon, and so careful local antimicrobial usage may control local resistance.

Model-Based Correlation Detection. Using a mixture of random variables to model data is a tried-and-tested method common in data mining, machine learning, and statistics. By using mixture modeling it is often possible to accurately model even complex, multimodal data via very simple components. However, the classical mixture model assumes that a data point is generated by a single component in the model. This means that a database point can only embody a single correlation that encompasses all of the database attributes. Many datasets (particularly the sort of microbial resistance datasets studied in the project) can be modeled closer to the underlying reality if we drop this restriction. For example, consider the sort of patient-level antimicrobial resistance data described above. It is not necessarily as useful to discover a weak correlation in resistance level that covers all of the antimicrobials as it is to find a strong correlation that covers a particular family of advanced drugs. To potentially find such relationships, the MOS model makes two fundamental changes to the classical mixture model. First, we allow a data point to be generated by a set of components, rather than just a single component. Next, we limit the number of data attributes that each component can influence, which forces the model to be sensitive to lower-dimensional correlations. In addition to defining the MOS model, we also developed a generic EM framework to learn the MOS model from a dataset. This paper recently appeared in TKDD as:

Manas Somaiya, Christopher M. Jermaine, Sanjay Ranka: Learning correlations using the mixture-of-subsets model. ACM Transactions on Knowledge Discovery in Data 1(4): (2008)

Project Personnel. The work of the following people has been sponsored at one time or another through the NSF funding associated with this project (in alphabetical order):

Last modified: June 20th, 2008