COMBINE

Customization of classification ensembles based on data characteristics

This project, a key initiative within the Software Campus program, receives its funding from the German Federal Ministry of Education and Research (BMBF). It is a collaborative effort that brings together the industry expertise of Software AG with the academic excellence of the University of Stuttgart, aiming to foster innovation and leadership in the field of technology.

For companies today, analyzing data is the basis for making reliable decisions in their business processes. Classification models, a method of machine learning (ML), are often used for this purpose. These models make predictions about future events based on existing data. However, the more complex this data is, the more difficult it becomes for the models to make accurate predictions. An example of this is small amounts of data, where classification models often achieve a low prediction accuracy, so that the models make many incorrect predictions. This is exacerbated by other data characteristics, such as the occurrence of outliers or noise in the data.

In order to continue achieving high prediction accuracy, several different classification models can be combined, which, for example, are trained on different subsets of the data. This combination of multiple classification models is referred to as a classification ensemble. In this case, the combination of models is performed using a decision fusion method, which merges the predictions of the individual models into a single prediction. In order for an ensemble to achieve the highest possible prediction accuracy, several requirements must be met: 1) The prediction accuracy of the individual models must already be as high as possible, 2) the models must be diverse from one another, that is, the models complement each other by making accurate predictions on different data, and 3) the fusion method for combining the different model predictions must be optimized for this set of models, so that, as far as possible, only the correct predictions of the individual models are adopted as the common, merged prediction.

However, for a specific dataset, it is unclear which classification models and which fusion method will meet these requirements. Existing solutions, such as Random Forest, create ensembles by randomly generating the diversity and accuracy of models. However, these approaches do not consider the characteristics of the data, and therefore the achieved prediction accuracy is also random. An equally impractical alternative, because it is time-consuming, is the manual creation of an ensemble optimized for the data by an expert. This requires, among other things, testing various classification and decision fusion algorithms as well as their hyperparameters to select models for an ensemble.

Complex data characteristics frequently occur in practice, especially in industrial applications and healthcare, leading to challenges in training ML models. To address this problem, ensembles are often used instead of classical ML models. However, there is currently no approach for the targeted creation of optimized ensembles based on data characteristics. This project aims to investigate which data characteristics influence the creation of a classification ensemble and thus its prediction accuracy and diversity. Subsequently, an approach will be developed that makes it possible to create optimized classification ensembles based on data characteristics.