Jan Pablo Burgard – Optimization Online

Mixed-Integer Linear Optimization for Cardinality-Constrained Random Forests

Published: 2024/05/16, Updated: 2025/01/23

(Mixed) Integer Linear Programming, Optimization in Data Science cardinality constraints, Mixed-Integer Linear Optimization, preprocessing, Random forests, Semi-Supervised Learning

Random forests are among the most famous algorithms for solving classification problems, in particular for large-scale data sets. Considering a set of labeled points and several decision trees, the method takes the majority vote to classify a new given point. In some scenarios, however, labels are only accessible for a proper subset of the given … Read more

Mixed-Integer Linear Optimization for Semi-Supervised Optimal Classification Trees

Published: 2024/01/17

Jan Pablo Burgard

Maria Eduarda Pinheiro

Martin Schmidt

(Mixed) Integer Linear Programming, Optimization in Data Science Mixed-Integer Linear Optimization, optimal classification trees, Semi-Supervised Learning

Decision trees are one of the most famous methods for solving classification problems, mainly because of their good interpretability properties. Moreover, due to advances in recent years in mixed-integer optimization, several models have been proposed to formulate the problem of computing optimal classification trees. The goal is, given a set of labeled points, to split … Read more

Mixed-Integer Quadratic Optimization and Iterative Clustering Techniques for Semi-Supervised Support Vector Machines

Published: 2023/03/22, Updated: 2023/10/02

Martin Schmidt

Jan Pablo Burgard

Maria Eduarda Pinheiro

(Mixed) Integer Nonlinear Programming, Data Science Algorithms, Data Science Applications clustering, mixed-integer quadratic optimization, Semi-Supervised Learning, support vector machines

Among the most famous algorithms for solving classification problems are support vector machines (SVMs), which find a separating hyperplane for a set of labeled data points. In some applications, however, labels are only available for a subset of points. Furthermore, this subset can be non-representative, e.g., due to self-selection in a survey. Semi-supervised SVMs tackle … Read more

Mixed-Integer Programming Techniques for the Minimum Sum-of-Squares Clustering Problem

Published: 2022/03/09, Updated: 2022/11/29

(Mixed) Integer Nonlinear Programming, Cutting Plane Approaches, Data-Mining Computational Techniques, global optimization, Minimum Sum-of-Squares Clustering, Mixed-integer nonlinear optimization

The minimum sum-of-squares clustering problem is a very important problem in data mining and machine learning with very many applications in, e.g., medicine or social sciences. However, it is known to be NP-hard in all relevant cases and to be notoriously hard to be solved to global optimality in practice. In this paper, we develop … Read more

Robustification of the k-Means Clustering Problem and Tailored Decomposition Methods: When More Conservative Means More Accurate

Published: 2020/05/17, Updated: 2022/04/27

Jan Pablo Burgard

Carina Moreira Costa

Martin Schmidt

(Mixed) Integer Nonlinear Programming, Robust Optimization alternating direction method, gamma-robustness, k-means clustering, robust optimization, strict robustness

k-means clustering is a classic method of unsupervised learning with the aim of partitioning a given number of measurements into k clusters. In many modern applications, however, this approach suffers from unstructured measurement errors because the k-means clustering result then represents a clustering of the erroneous measurements instead of retrieving the true underlying clustering structure. … Read more

Exact solution of the donor-limited nearest neighbor hot deck imputation problem

Published: 2019/08/20

Graphs and Matroids, Network Optimization, Statistics b-matching, combinatorial optimization, missing data, nearest neighbor hot deck imputation, survey data

Data quality in population surveys suffers from missing responses. We use combinatorial optimization to create a complete and coherent data set. The methods are based on the widespread nearest neighbor hot deck imputation method that replaces the missing values with observed values from a close unit, the so-called donor. As a repeated use of donors … Read more