An Optimization-Based Algorithm for Fair and Calibrated Synthetic Data Generation

For agent based micro simulations, as used for example for epidemiological modeling during the COVID-19 pandemic, a realistic base population is crucial. Beyond demographic variables, health-related variables should also be included. In Germany, health-related surveys are typically small in scale, which presents several challenges when generating these variables. Specifically, strongly imbalanced classes and insufficient observations within sensitive groups necessitate the use of advanced synthetic data generation methods. To address these challenges, we present a method formulated as a mixed-integer linear optimization model designed to create health variables based on class probabilities. This model incorporates fairness by considering the class distribution across sensitive groups as constraints. Furthermore, we prove that the proposed model possesses unimodularity properties and present a preprocessing technique. This allows us to generate data for large populations, such as Germany’s population of over 80 million. Our numerical tests, using one of the largest German Health Survey (GEDA), demonstrate that our approach yields better classification results than a standard random forest when considering different ages as sensitive groups.

Article

Download

View PDF