Numalis

Studying the robustness of models trained on synthetic data

Challenges

  • Prove the robustness and reliability of models : Measure whether the use of avatar (synthetic) data guarantees the same precision as real data to classify information.
  • Treating complex data in a time series : Demonstrate the effectiveness of the method on sensitive chronological data, in particular for automated heart beat analysis (ECG). (https://www.kaggle.com/datasets/shayanfazeli/heartbeat/data)

Maintaining statistical quality & utility

The unstable areas (yellow zone) for the original and avatar models are the same

The correlations of the original model are maintained in the Avatar model

Results

  • Maintaining temporal correlations : The morphological properties of the ECG signal are statistically preserved after avatarization.
  • The model trained on avatar data is as sturdy as ever than the one trained on original data.
  • Compliance RGPD:
    • Both models have the same robustness dynamics. If one is compliant, the other will also be considered compliant.
    • The use of Avatar allows the sharing of ECG signals between research centers without legal constraints.
“In our collaboration with Octopize, we studied the robustness of models trained on avatar data. The analyses carried out with Saimple showed that avatar data does not change the behavior of the model or its performance. The metrics obtained remained comparable to those of the original data. These results confirm that avatar data is a reliable alternative for training models while respecting confidentiality constraints.” - Noëmie Rodriguez, Data Scientist & Project Manager @Numalis