Not all synthetic data is created equal: why we don't use generative AI

While GAN and LLM seduce by their power, Octopize has made another choice: that of understanding and rigor. Our statistical method makes it possible to generate high quality synthetic data that is scientifically traceable and compliant with the GDPR. An assumed choice, guided by a simple conviction: without transparency, there is no trust.

Not all synthetic data is created equal: why we don't use generative AI

At Octopize, we made a conscious choice: that of transparency and control.

While many rely on generative models such as GAN or LLM to produce synthetic data, we have chosen an explainable statistical approach.

A fundamental choice to guarantee the quality, compliance and robustness of the data generated.

Understanding rather than guessing

Generative models work like black boxes. They produce data based on complex learning. Although appealing, this approach does not make it possible to understand and explain the logic used to produce these results.

When it comes to generating synthetic data, this opacity prevents the statistical signal from being controlled and therefore from ensuring the fidelity of the data produced.

Our statistical approach is based on understanding and controlling the process.

We know how each avatar data is generated and can adapt the method to each use case in order to preserve important properties.

In this way, we maintain total control over the transformations applied and verify that the datasets generated are reliable, balanced and scientifically consistent.

Compliance and traceability: a prerequisite

At the regulatory level:

  • Synthetic data remains personal data until proven otherwise.
  • Any model trained on the basis of personal data remains personal data within the meaning of the GDPR.

To be considered anonymous, data must demonstrate the impossibility of re-identification with reasonable means (recital 26). Technically, this measure of the risk of re-identification must be based on three criteria identified by the EDPS (European Data Protection Board), and be systematically documented.

Our individual-centered approach natively integrates state-of-the art attack scenarios to assess these risks in a technical way and provide a proof of regulatory compliance.

Conversely, generative models are vulnerable to specific attacks (such as Membership Inference Attacks), making risk assessment more complex and more expensive.

Performance and robustness

Generative models require large volumes of training data, massive resources and can converge on biased results.

Because of its algorithmic construction, our method is:

  • 25 times faster,
  • scalable, suitable for small and large data sets,
  • and robust, because it is independent of the biases of deep learning algorithms.


In summary

The Avatar method is based on a simple principle:

“Understand to master, master to trust.”

We prefer a transparent, explainable, and verifiable approach to an opaque generation, however powerful it may be.


To learn more: see our scientific article on the method

Sign up for our newsletter!