Contactez-nous - Octopize

FAQS

Does the method work on small data sets (a few dozen records)?

No, it would defeat anonymization as defined by the GDPR. It is an irreversible process.The method supports small data sets — with a few precautions:
• A low number of individuals increases the risk of re-identification due to uniqueness.
With the same level of confidentiality, fewer statistical properties can be maintained than in a large data set.
• The data generated reflects the confidence interval of the source data. The fewer records there are, the wider this interval is, so results may vary more from one iteration to the next.

Can this method be used to increase data?

Yes. The Avatar method is based on a random, repeatable local process to generate several synthetic games. Initially designed for confidentiality, this randomness also allows classes to be increased and balanced. The records generated explore the space of the original data without introducing values out of range (e.g. ages between 20 and 60 remain in this range).

Can you preserve exactly certain results in the anonymized game?

Yes, the method can be configured to maintain certain exact values. This naturally impacts privacy, but that impact is quantified by the metrics in the report, helping you assess whether the trade-off is reasonable in your context.

Can the method maintain the characteristics of small groups of individuals?

Yes — the method can preserve patterns that are rare but not unique in the data. Unique individuals are automatically refocused to reduce the risk of re-identification, which also helps to clean up data. However, if a small group shares a rare characteristic, you can preserve that group by setting the anonymization parameter k below the group size. This makes the method suitable for the analysis of rare events or sub-populations without compromising confidentiality‍.

Can we keep the link between personal data and avatar?

No, it would defeat anonymization as defined by the GDPR. It is an irreversible process.

What is the difference with the competitors?

Our anonymization metrics and report that allow you to prove compliance and usefulness are unique. In addition, our calculation speed as well as the transparency and explainability of the method are differentiating points. To learn more about the method: https://www.nature.com/articles/s41746-023-00771-5

Can we anonymize in flows?

We have already successfully completed flow anonymization projects. The challenge is to anonymize small volumes of data while maintaining maximum usefulness. To meet this challenge we have developed a batch approach.

How can I trust the robustness of the method?

Confidence in our method is based on three pillars: certified compliant by the CNIL, published in Nature Digital Medicine, and each generation of synthetic data is accompanied by a quality report measuring confidentiality and usefulness. We believe in transparency, with part of our code accessible in open source. Read the full article on the robustness and transparency of our approach.

Does the method allow free text to be anonymized?

Not directly. Free text is unstructured, and there is not yet a clear legal framework defining what constitutes anonymous text. However, it is possible to structure the text using NLP, anonymize the structured version with Avatar, then to regenerate text (if necessary) with a language model.
This ensures that personal data is not used to train a model.
We've already implemented this workflow — contact us to find out more.

Can images be anonymized using this method?

Not at the moment. There is no clear legal definition of what makes an image anonymous. However, you can anonymize structured data linked to pseudonymized images and then re-associate them through probabilistic matching. This framework has been used in real cases — anonymize structured data linked to images pseudonymized, then re-associate them then via probabilistic matching. This framework has been used in real cases — contact us for more details.

When is my data considered anonymous?

Data anonymity is not a binary status — it depends on the context of use. To assess it, the residual risk of re-identification must be measured according to the criteria defined by the EDPS. The Avatar method provides automatic metrics and recommended thresholds for strict cases like open data. If the risk is too high, additional protections can be applied to ensure compliance.

What is the need for deployment with us in terms of infrastructures?

Deployment is completely industrialized thanks to Docker and Kubernetes. Our teams adapt to all architectures in a few hours.

Why is the avatar method compliant with the CNIL?

The CNIL successfully evaluated our anonymization method on the basis of our security and utility metrics respecting the 3 criteria set out by the EDPS to define anonymization (opinion of 05/2014).

Why not use generative models (GAN, LLM, etc.) to create synthetic data?

We prefer a statistical approach, transparent and explainable, rather than generative “black box” models.
This allows us to master the statistical signal, to ensure regulatory compliance (re-identification risk assessment included) and to obtain more results fast, robust and adaptable. Read the full article on our choice of statistical approach.

Why doesn't the Avatar method use differential privacy?

The Differential Privacy remains an interesting but imperfect approach: difficult to configure, often degrading the quality of the data and costly in calculation.
We preferred a method that check the confidentiality afterwards, ensures a full compliance with the EDPS and offers stronger and measurable guarantees. Read our full article on the limits of differential privacy.

What is the difference between anonymization and pseudonymization?

Pseudonymization masks the identity of a person by replacing their identifiers with a pseudonym or a code, but maintains a reversible link : it is still possible to find the person using a key. So this data remains personal.

Anonymization, on the contrary, permanently severs all ties between the data and the individual. The data then becomes non-personal, because it is impossible to re-identify someone, even indirectly. For more information, see this full article dedicated to the question.

Contact us

FAQS