April 28, 2026

Can an attacker tell if your data was used? Understanding Membership Inference Attacks

When an organization publishes synthetic data, a critical privacy question arises: can an attacker guess whether a specific person's file was used to train the model? This is the whole point of attacks by inference of belonging (MIA). This article deciphers how this threat works, how to measure it in practice, and why it is essential to assess the security of your data sets.

When an organisation releases synthetic data, a natural question arises: can someone tell whether a specific individual was part of the original dataset used to create it? That question is exactly what a Membership Inference Attack (MIA) tries to answer. It is one of the studied privacy threats for machine-learning models and synthetic data generators, and a useful lens to understand what "privacy" really means in this context.

What is a Membership Inference Attack?

Consider a hospital that publishes a synthetic dataset derived from a clinical study on a rare disease. An attacker has access to one specific patient's medical record and asks a simple question:

> Was this person part of the original study?

If the answer can be guessed reliably from the synthetic data alone, the dataset leaks information, even if no individual synthetic record matches a real one.

This kind of leakage matters for two reasons:

- Participation can be sensitive on its own. Confirming that someone contributed to a dataset about a stigmatised condition, a specific treatment, or a particular behaviour already reveals something about them, independently of the actual attribute values.
- It is a proxy for memorisation. If a generator retains subtle traces of its training data, those traces are exactly what an attacker exploits. A method resistant to MIA is a method that has not memorised its inputs.

MIA is also directly tied to the singling-out re-identification risk defined by the EDPB in its GDPR opinion on anonymisation. A successful MIA is a concrete, measurable instance of singling out.

Why generative models can leak membership

Generative models (whether GANs, diffusion models, or nearest-neighbour-based methods like avatar) learn patterns from a training dataset and produce new records that resemble it. The risk is that the produced samples resemble the training records slightly more than unseen records from the same population.

That asymmetry is the membership signal. Intuitively:

- For a training individual, the generator has "seen" their record and may produce synthetic points that sit close to it.
- For a non-training individual from the same population, the generator has never seen them, so synthetic points are only as close as the overall distribution allows.

If this gap is detectable, membership can be inferred. If it is not detectable, the generator behaves the same whether a given person was in the training set or not, which is the informal property that formal definitions like differential privacy try to capture.

How MIA is generally measured

Different MIA techniques exist in the literature, but most follow the same three-step recipe:

1. Split the real data into a *member* set (used to train the generator) and a *non-member* set (held out).
2. Generate the synthetic dataset from the member set only.
3. Score each real record, both members and non-members, with some function of the synthetic data, then test whether the two groups of scores can be distinguished.

The score function is what varies across methods:

- Shadow-model attacks train many auxiliary generators on known member/non-member splits, then train a classifier to predict membership from the generator's outputs ([Shokri et al., 2017] ; [Hayes et al., 2019 — LOGAN]). Powerful but expensive and knowledge-hungry.
- Likelihood-based attacks estimate a density from the synthetic data and flag records with unusually high likelihood as likely members ([DOMIAS, Van Breugel et al., 2023] ; [GAN-Leaks, Chen et al., 2020]).
- Distance-based attacks use a geometric score — typically the distance to the nearest synthetic record.

Whatever the score, the final question is always the same: are member scores systematically different from non-member scores? The degree of separation, usually summarised as an AUC, is the measure of leakage.

What the metric actually tells you

Regardless of which MIA variant is used, the output is best understood as a single spectrum:

- No separation between members and non-members → the generator behaves the same whether someone was in the training set or not → no detectable membership leakage.
- Strong separation → the generator's output carries a clear signature of its training data → high membership leakage.
- Anything in between → a residual, quantifiable signal whose practical impact depends on context.

How we compute it?
Our implementation follows the distance-based family. For each real record, we compute its distance to the nearest synthetic point in a shared latent space (FAMD), then normalise by the distance to the nearest real record in the opposite subset to correct for local density effects. The separation between member and non-member distance distributions is summarised with a Mann–Whitney $U$ statistic, expressed as an AUC and converted into a protection rate. Details are in the [metric documentation].

What a "good" MIA score looks like?

A protection rate slightly below 100% does not automatically indicate a privacy breach. It indicates that an attacker who already has a candidate record and only the synthetic data can, on average, guess membership slightly better than chance. Whether that matters depends on:

- The sensitivity of participation itself. Being in a diabetes cohort is not the same as being in a study on a stigmatised condition.
- The attacker's realistic capabilities. Does the adversary actually have candidate records? Background knowledge? Or only the synthetic release?
- Other privacy metrics. Membership inference is one facet of privacy. Our full [privacy metrics catalogue] covers complementary risks: Hidden Rate, Local Cloaking, and Column Direct Match Protection assess record-level closeness; [Anonymeter] simulates the three GDPR re-identification risks (singling out, linkability, inference). MIA metric alone doesn't tell the full story.
- The utility cost of stricter protection. Near-perfect privacy metrics often degrades synthetic data utility.

In practice, the acceptable level is a decision, not a threshold. It is informed by the threat model, the use case, and the full set of privacy metrics, not by MIA alone.

Strengths and limits of MIA

MIA gives a concrete, empirical, interpretable answer to a concrete question: can an attacker tell if this record was used? It is simple to communicate, fast to compute, and comparable across generators.

It also has limitations:

- It requires a held-out subset, which is not always available in production settings.
- It measures a specific attack model. A generator robust to distance-based MIA may still be vulnerable to a very different attack, and vice versa.
- It is empirical, not formal. Unlike differential privacy, it does not provide a worst-case mathematical guarantee.
‍
‍

In summary

A Membership Inference Attack asks whether participation in a training set can be detected from a synthetic release. It formalises one of the most intuitive privacy questions — *"am I in there?"* — and turns it into a measurable quantity. Different techniques implement it, but they all reduce to the same idea: compare how the synthetic data behaves on people it has seen versus people it has not. The metric reports how separable these two behaviours are, and a high protection rate means the generator does not betray its training set.

Used alongside other privacy metrics and a realistic threat model, MIA is a valuable building block for assessing whether a synthetic dataset can be safely released.

‍

Links :

- Metric documentation
- Full privacy metrics catalogue
- FastDP privacu-utility comparison
- Technical documentation
- Contact

‍