TLDR/What to remember from this study:
- Performance: +6.3 points accuracy on your models with only 30% of data collected.
- Time-to-market: 691 days savings in collection and labelling.
- Financial ROI: 774,213€ savings generated from the first year of production.
The Data Science dilemma: Let's imagine a classic case: you need to train a model to detect fraudulent transactions. Your ideal target? 492 individuals labelled fraudsters. The problem? The reality on the ground. Each identification and validation of a fraud case takes on average 2 days.
This delay puts your project at risk. Waiting for 100% of the data delays the launch of the algorithm and leaves your company vulnerable to current fraud.
The critical question: What happens if you decide to leave early, with only 10%, 30%, or 50% of the data? → Traditionally, the Model robustness collapses. Who says reduced precision, says undetected fraud and dry financial losses.
But is there a third way?
The Avatar method
This path is data increase. The idea is simple but powerful: use the partial data already collected (for example 30%) to generate synthetic data (Avatars). These data are statistically relevant and guarantee a anonymization in accordance with the GDPR.
Let's analyze quantitatively the impact of this method on a concrete case.
1. Performance: the power of augmentation

The graph above illustrates the impact of increased data on the accuracy of the model (metric: Average Precision - statistical metric corresponding to the percentage of fraudulent transactions that are well detected as such).
We observe a systematic improvement performance when the data is increased with Avatar (green line), compared to the standard method (gray line), regardless of the starting volume. The error bars confirm that this gain is statistically significant.
The number to remember : With only 30% of real data collected, the addition of synthetic data makes it possible to jump from an accuracy of 79.9% to 86.2%.
Result: A gain of +6.3 accuracy points without waiting for the end of the collection.
In other words, out of a thousand fraudulent transactions, the model trained with augmented data will detect 63 more.
2. Temporal ROI: a major acceleration of the project

Time is money, and recruiting data requires a lot of time. This graph shows the potential time savings resulting from the use of an augmentation method.
The green curve represents the number of “recruitment” days (collection/certification) saved by stopping the collection earlier and compensating with synthetic data. Since the time to generate synthetic data is negligible (a few minutes), the gain is significant.
The observation : If you stop collecting at 30% of the goal to move on to the increase:
👉 You save 691 days of the collection phase, helping to shorten the time needed to put the algorithm into production (Time-to-Market).
3. Economic ROI: the dual economy

The saving of time is mechanically accompanied by a reduction in costs. But the economy is twofold:
- Operational costs: Less analyst/recruiter time to find and label data.
- Business performance: Since the model is better (see point 1), it detects more fraud, reducing losses.
The numerical impact (at 30% of collection) : By stopping the collection at 30% and increasing the data, the estimated total savings after 1 year of production amounts to 774,213€.
- Dont €747,618 saved on the recruitment/certification process.
- Dont €26,594 won thanks to better fraud detection.
In summary: the winning strategy
The results of the experiment are unquestionable. With only 30% real data supplemented with Avatar synthetic data, you get:
- Better performance (+6.3 points of accuracy).
- Accelerated production (691 days of collection avoided).
- Immediate financial gain (~750k € in savings the first year).
- RGPD compliance total (privacy-by-design).
Conclusion : You get a more efficient, faster, and cheaper model.
IT'S UP TO YOU TO PLAY
This use case on bank fraud is perfectly transposable to other critical sectors where data is rare or expensive to acquire:
- Health : recruitment of patients for clinical trials.
- Banking/Insurance : analysis of atypical claims.
- Industry : detection of specific failures on production lines.
- Administration : detection of fraud, undue payments or predictive analysis of savings measures.
- Defense : increasing the statistical power of existing data
🔎 Appendix: methodology and assumptions
For technical profiles wishing to reproduce or understand the calculation, here are the parameters of the study (based on the dataset): Credit card and an XGBoost model).
Key parameters:
$$
\begin{array}{|c|l|r|}
\hline
\textbf{Symbol} & \textbf{Description} & \textbf{Value} \\
\hline
r & \text{Sampling ratio (fraction of target collected)} & \mathbf{30\%} \\
\hline
PTF & \text{Percentage of fraudulent transactions} & \mathbf{0.1727\%}^\ast \\
\hline
NTA & \text{Number of annual transactions} & \mathbf{2,000,000} \\
\hline
NFA & \text{Number of annual frauds } (PTF \times NTA) & \mathbf{3,454} \\
\hline
CMTF & \text{Average cost per fraudulent transaction} & \mathbf{122.21\,\text{€}}^\ast \\
\hline
CHA & \text{Analyst hourly cost} & \mathbf{45\,\text{€/h}} \\
\hline
TLD & \text{Labeling time per data point} & \mathbf{5\,\text{min}} \\
\hline
TAFL & \text{Time to acquire labeled fraud } (TLD/PTF) & \mathbf{2,895.6\,\text{min}} \\
\hline
\end{array}
$$
* Value calculated from data
Formula for calculating savings:
$$\boxed{\text{Savings}_{\text{Total}}(r) = \text{Savings}_{\text{Recruitment}}(r) + \text{Savings}_{\text{Detection}}(r)}$$
where :
- $\text{Savings}_{\text{Recruitment}}(r) = \frac{(100 - r) \times \text{TAFL}}{60} \times \text{CHA}$ are the savings related to reduced recruitment and labeling time;
- $\text{Savings}_{\text{Detection}}(r) = \Delta\text{Precision}(r) \times \text{CMTF} \times \text{NFA}$ are the savings related to improved fraud detection.
Note : $\Delta\text{Precision}(r) = \text{Precision with Avatar augmentation} - \text{Baseline precision}$
For more information:
🔗 Documentation : docs.octopize.io
📅 Calculate your potential ROI with our experts: https://meeting.octopize.io/meetings/gabrielle-crolard/ai-diagnostic
📧 Contact us: contact@octopize.io



.png)



