SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

Petros Andreou1, Jamie Lanyon1, Axel Finke1,2, Georgina Cosma1
1Department of Computer Science, Loughborough University, UK
2School of Mathematics, Statistics and Physics, Newcastle University, UK
SUPREME multi-stage pipeline: training, unlearning, and evaluation distributed across multiple GPUs.

SUPREME's seeded multi-stage pipeline (Fig. 1 from the paper). All three stages (training, unlearning, evaluation) execute across P devices. Training and unlearning use gradient synchronisation; evaluation uses result aggregation.

Abstract

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time.

We introduce SUPREME (Standardised Unlearning Platform for REproducible Method Evaluation), an open-source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry-based design for adding new methods, metrics, models, and scenarios; a multi-GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full-class and random-sample unlearning across ten seeds. The framework is available at github.com/pedroandreou/supreme-unlearning.

Key Contributions

  Extensible framework

Registry-based design covers datasets, model architectures, unlearning methods, evaluation metrics, and unlearning scenarios. New components are added by implementing an interface and registering a module path, with no framework changes required.

  Multi-GPU support

Hardware-agnostic architecture built on PyTorch and Lightning Fabric, with DDP, FSDP, and DeepSpeed ZeRO-1/2/3 distribution applied to training, unlearning, and evaluation. To our knowledge the first image classification unlearning framework to do so.

  Demonstration

Pins Face Recognition with ResNet18 and ViT, full-class and random-sample unlearning, across ten seeds. The first multi-seed image unlearning study on this benchmark, surfacing across-seed variance that single-seed evaluations miss.

What ships in the registry

Component Available implementations
Datasets CIFAR-10, CIFAR-20, CIFAR-100, PinsFaceRecognition, Caltech-101
Models ResNet18, Vision Transformer (ViT)
Baselines Retrain, Original
Unlearning methods Fine-Tuning (FT), Bad Teacher (BadT), Random Labels (RL), UNSIR, SSD, LFSSD
Evaluation metrics Accuracy, Loss, ZRF, Activation Distance, JS-Divergence, Layer-wise Distance, Membership Inference Attack, Completeness, Resource Consumption, Time
Unlearning scenarios Full-class, Subclass, Random-sample
Distributed strategies DDP, FSDP, DeepSpeed ZeRO-1, ZeRO-2, ZeRO-3
Accelerators CPU, CUDA, MPS (Apple Silicon), TPU (via PyTorch XLA)
Precision modes fp32, fp16, bf16, mixed-precision via Lightning Fabric

The pipeline

SUPREME runs a three-stage pipeline parameterised by three independent seeds, allowing users to isolate variance from each stage:

  1. Stage 1, Training. Train the original model Mo on the full dataset D using training seed st.
  2. Stage 2, Unlearning. For each forget target cC, train the retrained baseline Mr from scratch on the retain set Dr using unlearning seed su. Then for each method aA, apply a to Mo to obtain the unlearned model Mu.
  3. Stage 3, Evaluation. Evaluate Mu against Mr on the test forget and retain sets using evaluation seed se, computing the configured metrics in E.

Running every method under the same seed configuration ensures every method is evaluated under identical starting conditions for a given seed, isolating method differences from pipeline randomness. The retrained baseline Mr is produced once per (st, j, c), shared across all methods, avoiding redundant retraining.

Demonstration on Pins Face Recognition

We demonstrate SUPREME on Pins Face Recognition, an image classification benchmark of 17,534 facial images across 105 celebrity identities. Two scenarios are evaluated:

  • Full-class unlearning removes all samples for five identities: alex_lawther, bill_gates, danielle_panabaker, hugh_jackman, josh_radnor.
  • Random-sample unlearning removes a 0.1% subset of training samples drawn from across all classes.

Both architectures (ResNet18 and ViT) are evaluated across I = 10 training seeds (260–269) on a single NVIDIA L40S GPU to maintain exact numerical parity with the reference implementations. Table 1 reports accuracy differences (ΔAcc) and layer-wise weight distances between Mu and Mr; closer to 0 is better.

Table 1: Accuracy differences and layer-wise distances on Pins Face Recognition

Mean ± std across 10 seeds. Bold marks the best (closest to 0) value in each column within a model–scenario block.

Model Scenario Method ΔAcc on D'f ΔAcc on D'r Layer
ResNet18Full-classFT29.22 ± 5.212.52 ± 0.1131.52 ± 0.22
BadT0.26 ± 0.17-2.35 ± 0.2731.72 ± 0.34
UNSIR89.44 ± 1.982.51 ± 0.1232.25 ± 0.34
RL0.00 ± 0.002.58 ± 0.1231.98 ± 0.34
SSD1.97 ± 6.22-9.14 ± 7.4331.55 ± 0.36
LFSSD0.00 ± 0.00-3.66 ± 1.5331.56 ± 0.34
RandomFT2.78 ± 24.32-2.60 ± 15.2537.70 ± 7.91
BadT-35.00 ± 25.93-32.47 ± 29.8341.99 ± 8.78
RL-48.89 ± 34.03-4.59 ± 20.1242.11 ± 8.72
SSD-75.00 ± 19.47-79.97 ± 23.7940.25 ± 9.06
LFSSD-68.33 ± 15.28-58.61 ± 19.2040.82 ± 8.72
ViTFull-classFT0.03 ± 0.06-22.66 ± 23.47105.29 ± 1.31
BadT17.90 ± 5.18-0.88 ± 0.1033.97 ± 0.12
UNSIR35.52 ± 4.92-0.31 ± 0.1339.38 ± 0.18
RL0.00 ± 0.000.20 ± 0.0536.92 ± 0.18
SSD0.00 ± 0.00-2.40 ± 0.9963.38 ± 2.58
LFSSD0.00 ± 0.00-3.88 ± 2.1186.77 ± 6.07
RandomFT8.33 ± 5.401.42 ± 0.1260.40 ± 0.19
BadT-8.33 ± 7.970.21 ± 0.5232.76 ± 0.18
RL-76.67 ± 11.941.33 ± 0.1235.71 ± 0.20
SSD-55.00 ± 37.99-54.09 ± 45.60171.80 ± 84.31
LFSSD-89.44 ± 6.11-90.70 ± 6.59231.91 ± 15.22

UNSIR is excluded from the random scenario by design. Full-class values average over 5 forget classes; random uses a 0.1% forget set. The large standard deviations in the random-sample row underline the paper's main practical point: single-seed image unlearning results can misrepresent a method's behaviour, and multi-seed evaluation is required for a fair comparison.

BibTeX

@misc{andreou2026supreme,
  title  = {SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation},
  author = {Andreou, Petros and Lanyon, Jamie and Finke, Axel and Cosma, Georgina},
  year   = {2026},
  howpublished = {}
}

Acknowledgement

Petros Andreou is supported by a PhD studentship.