SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

Abstract

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time.

We introduce SUPREME, an open-source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry-based design for adding new methods, metrics, models, and scenarios; a multi-GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full-class and random-sample unlearning across ten seeds. The framework is available at github.com/pedroandreou/supreme-unlearning.

Key Contributions

Extensible framework

Registry-based design covers datasets, model architectures, unlearning methods, evaluation metrics, and unlearning scenarios. New components are added by implementing an interface and registering a module path, with no framework changes required.

Multi-GPU support

Hardware-agnostic architecture built on PyTorch and Lightning Fabric, with DDP, FSDP, and DeepSpeed ZeRO-1/2/3 distribution applied to training, unlearning, and evaluation. To our knowledge the first image classification unlearning framework to do so.

Demonstration

Pins Face Recognition with ResNet18 and ViT, full-class and random-sample unlearning, across ten seeds. The first multi-seed image unlearning study on this benchmark, surfacing across-seed variance that single-seed evaluations miss.

What ships in the registry

Component	Available implementations
Datasets	CIFAR-10, CIFAR-20, CIFAR-100, PinsFaceRecognition, Caltech-101
Models	ResNet18, Vision Transformer (ViT)
Baselines	Retrain, Original
Unlearning methods	Fine-Tuning (FT), Bad Teacher (BadT), Random Labels (RL), UNSIR, SSD, LFSSD
Evaluation metrics	Accuracy, Loss, ZRF, Activation Distance, JS-Divergence, Layer-wise Distance, Membership Inference Attack, Completeness, Resource Consumption, Time
Unlearning scenarios	Full-class, Subclass, Random-sample
Distributed strategies	DDP, FSDP, DeepSpeed ZeRO-1, ZeRO-2, ZeRO-3
Accelerators	CPU, CUDA, MPS (Apple Silicon), TPU (via PyTorch XLA)
Precision modes	fp32, fp16, bf16, mixed-precision via Lightning Fabric

The pipeline

SUPREME runs a three-stage pipeline parameterised by three independent seeds, allowing users to isolate variance from each stage:

Stage 1, Training. Train the original model M_o on the full dataset D using training seed s_t.
Stage 2, Unlearning. For each forget target c ∈ C, train the retrained baseline M_r from scratch on the retain set D_r using unlearning seed s_u. Then for each method a ∈ A, apply a to M_o to obtain the unlearned model M_u.
Stage 3, Evaluation. Evaluate M_u against M_r on the test forget and retain sets using evaluation seed s_e, computing the configured metrics in E.

Running every method under the same seed configuration ensures every method is evaluated under identical starting conditions for a given seed, isolating method differences from pipeline randomness. The retrained baseline M_r is produced once per (s_t, j, c), shared across all methods, avoiding redundant retraining.

Demonstration on Pins Face Recognition

We demonstrate SUPREME on Pins Face Recognition, an image classification benchmark of 17,534 facial images across 105 celebrity identities. Two scenarios are evaluated:

Full-class unlearning removes all samples for five identities: alex_lawther, bill_gates, danielle_panabaker, hugh_jackman, josh_radnor.
Random-sample unlearning removes a 0.1% subset of training samples drawn from across all classes.

Both architectures (ResNet18 and ViT) are evaluated across I = 10 training seeds (260–269) on a single NVIDIA L40S GPU to maintain exact numerical parity with the reference implementations. Table 1 reports accuracy differences (ΔAcc) and layer-wise weight distances between M_u and M_r; closer to 0 is better.

Table 1: Accuracy differences and layer-wise distances on Pins Face Recognition

Mean ± std across 10 seeds. Bold marks the best (closest to 0) value in each column within a model–scenario block.

Model	Scenario	Method	ΔAcc on D'_f	ΔAcc on D'_r	Layer
ResNet18	Full-class	FT	29.22 ± 5.21	2.52 ± 0.11	31.52 ± 0.22
		BadT	0.26 ± 0.17	-2.35 ± 0.27	31.72 ± 0.34
		UNSIR	89.44 ± 1.98	2.51 ± 0.12	32.25 ± 0.34
		RL	0.00 ± 0.00	2.58 ± 0.12	31.98 ± 0.34
		SSD	1.97 ± 6.22	-9.14 ± 7.43	31.55 ± 0.36
		LFSSD	0.00 ± 0.00	-3.66 ± 1.53	31.56 ± 0.34
	Random	FT	2.78 ± 24.32	-2.60 ± 15.25	37.70 ± 7.91
		BadT	-35.00 ± 25.93	-32.47 ± 29.83	41.99 ± 8.78
		RL	-48.89 ± 34.03	-4.59 ± 20.12	42.11 ± 8.72
		SSD	-75.00 ± 19.47	-79.97 ± 23.79	40.25 ± 9.06
		LFSSD	-68.33 ± 15.28	-58.61 ± 19.20	40.82 ± 8.72
ViT	Full-class	FT	0.03 ± 0.06	-22.66 ± 23.47	105.29 ± 1.31
		BadT	17.90 ± 5.18	-0.88 ± 0.10	33.97 ± 0.12
		UNSIR	35.52 ± 4.92	-0.31 ± 0.13	39.38 ± 0.18
		RL	0.00 ± 0.00	0.20 ± 0.05	36.92 ± 0.18
		SSD	0.00 ± 0.00	-2.40 ± 0.99	63.38 ± 2.58
		LFSSD	0.00 ± 0.00	-3.88 ± 2.11	86.77 ± 6.07
	Random	FT	8.33 ± 5.40	1.42 ± 0.12	60.40 ± 0.19
		BadT	-8.33 ± 7.97	0.21 ± 0.52	32.76 ± 0.18
		RL	-76.67 ± 11.94	1.33 ± 0.12	35.71 ± 0.20
		SSD	-55.00 ± 37.99	-54.09 ± 45.60	171.80 ± 84.31
		LFSSD	-89.44 ± 6.11	-90.70 ± 6.59	231.91 ± 15.22

UNSIR is excluded from the random scenario by design. Full-class values average over 5 forget classes; random uses a 0.1% forget set. The large standard deviations in the random-sample row underline the paper's main practical point: single-seed image unlearning results can misrepresent a method's behaviour, and multi-seed evaluation is required for a fair comparison.

BibTeX

@misc{andreou2026supreme,
  title  = {SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation},
  author = {Andreou, Petros and Lanyon, Jamie and Finke, Axel and Cosma, Georgina},
  year   = {2026},
  eprint = {2606.00380},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url    = {https://arxiv.org/abs/2606.00380}
}

Acknowledgement

Petros Andreou is supported by a PhD studentship.