scLatent

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Radar plot comparing model variants across distributional and perturbation metrics

Abstract

We introduce ExpressionVAE, the first discrete-latent perturbation model for single-cell data: a scalar-quantized variational autoencoder paired with a perturbation-conditioned generative prior. On Replogle and Parse~1M multiple variations of this framework achieves state-of-the-art on every distributional and all cell-eval derived generation and perturbation metrics we evaluate, with order-of-magnitude gaps on Frechet distance and MMD^2 over the strongest continuous-latent baselines. We test two prior families (autoregressive and masked discrete diffusion) and find they achieve effectively identical numbers, isolating the gain to the discrete latent space. A controlled output-head ablation further reveals a single design axis governing decoder-head choice, the richness of the inference-time sampling distribution, with standard evaluation metrics partitioning into three groups whose rankings flip along it. Finally, on a held-out CRISPRi reversion benchmark of 1732 perturbations under inflammatory cytokine stress, the frozen encoder outperforms existing methods like UMAP & DE and matches the the scGPT model (trained on 10 times larger dataset) on target ranking.

Approach

evae factorizes perturbation prediction into two stages: (1) a FSQ-VAE encoder that maps gene expression profiles into discrete or continuous latent codes, and (2) a generative prior trained on those codes to predict how a cell's latent representation shifts under perturbation. We perform a controlled cross-product study over two axes:

Prior (generative model)

  • Autoregressive (AR)
  • Masked diffusion MDLM
  • Flow matching

Output Head (tokenization)

  • Cross-entropy / quantile (ce-quantile)
  • Hurdle model
  • MSE
  • Negative binomial (nb)

We also vary quantizers (FSQ, Gaussian/continuous) across two large-scale benchmarks. The output head choice is analogous to the "output tokenization" decision in LLM design and proves to be the dominant architectural variable.

Materials

Project
PDF (coming soon)
Code

BibTeX

Coming soon.