SAFFRON-1: Inference Scaling for LLM Safety Assurance

Authors: Ruizhong Qiu*, Gaotang Li*, Tianxin Wei, Jingrui He, Hanghang Tong
University of Illinois Urbana-Champaign

Paper | GitHub

More details coming soon... Star our GitHub repo to stay tuned!


Overview

SAFFRON-1 introduces the first inference scaling paradigm tailored for LLM safety assurance, addressing the shortcomings of existing methods like Best-of-N, Beam Search, and MCTS under adversarial settings. Our method replaces expensive process reward models (PRMs) with a novel multifurcation reward model (MRM) that reduces reward evaluations while improving robustness and efficiency.


Motivation

While inference-time scaling has greatly advanced reasoning tasks, it fails to scale efficiently in safety settings. Existing techniques suffer from what we identify as the exploration-efficiency dilemma: more exploration incurs high computational cost due to frequent PRM calls, limiting their scaling efficiency.

Figure 1: Comparison between PRM-based and MRM-based tree search procedures.

Figure 2: Comparison with existing inference scaling methods.


SAFFRON Paradigm

We propose Safe Multifurcation (SAFFRON), an inference scaling paradigm with the following key innovations:

  • Multifurcation Reward Model (MRM): A vector-valued model that approximates all next-token rewards in one call.
  • Partial Supervision Training: Efficient learning from token-level reward annotations.
  • Conservative Exploration: Avoids unsafe out-of-distribution tokens by masking unseen outputs.
  • Trie-based Key-Value (KV) Caching: Efficient reuse of KV caches across searched sequences.

Figure 3: Trie-based cache sharing across sequences with common prefixes.

Resources

To facilitate future research on LLM safety, we release our trained MRM SAFFRON-1 and the training dataset Safety4M: