SAFFRON-1: Inference Scaling for LLM Safety Assurance

Authors: Ruizhong Qiu*, Gaotang Li*, Tianxin Wei, Jingrui He, Hanghang Tong
University of Illinois Urbana-Champaign

Paper | GitHub

More details coming soon... Star our GitHub repo to stay tuned!

Overview

SAFFRON-1 introduces the first inference scaling paradigm tailored for LLM safety assurance, addressing the shortcomings of existing methods like Best-of-N, Beam Search, and MCTS under adversarial settings. Our method replaces expensive process reward models (PRMs) with a novel multifurcation reward model (MRM) that reduces reward evaluations while improving robustness and efficiency.

Motivation

While inference-time scaling has greatly advanced reasoning tasks, it fails to scale efficiently in safety settings. Existing techniques suffer from what we identify as the exploration-efficiency dilemma: more exploration incurs high computational cost due to frequent PRM calls, limiting their scaling efficiency.

Figure 1: Comparison between PRM-based and MRM-based tree search procedures.

Figure 2: Comparison with existing inference scaling methods.

SAFFRON Paradigm

We propose Safe Multifurcation (SAFFRON), an inference scaling paradigm with the following key innovations:

Multifurcation Reward Model (MRM): A vector-valued model that approximates all next-token rewards in one call.
Partial Supervision Training: Efficient learning from token-level reward annotations.
Conservative Exploration: Avoids unsafe out-of-distribution tokens by masking unseen outputs.
Trie-based Key-Value (KV) Caching: Efficient reuse of KV caches across searched sequences.

Figure 3: Trie-based cache sharing across sequences with common prefixes.

Resources

To facilitate future research on LLM safety, we release our trained MRM SAFFRON-1 and the training dataset Safety4M:

Trained MRM SAFFRON-1: https://github.com/q-rz/saffron/tree/main/models/Saffron-1-1B
Training dataset Safety4M: Coming soon...