OXRL Study Finds Post-Training Algorithm Rankings Invert with Model Scale
A comprehensive study evaluated 51 post-training algorithms across different model scales, revealing that algorithm performance rankings completely invert between 1.5B and 7B parameter models. The choice of loss function provides negligible gains compared to model scale.
Why it matters
This study provides critical insights for the post-training alignment community, revealing scale-dependent behaviors that could significantly impact how AI models are fine-tuned and aligned.
Key Points
- 1Researchers built the OXRL framework to enable controlled comparisons of post-training algorithms
- 2Algorithm performance rankings are not stable across different model scales, with best and worst performers inverting
- 3At 1.5B parameters, online RL (SGRPO) performed best, while SimPO was worst
- 4At 7B parameters, SimPO became the best performer, while SGRPO was no longer optimal
Details
The OXRL study systematically evaluated 51 different post-training algorithms across 4 model scales (0.5B to 7B parameters), 3 evaluation domains, and 20 DPO variants. The key finding is that algorithm performance rankings completely invert between smaller and larger models. What works best for 1.5B parameter models may perform worst for 7B models, and vice versa. This scale-dependent behavior challenges assumptions about post-training algorithm selection and could reshape how practitioners approach model alignment. The researchers used rigorous experimental design and statistical analysis to isolate the effects of model scale, algorithm category, and evaluation domain. They found that the choice of loss function provides less than 1 percentage point of leverage compared to the dramatic impact of model scale.
No comments yet
Be the first to comment