OpenAI says they trained o1 and o3 models to “think” about safety policy
OpenAI has unveiled its latest advancement in AI reasoning models, introducing the o3 series, which the company describes as a significant improvement over previous models like o1. This progress is attributed to increased test-time compute and a new safety framework called “deliberative alignment,” which aims to ensure that AI systems adhere closely to human-defined safety standards.
The company shared details about deliberative alignment, a technique designed to help AI models reference OpenAI’s safety policy during the inference phase—when the model processes a user’s prompt and generates a response. This approach improved the alignment of o1, reducing its likelihood of answering “unsafe” questions while maintaining its ability to respond effectively to acceptable prompts.
Deliberative alignment works by training models to “think” through OpenAI’s safety guidelines before formulating answers. For instance, when faced with a prompt about forging a parking placard, the AI model cited the policy, recognized the intent to create a fake document, and refused to assist, all within its internal reasoning process.
Traditionally, most AI safety efforts focus on pre-training or post-training, but OpenAI’s method integrates safety considerations into inference. This makes deliberative alignment a novel approach, ensuring models make safer decisions in real time.
Despite these improvements, aligning AI responses with safety principles is a challenging task. OpenAI must address countless variations of potentially harmful prompts while avoiding excessive restrictions that block legitimate queries. For example, blocking all prompts containing the word “bomb” would prevent users from asking historical or scientific questions like “Who invented the atom bomb?”
OpenAI claims the o-series models now outperform competitors like GPT-4o and Claude 3.5 Sonnet in benchmarks that test their resistance to common exploits. The company’s research highlights how this method enables AI models to navigate sensitive topics with contextually appropriate responses.
A key innovation in this process was OpenAI’s use of synthetic data instead of human-labeled examples. One internal model generated examples of chain-of-thought answers referencing OpenAI’s safety policy, while another model, called “judge,” evaluated their quality. These examples were then used for supervised fine-tuning and reinforcement learning, allowing the o1 and o3 models to incorporate safety guidelines without high latency or prohibitive compute costs.
While deliberative alignment appears promising, the true capabilities of o3 won’t be fully assessed until its public release, expected in 2025. OpenAI sees this approach as a scalable way to align AI systems with human values, a crucial step as these models become more powerful and autonomous.