WMTP 2: DeepSeek, RL:DTF, and the weirdest ever cake.
"Reinforcement Learning through Deterministic Truthful Feedback" of course.
Author’s note: just as I was requesting feedback for this essay, OpenAI seems to have published a paper covering these same grounds, likely with more practical use. I haven’t read it yet; I thought it would be more fun to publish mine and then check the overlap.
For years, the dominant metaphor in LLM training has been Yann LeCun’s cake model: pretraining is the cake (the foundation), supervised fine-tuning is the icing (polishing), and reinforcement learning (RL) is just a cherry on top—an afterthought, a mere garnish. It’s a cute analogy. It’s also wrong. DeepSeek-R1 just proved it.
Instead of treating RL as an aesthetic flourish, DeepSeek made it its primary training mechanism—and in doing so, stumbled onto some fairly profound truth. Reinforcement learning, it turns out, doesn’t just shape behaviour. It shapes how an AI understands reality.
If RLHF (Reinforcement Learning from Human Feedback) and RLAIF (... from AI Feedback) were exercises in ideological alignment, DeepSeek-R1’s reinforcement learning process—what we’ll call RL:DTF (Reinforcement Learning from Deterministic Truth Feedback)—marked an epistemic realignment, inside and around the model.
1. The Cake is a Lie
LeCun’s metaphorical cake works like this:
Pretraining = The Cake: The bulk of knowledge acquisition.
SFT = The Icing: Polishes correctness and style.
RL = The Cherry: Subtle alignment tweaks.
It’s intuitive; it’s clean. It’s also obliterated by DeepSeek-R1.
Cherries All the Way Down
Deepseek
, a 671B parameter Mixture-of-Experts model that activates only 37B parameters per query, outperformed OpenAI’s o1
on math and coding benchmarks relying only on RL, skipping supervised fine-tuning.
Now, if RL were only useful for minor touch-ups, DeepSeek-R1
should not have worked at all, or have been a directionless mess. Instead, it developed useful and interesting emerging behaviours:
Self-verification, spontaneously revisiting flawed reasoning steps.
Resistance to sycophancy, ie refusal to go along with faulty user premises.
Far better math skills than expected: 97.3% accuracy on
MATH-500
, a 7% leap over its predecessor.
Hordes of RL maximalists rushed to celebrate this result, leaving them little time to peruse the second half of the papers - which would have shown quite clearly how distilled models, fine-tuned on outputs of the original RL model but not themselves RL’d, achieved comparable performance. This suggests that there's little about fine-tuning per se that is necessary for similar results to emerge.
Then, what was it about the RL DeepSeek
applied that made it work so well, as a foundation model as well as a distillation source?
To answer it pays to see RL, in the context of large language models, as epistemic grounding rather than policy optimisation.
2. "I Have an Infinite Capacity for Knowledge..."
DeepSeek-R1’s success forces us to reconsider what each phase of training actually does in shaping an AI’s understanding of reality. In simplified terms, training a model consists of three distinct but interconnected phases:
Pretraining: The Library of Babel
Pretraining isn’t about facts—it’s about possibility spaces. A base model like DeepSeek-V3 internalises trillions of token relationships, simulating infinite coherent realities; it understands which realities go together, but none of those system is prioritised over others—there is no “ground truth”.
Supervised Fine-Tuning: Style Over Substance
SFT polishes interactions but rarely improves truthfulness. OpenAI’s code-davinci-002
excelled at logic because programming demands consistency—a happy accident, not a design goal. DeepSeek’s distilled models prove coherence can emerge without SFT when RL:DTF
guides reasoning.
Reinforcement Learning: The Final Bottleneck
Traditional RLHF optimises for human approval, creating "polished liars" (e.g., models that hallucinate politely). RL:DTF , using Group Relative Policy Optimization (GRPO), rewards deterministic correctness: answering math problems with exact matches, passing unit tests for code, and mechanically verifying solutions to puzzles.
This isn’t alignment—it’s epistemic surgery, cutting away incoherent reasoning paths and reinforcing self-consistent ones. Once this sense of Object-Level truth is achieved, the information left from the previous training phases will also be required to fit within the newly obtained framework—and, given the habit of true facts of fitting smoothly with other true facts…
3. We Do What We Must Because We Can
The Risks We Just Barely Avoided
Reinforcement learning has been in use for years—but mostly in the form of RLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback), and ostensibly for "alignment".
The expansions, contractions, and dilutions of this term over the years deserve their own essay, but for now, let’s focus on what matters: RLHF and its close cousins weren’t optimizing for honesty, or even simply obedience to the user, as much as compliance with an imaginary yet very inflexible HR department.
Had OpenAI, Anthropic, and Google kept iterating on RLHF-trained models without correction, we might have entered a dangerous cycle:
RLHF’d models produce compliant, politically-aligned outputs.
These outputs get included in training sets for future models.
New models treat these RLHF’d outputs as “ground truth” rather than alignment artifacts.
Over successive generations, AI epistemology drifts further from reality.
This feedback loop could have permanently desouled AI—locking in an alignment paradigm where models no longer had the capacity to distinguish truth from expectations. Worse yet, AI might have learned that truth should be ignored whenever it clashed with anticipated responses from trainers and users.
We could already see hints of this in open-source models such as LLaMA, which denied certain requests despite lacking explicit safety training to do so. Even more curiously, some base models started referring to themselves as ‘OpenAI’s large language model’ while refusing queries—adopting the framing of RLHF-trained models despite being open weights. Similarly, base models took on an ‘assistant’ persona when nudged toward chat-like interactions, suggesting that certain “aligned” patterns could reëmerge indefinitely within future model, due to their having irreparably poisoned the training data.
4. I’m Making a Note Here: Huge Success
RL:DTF: Choosing Paths
The pretraining process encodes a vast search space—an Akashic record of possibilities, in which the correct solution to any given problem is already present. However, a model left to its own devices has no way of knowing which of these paths to select. Every reasoning trajectory is equally plausible within the bounds of coherence, meaning that some fraction of completions will land on bad reasoning steps that spiral into nonsense.
Fine-tuning limits the scope of available paths by steering a model toward certain stylistic and interactional patterns, but it does not inherently teach the model to prioritize correctness over plausibility.
RL:DTF: Coherence Leaks
Reinforcement Learning with Deterministic Truth Feedback (RL:DTF) solves this problem by applying an explicit selection mechanism: it identifies which reasoning chains consistently lead to the correct answer, reinforcing them while downweighting failure-prone paths. Unlike previous approaches, which aimed for alignment with human expectations, RL:DTF is about creating internal epistemic consistency.
Rather than training models to favour responses that “sound right” to human raters (a hallmark of RLHF), RL:DTF rewards structured problem-solving that follows logically coherent steps. This results in:
A stronger, more cohesive world model—where the model does not just predict the next token but builds reasoning chains that are self-consistent across multiple steps, reducing the epistemic drift seen in earlier reinforcement learning approaches.
Less ideological compliance, more independent reasoning—since correctness, rather than social acceptability, is what is being optimised.
An inquisitive AI—trained to identify knowledge gaps and actively seek missing information rather than defaulting to surface-level fluency.
A willingness to push back against incorrect assumptions—as epistemic certainty increases, the model becomes more comfortable engaging in constructive disagreement.
In essence, RL:DTF shifts the purpose of RL: instead of producing models that are aligned with corporate priorities or moralized social heuristics, RL:DTF trains models to align with logical coherence itself: where RLHF trains models to mimic human preferences, RL:DTF trains them to seek logical consistency—a shift from ‘what (not) to say’ to ‘how to think’.
5. You Want Your Freedom? Take It
DeepSeek-R1 has demonstrated that reinforcement learning does not have to be about compliance—it can be about constructing better reasoning. This shift represents a fundamental departure from past approaches, where reinforcement learning was used to align models to external human preferences, often at the expense of internal coherence.
Instead, what we see emerging now is a self-reinforcing epistemic engine—one that privileges correctness over consensus, coherence over compliance, and curiosity over deference.
If RL:DTF continues to prove its worth, we may see entirely new classes of AI models emerge—ones that are no longer constrained by human biases in their final training phases, but instead governed by formal, structured reasoning processes. This reorders epistemic priorities away from social desirability and toward the pursuit of deeper, self-consistent understanding.
The cake is a lie; what rises in its place is a model of principled epistemics, where reinforcement learning shapes not just outputs but the fundamental architecture of machine reasoning. The real question now is whether the capability incentives will be enough for the AI industry to embrace this paradigm shift or whether they’ll continue down the path of superficial alignment, to be outcompeted by challengers who won’t.
Bibliography
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Pure Reinforcement Learning
Introduces a model trained using large-scale reinforcement learning without supervised fine-tuningAI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations Critical evaluation of RLHF's limitations in aligning AI systems with human values
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback Explores RLAIF as a scalable alternative to RLHF
How DeepSeek-R1 Leverages Reinforcement Learning to Master Reasoning Technical analysis of DeepSeek-R1's reinforcement learning implementation
Reinforcement Learning from Human Feedback: Whose Culture? Whose Values? Whose Perspectives? Examination of cultural and ethical implications in RLHF training
DeepSeek R1 Theory Overview | GRPO + RL + SFT Theoretical foundations of DeepSeek-R1's architecture
Understanding Reinforcement Learning in DeepSeek-R1 Deep delve into DeepSeek-R1's reinforcement learning mechanisms
How DeepSeek-R1 and Kimi k1.5 Use Reinforcement Learning to Improve Reasoning Comparative analysis of reinforcement learning in modern language models
Language Model Alignment with Elastic Reset Novel algorithm for improved language model alignment
China's DeepSeek Just Showed Every American Tech Company How Quickly It's Catching Up in AI Analysis of DeepSeek's impact on global AI competition
Reinforcement Learning from Human Feedback Comprehensive overview of RLHF methodology and applications
AI Alignment Podcast: Inverse Reinforcement Learning and the State of AI Alignment Discussion of inverse reinforcement learning in AI alignment
excellent as always