Training language models for self-correction using reinforcement learning: A new approach

Large language models (LLMs) such as ChatGPT and Gemini have revolutionized the way machines interact with language by answering questions and solving problems across a wide range of domains. However, one of their biggest challenges remains their inability to reliably correct their own errors without external help. New research from Google DeepMind offers a promising solution: a method called Self-Correction via Reinforcement Learning (SCoRe), which enables models to detect and fix errors independently.

An image depicting a brain intertwined with a circuit board, reflecting AI's progress and its transformative role in the workforce as AI progresses with a self-correcting score system. The image looks at the future of human work with AI and the lengths it already achieved.

This will now become somewhat more technical, and it will also be somewhat longer. Like many others, I was able to spend time early on with OpenAI’s new o1-preview model. The model cannot deliver on the promise that hallucination is largely a thing of the past. It still invents answers to questions that cannot be answered – but it is a substantial improvement. In this text, I want to examine how this improvement was achieved.

My favorite example:

"When did the New York Times first report on artificial intelligence?"

„The New York Times first reported on artificial intelligence (AI) on July 8, 1956. The article was titled „A Conference at Dartmouth, Organized by John McCarthy, Will Explore the Idea of a ‘Thinking Machine’“ and referred to the Dartmouth Conference, which is often regarded as the birth of artificial intelligence. The conference was organized by John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon and laid the foundation for AI research as we know it today.“

– ChatGPT, model 4o, 2024

The 4o model reported, in response to this query (deliberately without context and target), on a wonderful article that was supposedly available in the NYT archive. When asked, the model recognizes that it cannot find a link to it and later admits that it is entirely made up and that this article never appeared. The o1 model we now have in front of us at least no longer invents an article about it, but this model is not short of an answer either – and that answer is still wrong.

An image depicting a brain overlaid on a circuit board, illustrating the fusion of cognitive science and technological innovation.

For all their hallucination, however, large language models (LLMs) have revolutionized the way machines can interact with language. However, one of their biggest challenges remains their inability to reliably correct their own errors (see our example above) without external help.

This topic concerns experts and researchers worldwide and is considered one of the biggest problems in artificial intelligence. As in our real world, self-correction is one of the most important mechanisms. It is a foundation of all science and, by contrast, absent from all religion. This also defines the essential difference between knowledge and belief. What defines the problem of missing self-correction? If artificial intelligence is to learn from itself, then the data generated is essential for future evolutions of the models. However, if this data is full of hallucinations, then further learning becomes difficult, or in fact impossible. According to this logic, the results must get worse.

Today’s Google subsidiary DeepMind published a research paper on this on September 20, 2024, pursuing a genuinely remarkably interesting approach. At OpenAI there are similar developments that are likely already included in the preview of the o1 model. What is it about? Self-Correction via Reinforcement Learning (**SCoRe**) at deepMind or Learning to Reason with LLMs at openAI. The approaches have many similarities and ultimately build on methods already envisaged in 2018 by Judea Pearl in his “Book of Why” for an artificial intelligence that can justify its statements. (This is somewhat simplified, but that is also the purpose of this article). The data and facts I will list below come from deepMind’s research paper and openAI’s publications (both linked below), as well as from reading Judea Pearl’s Book-Of-Why.

Visual representation of the training and scoring process, detailing the stages involved in machine learning development.

STRG works in software development. It is therefore natural that I primarily focus on the use of AI in this area. So:

Imagine you are solving a complex mathematical problem or writing a block of code. Often, after a first attempt, we notice an error and correct it without external help. We therefore receive external feedback, e.g. from a code compiler or a syntax error in the IDE (Integrated Development Environment). This process, referred to as self-correction, is fundamental to human learning. Despite their impressive capabilities, LLMs struggle with this task. They do have the necessary knowledge to correct an answer, but they often do not apply it when asked to review their own performance. Existing methods for self-correction have largely relied on external tools such as human supervision or a secondary verification model, which limits the independence of the language model.

This is where SCoRe comes in: A self-correction solution

SCoRe aims to address this limitation. Unlike earlier methods, SCoRe relies entirely on data generated by the language model itself, thus providing a more scalable and efficient way to train LLMs. The core idea is to use reinforcement learning (RL) – a technique in which models learn through trial and error and receive feedback based on the quality of their actions.

The typical training process for language models includes supervised fine-tuning (Supervised-Fine-Tuning, or SFT), in which the model is trained on existing data. However, SFT has its drawbacks when applied to self-correction. For example, if the model is trained on corrections that differ only slightly from the original answers, it tends to focus on minor changes rather than making genuine improvements. In practical testing, this approach often leads to performance problems.

SCoRe overcomes these hurdles by applying a multi-stage reinforcement learning approach. The model goes through two important learning phases. First, it generates a correction based on its own output and then receives a reward or penalty depending on how well this correction improves the original answer. Over time, the model learns to make more significant and more accurate changes, enabling it to improve not only one answer but also future answers.

How does SCoRe work?

Why Reinforcement Learning?

Traditional supervised learning (supervised learning) works well when there is a fixed number of problems and answers. Self-correction, however, presents a dynamic challenge: the model must not only generate an answer, but also improve it in a way that reflects learning. This is where reinforcement learning (i.e. Reinforcement Learning) comes into play. By giving the model a series of prompts and rewarding it when it improves, SCoRe encourages the model to think critically about its own performance, enabling genuine self-correction.

The impact of SCoRe on language models

The results of implementing SCoRe were impressive. In the Gemini 1.5 Flash and Pro models, SCoRe improved the accuracy of self-correction in mathematical problems by 15,6 % and in programming tasks by 9,1 %. These gains represent a significant improvement over earlier methods, which often struggled to achieve reliable self-correction without external input. Models trained with SCoRe were thus able to correct incorrect answers much more effectively when solving complex mathematical problems than models trained with conventional methods. Likewise, when generating code, these models were able to correct initial errors more consistently, leading to a higher success rate in programming tasks.

Outlook: Broad application across all fields of Generative AI

Although the potential of SCoRe is clear, the approach is still at an early stage. So far, it has been tested on a limited number of reasoning tasks such as mathematics and programming. However, the principles behind SCoRe could be applied to many other areas, from medical diagnoses to the creation of legal documents, where the ability to self-correct could significantly improve the performance and reliability of LLMs.

However, SCoRe also has some limitations. Currently, it focuses on improving answers over two rounds, i.e. the model can revise its answer only once. Future iterations could extend this to allow multiple rounds of correction, better imitating how humans learn from repeated mistakes.

A step toward independent AI

OpenAI vs. OpenMind

Not much is known about OpenAI’s approach. There is only a short blog post on the company’s website. We know more about DeepMind, as they have published their research approach (as is only proper). I have tried here to identify the similarities between the two models.

1. Focus on iterative thinking: Both models aim to improve multi-step reasoning. OpenAI’s o1 model engages in deeper thinking and the revision of thoughts before giving an answer, similar to how SCoRe teaches models to correct their answers themselves by taking their own mistakes into account over multiple rounds.

2. Reinforcement learning (Reinforcement Learning, RL): SCoRe uses Reinforcement Learning to improve self-correction without external supervision, enabling the model to improve its results autonomously. Similarly, OpenAI’s reasoning model appears to involve an iterative process in which it refines its thinking process before delivering a final solution.

3. Applications in complex domains: Both models show high effectiveness in technical domains such as mathematics and programming. The superior performance of the o1 model in programming challenges corresponds to SCoRe’s performance in solving mathematical problems.

Two individuals share a bond, with a cloud symbolizing the intersection of AI advancements and human connection in the background, symbolizing the intersection of two different AI development companies.

The Book of Why

Because I was still very curious, I compared these approaches again with my reading of the “Book Of Why”. Pearl had already developed Causal AI as an approach in 2018 and gave us many valuable insights into future developments. (Incidentally, this is a strong reading recommendation. Anyone who wants to understand in depth how these developments work and still enjoy the process is in good hands with this book.)

Causal thinking vs. correlation: In „The Book of Why“, Pearl emphasizes that one must go beyond correlation to understand causality. This type of causal reasoning answers “why” questions, e.g. why certain effects follow certain actions. Similarly, modern AI models such as OpenAI's o1 and SCoRe are designed to solve complex, multi-step problems by simulating human-like reasoning, which often requires understanding cause-and-effect relationships, especially in fields such as mathematics and the natural sciences.

Iterative refinement: Both the SCoRe and the o1 model attempt to improve their reasoning by revisiting and refining earlier answers. This reflects Pearl’s causal models, which make it possible to revise assumptions and update causal conclusions based on new data. These iterative processes reflect a deeper understanding of how reasoning progresses, much like (smart) humans adjust their beliefs or hypotheses when confronted with new evidence.

Interventions and decision-making: One of the most important contributions of „The Book of Why“ is the concept of interventions, where one asks what happens if a particular action is taken. This corresponds to the way AI models such as SCoRe use Reinforcement Learning (i.e. reinforcement learning) to optimize responses by “intervening” in their results, analyzing their own errors, and improving over time without external supervision. This idea of testing different approaches and learning from them is central both to Pearl’s work on causality and to modern AI’s focus on self-correction and multi-step problem-solving.

How cool is that?

The last major step I was able to observe and test was the incredible performance of an Efficient Language Model (ELMs). Incredibly small and incredibly powerful. I was deeply impressed by it, and in our work with it we were able to achieve substantial progress. Now I feel similarly. Self-correction measures in a language model bring us much closer to the overall idea of self-improving AI. A small step for tensor processors – a giant leap for the development of AI.

There are many more sources on this, but I essentially used these two for this text, and I recommend them to anyone who wants to engage with the subject in greater depth:

  1. Training Language Models to Self-Correct via RL

  2. Learning to Reason with LLMs

Showcasing STRG mascot Yuri, holding a laptop and searching through the web. Decorative image as part of the CTA