2 min Applications

New OpenAI models hallucinate more often than their predecessors

New OpenAI models hallucinate more often than their predecessors

According to OpenAI’s own tests, the latest reasoning models, o3 and o4-mini, hallucinate significantly more often than the previous model, o1.

TechCrunch was the first to report that OpenAI’s system card describes the results of the PersonQA evaluation. This test is designed to measure hallucinations. The evaluation showed that the hallucination rate of o3 is 33 percent. And that of o4-mini is even higher at 48 percent. That is almost half the time. By comparison, o1 had a hallucination rate of 16 percent, which means that o3 hallucinated about twice as often.

New models show no improvement

Hallucinations appear to be one of the biggest and most difficult problems to solve in AI, and even affect today’s best-performing systems. Historically, each new model has improved slightly in terms of hallucinations, causing it to hallucinate less than its predecessor. However, this does not seem to be the case for o3 and o4-mini.

According to the system card, o3 generally makes more assertions. This leads to both more correct and more incorrect or hallucinated statements. OpenAI indicated that the underlying cause of this is still unknown and that more research is needed to understand it.

More computing power

OpenAI’s reasoning models are presented as more accurate than non-reasoning models, such as GPT-4o and GPT-4.5. This is because they use more computing power to think longer before responding. In the announcement of o1, this was described as a process in which the models refine their thinking process, try different strategies, and learn to recognize their mistakes. They rely less than before on stochastic methods.

The system card for GPT-4.5, released in February, shows that this model has a hallucination rate of 19 percent in the PersonQA evaluation. The same card compares this to GPT-4o, which achieved a rate of 30 percent.