Hallucination in Artificial Intelligence#

Robots also dream. Yes, and they dream while awake. The hallucination in language models (or artificial intelligences) is inevitable. Not everything generated by an AI can be trusted, and you should have appropriate criteria to judge the information produced. In general, AI is a good assistant, but only that—its purpose is to assist you, and the final decision must always be yours. Here I will illustrate why this is important.

What Is Hallucination in Artificial Intelligence?#

Hallucination refers to the phenomenon where responses generated are coherent and grammatically correct, yet factually incorrect or nonsensical. In other words, it involves the generation of false information.

Types of Language Model Hallucinations#

Language model (LLM) hallucinations have been characterized:

  • Factual Errors or Contradictions: This type of factual hallucination occurs when the responses can be compared to real information and the response presents contradictions or falsehoods.

  • Fabrication of Facts: This is another type of factual hallucination where the responses cannot be verified, are nonexistent data, or the answers are debatable.

  • Instruction Inconsistency: This type of faithfulness hallucination occurs when the instruction or question is deviated or altered.

  • Context Inconsistency: This type of faithfulness hallucination happens when the instruction ignores contextual information.

  • Logical Inconsistency: This type of faithfulness hallucination involves presenting logical contradictions, such as mathematical reasoning errors.

Why Do Language Models Hallucinate?#

There are multiple reasons why language models (LLMs) exhibit hallucinations. These can include:

  • Outdated Training Data: The models possess a base set of data on which they are trained and that is limited in time. Therefore, if the model had a training cutoff in 2022, you cannot expect responses based on subsequent information. It’s important to be aware of the training data cutoff of the model you choose.

  • Data Bias: Information collection can present biases, and the most common one we encounter is generated by the language used in the majority of sources. For example, models like Qwen and Deepseek have a strong bias towards English and Chinese (more noticeable in deepseek and lighter versions of qwen).

  • Misinformation Sources: In general, the sources are not cleaned up, and there is a high presence of web-extracted data from general sources that may contain erroneous or false information (even deliberately as an attack to language models).

  • Lack of Data or Truth Source: Not everything we can consult is in the dataset or simply there is no information. This could be due to copyright limitations on existing content or limitations of the knowledge itself.

  • Ambiguous Questions: The more ambiguous the question, the higher the probability of hallucination because it is subject to the “interpretation” of the model. You can consider how if the model will compare similar questions. Generally, we improve results by increasing the level of specificity in the query.

  • Reasoning Failures: Most models have low reasoning capabilities. LLMs understand language structure (or mimic it), but there are no causal chains within the context that represent their words (tokens).

  • High Computational Complexity: The higher the computational complexity of a problem, the more sensitive it is to hallucinations. For example, combinatorial problems can be highly sensitive.

Can Language Model Hallucinations Be Avoided?#

Given all that has been mentioned above, an important question is whether this problem of hallucinations can be solved. However, the honest and unpalatable truth is that there will always be some level of hallucination in language models (and reasoning models), no matter how we try to mitigate it.

The training data set for a model is always a subset or approximation of the real world, but the very construction of the model is an iterative process that seeks patterns within its dataset and allows it to converge. This convergence causes the model to stop short of fully approximating the source of truth and results in overfitting. In other words, the model is an approximation of the data used for training, which can lead to inaccuracies or biases.

Which Model to Use to Reduce Hallucinations?#

This question is not straightforward, and we cannot rely on standard comparison tests of the models either, as these tests are standardized. This is an important criticism because standardized tests gradually become part of the model training process. Therefore, the best you can do is to test different models. Build a set of questions related to your daily life, work, and areas of interest where you have knowledge and can validate the certainty and quality of the responses. And from there, choose.

Given that one source of hallucination is data bias, this is important because standardized tests are in English. Therefore, a good model might not be the best choice for you if you don’t dominate English, or if your field has limited public data—perhaps the best model isn’t the one with more information relevant to your interests.

Which Model to Use to Reduce Hallucinations?#

In general, “ultralight” models are not a good idea (those with fewer than 3 billion parameters are a constant source of hallucinations), as a smaller number of parameters implies a loss of data, connections between them, and semantic language structure.

A good starting point is qwen2.5:3b or llama3.2:3b if you do not have a GPU. These models are decent but exhibit significant regional context bias (in my case, Colombia) and Spanish bias. However, they have a very good context for code-related tasks or English queries. I tested them on a machine without a GPU with 16 GB of RAM. However, if you have a GPU, my recommendation is to use qwen2.5:7b or qwen2.5:14b. Definitely do not recommend using Phi (it has significant language biases) or any variant of Deepseek (deepseek-r1 included), which are distilled versions of Qwen2, Qwen2.5, and Llama3.1, with reduced precision, continuous loss of context, and mixed-language issues.

Math-specialized models disappointed me, but the issue might be bias when asking in Spanish. For mathematical purposes in Spanish, I recommend using the general versions of qwen2.5:7b and qwen2.5:14b.

In code-specialized versions, my recommendation holds true for the family of qwen2.5, with its variant qwen2.5-coder. The version qwen2.5-coder:3b is useful if you do not have a GPU; however, qwen2.5-coder:7b on a GPU provides a good balance without excessive resource consumption. It’s important to use the code-specialized versions because they provide the appropriate integration for code autocompletion with code editors and more precise responses when asking about programming topics. A curiosity about qwen2.5-coder:7b, it is used as the base model for the zeta of zed.

If you have a better machine, you can try larger models. In my case with an RTX 2060 GPU and 16 GB of RAM, I hit the limit with 14b models. But keep in mind that for your use case, increasing the model size might not be necessary. For example, although I can run qwen2.5:14, I do not see additional benefits compared to qwen2.5:7b in my usage scenario.

References#