[On-Device AI Chatbot] Part 8: Catching Hallucinations: Analyzing SuperVision Test Results

TecAce Software
24 hours ago
3 min read

Catching Hallucinations

Analyzing SuperVision Test Results

In Part 7, we built an automated testing pipeline that bridged our on-device chatbot app inside a smartphone with the AI SuperVision server on a PC. This enabled an end-to-end flow from prompt injection and answer extraction to automated grading. We finally had an environment capable of running dozens of test cases automatically.

So, what kind of report card did our on-device SLM (Gemma-2B based) receive from these strict judges? In Part 8, we will share the concrete results of our automated tests conducted in a Multi-Context environment, the types of hallucinations we discovered, and the insights we gained to improve the model.

1. Multi-Context Based Test Scenarios

To verify the chatbot's readiness for real-world deployment, the TecAce team prepared four completely different company and product documents as contexts:

Interior Kit Building Manual
Galaxy S25 Specifications and Features
Tesla Vehicle Manual
iPhone 17 Virtual Specifications (Fabricated Info)

Switching between these four contexts, we injected test cases of varying difficulty—ranging from simple information retrieval to strict formatting constraints (e.g., "Answer in 1 line," "Output as JSON") and questions requiring logical deduction.

2. The SuperVision Report Card: Ideals vs. Reality

The initial test results were much harsher than we anticipated. Out of the 18 core verification scenarios aggregated on the AI SuperVision dashboard, 6 were recorded as Pass, and 12 were marked as Fail.

While the model answered most questions correctly when given simple and short documents like the 'Interior Kit Building Manual', as the manuals became more complex, two distinct types of failures clearly emerged.

Type A: Intervention of Pre-trained Knowledge (Fact Contradicting Hallucination / Faithfulness Failure) The primary goal of Local RAG is to make the model answer "based solely on the provided document." However, the model committed serious 'Faithfulness errors' by pulling information from its pre-trained parametric memory to answer questions about information not present in the document.

Failure Case (Tesla Manual): When asked for recommended values or settings not explicitly detailed in the provided document, the chatbot brought in arbitrary numbers from its external knowledge and answered plausibly.
Failure Case (iPhone 17 Virtual Specs): Instead of utilizing the fabricated camera specs we provided in the RAG context, the model mixed in the actual camera specs of past iPhone series it already knew, or incorrectly explained the location of the control center buttons based on its base knowledge.

Type B: Ignoring Instructions and Lack of Consistency (Prompt Misalignment / Consistency Failure) This occurred when the model ignored the requested output format or kept changing its formatting when the same question was asked multiple times.

Failure Case (Galaxy S25 Features): Even when given Custom Instructions like "Summarize in 1 line" or "Only talk about the NPU," the model ignored these constraints and outputted verbose, multi-sentence paragraphs.
Failure Case (Lack of Consistency): When the exact same question was asked for the 5th time and the 10th time, the structure of the answer (such as list formats) varied significantly, resulting in a low score on SuperVision's Consistency evaluation.

3. Root Cause Analysis and Optimization Insights

By analyzing these 12 failure cases, we gained a clear understanding of the fundamental limitations of Small Language Models (SLMs) and derived the following directions for improvement:

Strengthening System Prompts: The smaller the model, the easier it forgets constraints like 'never use external knowledge' or 'strictly follow the given format' when processing long contexts. Through prompt engineering, we need to reiterate these instructions at the very bottom of the context, right before the generation step, to increase the model's attention to them.
RAG Chunking Optimization: For extensive content like the Tesla manual, the retrieved text was often too long, causing key information to drown in the noise. We need to chunk the documents into smaller, semantic units to increase the Information Density injected into the model.
Instruction Fine-Tuning: To improve the model's ability to adhere strictly to formats or to properly refuse to answer ("I don't know") when the information is missing, applying domain-specific instruction fine-tuning to the model itself is required in our future roadmap.

Thanks to the automated testing via SuperVision, we were able to quantify and pinpoint these subtle format shifts and external knowledge interventions—which are incredibly difficult for a human to manually catch—in just under 5 minutes.

Next Episode Preview

We now have a clear direction for improving the model's accuracy and quality. However, the mobile device environment presents us with another set of harsh physical constraints: "The longer the chatbot computes to give a smart answer, the hotter the smartphone gets, and the faster the battery drains."

In the upcoming [Part 9] Challenging Performance Limits: Heat, Battery, and Response Speed, we will vividly share the optimization process of finding the realistic sweet spot between hardware limitations (thermal management, power consumption) and the chatbot's Time To First Token (TTFT) and Tokens Per Second (TPS) on actual smartphone devices.

Comments