top of page

[On-Device AI Chatbot] Part 6: How to Verify AI Quality? (Introduction to SuperVision)


How to Verify AI Quality?

Introduction to SuperVision


In Part 5, we explored how to inject our company's proprietary knowledge into the on-device chatbot using Local RAG (Retrieval-Augmented Generation) and Multi-Context Switching. However, equipping the chatbot with knowledge does not immediately solve all problems.

"How can we be absolutely sure that this chatbot isn't fabricating answers and is truthfully speaking only about what is in the provided documents?"

In Part 6, we will introduce the chronic issue of generative AI known as 'Hallucination' and present 'AI SuperVision', an automated verification tool adopted by TecAce to objectively evaluate the chatbot's quality.


1. The Limits of Traditional Testing and the Threat of Hallucination

In traditional software development, quality is verified through unit tests that compare expected outputs with actual outputs (assert A == B). However, Large Language Models (LLMs) inherently generate text based on probabilities, meaning they respond with slightly different sentence structures each time even when asked the exact same question. Therefore, traditional rule-based testing methods cannot effectively verify AI quality.

Furthermore, LLMs often suffer from Hallucinations, generating plausible but factually incorrect information. Hallucinations can be broadly divided into two categories:


  • Fact Contradicting Hallucination: When the model generates information that completely contradicts the provided context or existing facts.

  • Prompt Misalignment: When the model ignores the user's intent or specific instructions in the prompt (e.g., "Summarize in 3 lines", "Output as JSON") and responds in an irrelevant format or context.


Relying on Human Evaluation—where a person manually asks questions, reads responses, and grades them—to catch these errors is incredibly time-consuming, expensive, and impossible to scale objectively.


2. The Relief Pitcher: Testworks 'AI SuperVision'

AI Supervision Eval Studio
AI Supervision Eval Studio

To overcome these limitations in LLM evaluation, TecAce adopted an automated LLM verification tool called 'AI SuperVision', developed by Testworks.

AI SuperVision utilizes the 'LLM-as-a-judge' methodology. Rather than humans, another powerful AI (the evaluator model) reads the responses generated by our on-device chatbot (the evaluatee model), scores them according to strictly predefined criteria, and analyzes the reasoning behind the scores. This system allows us to automatically, consistently, and rapidly evaluate hundreds or thousands of test cases without human intervention.


3. Core Evaluation Metrics of AI SuperVision

When evaluating TecAce's on-device RAG chatbot via AI SuperVision, we focused primarily on these three critical metrics:


  1. Faithfulness / Groundedness This is the most critical metric in a RAG environment. It evaluates whether the chatbot's response is solely grounded in the context provided by the user (internal documents, manuals, etc.). If the model mixes in external knowledge it learned during its original pre-training, or fabricates information not present in the document, this score drops significantly.

  2. Answer Relevance This evaluates whether the chatbot accurately grasped the intent of the user's question and delivered a to-the-point answer. Even if the information is factually correct, a chatbot is not a good assistant if it rambles on with unasked background knowledge.

  3. Consistency This verifies whether the chatbot provides the same reliable information even if the user asks the exact same question using slightly different words or sentence structures (paraphrasing). This is a crucial factor in determining the overall reliability of the model.


Next Episode Preview

While the evaluation tools and criteria were set, a new hurdle awaited us. Our chatbot operates entirely offline "On-device" (inside an Android smartphone), whereas the AI SuperVision tool exists on a "PC / Web Server". How can we connect these two entirely different environments to build an automated testing pipeline?

In the upcoming [Part 7] Building SuperVision: An Automated Chatbot Testing Pipeline, we will dive deep into the development of a 'Broker App' that bridges the Android app on the smartphone with the verification tool on the PC, and explore the process of constructing an extreme automated testing environment utilizing ADB and Python scripts.

Comments


bottom of page