Building SuperVision: An Automated Chatbot Testing Pipeline 7/10

TecAce Software
Feb 23
3 min read

Building SuperVision

An Automated Chatbot Testing Pipeline

In Part 6, we explained the background of introducing Testworks' 'AI SuperVision' tool to objectively evaluate the chronic hallucination issues inherent in generative AI. However, to actually apply this tool to our project, we had to overcome a significant technical barrier.

Our LLM chatbot operates completely offline "On-device" (inside a smartphone), whereas the AI SuperVision system evaluating it exists in a "PC or Web Server (Host)" environment. Manually typing dozens or hundreds of test cases into a smartphone and transcribing the results back to a PC is nearly impossible and highly inefficient.

In Part 7, we will share the detailed process of building an automated testing pipeline that bridges this physical gap, reducing the entire process—from injecting questions and extracting answers to executing AI validation—down to under 5 minutes.

1. Bridging the Physical Gap: Utilizing a Broker App and ADB

To enable communication between the test scripts on the PC and the chatbot app inside the smartphone (the target device), we developed and installed a lightweight intermediary Android app called the 'Broker App' directly onto the smartphone.

The overall communication architecture using the Broker App operates as follows:

ADB Port Forwarding: First, the PC and the smartphone are connected via USB (or on the same network). We use ADB (Android Debug Bridge) to forward a specific port on the PC to a port on the smartphone (e.g., adb forward tcp:8080 tcp:8080).
PC to Broker App Request: The Python-based Data Collector and Orchestrator scripts on the PC send an HTTP POST request (containing the prompt data) to the smartphone's Broker App via the localhost address.
Inter-App Communication (Broadcast Intent): Upon receiving the HTTP request, the Broker App utilizes the Android system's Broadcast Intent feature (com.tecace.supervision.broker.ACTION_CHAT_MESSAGE) to pass the question to the on-device Chatbot App (Test Runner).
Answer Generation and Return: The MediaPipe engine within the Chatbot App drives the on-device LLM to generate an answer, which is then sent back to the Broker App via an ACTION_CHAT_REPLY intent. The Broker App finally returns this result as an HTTP response to the Python script on the PC.

Through this architecture, we established a foundation to fully remote-control the on-device app and extract text from an external PC without ever having to touch the smartphone screen.

2. AI SuperVision Integration: Automated Evaluation via Python SDK

Now that we have successfully extracted the actual output from the device, it is time to send it to the AI SuperVision server for grading. To accomplish this, we pre-defined a golden dataset (Ground Truth, the expected correct answers) alongside various test cases.

Automated Evaluation Script Logic (Python) We integrated the Python client SDK (supervision.client) provided by SuperVision into our scripts.

Initialize supervision.client and create an Evaluation Session.
Select a pre-defined Evaluator Model (i.e., the judge model that will score metrics like faithfulness and answer relevance).
Iterate through a For-loop, creating an EvaluationData object for each response retrieved from the smartphone. This object contains the prompt (question), the injected context (internal document content), the expected output (Ground Truth), and the actual output generated by the chatbot.
By sending this data to the SuperVision server using the client.session.request() function, the LLM-based evaluation is automatically executed in the background, and the results are recorded on the dashboard.

3. Completing the Foundation for Data-Driven Optimization

By building this flawless End-to-End (E2E) pipeline connecting GitHub Actions (or Local Script) -> Python (ADB) -> Broker App -> Chatbot App -> Python SDK -> SuperVision, we achieved tremendous benefits.

Overwhelming Efficiency Gain: Regression testing, which previously consumed over 60 minutes due to manual input and verification on the device, can now be completed in under 5 minutes with a single script execution.
Securing Objective Data: Even in environments with device fragmentation (such as varying Android OS versions and NPU compatibilities), we are now able to consistently and objectively accumulate the model's performance (Latency, TPS) and quality (Accuracy, Hallucination) data into our database for every release.

Next Episode Preview

At last, the entire process from 'injecting questions' to 'validating answers' is running automatically on our pipeline. So, what kind of report card did our on-device SLM receive from the strict AI SuperVision judges?

In the upcoming [Part 8] Catching Hallucinations: Analyzing SuperVision Test Results, we will disclose the detailed report card and analysis, including the specific scores our chatbot achieved on SuperVision's Factuality and Consistency metrics based on actual answers extracted from Galaxy devices, and uncover the root causes behind the failed test cases.

Building SuperVision: An Automated Chatbot Testing Pipeline 7/10

Related Posts

Comments