top of page

The Journey to Automatically Measure LLM Performance on Smartphones – Building the On-Device LLM Tester


"You want to put AI on a smartphone?" — The Beginning of a Reckless Challenge

The story of how TecAce's AI Supervision Team built the On-Device LLM Tester


"Um… I'd like to automatically measure LLM performance on a smartphone."


A brief silence fell over the meeting room.


Our team was developing our own on-device AI chatbot. The problem was that every time we swapped models, a tester had to physically hold the phone, send prompts one by one, and time everything by hand. That process took over 60 minutes.

At the time, our chatbot ran a 3B-parameter model, but new models like Phi-4 Mini (3.8B), Gemma 3 (4B), Llama 3.2 (3B), and Qwen 3 (4B) were pouring out. Testing each of these manually would have meant spending a whole month just on model comparisons.


"Let's just automate it."


And so the On-Device LLM Tester project began.


We had exactly three goals:

  • Automatically pull the latest models from Hugging Face and load them onto a smartphone

  • Run inference on the phone and automatically collect performance metrics

  • Organize the data so you can tell at a glance: "Is this model faster than that one? More accurate?"


Our core values boiled down to three words:

  • Efficiency: Save time by automating manual work.

  • Data-Driven: Secure objective performance data across different hardware environments on each device.

  • Security: Meet internal security requirements and support NPU optimization.


Metrics to Measure

We weren't just looking at "fast or slow."

  • Performance: TTFT (Time to First Token), TPS (Tokens Per Second), peak memory

  • Accuracy: Context Fidelity against a golden dataset

  • Battery consumption rate, hallucination rate (the ultimate goal)



"Remotely Controlling a Smartphone via ADB" — Designing the System Architecture


Before building something, you have to decide how to build it.

The biggest challenge for the On-Device LLM Tester was how to automate the flow of "issuing commands from a PC, running the AI on the smartphone, and bringing the results back to the PC." The answer was surprisingly simple: ADB (Android Debug Bridge) — the tool developers know well for controlling Android devices over USB.


A Pipeline Split Into Host and Target

The overall system is broadly divided into the Host (PC/server) side and the Target (smartphone) side.


Host Components

  • Model Controller — Downloads models from Hugging Face and handles quantization conversion

  • Orchestrator — Triggers the entire process via GitHub Actions and controls the phone via ADB

  • Data Collector — Parses test logs, stores them in the DB, and generates reports


Target (Smartphone)

  • Lightweight Test App — A simple Android app responsible for loading models and running inference


The Tech Stack Selector

The part that sparked the most debate within the team was the tech stack.

  • Inference Engine: Google MediaPipe — the de facto standard for mobile AI inference, packaged in the .task format

  • CI/CD: GitHub Actions (Self-hosted) — Runner installed directly to access a physically USB-connected phone

  • Data Analysis: Pandas/Python

  • Quality Evaluation: Internal GPT Tester API

  • Report Visualization: Streamlit / Plotly


MVP Within 10 Days

Under the principle of "first build something that works," we scoped the MVP to exactly four things.

  • Deploy: Send the model to the phone using a Python script + ADB

  • Run: Launch the test app via adb shell

  • Collect: Extract latency data to CSV using adb logcat

  • Result: Print summary metrics to the terminal and generate an Excel report


"Finish a 60-minute test in under 5 minutes" — that was the MVP's true goal.



"The Data Started to Pile Up" — The Birth of the Dashboard and DB

Once the MVP started running, a new problem emerged.


"The test results pile up as CSV files… but how do we compare them?"


Gemma vs Qwen, CPU vs GPU, Samsung S25 vs S26. As the combinations grew, the Excel files just kept stacking up. We needed a proper visualization tool.


Dashboard Architecture: FastAPI + React

We could have simply served the JSON files directly, but considering future needs—like swapping out the DB or attaching AI Quality Eval—it made sense to add an API layer.


The conclusion was a FastAPI backend + React (Vite + TypeScript) frontend combination.

  • FastAPI: Reuses the existing report.py parsing/statistics logic, auto-generates Swagger docs, and scales well to multi-device

  • React + TypeScript: Handles the complex, deeply nested JSON metric structure in a type-safe way


Five Dashboard Pages


Overview

KPI cards (total test count, success rate, average latency, average Decode TPS), latency box plots by model, and success-rate charts by category. A landing page to grasp the whole situation at a glance.


Performance

Latency histograms, p50/p95/p99 distributions, TPS analysis by category, TTFT comparison, and detailed memory usage analysis.


Compare

Select two models from a dropdown to compare them side by side. View latency, TPS, and actual response text for the same prompt side-by-side, with a radar chart visualizing strengths and weaknesses by category.


Validation


Responses

A viewer for actual response text, with space already reserved for AI quality evaluation scores to be added later.


Raw Data

A full data table with sorting, filtering, search, and CSV download functionality.


Information Stored in SQLite

  • Device info: manufacturer, model name, SoC, Android version, CPU core count

  • Model info: name, path, backend

  • Metrics: TTFT, prefill/decode TPS, peak memory, and p50/p95/p99 of ITL


Once the data started to pile up, "comparison" finally began to mean something.



"One Button Runs the Benchmark" — The Reality of CI/CD Automation

The MVP was complete, and we had a dashboard. But there was still something inconvenient.


"To run a test, you have to open a terminal, run a script, check the phone connection…"


If it's true automation, shouldn't everything run with a single button press on GitHub?


Why a Self-hosted Runner?

The first wall we hit when adopting GitHub Actions was that "cloud Runners can't access a physical phone." The answer was a Self-hosted Runner — installing the GitHub Actions Runner directly on a development PC so GitHub uses our PC like a CI server.

We limited the trigger to workflow_dispatch (manual execution) only. If an LLM benchmark ran automatically on every commit, the phone would be testing all day long.


Pipeline Flow


GitHub Actions trigger

Self-hosted Runner (development PC, USB phone connected)

Step 1: runner.py — Sends test commands to the phone via ADB

Step 2: sync_results.py — Collects the phone's JSON results onto the PC

Step 3: ingest.py — Loads JSON into the SQLite DB + records the runs table

Step 4: GitHub Artifact — Uploads the .db file (90-day retention)

Prints a result summary to the GitHub Step Summary


One thing we paid special attention to was preventing concurrent runs. If you accidentally press "Run workflow" twice, the ADB commands overlap and the test breaks completely. We used a concurrency setting so that only one run executes at a time within the same group.


The Team's Code Review Culture

During this process, our teammate Luke gave the architecture document a thorough review.

  • "You should note that partial results are kept only locally when a timeout is exceeded"

  • "The method for injecting commit_sha isn't visible in the YAML"

  • "The upload-artifact failure handling doesn't match between the doc description and the YAML"

Andrew immediately incorporated the valid feedback, and politely declined the out-of-scope items with reasons.

Good technical documentation is built through this kind of give-and-take.

"Run GGUF Too!" — Evolving Into a llama.cpp Multi-Engine

Our pipeline, which started with the MediaPipe + .task format, finally hit a limit.

Most of the latest models uploaded to Hugging Face were in GGUF format. This format, pushed by llama.cpp, has become the de facto standard for PC/server use, and its variety of quantization options makes it great for comparison experiments. But our app could only read .task.


"Couldn't we run llama.cpp on Android too?"


Engine Abstraction: Adopting the Strategy Pattern

Simply branching engines with if-else quickly hits a limit. The code becomes a mess every time a third or fourth engine (MLC LLM, ExecuTorch, Qualcomm QNN…) is added.

So we introduced an InferenceEngine interface. MediaPipeEngine and LlamaCppEngine implement this interface, and MainActivity picks the appropriate implementation based on the engine type. It's a classic Strategy pattern.


interface InferenceEngine {
    val engineName: String  // "mediapipe" | "llamacpp"
    suspend fun init(modelPath: String, maxTokens: Int, params: Map<String, String>)
    suspend fun generate(prompt: String, inputTokenCount: Int): InferenceMetrics
    fun close()
}

The key was unifying the result JSON format. No matter which engine runs it, only a single engine field is added and the rest of the structure is identical. Thanks to that, the downstream pipeline (sync → ingest → dashboard) barely needed any changes.


How to Run llama.cpp on Android


After reviewing three approaches:

  • HTTP server mode: complex separate process management, ADB forwarding gets tangled → dropped

  • Termux + llama-cli: easy for a PoC but no CI automation possible → dropped

  • JNI binding: build llama.cpp as a C++ native library and embed it directly in the Android app → chosen


We added llama.cpp as a git submodule and built an arm64-v8a .so with CMake to include it in the app.


Handling Chat Templates

MediaPipe .task models include the chat template internally, so you can feed in raw text and it handles it automatically. GGUF models require the caller to apply it directly. The llama_chat_apply_template() API automatically reads and processes the template from the GGUF metadata.

An example comparison now possible

{
  "engine": "llamacpp",
  "model_name": "qwen2.5-1.5b-instruct-q4_k_m.gguf",
  "backend": "CPU",
  "metrics": {

    "ttft_ms": 120,
    "decode_tps": 4.49,
    "peak_native_memory_mb": 890
  }
}

If you mix .task models and .gguf models into a single test_config.json, the pipeline picks the engine and runs them automatically. Q4_K_M vs Q5_K_M vs Q8_0 quantization comparisons also became possible.


The Road Ahead

  • Phase 6: Detailed CPU/GPU/memory profiling

  • Phase 6.5: Quantization comparison pipeline

  • Phase 7: Cloud deployment architecture


What we're most excited about is integration with AI Supervision. The idea is to bring responses generated on the phone back to the PC, pass them to AI Supervision to check for hallucinations/accuracy, and produce a Final Score that combines the performance score and the quality score. With that single score, we can objectively judge "whether this model is a good fit for our chatbot."


What began as a journey to turn a 60-minute manual test into a 5-minute automated one grew into a multi-engine benchmark platform.


About TecAce

TecAce is a software development company built on AI/ML technology, researching AI evaluation automation and on-device AI optimization. This series is based on real Confluence development records.

bottom of page
AI Transformation
How Far Along Is Your AI Transformation?
Start your AI transformation
FREE