The Journey to Automatically Measure LLM Performance on Smartphones – Building the On-Device LLM Tester

TecAce Software
Jun 5
7 min read

"You want to put AI on a smartphone?" — The Beginning of a Reckless Challenge

The story of how TecAce's AI Supervision Team built the On-Device LLM Tester

"Um… I'd like to automatically measure LLM performance on a smartphone."

A brief silence fell over the meeting room.

Our team was developing our own on-device AI chatbot. The problem was that every time we swapped models, a tester had to physically hold the phone, send prompts one by one, and time everything by hand. That process took over 60 minutes.

At the time, our chatbot ran a 3B-parameter model, but new models like Phi-4 Mini (3.8B), Gemma 3 (4B), Llama 3.2 (3B), and Qwen 3 (4B) were pouring out. Testing each of these manually would have meant spending a whole month just on model comparisons.

"Let's just automate it."

And so the On-Device LLM Tester project began.

We had exactly three goals:

Automatically pull the latest models from Hugging Face and load them onto a smartphone
Run inference on the phone and automatically collect performance metrics
Organize the data so you can tell at a glance: "Is this model faster than that one? More accurate?"

Our core values boiled down to three words:

Efficiency: Save time by automating manual work.
Data-Driven: Secure objective performance data across different hardware environments on each device.
Security: Meet internal security requirements and support NPU optimization.

Metrics to Measure

We weren't just looking at "fast or slow."

Performance: TTFT (Time to First Token), TPS (Tokens Per Second), peak memory
Accuracy: Context Fidelity against a golden dataset
Battery consumption rate, hallucination rate (the ultimate goal)

"Remotely Controlling a Smartphone via ADB" — Designing the System Architecture

Before building something, you have to decide how to build it.

The biggest challenge for the On-Device LLM Tester was how to automate the flow of "issuing commands from a PC, running the AI on the smartphone, and bringing the results back to the PC." The answer was surprisingly simple: ADB (Android Debug Bridge) — the tool developers know well for controlling Android devices over USB.

A Pipeline Split Into Host and Target

The overall system is broadly divided into the Host (PC/server) side and the Target (smartphone) side.

Host Components

Model Controller — Downloads models from Hugging Face and handles quantization conversion
Orchestrator — Triggers the entire process via GitHub Actions and controls the phone via ADB
Data Collector — Parses test logs, stores them in the DB, and generates reports

Target (Smartphone)

Lightweight Test App — A simple Android app responsible for loading models and running inference

The Tech Stack Selector

The part that sparked the most debate within the team was the tech stack.

Inference Engine: Google MediaPipe — the de facto standard for mobile AI inference, packaged in the .task format
CI/CD: GitHub Actions (Self-hosted) — Runner installed directly to access a physically USB-connected phone
Data Analysis: Pandas/Python
Quality Evaluation: Internal GPT Tester API
Report Visualization: Streamlit / Plotly

MVP Within 10 Days

Under the principle of "first build something that works," we scoped the MVP to exactly four things.

Deploy: Send the model to the phone using a Python script + ADB
Run: Launch the test app via adb shell
Collect: Extract latency data to CSV using adb logcat
Result: Print summary metrics to the terminal and generate an Excel report

"Finish a 60-minute test in under 5 minutes" — that was the MVP's true goal.

"The Data Started to Pile Up" — The Birth of the Dashboard and DB

Once the MVP started running, a new problem emerged.

"The test results pile up as CSV files… but how do we compare them?"

Gemma vs Qwen, CPU vs GPU, Samsung S25 vs S26. As the combinations grew, the Excel files just kept stacking up. We needed a proper visualization tool.

Dashboard Architecture: FastAPI + React

We could have simply served the JSON files directly, but considering future needs—like swapping out the DB or attaching AI Quality Eval—it made sense to add an API layer.

The conclusion was a FastAPI backend + React (Vite + TypeScript) frontend combination.

FastAPI: Reuses the existing report.py parsing/statistics logic, auto-generates Swagger docs, and scales well to multi-device
React + TypeScript: Handles the complex, deeply nested JSON metric structure in a type-safe way

Five Dashboard Pages

Overview

KPI cards (total test count, success rate, average latency, average Decode TPS), latency box plots by model, and success-rate charts by category. A landing page to grasp the whole situation at a glance.

Performance

Latency histograms, p50/p95/p99 distributions, TPS analysis by category, TTFT comparison, and detailed memory usage analysis.

Compare

Select two models from a dropdown to compare them side by side. View latency, TPS, and actual response text for the same prompt side-by-side, with a radar chart visualizing strengths and weaknesses by category.

Validation

Responses

A viewer for actual response text, with space already reserved for AI quality evaluation scores to be added later.

Raw Data

A full data table with sorting, filtering, search, and CSV download functionality.

Information Stored in SQLite

Device info: manufacturer, model name, SoC, Android version, CPU core count
Model info: name, path, backend
Metrics: TTFT, prefill/decode TPS, peak memory, and p50/p95/p99 of ITL

Once the data started to pile up, "comparison" finally began to mean something.

"One Button Runs the Benchmark" — The Reality of CI/CD Automation

The MVP was complete, and we had a dashboard. But there was still something inconvenient.

"To run a test, you have to open a terminal, run a script, check the phone connection…"

If it's true automation, shouldn't everything run with a single button press on GitHub?

Why a Self-hosted Runner?

The first wall we hit when adopting GitHub Actions was that "cloud Runners can't access a physical phone." The answer was a Self-hosted Runner — installing the GitHub Actions Runner directly on a development PC so GitHub uses our PC like a CI server.

We limited the trigger to workflow_dispatch (manual execution) only. If an LLM benchmark ran automatically on every commit, the phone would be testing all day long.

Pipeline Flow

GitHub Actions trigger

↓

Self-hosted Runner (development PC, USB phone connected)

↓

Step 1: runner.py — Sends test commands to the phone via ADB

↓

Step 2: sync_results.py — Collects the phone's JSON results onto the PC

↓

Step 3: ingest.py — Loads JSON into the SQLite DB + records the runs table

↓

Step 4: GitHub Artifact — Uploads the .db file (90-day retention)

↓

Prints a result summary to the GitHub Step Summary

One thing we paid special attention to was preventing concurrent runs. If you accidentally press "Run workflow" twice, the ADB commands overlap and the test breaks completely. We used a concurrency setting so that only one run executes at a time within the same group.

The Team's Code Review Culture

During this process, our teammate Luke gave the architecture document a thorough review.

"You should note that partial results are kept only locally when a timeout is exceeded"
"The method for injecting commit_sha isn't visible in the YAML"
"The upload-artifact failure handling doesn't match between the doc description and the YAML"

Andrew immediately incorporated the valid feedback, and politely declined the out-of-scope items with reasons.

Good technical documentation is built through this kind of give-and-take.

"Run GGUF Too!" — Evolving Into a llama.cpp Multi-Engine

Our pipeline, which started with the MediaPipe + .task format, finally hit a limit.

Most of the latest models uploaded to Hugging Face were in GGUF format. This format, pushed by llama.cpp, has become the de facto standard for PC/server use, and its variety of quantization options makes it great for comparison experiments. But our app could only read .task.

"Couldn't we run llama.cpp on Android too?"

Engine Abstraction: Adopting the Strategy Pattern

Simply branching engines with if-else quickly hits a limit. The code becomes a mess every time a third or fourth engine (MLC LLM, ExecuTorch, Qualcomm QNN…) is added.

So we introduced an InferenceEngine interface. MediaPipeEngine and LlamaCppEngine implement this interface, and MainActivity picks the appropriate implementation based on the engine type. It's a classic Strategy pattern.

interface InferenceEngine {

    val engineName: String  // "mediapipe" | "llamacpp"

    suspend fun init(modelPath: String, maxTokens: Int, params: Map<String, String>)

    suspend fun generate(prompt: String, inputTokenCount: Int): InferenceMetrics

    fun close()

The key was unifying the result JSON format. No matter which engine runs it, only a single engine field is added and the rest of the structure is identical. Thanks to that, the downstream pipeline (sync → ingest → dashboard) barely needed any changes.

How to Run llama.cpp on Android

After reviewing three approaches:

HTTP server mode: complex separate process management, ADB forwarding gets tangled → dropped
Termux + llama-cli: easy for a PoC but no CI automation possible → dropped
JNI binding: build llama.cpp as a C++ native library and embed it directly in the Android app → chosen

We added llama.cpp as a git submodule and built an arm64-v8a .so with CMake to include it in the app.

Handling Chat Templates

MediaPipe .task models include the chat template internally, so you can feed in raw text and it handles it automatically. GGUF models require the caller to apply it directly. The llama_chat_apply_template() API automatically reads and processes the template from the GGUF metadata.

An example comparison now possible

  "engine": "llamacpp",

  "model_name": "qwen2.5-1.5b-instruct-q4_k_m.gguf",

  "backend": "CPU",

  "metrics": {

    "ttft_ms": 120,

    "decode_tps": 4.49,

    "peak_native_memory_mb": 890

If you mix .task models and .gguf models into a single test_config.json, the pipeline picks the engine and runs them automatically. Q4_K_M vs Q5_K_M vs Q8_0 quantization comparisons also became possible.

The Road Ahead

Phase 6: Detailed CPU/GPU/memory profiling
Phase 6.5: Quantization comparison pipeline
Phase 7: Cloud deployment architecture

What we're most excited about is integration with AI Supervision. The idea is to bring responses generated on the phone back to the PC, pass them to AI Supervision to check for hallucinations/accuracy, and produce a Final Score that combines the performance score and the quality score. With that single score, we can objectively judge "whether this model is a good fit for our chatbot."

What began as a journey to turn a 60-minute manual test into a 5-minute automated one grew into a multi-engine benchmark platform.

About TecAce

TecAce is a software development company built on AI/ML technology, researching AI evaluation automation and on-device AI optimization. This series is based on real Confluence development records.

The Journey to Automatically Measure LLM Performance on Smartphones – Building the On-Device LLM Tester

Related Posts

Comments