[On-Device AI Chatbot] Part 9: Challenging Performance Limits: Heat, Battery, and Response Speed
- TecAce Software
- 6 days ago
- 3 min read

Challenging Performance Limits
Heat, Battery, and Response Speed
In Part 8, we shared how we caught hallucinations and improved response quality using 'AI SuperVision'. While making the model smarter and more accurate is a huge milestone, running it in a real-world smartphone environment (like the Galaxy S25 FE) forces us to confront harsh physical walls: Thermal management, Battery consumption, and Latency limits.
Unlike the limitless resources of cloud data centers, a mobile device that fits in the palm of your hand has extremely limited power and cooling capabilities. In Part 9, we share the vivid performance benchmarking process and insights on how the TecAce team compromised and optimized against these physical constraints to implement true on-device AI.
1. The Big 3 Performance Metrics of Mobile AI: TTFT, TPS, and IPW
To make users feel that the chatbot is "fast" and "natural," the system must meet the following core performance metrics:
TTFT (Time To First Token): The time it takes for the AI to process the user's prompt and display the very first word (token) on the screen. In mobile environments, achieving a TTFT of under 500ms (0.5 seconds) is ideal to prevent users from feeling conversational lag.
TPS (Tokens Per Second): The speed at which subsequent words are generated after the first token. To stream text seamlessly at an average human reading speed, a rate of at least 2050 TPS (or 2050ms per output token) must be guaranteed.
IPW (Inferences Per Watt): A critical power efficiency metric directly linked to battery life. It indicates how many inferences can be performed per unit of power consumed. A higher IPW means the chatbot can be used longer without draining the battery.
2. The Hot Smartphone: The Dilemma of Thermal Throttling and Battery
Our biggest hurdle during initial testing was Thermal Throttling.
While a Large Language Model (LLM) is generating an answer, the device's compute units (CPU, GPU, NPU) run at near 100% utilization. If the workload relies heavily on the CPU or standard GPU, power draw can momentarily spike over 15W, causing the device temperature to skyrocket. When a smartphone reaches its thermal limit (usually around 85°C internally), it forcibly lowers clock speeds to prevent hardware damage. At this exact moment, TPS plummets drastically, causing the chatbot to "stutter."
Additionally, extended inference times visibly drained the battery. We were caught in a dilemma: "The longer the chatbot thinks and speaks to give a smart answer, the hotter the device gets, and the faster the battery melts."
3. Finding the Realistic Sweet Spot: TecAce's Optimization Strategy
To overcome these hardware limitations, the TecAce team executed an optimization strategy that combined hardware accelerator utilization with strict inference parameter control.

Power Diet through Active NPU Utilization Instead of relying on power-hungry CPUs/GPUs, we offloaded inference workloads to the NPU (Neural Processing Unit), which is purpose-built for matrix multiplication. Tests showed that running an optimized model on the NPU slashed power consumption down to the 2~5W range, effectively suppressing heat generation and maintaining a relatively consistent TPS even when processing long contexts.
Limiting Compute via Hyperparameter Control As confirmed in our [Part 2] Chatbot Test Matrix, the factor that most absolutely impacts performance is the number of tokens the model generates.
- Strict max_tokens Limit: Every additional token generated linearly increases latency and power consumption. Through prompt engineering, we instructed the chatbot to "answer concisely in 3 sentences or less without unnecessary background," forcing an early exit in inference before hitting the max_tokens ceiling.
- Tuning top_k and top_p: By appropriately narrowing the candidate word pool (e.g., lowering the top_k value), we reduced the computational overhead of softmax operations by about 10~20%. As soon as inference concludes, the processor immediately transitions into a quiescent state to preserve battery life.
RAG Context Length Optimization (Chunking) When using Local RAG, injecting excessively long documents (Context) dramatically increases the initial prompt processing load, slowing down the TTFT. By optimizing the chunk size during the vector retrieval process, we injected only the most essential information into the model. This reduced bottlenecks and significantly accelerated the time it takes to output the first word.
Next Episode Preview (The Finale)
We selected the right model for the device (SLM), loaded it via optimization and quantization, gave it ears and a mouth (STT/TTS, RAG), passed the rigorous verification of an AI judge (SuperVision), and finally overcame the physical barriers of heat and battery limits.
In our grand finale, [Part 10] The Future of On-Device AI and TecAce's Roadmap (Conclusion), we will reflect on the invaluable Lessons Learned throughout this journey and share TecAce's vision for evolving beyond a simple chatbot into an 'Agentic AI' capable of autonomous reasoning and device control.

![[On-Device AI Chatbot] Part 10: The Future of On-Device AI and TecAce's Roadmap (Conclusion)](https://static.wixstatic.com/media/2ea07e_d1771a9889764093a8c855756693ba51~mv2.png/v1/fill/w_980,h_535,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/2ea07e_d1771a9889764093a8c855756693ba51~mv2.png)
![[On-Device AI Chatbot] Part 8: Catching Hallucinations: Analyzing SuperVision Test Results](https://static.wixstatic.com/media/2ea07e_69fba1e933354148a97a50bbfb2f2dcb~mv2.png/v1/fill/w_980,h_535,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/2ea07e_69fba1e933354148a97a50bbfb2f2dcb~mv2.png)
![[On-Device AI Chatbot] Part 7: Building SuperVision: An Automated Chatbot Testing Pipeline](https://static.wixstatic.com/media/2ea07e_22b8a8781b1743cb8aaa018b782ab4da~mv2.png/v1/fill/w_980,h_535,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/2ea07e_22b8a8781b1743cb8aaa018b782ab4da~mv2.png)
Comments