top of page

[On-Device AI Chatbot] Part 3: Core Technologies of Mobile AI: Quantization and NPU Optimization


Core Technologies of Mobile AI

Quantization and NPU Optimization


In Part 2, we discussed our selection of Gemma-2B as the ideal Small Language Model (SLM) for our project and shared our experiences benchmarking CPU and GPU performance in a constrained smartphone environment. However, the initial tests revealed significant challenges: noticeable latency delays and out-of-memory errors.

To run LLMs in real-time on a mobile device held in the palm of your hand—not on a data center’s GPU rack—you need an extreme technical diet. This involves drastically reducing the model's size and utilizing the hardware accelerator (NPU) to its absolute limits. In this third installment, we dive deep into the core technologies that make on-device AI possible: Quantization and NPU optimization strategies.


1. The Magic of Model Compression: Quantization

Quantization is an essential technique for reducing the size and computational requirements of deep learning models. It involves converting the model’s weights and activations from high-precision 32-bit floating-point (FP32) values down to lower-precision integers, such as 8-bit (INT8) or 4-bit (INT4).


  • Memory Savings and Speed Boost: Applying quantization can drastically reduce the model's memory footprint by a factor of 4x to 8x. Moreover, it accelerates inference speeds and significantly cuts power consumption by leveraging the integer-arithmetic units within modern NPUs. Specifically, utilizing INT4 post-training quantization can reduce language model sizes by 2.5 to 4 times while yielding massive improvements in both latency and memory consumption.


  • Two Approaches to Quantization:

    1. Post-Training Quantization (PTQ): Performed after the model has been fully trained using standard precision. It is relatively simple and fast, though it may result in slight accuracy loss.

    2. Quantization-Aware Training (QAT): Incorporates quantization simulation directly into the training process, allowing the model to adapt to lower precision and generally resulting in higher accuracy than PTQ.


Through advanced techniques like Activation-Aware Quantization (AWQ) and QLoRA, modern mobile AI chatbots can load 3-billion-parameter models smoothly into the roughly 4GB of available mobile memory, all while maintaining their powerful reasoning capabilities.


2. Pushing the NPU to the Limit: Manufacturer Toolchains

Once the model is compressed, the next step is ensuring it runs optimally on the mobile device's core brain for AI: the Neural Processing Unit (NPU). TecAce analyzed the tools provided by the two pillars of the Android ecosystem—Qualcomm and Samsung—to establish an optimal deployment strategy.



Optimization via Qualcomm AI Hub For devices equipped with Snapdragon platforms, the Qualcomm AI Hub Workbench provides robust optimization capabilities.


  1. Hardware-Aware Optimization: Upon uploading a trained PyTorch or ONNX model, the toolkit automatically performs hardware-aware optimizations tailored specifically for the target platform.

  2. Physical Device Profiling: The system provisions actual physical smartphones hosted in the cloud to run on-device inference. It gathers precise metrics such as layer-to-compute-unit mapping, inference latency, and peak memory usage.


Structural Approach with Samsung Exynos AI Studio When targeting Samsung devices, the dedicated SDK, Exynos AI Studio, is utilized.


  • EHT (High Level Toolchain): Takes open-source framework models (like ONNX or TFLite) and converts them into an internal IR. It then modifies the graph structure for NPU execution and applies quantization to shrink the model.

  • ELT (Low Level Toolchain): Performs generation-specific lowering operations, converting the model into hardware-executable code and compiling it for the specific Exynos NPU. During this process, the output is rigorously verified using simulators and emulators to compare operator-level Signal-to-Noise Ratio (SNR) against the original model, ensuring minimal loss of accuracy performance.


3. Android Native Integration: MediaPipe and LiteRT

To finally deploy our optimized models into an Android application, we utilized Google's AI Edge Stack.


  • MediaPipe Tasks: Provides high-level cross-platform APIs for easily deploying ready-made ML solutions, including LLM inference.

  • LiteRT (formerly TensorFlow Lite): Serves as Google’s high-performance runtime engine for on-device AI. It natively supports hardware acceleration (via GPU and NPU delegates), ensuring the fastest and most stable execution environment on mobile.


Furthermore, with recent additions of mobile-ready Retrieval-Augmented Generation (RAG) and Function Calling libraries, on-device chatbots are now empowered to go beyond simple text generation. They can ground themselves in device-specific data and actively execute predefined applications and features autonomously.


Next Episode Preview

Now that our mobile device successfully hosts an optimized, fast, and smart "brain", it’s time to give our chatbot the "eyes, ears, and mouth" it needs to interact with users.

In the upcoming [Part 4] The Ears and Mouth of a Chatbot: On-Device STT/TTS Integration, we will share TecAce's vivid development experience of integrating fully offline Speech-to-Text (STT) and Text-to-Speech (TTS) technologies into an Android native app, breaking free from the limits of text typing and achieving a seamless multimodal conversational interface.

Comments


bottom of page