The Ears and Mouth of a Chatbot: On-Device STT/TTS Integration 4/10

TecAce Software
Feb 20
4 min read

The Ears and Mouth of a Chatbot

On-Device STT/TTS Integration

In Part 3, we explored the optimization process of compressing a massive language model to fit the constrained resources of a smartphone and boosting inference speed using the mobile NPU. Now that we have successfully embedded a fast and smart "brain" inside the device, it is time to give our chatbot the "ears and mouth" it needs to interact naturally with users.

In a mobile environment, typing out long texts every time can be a significant constraint on the User Experience (UX). Therefore, the TecAce team set out to integrate on-device STT (Speech-to-Text) and TTS (Text-to-Speech) to implement a multimodal interface that allows for seamless voice conversations while maintaining a fully offline environment.

1. The Chatbot's Ears: Adopting Offline STT (Speech-to-Text)

Given the nature of this project, where security and privacy are paramount, the user's voice data must never be transmitted to a cloud server. During the initial sprint phases, the TecAce team compared and analyzed various STT solutions.

Solution Review and Selection

Online/Cloud: Azure STT (Fast and highly accurate, but failed our offline/security requirements)
Offline 1: Vosk-Kaldi (Lightweight and fast, but fell short in contextual understanding and multilingual processing)
Offline 2: Whisper C++ (A C/C++ port of OpenAI's Whisper model)

Ultimately, we decided to integrate Whisper C++, which boasted the highest accuracy and best multilingual processing capabilities, into our Android app. We built a pipeline where pressing the microphone button immediately filters background noise and listens to the user's voice; as soon as the speech ends, it locally converts the audio to text and passes it directly to the LLM prompt.

2. The Chatbot's Mouth: What Voice Will It Use? (TTS)

Once the STT understands the question and the LLM generates an answer, we need TTS (Text-to-Speech) to read it aloud in a natural-sounding voice.

Exploring TTS Solutions and Their Limitations Initially, we reviewed both the native TTS built into the Android OS and third-party AI offline TTS solutions (such as Kokoro) that offer more natural voices. The biggest technical hurdle we faced was "Synthesis Latency." If we passed the entire long response generated by the LLM to the TTS engine all at once, it took far too long to synthesize the audio and begin playback. To resolve this, it was essential to implement optimization logic that slices the response into sentence or paragraph units (Chunking) and plays the audio sequentially, much like streaming.

3. The Pitfalls of Mobile Environments: Edge Cases and UX Troubleshooting

Running three heavy engines—LLM, STT, and TTS—simultaneously within a device inevitably led to various bugs related to app lifecycles and memory management. Here are some of the key troubleshooting cases we resolved while preparing for the v1.0.5 release:

App Crash Issue (DeadObjectException): When navigating to the settings screen or when the app went into the background, reinitializing the TTS engine occasionally threw a DeadObjectException, causing the app to crash. This occurred because the connection to the TTS service was severed by Android's memory reclamation policies. We fixed this by reinforcing our exception handling logic and safely releasing objects in sync with the app's lifecycle.
The Chatbot's Endless Monologue (Background Playback): We discovered a UX flaw where the chatbot would continue reading the generated response aloud to the end even after the user pressed the home button to exit to the home screen or opened the settings page. We synchronized the lifecycle events (onPause / onStop) to immediately halt active TTS playback when the app disappears from the screen. Similarly, for STT, we caught a bug where previous voice inputs were repeatedly fed into the system after returning from the settings screen, and we resolved it by strictly separating state management.
'Stop Button' UI/UX Improvements: We added a 'Stop Speech' button for instances where the user wanted to cut off the LLM's long response midway. Initially, there were issues where the stop button would disappear as soon as text generation finished (even if audio playback was still ongoing), or the previous audio wouldn't stop and overlapped when the user asked a new question during playback. We improved this to create a smart conversational flow where existing TTS output immediately stops if the user interrupts by initiating a new voice input.

Conclusion: The Completion of a True 'Personal Assistant'

By successfully integrating STT and TTS into an on-device environment, TecAce's chatbot evolved from a simple text messenger into a "true AI assistant that you can talk to even in airplane mode." This made natural voice conversations possible for field workers whose hands might be wet or occupied, as well as for personnel working inside highly secure facilities.

Next Episode Preview

"So, how does this chatbot understand our company's tasks or my personal context to provide answers?" No matter how smart a model is, if it doesn't know the specific context the user is in or the internal company regulations, it is bound to give irrelevant answers.

In the upcoming [Part 5] A Chatbot That Understands Context: Implementing RAG and Multi-Context Switching, we will cover the detailed implementation of Local RAG (Retrieval-Augmented Generation)—which allows the chatbot to read and answer based solely on documents stored inside the smartphone without an external internet connection—and how we handled multi-context switching across various conversational topics.

The Ears and Mouth of a Chatbot: On-Device STT/TTS Integration 4/10

Related Posts

Comments