A Deep Dive into Building a Robust Text To Speech Handler with Pause, Resume, and Word Highlighting.

Table Of Contents:
· ✨ The Trailer: Initial Setup
· 🤫 Chapter 1: The First Tension — “Unexpected Silence”
· ⚡️ Chapter 2: The “Wow” Factor — Lighting Up Words on the Screen
· 💪 Chapter 3: The “Too Much Text!” Problem
· ⚖️ Chapter 4: The Real Brain-Teaser — Pause, Resume, and All That Drama
· 👻 Chapter 5: The “Ghost Speaker” That Won’t Shut Up
· 😉 Bonus Tips:
· 🧠 Conclusion:
· 💬 Let’s Talk
· 📚 My Articles:
Hey Android Developers!
Let’s talk about the Android TextToSpeech (TTS) engine. At first glance, it looks simple, right? You give it some text, it speaks. Done!. But if you have ever tried to use it in a proper production level application, you will know the real story 🥸. Its like expecting a simple two-minutes Maggi 🍜 noodles and ending up having to cook a full biriyani 🥘.
- What happens when you need to pause and resume?
- How do you highlight words as they’re spoken?
- What if the user tries to speak a paragraph longer than the engine’s buffer limit?
- How do you handle the user minimising the app or navigating to another screen?
Today, we are going on a journey. We’ll start with the basic, “it’s not working” code and, step-by-step, turn it into a rock-solid TextToSpeech handler that can handle anything you throw at it. We will be using modern Android libraries like Coroutines, Flows, and Lifecycle Observers.
Let’s get started!
✨ The Trailer: Initial Setup
class TextToSpeechManager(private val context: Context) {
private val coroutineScope = CoroutineScope(SupervisorJob() + Dispatchers.Main.immediate)
private var tts: TextToSpeech? = null
fun speak(text: String) {
// TODO..
}
fun stop() {
tts?.stop()
}
fun shutdown() {
stop()
tts?.shutdown()
tts = null
coroutineScope.cancel()
}
}
In TextToSpeechManager, we have initialised our TTS instance as null, and it will be initialised only if the speak is called.
🤫 Chapter 1: The First Tension — “Unexpected Silence”
You write your first piece of TextToSpeech code, run the app, click the button, and… nothing. Pin-drop silence. You click it again, and suddenly it works. What’s this magic?
fun speak(text: String) {
// The code that gives every developer trust issues
tts = TextToSpeech(context) { status ->
if (status == TextToSpeech.SUCCESS) {
// This part runs after some time... a long time...
}
}
// ...but your code reaches here in a flash!
tts?.speak("Hello, world!", TextToSpeech.QUEUE_FLUSH, null, null) // Silently fails
}
The main problem is that the TextToSpeech engine takes its own sweet time to get ready. Your code calls .speak() and moves on, while the engine is still waking up.
The modern, clean way to solve this is to stop thinking in callbacks and start thinking in coroutines. We will use suspendCancellableCoroutine to create a “bridge”. This function will patiently wait for the TextToSpeech engine to be ready before letting our code proceed.
Create a suspend function that only returns when the engine is 100% ready.
private suspend fun initialize() {
// If already initialized, do nothing.
if (tts != null) return
tts = suspendCancellableCoroutine { continuation ->
var ttsInstance: TextToSpeech? = null
val listener = TextToSpeech.OnInitListener { status ->
if (status == TextToSpeech.SUCCESS) {
ttsInstance?.let {
if (continuation.isActive) {
// If worked! Resume the coroutine with the engine instance.
continuation.resume(it)
}
}
} else {
if (continuation.isActive) {
// If failed. Throw a exception.
continuation.resumeWithException(Exception("Init failed!"))
}
}
}
ttsInstance = TextToSpeech(context, listener)
// A safety net: if the coroutine is cancelled while waiting, shut down the engine.
continuation.invokeOnCancellation {
ttsInstance.shutdown()
}
}
}
This function will be our gateway to a ready-to-use TTS engine. Now, our code is simple and readable, just calling this initialize method to get TextToSpeech instance. No more guesswork!
⚡️ Chapter 2: The “Wow” Factor — Lighting Up Words on the Screen
To make our app look really premium, let’s highlight words as they are being spoken. This needs a bit of teamwork between our handler and the UI.
The TextToSpeech API has a secret weapon for this: UtteranceProgressListener. Inside it, the onRangeStart() method tells us exactly which characters are about to be spoken.
Our plan is simple:
- When
onRangeStart()is called, we will emit an event with the start and end character positions. - We will use a
SharedFlowto send these events out from our handler. - Our ViewModel and UI will listen to this flow and update the screen.
class TextToSpeechManager(private val context: Context) {
// (1)
private val _highlight = MutableStateFlow<Pair<Int, Int>?>(null)
val highlight: StateFlow<Pair<Int, Int>?> = _highlight
// (2)
private fun createProgressListener() = object : UtteranceProgressListener() {
override fun onStart(utteranceId: String?) {}
override fun onError(utteranceId: String?) { stop() }
override fun onDone(utteranceId: String?) {}
override fun onRangeStart(utteranceId: String?, start: Int, end: Int, frame: Int) {
// (3)
coroutineScope.launch {
_highlight.emit(Pair(absoluteStart, absoluteEnd))
}
}
}
private suspend fun initialize() {
// existing code...
if (status == TextToSpeech.SUCCESS) {
// (4)
it.setOnUtteranceProgressListener(createProgressListener())
// resume...
}
// assign tts instance..
}
}
- We first declare a
stateFlowinside our TTSManager to track the currently spoken word. - Creating an
UtteranceProgressListenerobject. - In the
onRangeStartmethod, we just emit the start and end value into our highlight flow. - Finally, assigning the listener into the TextToSpeech instance in our
initialisemethod.
In Jetpack Compose, this is super easy to handle. We just collect the flow and use an AnnotatedString to paint a nice yellow background on the text. Simple, na?
💪 Chapter 3: The “Too Much Text!” Problem
The TextToSpeech engine isn’t a miracle worker. If you give it a 10,000-character essay, it will simply give up. Most engines have a limit of around 4000 characters.
The solution? We become smarter. We’ll chop the big text into smaller, digestible chunks (say, 3000 characters each). For a natural feel, we’ll try to chop at the end of a sentence (., !, ?).
We will feed the first chunk to the engine. When the onDone() callback tells us it’s finished, we simply feed it the next chunk.
Let’s create a data class for better handling of this case:
private data class TtsSession(
val text: String,
val chunks: List<String>,
val chunkOffsets: List<Int>, // Start index of each chunk in originalText
var currentChunkIndex: Int = 0,
)
- text — Contains the original text.
- chunks — Contains the chunk of sentences.
- chunkOffsets — Start index of the each chunk in original text.
- currentChunkIndex — Index of the chunk which is currently speaking.
Let’s create a util function to split the original text into chunks:
class TextTospeechManager(private val context: Context) {
// ... existing code
companion object {
private const val TTS_CHUNK_SIZE_LIMIT = 3000
private fun splitTextIntoChunks(text: String): List<String> {
if (text.length <= TTS_CHUNK_SIZE_LIMIT) {
return listOf(text)
}
val chunks = mutableListOf<String>()
var remainingText = text
while (remainingText.isNotEmpty()) {
if (remainingText.length <= TTS_CHUNK_SIZE_LIMIT) {
chunks.add(remainingText)
break
}
val potentialChunk = remainingText.substring(0, TTS_CHUNK_SIZE_LIMIT)
var splitIndex = potentialChunk.lastIndexOfAny(charArrayOf('.', '!', '?'))
if (splitIndex == -1) {
splitIndex = potentialChunk.lastIndexOf(' ')
}
if (splitIndex == -1) {
splitIndex = TTS_CHUNK_SIZE_LIMIT - 1
}
chunks.add(remainingText.substring(0, splitIndex + 1))
remainingText = remainingText.substring(splitIndex + 1)
}
return chunks
}
}
}
As I have already mentioned, this method splits the original text into chunks. To get a natural feeling, it looks for (‘.’, ‘!’, ‘?’). It’s also aware of not splitting the word. For example, if we have a long word at the 2995th index, it will not split the chunk at 3000; it will split at 2994th index (probably whitespace). Because splitting the word into two is not what we want, it will be weird when speaking, as if it’s two separate words.
Let’s combine all together:
class TextToSpeechManager(...) {
private var currentSession: TtsSession? = null
suspend fun speak(text: String) {
initialize()
stop() // Stop any previous session
// Create and store the new session
val chunks = splitTextIntoChunks(text)
// Use runningFold to correctly calculate start offsets for each chunk
val offsets = chunks.runningFold(0) { acc, chunk -> acc + chunk.length }.dropLast(1)
currentSession = TtsSession(text, chunks, offsets)
speakCurrentChunk()
}
private fun speakCurrentChunk() {
val session = currentSession ?: return
val ttsInstance = tts ?: return
val chunkIndex = session.currentChunkIndex
if (chunkIndex >= session.chunks.size) {
stop() // All chunks are done
return
}
val chunk = session.chunks[chunkIndex]
ttsInstance.speak(chunk, TextToSpeech.QUEUE_FLUSH, null, "chunk_${chunkIndex}")
}
}
- In
speakmethod: we get the chunks from oursplitTextIntoChunks()method and find thechunkOffsetfor each chunks. Then store it into theTTsSessionstate. - In
speakCurrentChunkmethod: It’s pretty straight forward, we just get the chunk text from thecurrentChunkIndexand feed the chunk into TTS’s speak method.
override fun onDone(utteranceId: String?) {
// This chunk is done, let's play the next one.
currentSession?.let {
it.currentChunkIndex++
speakCurrentChunk()
}
}
⚖️ Chapter 4: The Real Brain-Teaser — Pause, Resume, and All That Drama
Here’s where the real headache begins. TextToSpeech has a stop() method, but no pause() or resume() method. And stop() is destructive — it forgets everything it was supposed to say.
So, what to do? We have to build our own pause / resume logic. It’s a bit of a an innovative, but an engineering one!
The Master Plan:
- We will create a
TtsSessiondata class to be our “tracker”. It will keep track of the full text, whether we are paused, and most importantly, a bookmark of where we last stopped. - On Pause: We will set a flag
isPaused = truein our session and then call tts.stop(). - On Resume: We check our
isPausedflag. We’ll find our bookmark, chop the original text from that point, and give this new, smaller piece of text to the TTS engine to speak.
private data class TtsSession(
// ...
var lastSpokenTextEndIndex: Int = 0,
var resumeOffsetInChunk: Int = 0
)
class TextToSpeechManager(...) {
fun pause() {
tts?.stop()
}
fun resume() {
speakCurrentChunk(fromPause = true)
}
private fun speakCurrentChunk(fromPause: Boolean = false) {
// existing code...
val chunk = session.chunks[chunkIndex]
val textToSpeak: String
if (fromPause) {
// We are resuming. Calculate the starting point within the chunk.
val resumeIndex = session.lastSpokenTextEndIndex - session.chunkOffsets[chunkIndex]
// Set the state for the onRangeStart listener to use.
session.resumeOffsetInChunk = if (resumeIndex > 0) resumeIndex else 0
textToSpeak = if (session.resumeOffsetInChunk < chunk.length) {
chunk.substring(session.resumeOffsetInChunk)
} else {
// Paused at the very end, move to the next chunk.
session.currentChunkIndex++
speakCurrentChunk() // Recursive call for the next chunk
return
}
} else {
// Playing normally from the start of the chunk. Reset the offset.
session.resumeOffsetInChunk = 0
textToSpeak = chunk
}
ttsInstance.speak(textToSpeak, TextToSpeech.QUEUE_FLUSH, null, "chunk_${chunkIndex}")
}
}
But wait, there’s a catch! When you resume, the onRangeStart() callback gives you indices based on the smaller, chopped text. This will mess up our highlighting!
The Fix: Our session “tracker” must also store an offset — the starting point of the current text chunk.
override fun onRangeStart(utteranceId: String?, start: Int, end: Int, frame: Int) {
val session = currentSession ?: return
// 1. Get the offset of the current chunk within the full text.
val chunkOffset = session.chunkOffsets[session.currentChunkIndex]
// 2. Get the resume offset we explicitly set in speakCurrentChunk().
// This is 0 for normal playback, and > 0 for resumed playback.
val resumeOffset = session.resumeOffsetInChunk
// 3. Calculate the true absolute start and end indices.
// Absolute Index = (Start of Chunk) + (Start of Spoken Substring) + (Start from TTS)
val absoluteStart = chunkOffset + resumeOffset + start
val absoluteEnd = chunkOffset + resumeOffset + end
// 4. Save our position and emit the highlight for the UI.
session.lastSpokenTextEndIndex = absoluteEnd
coroutineScope.launch {
_highlight.emit(Pair(absoluteStart, absoluteEnd))
}
}
With this, our pause, resume, and highlighting work together like a charm.
👻 Chapter 5: The “Ghost Speaker” That Won’t Shut Up
This is a common mistake. The user leaves a screen, but the TextToSpeech keeps talking in the background. It’s annoying!
- App goes to the background: We can make our handler a
LifecycleObserverand watch the entire app’s lifecycle usingDefaultLifecycleObserver. When the app’sonPause()is called, we can automatically pause all ourTextToSpeechsessions.
import androidx.lifecycle.DefaultLifecycleObserver
class TextToSpeechManager(...): DefaultLifecycleObserver {
override fun onPause(owner: LifecycleOwner) {
pause()
}
override fun onDestroy(owner: LifecycleOwner) {
shutdown()
}
}
- User navigates back: We can handle the
TextToSpeechto stop when the user leaves the page. We expose a simple function:destroy()to handle this. There are two ways to handle this: - — Using ViewModel’s
onClearedmethod. - — Using
DisposableEffectin composable.
DisposableEffect(Unit) {
onDispose {
viewModel.destroy() // like calling shutdown in TTSManager
}
}
😉 Bonus Tips:
There are some additional implementations we can do like tracking the current state of the TTS instance.
Here is the entire code for your reference:
sealed class TtsState {
data object Idle : TtsState()
data object Speaking : TtsState()
data object Paused : TtsState()
}
private data class TtsSession(
val text: String,
val chunks: List<String>,
val chunkOffsets: List<Int>,
var currentChunkIndex: Int = 0,
var lastSpokenTextEndIndex: Int = 0,
var resumeOffsetInChunk: Int = 0
)
class TextToSpeechManager(private val context: Context): DefaultLifecycleObserver {
private val coroutineScope = CoroutineScope(SupervisorJob() + Dispatchers.Main.immediate)
private var tts: TextToSpeech? = null
private var currentSession: TtsSession? = null
private val _state = MutableStateFlow<TtsState>(TtsState.Idle)
val state: StateFlow<TtsState> = _state
private val _highlight = MutableStateFlow<Pair<Int, Int>?>(null)
val highlight: StateFlow<Pair<Int, Int>?> = _highlight
suspend fun speak(text: String) {
initialize()
stop()
val chunks = splitTextIntoChunks(text)
val offsets = chunks.runningFold(0) { acc, chunk -> acc + chunk.length }.dropLast(1)
currentSession = TtsSession(text, chunks, offsets)
_state.value = TtsState.Speaking
speakCurrentChunk()
}
fun pause() {
if (_state.value == TtsState.Speaking) {
tts?.stop()
_state.value = TtsState.Paused
}
}
fun resume() {
if (_state.value == TtsState.Paused) {
_state.value = TtsState.Speaking
speakCurrentChunk(fromPause = true)
}
}
fun stop() {
tts?.stop()
currentSession = null
_state.value = TtsState.Idle
_highlight.value = null
}
fun shutdown() {
stop()
tts?.shutdown()
tts = null
coroutineScope.cancel()
}
private fun speakCurrentChunk(fromPause: Boolean = false) {
val session = currentSession ?: return
val ttsInstance = tts ?: return
val chunkIndex = session.currentChunkIndex
if (chunkIndex >= session.chunks.size) {
stop()
return
}
val chunk = session.chunks[chunkIndex]
val textToSpeak: String
if (fromPause) {
val resumeIndex = session.lastSpokenTextEndIndex - session.chunkOffsets[chunkIndex]
session.resumeOffsetInChunk = if (resumeIndex > 0) resumeIndex else 0
textToSpeak = if (session.resumeOffsetInChunk < chunk.length) {
chunk.substring(session.resumeOffsetInChunk)
} else {
session.currentChunkIndex++
speakCurrentChunk()
return
}
} else {
session.resumeOffsetInChunk = 0
textToSpeak = chunk
}
ttsInstance.speak(textToSpeak, TextToSpeech.QUEUE_FLUSH, null, "chunk_${chunkIndex}")
}
private suspend fun initialize() {
if (tts != null) return
tts = suspendCancellableCoroutine { continuation ->
var ttsInstance: TextToSpeech? = null
val listener = TextToSpeech.OnInitListener { status ->
if (status == TextToSpeech.SUCCESS) {
ttsInstance?.let {
it.setOnUtteranceProgressListener(createProgressListener())
if (continuation.isActive) {
continuation.resume(it)
}
}
} else {
if (continuation.isActive) {
continuation.resumeWithException(Exception("Init failed."))
}
}
}
ttsInstance = TextToSpeech(context, listener)
continuation.invokeOnCancellation {
ttsInstance.shutdown()
}
}
}
private fun createProgressListener() = object : UtteranceProgressListener() {
override fun onStart(utteranceId: String?) {}
override fun onError(utteranceId: String?) { stop() }
override fun onDone(utteranceId: String?) {
if (_state.value == TtsState.Speaking) {
currentSession?.let {
it.currentChunkIndex++
speakCurrentChunk()
}
}
}
override fun onRangeStart(utteranceId: String?, start: Int, end: Int, frame: Int) {
val session = currentSession ?: return
val chunkOffset = session.chunkOffsets[session.currentChunkIndex]
val resumeOffset = session.resumeOffsetInChunk
val absoluteStart = chunkOffset + resumeOffset + start
val absoluteEnd = chunkOffset + resumeOffset + end
session.lastSpokenTextEndIndex = absoluteEnd
coroutineScope.launch {
_highlight.emit(Pair(absoluteStart, absoluteEnd))
}
}
}
companion object {
private const val TTS_CHUNK_SIZE_LIMIT = 3000
private fun splitTextIntoChunks(text: String): List<String> {
if (text.length <= TTS_CHUNK_SIZE_LIMIT) {
return listOf(text)
}
val chunks = mutableListOf<String>()
var remainingText = text
while (remainingText.isNotEmpty()) {
if (remainingText.length <= TTS_CHUNK_SIZE_LIMIT) {
chunks.add(remainingText)
break
}
val potentialChunk = remainingText.substring(0, TTS_CHUNK_SIZE_LIMIT)
var splitIndex = potentialChunk.lastIndexOfAny(charArrayOf('.', '!', '?'))
if (splitIndex == -1) {
splitIndex = potentialChunk.lastIndexOf(' ')
}
if (splitIndex == -1) {
splitIndex = TTS_CHUNK_SIZE_LIMIT - 1
}
chunks.add(remainingText.substring(0, splitIndex + 1))
remainingText = remainingText.substring(splitIndex + 1)
}
return chunks
}
}
override fun onPause(owner: LifecycleOwner) {
if (_state.value == TtsState.Speaking) {
pause()
}
}
override fun onDestroy(owner: LifecycleOwner) {
shutdown()
}
}
Here is the sample compose UI:
@Composable
fun RecipeReaderScreen() {
val context = LocalContext.current
val lifecycleOwner = androidx.lifecycle.compose.LocalLifecycleOwner.current
// Remember the manager across recompositions
val ttsManager = remember { TextToSpeechManager(context) }
// Clean up the manager when the composable leaves the screen
DisposableEffect(lifecycleOwner) {
onDispose {
ttsManager.shutdown()
}
}
DisposableEffect(lifecycleOwner) {
// When the Composable enters the composition, add the observer.
lifecycleOwner.lifecycle.addObserver(ttsManager)
// When the Composable leaves the composition, remove the observer.
// This is also where the manager's onDestroy will be called,
// which in turn calls shutdown(), cleaning up all resources.
onDispose {
lifecycleOwner.lifecycle.removeObserver(ttsManager)
}
}
val recipeText = "First, preheat the oven to 180 degrees Celsius. " +
"While it heats, whisk together the flour, sugar, and cocoa powder in a large bowl. " +
"Slowly mix in the eggs and milk until the batter is smooth. " +
"Finally, pour the batter into a greased baking pan and bake for 30 minutes. Enjoy your delicious cake!"
// Collect states from our manager to drive the UI
val ttsState by ttsManager.state.collectAsState()
val highlight by ttsManager.highlight.collectAsState()
val coroutineScope = rememberCoroutineScope()
// Create the annotated string for highlighting text
val annotatedText = buildAnnotatedString {
append(recipeText)
highlight?.let { (start, end) ->
if (start in recipeText.indices && end <= recipeText.length && start < end) {
addStyle(
style = SpanStyle(
fontWeight = FontWeight.ExtraBold,
background = Color(0xFFFFF59D) // A pleasant yellow
),
start = start,
end = end
)
}
}
}
Column(
modifier = Modifier
.fillMaxSize()
.padding(16.dp),
horizontalAlignment = Alignment.CenterHorizontally
) {
Text(
text = "Chocolate Lava Cake Recipe",
style = MaterialTheme.typography.headlineMedium,
modifier = Modifier.padding(bottom = 16.dp)
)
Card(
modifier = Modifier
.fillMaxWidth()
.weight(1f),
elevation = CardDefaults.cardElevation(4.dp)
) {
Text(
text = annotatedText,
style = MaterialTheme.typography.bodyMedium,
modifier = Modifier.padding(16.dp)
)
}
Spacer(modifier = Modifier.height(24.dp))
Row(
horizontalArrangement = Arrangement.Center,
modifier = Modifier.fillMaxWidth()
) {
// The main Play/Pause/Resume Button
Button(
onClick = {
when (ttsState) {
TtsState.Idle -> {
coroutineScope.launch { ttsManager.speak(recipeText) }
}
TtsState.Paused -> {
ttsManager.resume()
}
TtsState.Speaking -> {
ttsManager.pause()
}
}
},
modifier = Modifier.height(48.dp)
) {
val buttonText = when(ttsState) {
TtsState.Idle -> "Read Recipe"
TtsState.Speaking -> "Pause"
TtsState.Paused -> "Resume"
}
Text(buttonText)
}
Spacer(modifier = Modifier.width(16.dp))
// The Stop Button
Button(
onClick = { ttsManager.stop() },
enabled = ttsState != TtsState.Idle,
modifier = Modifier.height(48.dp),
colors = ButtonDefaults.buttonColors(
containerColor = MaterialTheme.colorScheme.error,
contentColor = Color.White
)
) {
Text("Stop")
}
}
}
}
If you run the app, you will see something like this:

In the video sample, I have demonstrated the Speak, Pause, Resume and Hightlight functionality. Due to image limitation in Medium.com, audio will not be hearable in the video (GIF Image).
Job Offers
🧠 Conclusion:
And there we have it! We started with a simple, buggy piece of code and tackled every problem one by one. Our final TextToSpeech handler is now a thing of beauty:
- It waits patiently for the engine to be ready.
- It highlights words beautifully.
- It handles Pause and Resume like a pro.
- It doesn’t choke on large texts.
- It behaves itself when the user leaves the app or the screen.
💬 Let’s Talk
Have you tried building your Text To Speech? Have you handled any additional features? Or have you handled it in better way?— I’d love to hear from you!

☕ Enjoying the content? Buy me a coffee to keep the ideas flowing!
📚 My Articles:
This article was previously published on proandroiddev.com.


