Blog Infos
Author
Published
Topics
, , , ,
Published
📌 Introduction

Lately, I’ve been interested in how AI can run directly on Android devices — without sending data to the cloud. Thanks to modern tools and better hardware, it’s now possible to handle tasks like image recognition and text generation locally, on a device. This brings several benefits: it’s faster, more secure, keeps user privacy, and it can save money since you’re not relying on paid APIs or cloud services. Tools like LiteRT (the newer, simpler version of TensorFlow Lite) and MediaPipe make it easy to bring these features into Android apps.

I started this project out of pure curiosity — to explore what on-device AI can do on Android. If you’re an Android developer curious about adding AI features into your app without relying on the cloud, this could be a good place to start exploring.

To test out these ideas I created Cats & Dogs demo app, where I experimented with on-device image classification, image captioning and lightweight LLM integration. I’ll explain the process in this article.

🔍 Project Overview

The app fetches and displays a random image of a cat or a dog, classifies fetched image, then generates a caption for the image. Finally, it uses the generated caption as a prompt for a lightweight Gemma3 LLM to generate image description. All steps (except image fetching) are performed on-device, without internet.

Showcase app through iterations

🧪 Iteration 1: Image Classification

The first goal was to implement image classification. After loading an image via Coil library, I used ImageRequest to extract a bitmap. This bitmap needed preprocessing before feeding it into the model — which involved resizing and converting it into a byte buffer format.

Stages in image classification

Loading and running a LiteRT model involves the following steps:

  1. Loading the model into memory.
  2. Building an Interpreter based on an existing model.
  3. Setting input tensor values.
  4. Invoking inferences.
  5. Outputting tensor values.

LiteRT is used on Android with Google Play services, Android’s official ML inference runtime, to run high-performance ML inference in your app

LiteRT for ML runtimehttps://developer.android.com/ai/custom

This logic is placed inside of the ImageClassifier class. In this class I initialised Interpreter from LiteRT library with tflite model loaded into memory. The model was trained using Transfer Learning Colab notebook and exported as a TensorFlow Lite model. Additionally, we can include interpreter options e.g. number of threads used for executing operations on the CPU or enabling or disabling the Neural Networks API (NNAPI) for model execution.

val modelByteBuffer = assetManager.loadModelFile(modelPath)
val options = Interpreter.Options().apply {
    numThreads = NUM_OF_THREADS
    useNNAPI = true
}
val interpreter = Interpreter(modelByteBuffer, options)

After interpreter instantiation, we can use its run method to execute inference:

fun recognizeImage(bitmap: Bitmap): List<Recognition> {
    val scaledBitmap = bitmap.scale(inputSize, inputSize, false)
    val byteBuffer = convertBitmapToByteBuffer(scaledBitmap)
    val result = Array(OUTPUT_TENSORS_COUNT) { FloatArray(labels.size) }
    interpreter.run(byteBuffer, result)
    return getSortedResult(result)
}
🧪 Iteration 2: Image Captioning

The next step was to generate a caption using a two-stage “Show and Tell” TFLite model. It combines:

  • CNN model — processes the image and extracts a 1024-dimensional feature vector.
  • LSTM language model — generates text based on those visual features.

I used CNN and LSTM models from ImageCaptioningAndroid repository.

The CNN model uses a 346x346x3 image as input (height x width x RGB channels) and outputs a 1024-dimensional feature vector, which then serves as input to the LSTM model. The LSTM generates a caption, delimited with special start/end tags.

fun generateCaption(bitmap: Bitmap): String {
    val imageFeed = preprocessImage(bitmap)
    val stateFeed = runCNNInference(imageFeed)
    return generateCaptionText(stateFeed)
}

The first step in generating a caption from an image is to preprocess the image bitmap. The preprocessImage method converts the input bitmap into a normalised float array suitable for CNN model input. We need to resize the image to the model’s expected inputSize. Then, we initialise a 3D array which represents [height][width][channels]. We use a 3-channel RGB float array. Finally, we populate the created 3D array with normalised RGB values.

private fun preprocessImage(bitmap: Bitmap): Array<Array<FloatArray>> {
    val scaledBitmap = bitmap.scale(inputSize, inputSize, false)
    val imageFeed = Array(inputSize) { Array(inputSize) { FloatArray(3) } }
    for (i in 0 until inputSize) {
        for (j in 0 until inputSize) {
            val pixelValue = scaledBitmap[i, j]
            imageFeed[i][j][0] = ((pixelValue shr 16) and 0xFF) / IMAGE_STD
            imageFeed[i][j][1] = ((pixelValue shr 8) and 0xFF) / IMAGE_STD
            imageFeed[i][j][2] = (pixelValue and 0xFF) / IMAGE_STD
        }
    }
    return imageFeed
}

The output of the preprocessImage method serves as the input for the runCNNInference method, which runs the CNN model to extract image features as the initial LSTM state.

private fun runCNNInference(imageFeed: Array<Array<FloatArray>>): Array<FloatArray> {
    val lstmInitialState = cnnInterpreter.getOutputIndex("import/lstm/initial_state")
    val stateFeed = Array(1) { FloatArray(LSTM_STATE_SIZE) }
    //Run CNN to get image features initial state
    val outputsCnn = hashMapOf<Int, Any>(lstmInitialState to stateFeed)
    cnnInterpreter.runForMultipleInputsOutputs(arrayOf(imageFeed), outputsCnn)
    return stateFeed
}

First, we identify the tensor index in the CNN output that corresponds to the LSTM’s initial state. Then, we create a 2D float array to hold the LSTM initial state (batch size = 1). Next, we use the preprocessed image as input to the CNN and return the initial LSTM state needed for caption generation. Using that output, the generateCaptionText method runs the LSTM model iteratively to generate the caption one word at a time.

private fun generateCaptionText(stateFeed: Array<FloatArray>): String {
    val softmax = Array(1) { FloatArray(VOCABULARY_SIZE) }
    val lstmState = Array(1) { FloatArray(LSTM_STATE_SIZE) }
    //Setup LSTM outputs
    val outputsLstm = hashMapOf<Int, Any>(
        lstmInterpreter.getOutputIndex("import/softmax") to softmax,
        lstmInterpreter.getOutputIndex("import/lstm/state") to lstmState
    )
    val words = mutableListOf<Int>()
    val inputFeed = Array(1) { LongArray(1) }
    repeat(MAX_CAPTION_LENGTH) {
        lstmInterpreter.runForMultipleInputsOutputs(arrayOf(inputFeed, stateFeed), outputsLstm)
        val maxId = softmax[0].findMaxId()
        if (maxId == vocabulary.getClosingTagWordIndex()) return buildCaption(words.toList())
        words.add(maxId)
        inputFeed[0][0] = maxId.toLong()
        stateFeed[0] = lstmState[0].copyOf()
    }
    return buildCaption(words.toList())
}
🧪 Iteration 3: LLM Integration with MediaPipe

In the final step, I used the caption as input for a lightweight LLM to generate a descriptive sentence. I integrated MediaPipe library which provides LlmInference class to facilitate this.

https://developers.googleblog.com/en/large-language-models-on-device-with-mediapipe-and-tensorflow-lite

With instance of LlmInference I created LlmInferenceSession where I provided generated caption as a prompt with addQueryChunk method. Finally, with generateResponseAsync method I started generating output. This allowed the app to provide more human-like image description.

fun generateLlmDescription(
    prompt: String,
    progressListener: ProgressListener<String>
): ListenableFuture<String> {
    if (!::llmInference.isInitialized) {
        if (model.isDownloaded(context)) {
            llmInference = createEngine(context)
        } else {
            return SettableFuture.create()
        }
    }
    if (::llmInferenceSession.isInitialized) {
        llmInferenceSession.close()
    }
    llmInferenceSession = createSession()
    llmInferenceSession.addQueryChunk(prompt)
    return llmInferenceSession.generateResponseAsync(progressListener)
}

private fun createEngine(context: Context) = try {
    LlmInference.createFromOptions(context, getInferenceOptions())
} catch (e: Exception) {
    Log.e(TAG, "Load model error: ${e.message}", e)
    throw IllegalStateException("Failed to load model")
}

private fun getInferenceOptions() = LlmInference.LlmInferenceOptions.builder()
    .setModelPath(model.path(context))
    .setMaxTokens(MAX_TOKENS)
    .apply { model.preferredBackend?.let { setPreferredBackend(it) } }
    .build()

private fun createSession(): LlmInferenceSession = try {
    LlmInferenceSession.createFromOptions(llmInference, getSessionOptions())
} catch (e: Exception) {
    Log.e(TAG, "LlmInferenceSession create error: ${e.message}", e)
    throw IllegalStateException("Failed to create model session")
}

private fun getSessionOptions() = LlmInferenceSessionOptions.builder()
    .setTemperature(model.temperature)
    .setTopK(model.topK)
    .setTopP(model.topP)
    .build()

Job Offers

Job Offers

There are currently no vacancies.

OUR VIDEO RECOMMENDATION

No results found.

Jobs

🏁 Outcome

This project was a deep dive into the practical aspects of using on-device AI on Android. These are my key takeaways and experiences:

  • LiteRT abstracts a lot of TensorFlow Lite boilerplate and integrating models is almost seamless.
  • Two-stage pipelines can run efficiently. Even multi-model architectures like image → caption → description can work responsively on modern devices.
  • MediaPipe makes LLM integration straightforward and performance was suitable for a lightweight prompt.

https://github.com/stevan-milovanovic/LiteRT-for-Android

This project started as a personal learning exercise, and it showed me how easy it is to bring On-Device AI features to Android users.

🧩 What’s Next

One of the possible ideas to extend the app could be adding offline image generation model like FastGAN. That would make the entire pipeline — image creation, classification, captioning, and description — fully on-device.

👉 Checkout the project: GitHub Repository

💬 I’d love to hear your feedback, ideas and suggestions for improvements!

#Android #OnDeviceAI #LiteRT #MediaPipe #Gemma #LLM #TFLite #ImageClassification #ImageCaptioning #GenAI #JetpackCompose

This article was previously published on proandroiddev.com.

Menu