Build Live Text on Android

Blog Infos

Author

Peng Jiang

Published

29. September 2021

Topics

Jetpack Compose

Author

Peng Jiang

Published

29. September 2021

Topics

Jetpack Compose

Posted By: Peng Jiang

One of the new features announced at WWDC this year was Live Text. It allows users to select, translate, and search for text found within any image. The demo used during the keynote was a meeting whiteboard with handwritten text. Opening the camera app on iPhone and pointing it at the whiteboard, a small indicator appears in the bottom right corner showing text is recognized in the viewfinder.

Android users may notice that the Live Text feature works the same as Google Lens. However, it isn’t as integrated into the Android experience as Live Text on iOS. You can trigger Live Text from any text input field on iPhone. In this post, I would like to show you how to build a simple version of the Live Text with Android Jetpack compose and CameraX. Here are some screenshots from the demo App. You can find the demo app code here.

Requirements

Before we dive into the implementation part, let’s define what are the basic functionalities of Live Text first. If you have seen the Live Text demo or use the Google lens, the basic steps are:

Recognizes the text from the image or camera frame
Handles the text selection.
Understand the text content and suggest potential actions, such as an address, phone number, email, etc.

You may notice this can be just a basic optical character recognition (OCR) process with context awareness suggestions. However, when I started to dig deeper and implement the features, the difficulty coming up especially from the user experience point of view.

App Structure

Using the ViewModel and other Jetpack libraries in the demo, the demo app structure is very simple. The following diagram shows the components and data flows:

The activity host three Composable views, which display the preview, detected text content and action button.

The CameraX library is used to handle the preview and image frame analysis. The analysis interface provides a CPU-accessible image to perform image processing on each frame. The analyzer is set to work in a non-block model (using STRATEGY_KEEP_ONLY_LATESTflag), which always tries to get the latest image frame and check the text in the frame.

In this demo app, the Huawei OCR library is used to handle the OCR process. You can also use other OCR libraries, such as the Firebase ML kit. The OCR service can run on the cloud or device. The text recognition service running on the cloud recognizes text with higher accuracy and more language support. But the on-device version can provide a real-time process, which is a small ML algorithm that can be added to your app. It has limited language support, more details you can find here. The following recognize steps are very similar to most of the OCR libraries, so you can replace them with the one you prefer.

	class TextAnalyzer(private val onTextDetected: (MLText) -> Unit) : ImageAnalysis.Analyzer {
	private val setting = MLLocalTextSetting.Factory()
	.setOCRMode(MLLocalTextSetting.OCR_TRACKING_MODE)
	.setLanguage("en")
	.create()

	private val analyzer = MLAnalyzerFactory.getInstance().getLocalTextAnalyzer(setting)

	@SuppressLint("UnsafeOptInUsageError")
	override fun analyze(imageProxy: ImageProxy) {

	imageProxy.image?.let { image ->
	analyzer.asyncAnalyseFrame(MLFrame.fromMediaImage(image, imageProxy.imageInfo.rotationDegrees))
	.addOnSuccessListener { mlText ->
	mlText?.let {
	onTextDetected.invoke(it)
	}
	imageProxy.close()
	}.addOnFailureListener {
	imageProxy.close()
	}
	}
	}
	}

view raw TextAnalyzer.kt hosted with ❤ by GitHub

TextAnalyzer

Create a text analyzer instance MLTextAnalyzer to recognize text in the camera frame. You can use MLLocalTextSetting to specify the support languages and OCR detection mode. There are two detection modes, one is for single image detection (OCR_DETECT_MODE), the other one is for video stream text detection (OCR_TRACKING_MODE). The detection result of the preceding frame is used as the basis to quickly detect the text position in the camera frame.
Create a MLFrame object using MLFrame.fromMediaImage. You need to pass the rotation value when creating the MLFrame, but if you are using the CameraX, the ImageAnalysis.Analyzer classes calculate the rotation value (imageProxy.imageInfo.rotationDegrees) for you.
Pass the MLFrame object to the asyncAnalyseFrame method for text recognition. The recognition result is specified by the MLText.Block array.

Job Offers

Publish on Maven Central

APPLY NOW

Room Tutorial(Part I): Grasping the Fundamentals

APPLY NOW

Integrating Android Views like AdView in Jetpack Compose

APPLY NOW

Job Offers

Reset

Posted 2 months ago

Senior Android Engineer

Carly Solutions GmbH

Munich

Full Time

apply now

Posted 2 months ago

Senior Android Developer

SumUp

Berlin

Full Time

apply now

OUR VIDEO RECOMMENDATION

droidcon San Francisco, Jetpack Compose, Navigation

Jetpack Compose Navigation: Beyond The Basics

droidcon

One of the challenges facing Android engineers today is that while there is a myriad of information available surrounding the more established navigation with fragments, there are few practical resources on how to adapt to…

Watch Video

Jetpack Compose Navigation: Beyond The Basics

Miguel Kano

Android Engineer

Robinhood Markets, Inc

Jetpack Compose Navigation: Beyond The Basics

Miguel Kano

Android Engineer

Robinhood Markets, I ...

Jetpack Compose Navigation: Beyond The Basics

Miguel Kano

Android Engineer

Robinhood Markets, Inc

Jobs

Bangkok, Helsinki or Oulu

Senior Full Stack Developer

Full Time

APPLY NOW

USA

Senior Cybersecurity Engineer

Full Time

APPLY NOW

Copenhagen

Senior Backend Developer

Full Time

APPLY NOW

Berlin

Senior Android Developer

Full Time

APPLY NOW

Boullay-Les-Troux (91)

Expert outils et système d’exploitation Android H/F

Full Time

APPLY NOW

Berlin

Team Lead App Development (all genders)

Full Time

APPLY NOW

San Francisco

Uber – Staff Android Engineer, Rider

Full Time

APPLY NOW

Munich

Senior Android Engineer

Full Time

APPLY NOW

Berlin

Backend Engineer – Java/Kotlin (m/f/d)

Full Time

APPLY NOW

Display the text and action

Once the text is recognized, the library will return the detected text content and vertices of the text bounding box. The share action button will be shown and the detected text will also display above the image. Once you click the text, it will copy the text to the clipboard. You can also click the share button to share all texts detected in the image frame.

When implementing the detected text, I was thinking to use the Text component and put above the preview, but the performance is not good, especially when you have lots of text blocks. Then I change to draw the text on the canvas, which looks better, but the click action need more code to handle, you need to check the touchpoint is in any text block area. The final result will look like this:

Conclusion

After finishing this simple demo, I think the most difficult part is still the text recognition. The demo app will only detect English characters. If more than one language character is in the camera stream or the characters are not easily recognised, the demo app performance will drop dramatically. The device-only OCR can improve privacy protection but it may increase the difficulty if you want to handle multiple languages at the same time. I think maybe that’s why the Live Text on iOS 15 only recognizes seven languages in the initial release.

The second one is the user experience, especially when users try to select the detected text. Draw text on the canvas makes the text display more accurately but the selection part will be more difficult. As I mentioned initially, the Live Text is integrated with the iOS system, so it needs to be fast and lightweight. My current implementation took around 130MB memory when run and 18MB installed space. This included the OCR model around 500KB.

Another part I didn’t touch in this demo is the device machine learning. From the detected text content to understanding the context. When more and more actions are triggered, the local model can be improved based on the feedback. Free free to fork the repo and create your own Live Text.