Image credit: Steve Johnson
Last week we looked at the basics of the CameraX library. That laid the foundations for something really exciting … AI vision! Now we can use your Android device to interpret and understand the physical world around us.
AI vision has incredible potential, like recognising what’s in a photo, separating/masking areas in images, or recognising body poses, smiles and other gestures. And this can all run on your phone — no need for internet access, or to pass your camera data to a third party.
Building a hand gesture recogniser
In this article we’re going to demo this by building a hand gesture recogniser. Like this:
CameraX delivers a live stream to an AI, which recognises my gestures
In the example, my hand gestures are being turned into emojis. We’ll use two libraries for this:
- CameraX is the Android Jetpack library which makes using camera features on Android much, much easier. (If you’ve ever tried to build a camera experience without it, you’ll know what I mean 😆).
- MediaPipe is a great cross-platform library for on-device AI tasks. It has a vast amount of uses including driving AI models related to audio, video and text related tasks. We’re going to use it with the gesture_recognizer model to recognise the hand gestures.
We’ll start with the CameraX side of things and get a live stream of frames from the camera. Then we’ll use MediaPipe to asynchronously deliver those frames to the AI model for its analysis. And, of course, all processing will be done on the device.
A working app demonstrating all the code in this article is availalable here: https://github.com/tdcolvin/MediaPipeCameraXDemo
Reminder: CameraX’s use cases
CameraX is good for four specific tasks, which it calls use cases. In my last article we built a sample which used two of those use cases:
- The Preview use case allows us to display what the camera is pointing at (like a viewfinder)
- The Image Capture use case allows us to take photos
The other two use cases are Video Capture (for, um, capturing video) and Image Analysis (for receiving a live stream of video frames).
We’re going to use the Image Analysis use case here. The live stream of frames it provides are going to be delivered directly to MediaPipe, which will in turn use them to crank the handle on our AI model.
Step 1: Add CameraX library and set up the Preview use case
My last article showed you how to add the CameraX dependency to Gradle, how to add the PreviewView, how to link it to the Preview use case, and how to bind all that together using a CameraProvider.
If you’ve not used CameraX at all before then I’d recommend you start with that. There’s some concepts to get your head around as a prerequisite to this article.
Here, we’ll start with a version of the CameraPreview composable that we built previously. It’s the same as before, but for simplicity I’ve taken out support for camera switching, zoom, and image capture:
@Composable | |
fun CameraPreview( | |
modifier: Modifier = Modifier | |
) { | |
val previewUseCase = remember { androidx.camera.core.Preview.Builder().build() } | |
var cameraProvider by remember { mutableStateOf<ProcessCameraProvider?>(null) } | |
val localContext = LocalContext.current | |
fun rebindCameraProvider() { | |
cameraProvider?.let { cameraProvider -> | |
val cameraSelector = CameraSelector.Builder() | |
.requireLensFacing(CameraSelector.LENS_FACING_FRONT) | |
.build() | |
cameraProvider.unbindAll() | |
cameraProvider.bindToLifecycle( | |
localContext as LifecycleOwner, | |
cameraSelector, | |
previewUseCase | |
) | |
} | |
} | |
LaunchedEffect(Unit) { | |
cameraProvider = ProcessCameraProvider.awaitInstance(localContext) | |
rebindCameraProvider() | |
} | |
AndroidView( | |
modifier = modifier.fillMaxSize(), | |
factory = { context -> | |
PreviewView(context).also { | |
previewUseCase.surfaceProvider = it.surfaceProvider | |
rebindCameraProvider() | |
} | |
} | |
) | |
} |
Step 2: Create the ImageAnalysis use case
Next we need the ImageAnalysis use case to get that live video stream. Like other use cases in CameraX, it’s created using a builder pattern:
val imageAnalysisUseCase = remember { | |
ImageAnalysis.Builder().build().apply { | |
setAnalyzer(context.mainExecutor, viewModel.imageAnalyzer) | |
} | |
} |
That imageAnalyzer function is going to be called by CameraX whenever there’s an image ready for us to process. Later, we’ll use it to call the MediaPipe code. For now we’ll just add an empty implementation into our view model:
class HandGestureViewModel(application: Application): AndroidViewModel(application) { | |
... | |
val imageAnalyzer = ImageAnalysis.Analyzer { image -> | |
Log.v("cameraxdemo", "Received frame for analysis: ${image.width} x ${image.height}") | |
image.close() | |
} | |
} |
We need to close() the image that we’re given so that CameraX knows we’re ready for the next one.
Step 3: Bind the ImageAnalysis use case
Now that we’ve created the ImageAnalysis use case, we need to get CameraX to actually use it. The CameraProvider.bindToLifecycle(…) function is the glue which binds together a physical camera with the use cases, against a particular lifecycle. In our demo app, that is called by the CameraPreview composable’s rebindCameraProvider function. And so we must pass our ImageAnalysis use case to that composable:
@Composable | |
fun CameraPreview( | |
modifier: Modifier = Modifier, | |
imageAnalysisUseCase: ImageAnalysis? | |
) { | |
... | |
fun rebindCameraProvider() { | |
... | |
cameraProvider.bindToLifecycle( | |
localContext as LifecycleOwner, | |
cameraSelector, | |
previewUseCase, | |
imageAnalysisUseCase | |
) | |
} | |
... | |
} |
Great! Now, if we run the app we’ll see a camera preview.
…And more importantly, we can see that there’s a stream of frames being blasted at our imageAnalyzer:
Woohoo! That’s the CameraX bit done. Now let’s get MediaPipe to detect hand gestures.
Step 4: Add the MediaPipe dependencies
The MediaPipe library we’re going to use is tasks-vision, which is added to Gradle like so:
dependencies { | |
... | |
implementation(libs.tasks.vision) | |
} |
[versions] | |
... | |
mediapipe = "0.10.20" | |
[libraries] | |
... | |
tasks-vision = { group = "com.google.mediapipe", name = "tasks-vision", version.ref = "mediapipe" } |
Aside: MediaPipe dependency version problem
At time of writing, the current MediaPipe version is 0.10.20. For some reason, a few years ago, a version of the MediaPipe libraries were released to Maven with a version number formed from the date: 0.20230731. This was probably a mistake — but whatever the reason, it means that Android Studio thinks that one is the latest version:

Because 20230731 > 10, right? Nope. Don’t be tempted to accept this change, it will break things. And you’ll have to manually check the MediaPipe releases page for new versions, because until v1.0.0 comes out, Android Studio is always going to get it wrong.
Step 5: Add the AI model
MediaPipe is just the chauffeur. It will need to be given a car to drive — that is, an AI model to input to and receive output from.
MediaPipe will accept almost any AI model built for LiteRT (formerly known as Tensorflow Lite), though obviously some might be too big to fit in a phone’s memory. HuggingFace is a good resource for downloadable models as you can search by library support.
For hand gesture recognition, I’m using gesture_recognizer.task. I found this in the MediaPipe samples, where it was provided without proper credit to the original author (although perhaps it was created specifically for that sample). If you know who it belongs to, let me know so I can credit!
MediaPipe expects to find this file in an asset directory / source set, so we’ll put it there.

Video vs image models
Our gesture_recognizer.task operates on single, static images. It’s also, therefore, fine for video since a video is a just a stream of single images. We will run the model separately on each frame, and the model itself won’t use or even remember data from previous frames.
Some models are explicitly designed to work on videos. Often these have quite a large memory footprint.
Step 6: Creating the gesture recogniser
We now have the MediaPipe library installed and the AI model in place. To open and use the model, we need a MediaPipe GestureRecognizer instance. We’ll use this in the next steps to pass the image frames to.
Like CameraX, MediaPipe makes heavy use of the builder pattern. So we build a GestureRecognizerOptions object using this pattern, making use of a BaseOptions object which tells us where the model file is. Then from that GestureRecognizerOptions we’ll create the GestureRecognizer.
The GestureRecognizerOptions methods we’ll use are:
- setRunningMode(RunningMode.LIVE_STREAM) which means that we’ll pass it frames from a live video feed and it’ll send us continuous results asynchronously. (The alternative choice would be RunningMode.IMAGE where we’d feed it a single image and it would give a single answer synchronously).
- setResultListener(…) which specifies a function to be called asynchronously when results are available from the model.
Once the GestureRecognizerOptions instance is built, we can use it to create our GestureRecognizer instance. That’s done using GestureRecognizer.createFromOptions(…)
:
private val gestureRecognizer by lazy { | |
val baseOptionsBuilder = BaseOptions.builder().setModelAssetPath("gesture_recognizer.task") | |
val baseOptions = baseOptionsBuilder.build() | |
val optionsBuilder = | |
GestureRecognizer.GestureRecognizerOptions.builder() | |
.setBaseOptions(baseOptions) | |
.setResultListener { result, _ -> handleGestureRecognizerResult(result) } | |
.setRunningMode(RunningMode.LIVE_STREAM) | |
val options = optionsBuilder.build() | |
GestureRecognizer.createFromOptions(getApplication(), options) | |
} | |
private fun handleGestureRecognizerResult(result: GestureRecognizerResult) { | |
// Handle the result from the model | |
} |
Step 7: Delivering the frames to the gesture recogniser
OK, so now we have CameraX delivering its frames to an imageAnalyzer function (from step 2), and a gesture recogniser ready to analyse an image.
Let’s join those two ends together, so the gesture recogniser gets its frames!
The images we get from CameraX are going to be in the camera’s natural orientation — which is not necessarily the way round that the phone/tablet is being held. So we need to rotate them to match the orientation of the device.
Also, modern cameras produce pictures which are way too big for most AI tasks. Generally, AI models — particularly LiteRT ones — run on very small images. There’s no need to feed our gesture recogniser anything bigger than, say, 500px. Even that is probably too big. Largr images just add latency as the model has to work harder.
So, we also need to resize the image.
We’ll add that scale and resize task to our imageAnalyzer from step 2:
val imageAnalyzer = ImageAnalysis.Analyzer { image -> | |
val imageBitmap = image.toBitmap() | |
val scale = 500f / max(image.width, image.height) | |
// Create a bitmap that's scaled as needed for the model, and rotated as needed to match display orientation | |
val scaleAndRotate = Matrix().apply { | |
postScale(scale, scale) | |
postRotate(image.imageInfo.rotationDegrees.toFloat()) | |
} | |
val scaledAndRotatedBmp = Bitmap.createBitmap(imageBitmap, 0, 0, image.width, image.height, scaleAndRotate, true) | |
image.close() | |
} |
Finally, we need to pass the processed image to our gesture recogniser:
val imageAnalyzer = ImageAnalysis.Analyzer { image -> | |
... | |
gestureRecognizer.recognizeAsync( | |
BitmapImageBuilder(scaledAndRotatedBitmap).build(), | |
System.currentTimeMillis() | |
) | |
} |
When results become available, they will be delivered to our handleGestureRecognizerResult method. So let’s fill that in now.
Step 8: Handling the gesture recogniser results
At this point, images are being delivered to the model, and the model is processing them and providing a response. That response tells us what gestures it’s recognised. So let’s parse those results, and display them on the screen as an emoji.
The response comes in the form of an instance of GestureRecognizerResult, which has a function gestures(). This gives a list of gestures it’s recognised, each of which has a list of possible options for what that gesture might be.
That’s a bit complex. Let’s give an example.
Say it saw this image:

Image credit: Mike Murray
There are two hand gestures there. If our model is doing well, the result will show that there were two gestures recognised.
Let’s say the first one is the gesture on the left. For that, it would hopefully be pretty sure it’s a thumbs up. But AI isn’t perfect, and that thumbs up might be recognised as something else. So those results include a list of gestures it could be, along with a score for each. The score is between 0 and 1, with 1 meaning it’s totally confident. An example here might be:
- Thumb_Up, score = 0.9
- Closed_Fist, score 0.2
- Pointing_Up, score 0.05
Here it’s pretty sure that that gesture is a thumb up, but it might instead be a closed fist. It’s unlikely to be an index finger pointing up.
And there would be a similar list of possibilities with confidence scores for the second gesture.
Each possible gesture is called a category in AI terms. What we’ll do is pick the first gesture it recognises, and for that gesture we’ll pick the category with the highest confidence score:
private fun handleGestureRecognizerResult(result: GestureRecognizerResult) { | |
// Figure out the most likely gesture | |
val bestCategory = result.gestures() | |
.firstOrNull() | |
?.maxByOrNull { it.score() } | |
} |
Finally we’ll convert that category into an emoji which we’ll display in the UI:
// Convert it to an emoji | |
val gesture = when(bestCategory?.categoryName()) { | |
"Thumb_Up" -> "\uD83D\uDC4D" | |
"Thumb_Down" -> "\uD83D\uDC4E" | |
"Pointing_Up" -> "☝\uFE0F" | |
"Open_Palm" -> "✋" | |
"Closed_Fist" -> "✊" | |
"Victory" -> "✌\uFE0F" | |
"ILoveYou" -> "\uD83E\uDD1F" | |
else -> null | |
} | |
// Display it in the UI | |
if (gesture != null) { | |
uiState.update { it.copy(mostRecentGesture = gesture) } | |
} |
And that’s it! Finally our app will detect hand gestures.

Job Offers
To wrap up
We’ve seen how to:
- Use CameraX’s ImageAnalysis use case to deliver us a live stream of frames from the camera
- Add the MediaPipe tasks-vision library and use it to detect hand gestures
- Rotate and resize the CameraX output
- Handle results from the gesture recogniser, and understand confidence scores.
You can see demo code for all the above, and a working app, on my GitHub.
I hope this has been a helpful tutorial, but as always please let me know any feedback or questions. I’m here in the comments, on LinkedIn or BlueSky. (Yes please do follow me on BlueSky! I’ve only just joined! 😆)
Need to build an app that uses AI or camera features? Or any app at all, whether Android, iOS or web? I’m a consultant Android developer and the cofounder of Apptaura, the app development experts. Please get in touch if I can help with your latest project.
This article is previously published on proandroiddev.com.