With transcription APIs that support word alignment, we can trim exact words or sentences from a video. This lets users remove filler words, bad takes, or silences easily. The result is a clean and polished video. This blog explains how I builtĀ QuickTrimĀ usingĀ Media3 TransformerĀ andĀ ElevenLabs APIs.

QuickTrim Android | Transcription Based Video Trimmer
High Level Overview
The core idea is to extract audio from a video, convert it into a transcript using a speech-to-text API, allow user to trim the video based on words or segments in the transcript. and Export the edited media sequence

š Step 1: Audio Extraction:
Most transcription APIs require only audio input. TheĀ EditedMediaItem.setRemoveVideo(true) method is used to strip the video and retain only the audio, which is then saved in the cache directory.
| // Create EditedMediaItem with video removed | |
| val editedMediaItem = EditedMediaItem.Builder(MediaItem.fromUri(uri)) | |
| .setRemoveVideo(true) | |
| .build() | |
| // Define audio output path | |
| val outputPath = File(context.cacheDir, "audio.m4a").absolutePath | |
| // Configure and start Transformer | |
| Transformer.Builder(context) | |
| .setAudioMimeType(MimeTypes.AUDIO_AAC) | |
| .addListener(object : Transformer.Listener { | |
| override fun onCompleted(composition: Composition, exportResult: ExportResult) { | |
| // Handle success | |
| } | |
| override fun onError(composition: Composition, exportResult: ExportResult, exportException: ExportException) { | |
| // Handle failure | |
| } | |
| }) | |
| .build() | |
| .start(editedMediaItem, outputPath) |
š Step 2: Generating Transcript
Once audio extraction is complete, the file is sent to the ElevenLabs Speech-to-Text API, which returns a transcription with word-level timestamps. The API also supports additional output formats such as:
- SegmentedJson (sentence-level breakdown)
- SRT (subtitle format)
In this application, the segmented JSON format is used. More details can be found in the officialĀ ElevenLabs documentation.
API Interface for calling speech-to-text endpoint:
| interface QuickTrimApiService { | |
| @Multipart | |
| @POST("speech-to-text") | |
| suspend fun speechToText( | |
| @Part file: MultipartBody.Part, | |
| @Part("model_id") modelId: String? = null, | |
| @Part("timestamps_granularity") timestampGranularity: String? = null, | |
| @Part("additional_formats") additionalFormats: String? = null, | |
| @Part("diarize") diarize: Boolean = false, | |
| @Header("xi-api-key") apiKey: String = BuildConfig.API_KEY | |
| ): ResponseBody | |
| } |

š„ Step 3: Removing Filler Words and Segments with Real-Time Preview
Once the transcription is generated, users can remove specific words or segments by toggling them in the transcript. After each update, a real-time preview of the trimmed video is played. Since Media3 does not support direct preview ofĀ CompositionĀ objects, an alternative approach is used. Instead of building a new composition on every change, the application computesĀ keep intervalsāthe sections of the video that remain. ClippedĀ MediaItems are then created for each interval and passed to ExoPlayer as a list usingĀ setMediaItems(), enabling seamless playback of the edited sequence in real time.
Real-time preview is generated on each edit as follows:
| fun playTrimVideoPreview() { | |
| //extract the duration of original video | |
| val retriever = MediaMetadataRetriever().apply { setDataSource(context, mediaUri) } | |
| val duration = | |
| retriever.extractMetadata(MediaMetadataRetriever.METADATA_KEY_DURATION)?.toDouble() | |
| ?: 0.00 | |
| //compute keep intervals | |
| val removedSegments = getRemovedSegments() | |
| val sortedRemovals = removedSegments.sortedBy { it.first } | |
| val keepIntervals = mutableListOf<Pair<Double, Double>>() | |
| var lastEnd = 0.00 | |
| for ((start, end) in sortedRemovals) { | |
| if (lastEnd < start) keepIntervals.add(Pair(lastEnd, start)) | |
| lastEnd = end | |
| } | |
| if (lastEnd < _totalDuration.value) keepIntervals.add(Pair(lastEnd, duration)) | |
| //create edited media item sequence | |
| val mediaItemList = keepIntervals.map { (start, end) -> | |
| val mediaItem = MediaItem.Builder() | |
| .setUri(mediaUri) | |
| .setClippingConfiguration( | |
| MediaItem.ClippingConfiguration.Builder() | |
| .setStartPositionMs(start.toMs()) | |
| .setEndPositionMs(end.toMs()) | |
| .build() | |
| ) | |
| .build() | |
| mediaItem | |
| } | |
| exoPlayer?.setMediaItems(mediaItemList) | |
| exoPlayer?.prepare() | |
| controller?.start() | |
| } |

ā³ Step 4: Exporting Edited Media
Once editing is finalized, the final trimmed video is exported through the following steps:
- Compute Keep Intervals: Identify portions of the video to retain based on removed segments like filler words or silences.
- Build Clipped MediaItems: Create aĀ
MediaItemĀ for each keep interval using clipping configurations for the corresponding start and end times. - Compose Edited Sequence: Wrap the clipped items into anĀ
EditedMediaItemSequenceĀ and include it in aĀComposition. - Start Export: Pass theĀ
CompositionĀ toĀTransformer, which muxes and encodes the result into aĀ.mp4Ā file.
| val retriever = MediaMetadataRetriever().apply { setDataSource(context, uri) } | |
| val duration = retriever.extractMetadata(MediaMetadataRetriever.METADATA_KEY_DURATION)?.toDouble() ?: 0.00 | |
| // Calculate keep intervals | |
| val keepIntervals = computeKeepIntervals(removedSegments, duration) | |
| // Build EditedMediaItems | |
| val editedMediaItems = keepIntervals.map { (start, end) -> | |
| EditedMediaItem.Builder( | |
| MediaItem.Builder() | |
| .setUri(uri) | |
| .setClippingConfiguration( | |
| MediaItem.ClippingConfiguration.Builder() | |
| .setStartPositionMs(start.toMs()) | |
| .setEndPositionMs(end.toMs()) | |
| .build() | |
| ) | |
| .build() | |
| ).build() | |
| } | |
| val composition = Composition.Builder(EditedMediaItemSequence(editedMediaItems)).build() | |
| val outputPath = File(context.cacheDir, "final_trimmed.mp4").absolutePath | |
| Transformer.Builder(context) | |
| .experimentalSetTrimOptimizationEnabled(true) | |
| .addListener(object : Transformer.Listener { | |
| override fun onCompleted(composition: Composition, exportResult: ExportResult) { | |
| // Handle success | |
| } | |
| override fun onError(composition: Composition, exportResult: ExportResult, exportException: ExportException) { | |
| // Handle error | |
| } | |
| }) | |
| .build() | |
| .start(composition, outputPath) |
Job Offers

š Wrapping Up
Thanks for Reading! For low-level implementation details, refer to the GitHub repository or drop a question in the comments š.
This article was previously published on proandroiddev.com.


