Blog Infos
Author
Published
Topics
, , , ,
Published

With transcription APIs that support word alignment, we can trim exact words or sentences from a video. This lets users remove filler words, bad takes, or silences easily. The result is a clean and polished video. This blog explains how I builtĀ QuickTrimĀ usingĀ Media3 TransformerĀ andĀ ElevenLabs APIs.

QuickTrim Android | Transcription Based Video Trimmer

High Level Overview

The core idea is to extract audio from a video, convert it into a transcript using a speech-to-text API, allow user to trim the video based on words or segments in the transcript. and Export the edited media sequence

šŸ”Š Step 1: Audio Extraction:

Most transcription APIs require only audio input. TheĀ EditedMediaItem.setRemoveVideo(true) method is used to strip the video and retain only the audio, which is then saved in the cache directory.

// Create EditedMediaItem with video removed
val editedMediaItem = EditedMediaItem.Builder(MediaItem.fromUri(uri))
.setRemoveVideo(true)
.build()
// Define audio output path
val outputPath = File(context.cacheDir, "audio.m4a").absolutePath
// Configure and start Transformer
Transformer.Builder(context)
.setAudioMimeType(MimeTypes.AUDIO_AAC)
.addListener(object : Transformer.Listener {
override fun onCompleted(composition: Composition, exportResult: ExportResult) {
// Handle success
}
override fun onError(composition: Composition, exportResult: ExportResult, exportException: ExportException) {
// Handle failure
}
})
.build()
.start(editedMediaItem, outputPath)
šŸ“ƒ Step 2: Generating Transcript

Once audio extraction is complete, the file is sent to the ElevenLabs Speech-to-Text API, which returns a transcription with word-level timestamps. The API also supports additional output formats such as:

  • SegmentedJson (sentence-level breakdown)
  • SRT (subtitle format)

In this application, the segmented JSON format is used. More details can be found in the officialĀ ElevenLabs documentation.

API Interface for calling speech-to-text endpoint:

interface QuickTrimApiService {
@Multipart
@POST("speech-to-text")
suspend fun speechToText(
@Part file: MultipartBody.Part,
@Part("model_id") modelId: String? = null,
@Part("timestamps_granularity") timestampGranularity: String? = null,
@Part("additional_formats") additionalFormats: String? = null,
@Part("diarize") diarize: Boolean = false,
@Header("xi-api-key") apiKey: String = BuildConfig.API_KEY
): ResponseBody
}
šŸŽ„ Step 3: Removing Filler Words and Segments with Real-Time Preview

Once the transcription is generated, users can remove specific words or segments by toggling them in the transcript. After each update, a real-time preview of the trimmed video is played. Since Media3 does not support direct preview ofĀ CompositionĀ objects, an alternative approach is used. Instead of building a new composition on every change, the application computesĀ keep intervals—the sections of the video that remain. ClippedĀ MediaItems are then created for each interval and passed to ExoPlayer as a list usingĀ setMediaItems(), enabling seamless playback of the edited sequence in real time.

Real-time preview is generated on each edit as follows:

fun playTrimVideoPreview() {
//extract the duration of original video
val retriever = MediaMetadataRetriever().apply { setDataSource(context, mediaUri) }
val duration =
retriever.extractMetadata(MediaMetadataRetriever.METADATA_KEY_DURATION)?.toDouble()
?: 0.00
//compute keep intervals
val removedSegments = getRemovedSegments()
val sortedRemovals = removedSegments.sortedBy { it.first }
val keepIntervals = mutableListOf<Pair<Double, Double>>()
var lastEnd = 0.00
for ((start, end) in sortedRemovals) {
if (lastEnd < start) keepIntervals.add(Pair(lastEnd, start))
lastEnd = end
}
if (lastEnd < _totalDuration.value) keepIntervals.add(Pair(lastEnd, duration))
//create edited media item sequence
val mediaItemList = keepIntervals.map { (start, end) ->
val mediaItem = MediaItem.Builder()
.setUri(mediaUri)
.setClippingConfiguration(
MediaItem.ClippingConfiguration.Builder()
.setStartPositionMs(start.toMs())
.setEndPositionMs(end.toMs())
.build()
)
.build()
mediaItem
}
exoPlayer?.setMediaItems(mediaItemList)
exoPlayer?.prepare()
controller?.start()
}
ā³ Step 4: Exporting Edited Media

Once editing is finalized, the final trimmed video is exported through the following steps:

  1. Compute Keep Intervals: Identify portions of the video to retain based on removed segments like filler words or silences.
  2. Build Clipped MediaItems: Create aĀ MediaItemĀ for each keep interval using clipping configurations for the corresponding start and end times.
  3. Compose Edited Sequence: Wrap the clipped items into anĀ EditedMediaItemSequenceĀ and include it in aĀ Composition.
  4. Start Export: Pass theĀ CompositionĀ toĀ Transformer, which muxes and encodes the result into aĀ .mp4Ā file.
val retriever = MediaMetadataRetriever().apply { setDataSource(context, uri) }
val duration = retriever.extractMetadata(MediaMetadataRetriever.METADATA_KEY_DURATION)?.toDouble() ?: 0.00
// Calculate keep intervals
val keepIntervals = computeKeepIntervals(removedSegments, duration)
// Build EditedMediaItems
val editedMediaItems = keepIntervals.map { (start, end) ->
EditedMediaItem.Builder(
MediaItem.Builder()
.setUri(uri)
.setClippingConfiguration(
MediaItem.ClippingConfiguration.Builder()
.setStartPositionMs(start.toMs())
.setEndPositionMs(end.toMs())
.build()
)
.build()
).build()
}
val composition = Composition.Builder(EditedMediaItemSequence(editedMediaItems)).build()
val outputPath = File(context.cacheDir, "final_trimmed.mp4").absolutePath
Transformer.Builder(context)
.experimentalSetTrimOptimizationEnabled(true)
.addListener(object : Transformer.Listener {
override fun onCompleted(composition: Composition, exportResult: ExportResult) {
// Handle success
}
override fun onError(composition: Composition, exportResult: ExportResult, exportException: ExportException) {
// Handle error
}
})
.build()
.start(composition, outputPath)

Job Offers

Job Offers

There are currently no vacancies.

OUR VIDEO RECOMMENDATION

No results found.

Jobs

šŸš€ Wrapping Up

Thanks for Reading! For low-level implementation details, refer to the GitHub repository or drop a question in the comments šŸ˜„.

This article was previously published on proandroiddev.com.

Menu