TL;DR
- (optional) Extract relevant video snippets using LosslessCut
- Re-encode using HandBrakeCLI with Apple's Video Toolbox
HandBrakeCLI -v --preset-import-file ~/WorkingDirectory/handbrake_presets/DJI_4K.json -Z "DJI Drone 4K" -i original.mp4 -o converted.mp4 - (optional) Evaluate video quality perception with VMAF
ffmpeg -i original.mp4 -i converted.mp4 \ -map 0:v:0 -map 1:v:0 \ -filter_complex "[0:v][1:v]libvmaf=model='path=/opt/homebrew/share/libvmaf/model/vmaf_4k_v0.6.1.json':log_path=vmaf.json" \ -f null -
Uh oh, no space left in disk
Lately, across several trips, I have been hoarding an insane amount of cool footage recorded with my DJI Mini 3 drone. This wouldn't be as much of a problem if I were just a normal guy wanting to share a few seconds on social media. The ideal basic pipeline involves copying the videos to your computer, cropping them (size and length), and uploading the interesting bits. There is nothing wrong with this pipeline, apart from having to copy several GBs of data from the SD card to my computer or storage after every trip.
But turns out I refuse to let go all these incredible shots and at some point I started running out of storage (I know, first world problems).
The immediate solutions presented were:
- Crop the videos in size ( because I dont need professional 4K footage, just something I can visualize with decent quality in computers or screens of current era) (but this is destructive action, meaning I am losing information, no possible recovery in the future, gone).
- Crop the videos in length (make snappier shorter videos that capture the main actions or events) (but again this is destructive action, meaning I am losing information, in this particular case I am losing context, no possible recovery in the future, gone).
- Compress the videos
Here we go, lets deep dive in the amateur video compression world.
Why Cameras Don't Pre-Compress Enough
One could think that the videos recorded by GoPro or DJI products would be using a combination of parameters that would deliver high quality footage in a reduced file size with the right balance so it is easy to store (more footage per storage device) and quickly share on social media.
But there are more factors influencing the parameters these products are using to record your holidays in Vietnam. These factors include Write speed to storage, battery optimization, heat management and the concept of using all the power of the camera, capturing the raw data so it is available for editing later on based on the user preferences (social media sharing, professional use).
- Write speed to storage: If the computer has to go through raw footage, compressing it or applying filters before saving it to storage it means it will take longer to actually save the footage. During a continuous stream of data (while recording video), the write speed cannot be lower than the recording speed. Otherwise the buffers would overflow and the computer would have to discard data (parts of the footage). This means the computer has to find a tradeoff between processing the raw footage and write it faster than new data is coming in.
- Battery optimization: Meaning that if we spend CPU cycles on further processing or compressing incoming video it will negatively affect the battery of the device. Particularly these small handheld devices already have a limited battery capacity so saving it is high in the list of priorities.
- Heat management: The more you use a CPU the more the underlying transistors are turning on and off. This increases heat loses. And too much heat could degrade or break electronics, specially if your camera is enclosed in a plastic waterproof case.
- Capturing raw data: If you have a good recording device, why would you downgrade the quality of the output without letting the user choose?. This means these products would record as much data as the hardware allows them to, in a convenient format, so the user can later on decide how to treat that data in post production.
DJI drones typically record in MP4 or MOV containers, using the HEVC (H.265) codec or sometimes AVC (H.264) depending on settings. Drone footage is recorded at very high bitrates (often 80–150 Mbps for 4K HEVC). This preserves fine details like leaves, water ripples, and sky gradients, but it also results in huge file sizes. A short 5-minute flight can easily consume multiple gigabytes.
Re-encoding: Software vs. Hardware
Now that we have covered the basis of why these recording devices do not compress the video files for us, lets see how can we do better. The objective is to compress the files so they are easy to share and store long term without losing information, or to be more precise, compress the files in a way they retain the perceived quality by a human eye.
HEVC (H.265) is already highly efficient, offering $\sim50\%$ better compression than H.264 at the same quality. By adjusting the bitrate or using constant quality encoding, we can reduce size 5–10$\times$ while retaining near-identical visual fidelity.
I found out that we can use software and hardware encoders.
1. Software Encoders (CPU-based)
- Examples: x264 (H.264), x265 (H.265/HEVC).
- Pros: Maximum quality per bitrate. Best for archival or critical compression tasks.
- Cons: Very slow. High CPU usage.
2. Hardware Encoders (GPU/ASIC-based)
- Examples: Apple VideoToolbox (Mac/iOS), NVENC (NVIDIA), Intel Quick Sync, AMD VCE.
- Pros: Extremely fast. Low CPU usage. Good for batch compression or streaming.
- Cons: Slightly less efficient. Fewer options. Quality per bitrate often slightly lower, especially on complex textures or high-motion scenes.
I tried using the sofware encoding but its crazy slow! Speed was just $0.1\times$ at its peak. Then hardware encoding showed $1\times$ average, meaning the re-encoding would take as much as the video duration.
Evaluating Quality with VMAF
To select the quality of the re-encoding, we need to know the right settings that would reduce file size while still preserving perceptual video quality. This is done using algorithms that compare original footage and compressed footage and evaluate certain parameters.
The evaluation metrics include:
- Traditional metrics: PSNR / SSIM. Good but dont always match human perception.
- VMAF (Video Multi-Method Assessment Fusion): Developed by Netflix, combines multiple metrics and machine learning to align closely with how people actually perceive video quality. I started using SSIM with ffmpeg but looks like VMAF would be more accurate in terms of human eye perception and it is backed by a company whose main priority is store and deliver video thats high quality perception using the least storage/bandwidth possible.
VMAF output score:
- 90–100 → virtually indistinguishable
- 80–90 → very good, minor differences
- <80 → noticeable quality loss
For quality evaluation we will be using ffmpeg and libvmaf.
Compressing and Evaluating: The Workflow
We will be using ffmpeg and Handbrake for re-encoding.
Pre-requisites
ffmpeg-brew install ffmpeg- Handbrake CLI -
brew install handbrake libvmaf-brew install libvmaf
The Compression Attempt (and Pivot)
The plan was to use a command line tool to compress our video footage, in this case ffmpeg. Using H.265 hardware encoding. As I am doing this on a Mac we will be using Video Toolbox.
ffmpeg -i DJI_20250927215013_0189_D.MP4 -c:v hevc_videotoolbox -pix_fmt p010 -q:v 30 -c:a copy DJI_20250927215013_0189_D_converted.MP4
Turns out this encoder does not expose parameters to control the quality of the output file and while file size was greatly reduced, after running VMAF it was returning an overall score of 81 out of 100.
Heres where Handbrake comes into play. They expose the Constant Quality parameter when encoding with Apple Video Toolbox. I used Handbrake GUI to configure the exact profile that would re-encode the video with HVEC at a specific quality, maintain video size, and maintain pixel format.
The Handbrake CLI command:
HandBrakeCLI -v --preset-import-file ~/WorkingDirectory/handbrake_presets/DJI_4K.json -Z "DJI Drone 4K" -i original.mp4 -o converted.mp4
The VMAF Evaluation
The command for using libvmaf:
ffmpeg -i original.mp4 -i converted.mp4 \
-map 0:v:0 -map 1:v:0 \
-filter_complex "[0:v][1:v]libvmaf=model='path=/opt/homebrew/share/libvmaf/model/vmaf_4k_v0.6.1.json':log_path=vmaf.json" \
-f null -
Results
138 MB 4K footage turned into 41 MB 4K footage with a VMAF score of 91.
[Parsed_libvmaf_0 @ 0x600003090000] VMAF score: 91.919797:00:11.77 bitrate=N/A speed=0.126x elapsed=0:01:33.77
Bonus Track
Note how I put the optional step of extracting relevant video snippets using LosslessCut BEFORE re-encoding. This is due to several reasons:
- The less video footage you have to re-encode, the earlier you will be done
- Re-encoding gets rid of a lot of keyframes, making it harder for tools to preview, seek or split the video afterwards
Keyframes? What is that? A keyframe (also called an I-frame) is like a full photo inside your video.
Other frames (P-frames and B-frames) don’t store the whole picture, they just store the changes from the last keyframe. That’s why they’re smaller and make compression efficient.
So in a video:
- Keyframe or I-frames: Full standalone images.
- P-frames (predicted): Only store the changes from a previous frame.
- B-frames (bidirectional): Store changes relative to both previous and future frames.
Why it matters:
- Editing / Scrubbing: You can only “jump” to a keyframe instantly. If you scrub to a non-keyframe, the player has to go back to the last keyframe and “rebuild” the video forward.
- File size vs. usability: More keyframes = bigger file, but smoother seeking and easier editing. Fewer keyframes = smaller file, but slower seeking.
Because P-frames and B-frames depend on other frames, you can’t just cut cleanly at them without re-encoding. If you try, the video would break since a dependent frame wouldn’t have its reference data anymore.
LosslessCut is designed for lossless editing — meaning it avoids re-encoding video and instead just copies the original compressed data. Since it doesn’t decode/re-encode, it can only safely cut on I-frames.
Editors like QuickTime, Premiere, or DaVinci Resolve let you trim anywhere, even between predicted frames, because they decode and re-encode around the cut point.