Syncing RGB and depth frames - SEI with GStreamer the Solution?

Background

I’m building an iPhone app that captures synchronized RGB and depth frames. After encoding and decoding the RGB frames, I need to re-synchronize them with the corresponding depth frames.

Currently, I’m using AVCaptureDataOutputSynchronizer to capture RGB and depth frames with wall-clock timestamps. The depth frames are zipped raw with the wall-clock timestamp embedded in each frame. RGB frames, on the other hand, are encoded and written with AVAssetwriter. This is where the problem starts: when the RGB frames are encoded I lose all the connection it had with the wall-clock timestamp. Therefore when I decode it, I cannot get a perfect synchronization with the depth frames.

To work around this, I write a timed metadata track alongside the video. Each metadata entry is written at the same time as the corresponding RGB frame and includes both RGB and depth wall-clock timestamps. Later, I locate the metadata with the closest PTS to each RGB frame and use that to realign the frames.

While this approach mostly works, I still see a drift of 1–3 frames at random times, which is unacceptable for my application.

Is SEI with GStreamer my best option?

From what I understand, the only reliable way to retain precise synchronization is to embed the wall-clock timestamp into the video frame itself—something like an SEI message. Unfortunately:

  • AVFoundation doesn’t support SEI messages,
  • FFmpeg isn’t supported on iOS,
  • GStreamer is(!), but its iOS build relies on the GStreamer Bad Plug-ins.

My questions are then:

  • Is GStreamer (and the “bad” plugins) a good option for my case? The app needs to be distributed.
  • Should I consider something else?

Other ideas I’ve considered:

  • Bake timestamps into the alpha channel
    • but this will be compressed and the timestamps can be corrupted, right?
  • Somehow utilize HDR metadata, as the encoder seems to preserve this.
    • I dont know how this works or if it works.

I’m quite new to both media encoding and GStreamer, and I’ve been working on this issue for a while. Any advice or alternative approaches would be deeply appreciated!