Qml6glsink pipeline performance issues on Raspberry Pi 5

Hi,

I have a pretty simple application running with qml6glsink on rpi 5.

void Player::createVideoBin() {
    VLOG(1) << Q_FUNC_INFO;

    if ((m_videoBin = gst_bin_new("videobin")) == nullptr) {
        LOG(ERROR) << "Unable to create videobin";
        return;
    }

    m_queuePostDecodeBin = gst_element_factory_make("queue", nullptr);
    g_object_set(G_OBJECT(m_queuePostDecodeBin), "max-size-time", 300 * GST_MSECOND, "max-size-bytes", 4 * 1024 * 1024, "max-size-buffers", 8, NULL);

    m_qmlSink = gst_element_factory_make(getQmlSinkName(), nullptr);
    if (m_qmlSink == nullptr) {
        LOG(ERROR) << "Unable to create qmlglsink";
        return;
    }
    if (m_output) {
        g_object_set(G_OBJECT(m_qmlSink), "widget", m_output, NULL);
        m_needToSetOutputItem = false;
    }

    gst_bin_add(GST_BIN(m_videoBin), m_queuePostDecodeBin);
          GstElement *glsinkbin = gst_element_factory_make("glsinkbin", nullptr);
    if (glsinkbin == nullptr) {
            LOG(ERROR) << "Unable to create glsinkbin";
            return;
        }

    g_object_set(G_OBJECT(glsinkbin), "sink", m_qmlSink, NULL);
    gst_bin_add(GST_BIN(m_videoBin), glsinkbin);
    gst_element_link(m_queuePostDecodeBin, glsinkbin);
  
    // Create the ghost pad
    GstPad *videopad = gst_element_get_static_pad(m_queuePostDecodeBin, "sink");
    Q_ASSERT(videopad);
    if (!gst_element_add_pad(m_videoBin, gst_ghost_pad_new("sink", videopad))) {
        LOG(ERROR) << "Unable to add sink ghost pad to m_videoBin";
        gst_object_unref(videopad);
        return;
    }

    gst_bin_add(GST_BIN(m_pipeline), m_videoBin);
}

This same binary works perfect on pi4. There is a difference of course, due to the pipeline not using hardware decoder and not being zero copy, but what I get does not look normal:
Even on 1080p I see a slideshow. Enabled mesa overlay shows 0.5 fps.

I tried both
gst-play-1.0 --videosink=kmssink $SNAP_COMMON/assets/22mwU~lzePUa_Le5w67012
and ran under wayland with waylandsink. They show no issues and perfectly play the video. Any pointers on my possible issues?

The pipeline I end up with is:
avdec_h264 → queue → gstgluploadelement → gstglcolorconvertelement → gstglcolorbalance → gstml6glsink

Disabling QOS manually brought me to 10 fps, which is better than 1 but is still very low for 1080p video.
Certainly feels like I am doing something wrong, there is 100% CPU usage on one of the cores.