An unexpected source of latency (part 2)
CPB Removal Delay
What we see here is that the moment at which we *display* a picture doesn't necessarily correspond to the moment at which we *finish receiving* a picture.
There's a variable delay between these two moments, depending on the amount of data in the CPB. This delay is called the "CPB removal delay", where "CPB removal" here means "removing a picture from the CPB and instantaneously decoding it".
This has an unexpected consequence:
When a set-top-box starts, it shouldn't display its first picture as soon as it's received. It should wait for some specified variable "delay".
Displaying the first picture as soon as it's available would basically mean that we're keeping our CPB empty. If the next picture is a "long picture", we're guaranteed to underflow.
So, how should a decoder know how much time it should wait before displaying a picture?
Easy: this information is conveyed in the bitstream.
- Either explicitly: for example, H.264 has an SEI NAL unit which litterally contains a field called "cpb_removal_delay".
- Either implicitly, through transport layer signalization: for example, MPEG2 Transport Stream will stream values for PTS (presentation timestamp), DTS (decoding timestamp), and PCR (program clock reference). Here the CPB removal delay is equal to `DTS - PCR`. Decoders don't need to recompute this value, though, they only need to compare the PCR of the stream, and the DTS of each picture.
Back to latency
This CPB removal delay is variable along the stream : it decreases on long pictures, and increases on short pictures.
Of course, it can't go below zero. It also has an upper limit, which depends on the size, in bits, of the CPB:
`max_cpb_removal_delay_in_seconds = cpb_size_in_bits / bitrate`.
We've just introduced algorithmic latency !
The more we increase `cpb_size_in_bits`, the more the encoder will have flexibility in picture sizes, which will increase the average video quality. However, this will also increase the overall latency.
If we decrease `cpb_size_in_bits`, the latency decreases. But this hardens the job of the video encoder, and the average video quality will also decreases, especially on scene changes.
To give you an idea, in a classic DVB scenario, the value of `max_cpb_removal_delay_in_seconds` is around 3s, which is a lot bigger than 20ms, or than any jitter a transmission channel could have!
This is why when you switch channels on a set-top-box, the video doesn't start immediately : the decoder is waiting for its CPB to fill at the requested level.
Is all of this relevant, for example, to live streaming on the Web, where there's no guarantee on the transfer rates anyway?
The short answer is yes, more than ever!
Do you remember the early days of Youtube? Sometimes, the playback would freeze, because the downloading of the video wasn't fast enough. This is an underflow!
After such a freeze, a solution was to pause the video playback, while the video download would continue, thus allowing the buffer to fill up a little. Afterwards, we would resume the playback, hoping that it wouldn't catch up again with the video download. (it's now obvious to see why, on a live stream, this would introduce latency).
Without knowing it, by doing this, we were actually manually introducing a "CPB removal delay", and doing what DVB set-top-boxes had been silently and perfectly doing for ages.
Although live streaming on the Web removes the need for a lot of costly infrastructure, it makes the problem a lot harder, by removing guarantees on the transfer rates. This means the encoder cannot simulate the CPB state, and thus, can provide no guarantee.
To mitigate a non-guaranted variable rate transfer channel, you can allow your application or player to drop frames. After all, in a live streaming scenario, there's no time to retransmit an obsolete frame. This is what most multiplayer games do, and also, this is what RTP video streaming does. Of course, this implies some smart mechanisms on the decoder side, which must now deal with an incomplete bitstream. DVB-T and DVB-S set-top-boxes implement such mechanisms, because, with radio-transmission, packet loss and corruption is expected.
Unfortunately, dropping lost frames is not an option on the Web.
The Web, by definition, relies on HTTP, which means TCP, which means: retransmission of all lost packets. Thus the application has to wait for the successful retransmission of lost packets *before* being able to process the next packets - although these next packets might technically already be somewere in local memory.
The last option is to oversize the reception buffer, which has the consequence of heavily increasing the latency. It's rather common, actually, for Youtube live-streaming, to introduce several seconds of latency.
Whether this latency is problematic or not depends on the use-case.
And if your use-case requires low latency, high resolution, and high video quality ... what are you doing on the Web anyway?