April 04, 2022 by Sebastien Alaiwan

An unexpected source of latency (part 1)

Introduction

In digital video streaming, there's a source of latency which is rather subtle, although it's most often a major one. It's related to the way video compression works, and it cannot be mitigated by costly equipment.

The setup

Let's suppose a perfect transmission channel, with:

negligible latency (which implies negligible jitter)
zero packet loss
a fixed and guaranteed bitrate

At one end of this transmission channel, there's a perfect live video encoder, powered by an infinitely fast CPU, encoding pictures at constant framerate, coming from a camera.

At the other end of this transmission channel, there's a live video decoder, also powered by an infinitely fast CPU, connected to a display, say, a TV set.

Our goal here is : minimize the latency between the camera and the TV set.

Of course, we also want to enforce a constant framerate on the display, which means : every picture must be displayed during the exact same duration. If, at some point, the display needs to show the next picture, but this next picture is not yet available from the decoder ... it's game over, we lose.

But things are looking pretty good in our hypothetical setup, what could possibly go wrong here? Is there still latency here anyway?

An expected source of latency

We've seen from a previous article that many video encoders introduce algorithmic latency, with a mechanism called "lookahead", in order to improve video quality. Briefly, this means the encoder will wait for N more input pictures to arrive before encoding the current picture, whose encoding choices will be more efficient, as we know the contents of the next input pictures. This obviously introduces, between the encoder input and its output, a latency of N pictures.

Here, we're going to put this problem aside, and suppose that our encoder doesn't do any lookahead : as soon as a picture arrives on the encoder input, it's instantaneously compressed, and sent to the transmission channel.

The problem

So, no latency can come from the encoder: it has an infinitely fast CPU, so all processing is instantaneous, and it doesn't introduce algorithmic latency.

The decoder also has an infinitely fast CPU.

The transmission channel also has negligible latency.

However, there are two limitations:

The transmission channel has a finite bitrate.
There's a big variability in the size (in bits) of the pictures produced by the encoder.

Both of these factors, combined, are going to force us to introduce an unexpectedly huge amount of algorithmic latency. Let's see why.

Pictures: size matters

Let's suppose the framerate in our setup is 50 frames per second. This means each picture corresponds to 20ms of capture, and so, must be displayed during 20ms.

Let's suppose the bitrate of our transmission channel is 1 Mbps. This means that during 20ms, we transmit exactly 20kb.

So, if every picture produced by the encoder is exactly 20kb big, everything's fine: on the decoder side, everything runs like clockwork : Every 20ms, a new picture is received, instantaneously decompressed, and displayed.

In this scenario, we would only have 20ms of total latency: this is the time it takes to transmit one compressed picture (we're assuming here that the picture transmission begins exactly when the whole compressed picture is produced by the encoder).

For live video broadcasting, 20ms latency is pretty good. However, a fixed-picture-size requirement on a video encoder is simply unrealistic.

Not all input content is created equal, some content is "harder" to encode, and some content is "easier" to encode.

Let's take two examples:

1) an action scene from a Hollywood movie
2) a still picture showing a blue sky

For a given video quality, content #1 will require more bits than content #2, because content #1 is going to be full of motions, flashes, etc. while content #2 is going to be low-frequency predictible data.

We will say that content #1 is "harder to encode", and that content #2 is "easier to encode".

Now, let's go back to our budget of 20kb-per-picture. Having such a fixed budget for every picture is going to:

waste bits on "easy" content.
miss bits on "hard" content, which will degrade the video quality.

So, let's relax the 20kb-per-picture constraint a little bit, and make it instead "20kb-per-picture-on-average".

This allows variability in picture size. This allows us to redistribute the bits between the pictures, depending on the content : this leads to a way better video quality. The "easy" pictures will be smaller than the "hard" pictures.

However, now, there's a high variability in the picture sizes ... and we've just opened a big can of worms.

Pictures: display duration versus transmission duration

So, we have:

50 fps content
1 Mbps transmission channel

Statements:

1 ) If a picture is 20kb big, it will take exactly 20ms to transmit it, and we will display it for 20ms. We're at equilibrium here. We will call such pictures "even pictures". If all pictures were like this, this whole article would be pointless!
2) If a picture is bigger than 20kb (e.g 30kb), transmitting it takes more than 20ms (e.g 30ms), but of course, we're still going to display this picture for 20ms. So this picture will be longer to transmit than to display, its transfer needs to be started sooner (e.g 10ms) than for an "even" picture. We will call such pictures "long pictures".
3) If a picture is smaller than 20kb (e.g 10kb), transmitting it takes less than 20ms (e.g 10ms), but of course, we're still going to display this picture for 20ms. So this picture will be shorter to transmit than to display. We will call such pictures "short pictures".

If the decoder receives a succession of "short pictures", it cannot display them at the same rate as their reception : this would violate the framerate constraint. So, it's going to store those pictures in a buffer, which we call the "coded picture buffer" aka "CPB" (This is MPEG-4 terminology, MPEG-2 uses the term "video buffer verifier" (VBV), which is basically the same thing).

As long as the succession of "short picture" lasts, the picture count in the CPB will grow.

If the decoder receives a succession of "long pictures", we have a difficulty, as pictures are arriving at a slower rate than we need to display them !

Here, there are two possibilies:

The CPB contains no picture: we're dead, as the display wants the next picture, and the decoder can't provide it, because the reception of this picture hasn't completed yet! This will result in a freeze on the display. This situation is called *underflow*. This should *never* occur.
The CPB contains some pictures: we take the oldest one, remove it from the CPB, instantaneously decompress it, and send it to the display.

Of course, if the sequence of "long pictures" lasts for too long, the CPB will eventually go empty, and an underflow will occur.

Video encoders simulate the filling of the decoder's CPB, and it's their job to guarantee that they won't cause decoders to underflow.

To be continued on how encoders do that!

Part 2 is now available. Check it out!

An unexpected source of latency (part 1)

Introduction

The setup

An expected source of latency

The problem

Pictures: size matters

Pictures: display duration versus transmission duration

After 10 years of existence, GPAC Licensing falls...

previous post

An unexpected source of latency (part 2)

next post