Low-latency

What is "latency" ?

In the engineering domain, the term of latency generally describes the notion of time delay between the input of a system and its response, between a cause and its consequence.

In the communication and IT domain, the latency represents the delay occurring in the transport of information.

For instance, in the context of live TV coverage, it can represent the delay between a live action and the moment it is displayed on the watcher’s screen.

Closer looking at video processing equipment in the production chain, the latency generally represents the processing time elapsed between the instant a video frame enters the system, and the moment it is output.

Requirements in terms of latency largely vary with applications’ purpose and with their context of execution.

As an example, TV broadcasting to the final subscribers usually measures latency counted in seconds, while it is counted in frames (dozens of milliseconds) when dealing with a giant screen installed in a soccer stadium or a live event hall. Latency requirements can be expressed in units below the millisecond in industrial, computer vision, and medical applications.

 

How to measure latency?

First of all, it is important to specify what to measure.

We formerly defined the latency as the processing delay of a video equipment. Actually, this delay is a combination of a capture latency, a processing latency, and a playout latency.

As the DELTACAST products implement the capture and playout interfaces of the video equipment, while the processing part depends on the final application, our measures focus on these video transport steps and assume that the processing part is instantaneous. We hence measure the DELTACAST device contribution to the final product end-to-end latency.

How we measure this contribution is illustrated in the following figure:

In this test setup, a video generator emits a source signal which goes through a zero-delay splitter producing two identical outputs.

One of these signals is directly connected to the measuring equipment, while the second signal goes through the device under test before reaching the measuring device.

Inside of the device under test, an application captures video frames or smaller portions of video content from a DELTACAST input channel, and forwards them as soon as possible to a DELTACAST output channel.

To provide figures that are meaningful for applications in the TV broadcast domain, the transmitters of the video source generator and the device under test can be genlocked onto a common source. This output signal synchronization influences the end-to-end latency of the device under test, as the emitted video frames are locked to the reference frame rate, hence delayed if they were available sooner.

 

Frame-based versus sub-frame approaches

The DELTACAST products historically implement a frame-based behavior. The VideoMaster API and its concept of slots also deal with frame granularity.

This is an evident approach for most of the use cases in which DELTACAST cards and FLEX modules are used: media servers, video encoders, video processors, subtitling engines, multiviewers, video servers, all these applications capture, process, timestamp, encode, analyze video frames.

Besides this semantic consideration, these products also rely on architectures, toolsets, and graphic engines that also work on a frame-by-frame basis.

Working on video frames is the default method and the most usual way of processing video content. On the other hand, video frames are not transported atomically and instantaneously on video interfaces like SDI, HDMI, and IP.

All these transport technologies serialize the content on the cable, line after line, pixel after pixel, resulting in the frame spending its complete time slot within the frame rate to be sent on the transport medium:

As an example, a 1080p60 video means that 60 frames are emitted each second. Otherwise expressed, one frame is emitted every 60th of a second, so every 16.66667 milliseconds.

Video transport being serialized, this means that a device receives the last pixel of the frame 16.66667 milliseconds after the time it received the first frame pixel.

For systems working on entire frames, this means that the complete frame acquisition prior to processing results in a minimal latency of one frame. Similarly, the emission on the cable of one frame made available by the system will also last for one frame.

Besides video transport layers, most applications also require internal buffering, also counted in a number of entire frames.

Frame-based systems often present an end-to-end latency of several frames, and this delay can be annoying in some use cases. Think about lip-sync problems. About a live concert where the audience could notice the delay between a band live performance and its display on an on-stage giant screen. Think about video feedback equipment during a surgery operation.

Because the end-to-end latency really matters in certain specific cases, it is sometimes interesting – yet more complex and challenging – to work on a sub-frame basis.

A sub-frame approach means that the video processing is designed to work on a portion of the frames, as soon as they are available.

Let’s imagine a video frame subdivision into four slices. The video being serialized and transmitted pixel by pixel, would mean that an application capable of working on such portions of frames can cut down the latency by a factor of 4.

DELTACAST support both the frame-based and the sub-frame methods. Let’s detail them.

 

Details on the frame-based video interfacing

DELTACAST video interface cards and FLEX modules implement hardware buffering and software buffering. At both stages, buffers contain entire frames.

In the case of reception, one complete frame is captured in the hardware memory. As soon as the last pixel of the frame is written into memory, the card issues an interrupt that instructs the driver to start a DMA transfer to move the fame to the host computer memory. As soon as that DMA transfer finishes, the frame is available to the caller application. Thanks to the double-buffering at the hardware level, the next frame is captured by the card while the current frame is DMA-transferred to the host computer.

This framework results in a cable-to-application latency of 1 frame plus the DMA transfer time of that frame:

As an example, for a 1080p60 video source reception in 8-bit 4:2:2 YUV, one frame time is 16.7 msec, and the DMA transfer takes around 3 msec on a PCIe Gen2 x4 card (~1500MB/sec) or around 1.5 msec on a PCIe Gen2 x8 card (~3000MB/sec).

The resulting minimal cable-to-application latency is hence around 18.2 msec in a single channel use case. It increases in multi-channel use cases as the PCIe bandwidth is shared amongst all concurrent channels.

 

In the case of transmission, the concept of latency does not make sense in every case, because the video frames generation is not necessarily linked/locked to their transmission. Actually, in most cases the output channel starts and frame ticks are asynchronous with the software transmission loop.

 

For video processing applications performing some on-the-fly task onto a video feed, the end-to-end latency is a measure making complete sense. The very minimal achievable latency, in this case, is 2 frames, as illustrated by the following figure:

If the video alignment differs a bit, or if the application processing time lasts for a bit longer, or if DMA transfers are a bit slower or delayed due to multi-channel operation, then the minimal latency can quickly increase to 3 frames:

 

Details on the sub-frame working mode

While all the DELTACAST I/O boards and FLEX modules implement a frame-based API, some of them also implement a sub-frame API. We call this mode Ultimate Low Latency (ULL).

On reception channels, the sub-frame mode allows the application to access the frame buffer while it is still being filled in by DMA from the underlying hardware card buffer. The application can poll the frame buffer filling to known which portion of the frame is ready for processing.

This buffer filling polling method allows a very small reduction of the capture latency, as the DMA transfer runs concurrently with the frame onboard capture. Actually, when the complete frame is captured at the DELTACAST card level, most of it is already transferred to the host computer buffer, and only the last pixels need to be sent by DMA.

The real gain in latency is made if the application is able to start processing the pixels as soon as a few are received, instead of when the complete frame is received like in frame mode.

That gain in latency makes complete sense when implementing a video processor realizing some task on a captured live video feed before emitting back the processed signal. In this case, if the application is able to perform its processing on a portion of frames during the capture, then the end-to-end latency can be cut down to one frame when the output is genlocked with the input, and even less in the non-genlocked use case.

 

ULL - a new way of thinking frame buffer processing

Besides a drastic reduction of the minimal end-to-end latency, the new ULL mode and its sub-frame approach allow reshaping the way video processing applications are implemented.

Instead of implementing frame-based processing, applications can work on smaller portions of the video frame buffer – from a few slices down to granularity as small as a few lines or a few tens of microseconds.

Through this new mode, VideoMaster even allows handling the processing more dynamically. As an example, imagine processing jobs driven by CPU availability, instead of overprovisioning the system in terms of CPU power, to be sure to be able to process entire frames when they are available.

The new Ultimate Low Latency (ULL) mode is currently available on the following products: