Thursday, May 14, 2009

HD Video Standard

Despite the last couple of years being a time when "high definition" video has really gained traction, there's one surprising thing about HD video: it doesn't have an obvious definition. Dan Rayburn brings up this observation in a recent blog post:
For an entire industry that defines itself based on the word "quality", today there is still no agreed upon standard for what classifies HD quality video on the web....If the industry wants to progress with HD quality video, we're going to have to agree on a standard - and fast.
He's absolutely right. Many companies attempt to pass off 480p as HD video, but most video enthusiasts would reject such an assertion--after all, if it isn't HD for an analog signal, why would it be HD for a digital signal? Likewise, lots of video is encoded at an unacceptably low bit rate which results in obvious artifacts. Why would such poor quality video be considered "high definition?"

Wikipedia's definition for High-definition television is a decent start:
High-definition television (or HDTV) is a digital television broadcasting system with higher resolution than traditional television systems (standard-definition TV, or SDTV). HDTV is digitally broadcast; the earliest implementations used analog broadcasting, but today digital television (DTV) signals are used, requiring less bandwidth due to digital video compression.
This is still lacking. What exactly is "higher resolution than traditional television systems?" And just what SDTV system, since there were many of them? And is resolution all there is to it? What if I encode video to 1080p but at a horrible bit rate, which causes lots of blocking artifacts? What about video with an odd aspect ratio, where the number of verticals lines doesn't pass muster? Clearly this definition is lacking.

Some aspects of creating a standard are fairly straight-forward: most people seem to be fairly comfortable with 720p being the "minimum" resolution at which video can be encoded to. Ben Waggoner had an interesting proposal where 720p was acceptable, but it was also acceptable to generalize it to anything with "at least 16 million pixels per second," which takes into account both framerate and resolution. He also brought up the issue of using horizontal resolution as a criteria, since not everything is 16:9.

But on the question of "quality," most simply punted, and I find this odd. Ben Waggoner mentions:

Hassan Wharton-Ali brought up another good point on the thread - HD should actually be HD quality. It can’t be a lousy, over-quantized encode using a suboptimally high resolution just so it can be called HD.

A good test is the video should look worse (due to less detail), not better (due to less artifacts), if encoded at a lower resolution at the same data rate. If reducing your frame size makes the video look better when scaled to the same size, then the frame size is too high!

It is a good point, and I don't completely disagree with Ben's proposal--it should look worse due to less detail if encoded at a lower resolution. But this is the crux of the issue: what does it mean to look worse? Is this just a subjective judgment call on behalf of the person encoding the video? I don't think this addresses the problem of having a minimum acceptable "quality" for HD video.

Dan Rayburn's suggestion is even less desirable, in my opinion:

To me, the term HD should refer to and be defined by the resolution and a minimum bitrate requirement. Since you could have a 1080p HD video encoded at a very low bitrate, which could result in a poor viewing experience inferior to that of a higher-bitrate video in SD resolution, the resolution and bitrate is the only way to define HD.

The first issue with this is the "minimum bit rate" requirement would have to somehow scale with the resolution and frame rate. It would have to account for the codec being used. This would result in an impossibly complicated system, endless arguments, etc. (for example, would we impose the same bit rate requirement on H.264 as we would on VC1? What about "future" codecs?)

A bigger issue is not all video content is the same. The resulting "quality" of a video encoded to a given bitrate absolutely has a relationship with the video being encoded. A video with very little movement can often be encoded with a low bit rate and look fantastic, so the bit rate requirement would essentially amount to wasted bandwidth. Conversely, a video with a lot of motion and scene changes may require a lot more bits to get an acceptable, block-free viewing experience--and it's not clear what that acceptable threshold would be.

I find both of these suggestions insufficient. I propose an alternative: objective video quality algorithms. The idea is straight-forward: by comparing the source material with the output material, we can objectively establish a score that at least has some meaningful relationship with Mean Opinion Scores. In a nutshell, a MOS is how "good" the average person thinks some piece of video appears.

Peak Signal-to-Noise/Mean Squared Error is the most common algorithm, albeit one that is quite crude, widely considered to be deficient by most engineers and scientists. But it's 2009, baby--we can do better. We have better.

My suggestion would be the Structural SIMilarity Index, which is relatively inexpensive (its closely related brother, MSSIM, is much more pricey) and definitely correlates better with MOS.

How would this work?

  1. During the encode process, a SSIM score is computed for each frame using the input as a reference image.
  2. This process is repeated for every input frame, and every output frame.
  3. The lowest observed SSIM score is the resulting quality score for that piece of encoded video. (I suppose another alternative is to use the average. Yet another option is to use the variance. I'd avoid the median, since it's robust against outliers, and outliers matter)
  4. If the lowest observed SSIM score is less than some threshold, then the video cannot be considered High Definition.
For a visual representation of how this works, take a look at this graph:


This is a graph of SSIM over time, displaying multiple bit rates. The x-axis is frame number, and the y-axis is SSIM score. My input was a VGA, 30 FPS, ~30 second raw-RGB video clip. Each line corresponds with a bit rate requested of the encoder (x264's H.264 implementation, using a baseline profile--you can see this by the low SSIM scores at the beginning of the video due to single-pass encoding). Notice the clear relationship between SSIM scores and bit rate. Also note how much variance there is in video quality: clearly certain portions of this clip are "more difficult" to encode, and this results in a degradation of video quality. Also, notice a clear law of diminishing returns: as more and more bits are thrown at the video clip, the SSIM scores converge on 1.0--SSIM at 2 mbit/sec aren't substantially different from the SSIM scores at 500 kbit/sec.

There are a few gotchas with this plan: what if we're changing the frame rate (e.g. 3:2 pulldown) and there is no clear reference frame to which we compare the output? How do we determine the SSIM threshold? Do we really want to use SSIM, or is some other algorithm better?

The first question is answered relatively easy: we compare only what was input to the encoder and the resulting output. Presumably the process of manipulating the frame rate is separate from the process of encoding. What we're talking about is how well of a job our encoder does matching the input.

The second question is easier, but it requires someone conducting subjective video quality assessment tests to determine what threshold corresponds with a baseline SSIM number. In effect, someone has to do some statistical analysis on data captured during viewing sessions of actual people watching actual footage encoded with an actual compression algorithm, and determine a threshold that correlates well with people's perception of "High Definition." But at least with SSIM, this is a manageable process: once a threshold is determined, it's really independent of a whole slew of factors, like codec, the video being encoded, etc.

Let's say we decide that any SSIM score below 0.9 invalidates the video from being called "high definition"--for the above video, this would mean 500 kbps would be just slightly too poor to call HD (notice the poor quality at the beginning of the clip). And 1000 kbps would be more than acceptable.

Lastly, even though PSNR is an outdated method, I see no reason a high-definition standard could not include metrics for both objective quality tests. There are other objective video quality algorithms, and certainly more will be developed in the future, so any standard should be open to extension at a later date.

(side note: maybe part of our problem is this emphasis on bit rate--which has no relationship to quality beyond "more is probably better"--when our real emphasis should be on a metric that correlates with quality, but I digress)

I don't really care what objective metric is used, and certainly there is plenty of debate over which objective method correlates best with MOS, and what threshold should be used--but let's at least be scientific about this. If there's going to be a "standard" for High Quality video, then let's choose a standard that will carry us forward and not create a quagmire.

Monday, May 04, 2009

Dealing with Image Formats

One of the most common tasks when working with video is dealing with colorspaces and image formats. In this post, I'll discuss the two major colorspaces commonly used in Microsoft code, converting between different formats of a given colorspace. In some future post, I might talk about converting one colorspace to a totally separate colorspace, but that topic is worthy of its own discussion.

In the Microsoft world, there are two colorspaces that we're concerned about: YUV and RGB.

RGB Color Space
RGB is generally the easiest colorspace to visualize, since most of us have dabbled with finger paints or crayons. By mixing various amounts of red, green, and blue, the result is a broad spectrum of colors. Here is a simple illustration to convey this colorspace:

The top image of the barn is what you see. Each of the three pictures below are the red, green and blue components, respectively. When you add them together, voila, you get barnyard goodness. (sidenote: because you "add" colors together in the RGB colorspace, we call this an "additive" color model)

In the digital world, we have a convenient representation for RGB. Typically 0, 0, 0 corresponds with black (i.e. red, green and blue values are set to 0), and 255, 255, 255 is white. Intermediate values result in a large palette of colors. A common RGB format is RGB24, which allocates three 8 bit channels for red, green, and blue values. Since each channel has 256 possible values, the total number of colors this format can represent is 256^3, or 16,777,216 colors. There are also other RGB formats that use less/more data per channel (and thus, less/more data per pixel), but the general idea is the same. To get an idea of how many RGB formats exist, one need not go any farther than fourcc.org.

Despite the multitude of RGB formats, in the MSFT world, you can basically count on dealing with RGB24 or RGB32. RGB32 is simply RGB24, but with 8 bits devoted to an "alpha" channel specifying how translucent a given value is.

YUV Color Space
YUV is a substantially different from RGB. Instead of mixing three different colors, YUV separates out the luminance and chroma into separate values, whereas RGB implicitly contains this information in the combination of its channels. Y represents the luminance component (think of this as a "black and white" channel, much like black and white television) and U and V are the chrominance (color) components. There are several advantages to this format over RGB that make it desirable in a number of situations:
  • The primary advantage of luminance/chrominance systems such as YUV is that they remain compatible with black and white analog television.
  • Another advantage is that the signal in YUV can be easily manipulated to deliberately discard some information in order to reduce bandwidth.
  • The human eye is more sensitive to luminance than chroma; in this sense, YUV is generally considered to be "more efficient" than RGB because more information is spent on data that the human eye is sensitive to.
  • It is more efficient to perform many common operations in the YUV colorspace than in RGB--for example, image/video compression. By nature, these operations occur more easily in a YUV colorspace. Often, the heavy lifting in many image processing algorithms is applied only to the luminance channel.
Thus far, the best way I've seen to visualize the YUV colorspace was on this site.

Original image on the left, and the single Y (luminance) channel on the right:



...And here are the U and V channels combined:



Notice that the Y channel is simply a black and white picture. All of the color information is contained in the U and V channels.

Like RGB, YUV has a number of sub-formats. Another quick trip to fourcc.org reveals a plethora of YUV types, and Microsoft also has this article on a handful of the different YUV types used in Windows. YUV types are even more varied than RGB when it comes to different format.

The bad news is there's a lot of redundant YUV image formats. For example, YUY2 and YUYV are the exact same format entirely, but merely have different fourcc names. YUY2 and UYVY are exactly the same thing (16 bpp, "packed" format) but merely have the per-pixel byte order reversed. IMC4 and IMC2 are exactly the same thing (both 12 bpp, "planar" formats) but merely have the U and V "planes" swapped. (more on planar/packed in a moment)

The good news is that it's pretty easy to go between the different formats without too much trouble, as we'll demonstrate later.

Packed/Planar Image Formats
The majority of image formats (in both the RGB and YUV colorspaces) are in either a packed or a planar format. These terms refer to how the image is formatted in computer memory:
  • Packed: the channels (either YUV or RGB) are stored in a single array, and all of the values are mixed together in one monolithic chunk of memory.
  • Planar: the channels are stored as three separate planes. Fo
For example, the following image shows a packed format:



This is YUY2. Notice that the different Y, U, and V values are simply alongside one another. Also note that the above represents six pixels. They are not segregated in memory in any way. RGB24/RGB32/YUV2 are all examples of packed formats.

This image shows a planar format:



This is YV12. Notice that the three planes have been separated in memory, rather than being in a single, monolithic array. Often times this format is desirable (especially in the YUV colorspace, where the luminance values can then easily be extracted). YV12 is an example of a planar format.


Converting Between Different Formats in the Same Color Space
Within a given colorspace are multiple formats. For example, YUV has multiple formats with differing amounts of information per pixel and layout in memory (planar vs. packed). Additionally, you may have different amounts of information for the individual Y, U, and V values, but most Microsoft formats typically allocate no more than 8 bits per channel.

As long as the Y, U, and V values for the source and destination images have equivalent allocation, converting between various YUV formats is reduced to copying memory around. For this section we'll deal with YUV formats, since RGB will follow the same general principles. As an example, let's convert from YUY2 to AYUV.

YUY2 is a packed, 16 bits/pixel format. In memory, it looks like so:

The above would represent the first six pixels of the image. Notice that each pixel ends up with a Y value, and every other pixel contains a U and a V value. There is no alpha channel. The image contains a 2:1 horizontal down sampling.

A common misconception is that the # of bits per pixel is directly related to the color depth (i.e. the # of colors that can be represented). In YUY2, our color depth is 24 bits (there are 2^24 possible color combinations), but it's only 16 bits/pixel because the U and V channels have been down sampled.

AYUV, on the other hand, is a 32 bits/pixel packed format. Each pixel contains a Y, U, V, and Alpha channel. In memory, it ends up looking like so:

The above would represent the first three pixels of the image. Notice that each pixels has three full 8 bit values for the Y, U and V channels. There is no down sampling. There is also a fourth channel for an alpha value.

In going from YUY2 to AYUV, notice that the YUY2 image contains 16 bits/pixel whereas the AYUV contains 32 bits/pixel. If we wanted to convert from YUY2 to AYUV, we have a couple of options, but the easiest way is to simply reuse the U and V values contained in the first two pixels of the YUY2 image. Thus, we have to do no interpolation at all to go from YUY2 to AYUV--it's simply a matter of re-arranging memory. Since all the values are 8 bit, there isn't any additional massaging to do; they can simply be reused as is.

Here is a sample function to converty YUY2 to AYUV:
  
// Converts an image from YUY2 to AYUV. Input and output images must
// be of identical size. Function does not deal with any potential stride
// issues.
HRESULT ConvertYUY2ToAYUV( char * pYUY2Buffer, char * pAYUVBuffer, int IMAGEHEIGHT, int IMAGEWIDTH )
{
if( pYUY2Buffer == NULL || pAYUVBuffer == NULL || IMAGEHEIGHT < 2
|| IMAGEWIDTH < 2 )
{
return E_INVALIDARG;
}

char * pSource = pYUY2Buffer; // Note: this buffer will be w * h * 2 bytes (16 bpp)
char * pDest = pAYUVBuffer; // note: this buffer will be w * h * 4 bytes (32 bpp)
char Y0, U0, Y1, V0; // these are going to be our YUY2 values

for( int rows = 0; rows < IMAGEHEIGHT; rows++ )
{
for( int columns = 0; columns < (IMAGEWIDTH / 2); columns++ )
{
// we'll copy two pixels at a time, since it's easier to deal with that way.
Y0 = *pSource;
pSource++;
U0 = *pSource;
pSource++;
Y1 = *pSource;
pSource++;
V0 = *pSource;
pSource++;

// So, we have the first two pixels--because the U and V values are subsampled, we *reuse* them when converting
// to 32 bpp.
// First pixel
*pDest = V0;
pDest++;
*pDest = U0;
pDest++;
*pDest = Y0;
pDest += 2; // NOTE: not sure if you have to put in a value for the alpha channel--we'll just skip over it.

// Second pixel
*pDest = V0;
pDest++;
*pDest = U0;
pDest++;
*pDest = Y1;
pDest += 2; // NOTE: not sure if you have to put in a value for the alpha channel--we'll just skip over it.
}
}

return S_OK;
}

Note that the inner "for" loop processes two pixels at a time.

For a second example, let's convert from YV12 to YUY2. YV12 is a 12 bit/pixel, planar format. In memory, it looks like so:

...notice that every four pixel Y block has one corresponding U and V value, or to put it a different way, each 2*2 Y block has a U and V value associated with it. And, yet another way to visualize it: the U and V planes are one quarter the size of the Y plane.

Since all of the YUV channels are 8 bits/pixel, again--it comes down to selectively moving memory around. No interpolation is required:
  
// Converts an image from YV12 to YUY2. Input and output images must
// be of identical size. Function does not deal with any potential stride
// issues.
HRESULT ConvertYV12ToYUY2( char * pYV12Buffer, char * pYUY2Buffer, int IMAGEHEIGHT, int IMAGEWIDTH )
{
if( pYUY2Buffer == NULL || pYV12Buffer == NULL || IMAGEHEIGHT < 2
|| IMAGEWIDTH < 2 )
{
return E_INVALIDARG;
}

// Let's start out by getting pointers to the individual planes in our
// YV12 image. Note that the Y plane in a YV12 image's size is
// simply the image height * image width. This is because all values
// are 8 bits. Also notice that the U and V planes are one quarter
// the size of the Y plane (hence the division by 4).
BYTE * pYV12YPlane = pYV12Buffer;
BYTE * pYV12VPlane = pYV12YPlane + ( IMAGEHEIGHT * IMAGEWIDTH );
BYTE * pYV12UPlane = pYV12VPlane + ( ( IMAGEHEIGHT * IMAGEWIDTH ) / 4 );

BYTE * pYUV2BufferCursor = pYUV2Buffer;

// Keep in mind that YV12 has only half of the U and V information that
// a YUY2 image contains. Because of that, we need to reuse the U and
// V plane values, so we only increment that buffer every other row
// of pixels.
bool bMustIncrementUVPlanes = false;

for( int ImageHeight = 0; ImageHeight < IMAGEHEIGHT; ImageHeight++ )
{
// Two temporary cursors for our U and V planes, which are the weird ones to deal with.
BYTE * pUCursor = pYV12UPlane;
BYTE * pVCursor = pYV12VPlane;

// We process two pixels per pass through this equation,
// hence the (IMAGEWIDTH/2).
for( int ImageWidth = 0; ImageWidth < ( IMAGEWIDTH / 2 ) ; ImageWidth++ )
{
// first things first: copy our Y0 value.
*pYUY2BufferCursor = *pYV12YPlane;
pYUY2BufferCursor++;
pYV12YPlane++;

// Copy U0 value
*pYUY2BufferCursor = *pUCursor;
pYUY2BufferCursor++;
pUCursor++;

// Copy Y1 value
*pYUY2BufferCursor = *pYV12YPlane;
pYUY2BufferCursor++;
pYV12YPlane++;

// Copy V0 value
*pYUY2BufferCursor = *pVCursor;
pYUY2BufferCursor++;
pVCursor++;
}

// Since YV12 has half the UV data that YUY2 has, we reuse these
// values--so we only increment these planes every other pass
// through.
if( bMustIncrementUVPlanes )
{
pYV12VPlane += IMAGEWIDTH / 2;
pYV12UPlane += IMAGEWIDTH / 2;
bMustIncrementUVPlanes = false;
}
else
{
bMustIncrementUVPlanes = true;
}
}

return S_OK;
}


This code is a little more complicated than the previous sample. Because YV12 is a planar format and contains half of the U and V information contained in a YUY2 image, we end up reusing U and V values. Still, the code itself isn't particularly daunting.

One thing to realize: neither of the above functions are optimized in any way, and there are multiple ways of doing the conversion. For example, here's an in-depth article about converting YV12 to YUY2 and some performance implications on P4 processors. Some people have also recommended doing interpolation on pixel values, but in my (limited and likely anecdotal) experience, it doesn't make a substantial difference.