Rate Control and H.264
Dynamically adjust encoder parameters to achieve a target bitrate
Concepts
A rate control algorithm dynamically adjusts encoder parameters
to achieve a target bitrate. It allocates a budget of bits to each
group of pictures, individual picture and/or sub-picture in a video
sequence. Rate control is not a part of the H.264 standard, but
the standards group has issued non-normative guidance to aid in
implementation. The purpose of this white paper is to offer 1) a basic understanding of what rate control is and why it is essential
and 2) a common framework and terminology so that schemes
originating from H.264 and other standards groups can be more easily
understood and compared.
Block-based hybrid video encoding schemes such as the MPEG [1,2]
and h.26* [3] families are inherently lossy
processes. They achieve compression not only by removing truly redundant
information from the bitstream, but also by making small quality
compromises in ways that are intended to be minimally perceptible.
In particular, the quantization parameter QP
regulates how much spatial detail is saved. When QP is very small,
almost all that detail is retained. As QP is increased, some of
that detail is aggregated so that the bit rate drops – but
at the price of some increase in distortion and
some loss of quality. Figure 1 suggests that relationship for a
particular input picture – if you want to lower bit rate,
you can do so by lowering QP at a cost of increased distortion.
Figure 2 suggests that as source complexity varies
during a sequence, you move from one such curve to another.
Figure 1. For a particular source frame |
Figure 2. But when source complexity varies…. |
Figure 3 illustrates open loop (or VBR) operation of a video encoder.
The user supplies two key inputs – the uncompressed video
source and a value for QP. As the source sequence progresses, you
will get compressed video of fairly constant quality, but the bitrate
may vary dramatically. Because the complexity of pictures is continually
changing in a real video sequence, it is not so obvious what value
of QP to pick. If you fix QP for an "easy" part of the
sequence having slow motion and uniform areas, then the bit rate
will go up dramatically when you reach the "hard" (i.e.,
more complex) parts.
In reality, constraints imposed by decoder buffer size and network
bandwidth force us to encode video at a more nearly constant bitrate.
To do this, Figure 4 suggests that we must dynamically vary QP based
upon estimates of the source complexity, so that each picture (or
group of pictures) gets an appropriate allocation of bits to work
with. Rather than specifying QP as input, the user specifies demanded
bitrate instead.
Figure 3. Open Loop Encoding (VBR) |
Figure 4. Closed Loop Rate Control (CBR) |
Elements of H.264 Rate Control
With a focus on the recommended approach [4, 5, 6] for H.264, Figure 5 identifies important elements within the rate controller. Most of these elements are common to other rate control schemes. Note that Figure 5 is conceptual and is not a literal representation of any software implementation. Many details are glossed over – for example, that B and P pictures are treated differently, and that some estimates are averages of sampled data over multiple pictures.
Figure 5. Elements of H.264 Rate Controller |
Rate-Quantization Model
The heart of the algorithm is a quantitative model describing Figure
2 — the relationship between QP, actual bitrate and
a surrogate for encoding complexity. However, the bits and complexity
terms should be associated only with the residuals.
Why?? Because the quantization parameter QP can only influence the
detail of information carried in the transformed residuals. QP has
no direct effect on the bitrates associated with overhead, prediction data, or motion vectors. The Mean Average Difference (or MAD) of the prediction error is used for
this purpose.
The model takes an algebraic form such as Residual Bits
= C1 * MAD / QP + C2 * MAD / QP2 but
it may take a simpler form (with C2 = 0) or a more complicated form
involving exponentials or other basis curves for fitting. This equation
[note that our term ResidualBits is synonomous with the term Texture
Bits used by other authors [2]] corresponds
to equation 2-84 of [6] and to equation 1 of
[2]. The free coefficients C1 and C2 may be
estimated empirically, by providing hooks in the encoder for extracting
the residual coefficients, as well as the number of residual bits
needed to transmit them.
Having established the model in (2), we can solve for the demanded
QP when the target value of ResidualBits is supplied by the Bit
Allocation modules in Figure 5.
Complexity Estimation
As indicated above, we need a simple metric that reflects the encoding
complexity associated with the residuals. The MAD of the prediction
error is a convenient surrogate for this purpose:
This MAD is an inverse measure of predictor's accuracy and (in the
case of interprediction) the temporal similarity of adjacent pictures.
Ideally, the MAD would be estimated after encoding the current picture,
but that would require us to encode the picture again after the
QP is selected – quite a burden for a computationally intensive
standard like H.264! Instead, we can usually assume that this complexity
surrogate varies gradually from picture to picture, and estimate
it based upon data extracted from the encoder for previous pictures.
Note that this assumption fails at a scene change.
QP-Limiter
Figures 4 and 5 represent a closed loop control system which must be appropriately damped to guarantee stability and to minimize perceptible variations in quality. For difficult sequences having rapid changes in complexity, QP-demand may oscillate noticeably, so a rate limiter is applied which typically limits changes in QP to no more than ± 2 units between pictures.
Virtual Buffer Model
Any compliant decoder is equipped with a buffer to smooth out variations
in the rate and arrival time of incoming data. The corresponding
encoder must produce a bitstream that satisfies constraints of the
decoder, so a virtual buffer model is used to simulate the fullness
of the real decoder buffer.
The change in fullness of the virtual buffer is the difference between
the total bits encoded into the stream, less a constant removal
rate assumed to equal the bandwidth (or demanded bitrate). The buffer
fullness is bounded by zero from below and by the buffer capacity
from above. The user must specify appropriate values for buffer
capacity and initial buffer fullness, consistent with the decoder
levels supported.
QP Initializer
QP must be initialized upon start of video sequence. An initial
value may be input manually, but a better approach is to estimate
it from the demanded bits per pixel, i.e.,
DemandedBitsPerPixel = DemandedBitrate / (FrameRate * height * width)
Equation 2-67 of [6] provides a recommended
table relating initial QP to DemandedBitsPerPixel.
GOP Bit Allocation
Based upon the demanded bit rate and the current fullness of the
virtual buffer, a target bit rate for the entire group of pictures
(GOP) is determined, and QP for the GOP's I-picture
and first P-picture is also determined.
The GOP Target is fed into the next block for detailed bit allocation
to pictures or to smaller basic units.
Basic Unit Bit Allocation
The "Basic Unit" is useful terminology introduced in [4],
which is the basis for H.264 rate control recommendations [6].
With this approach, scalable rate control may be pursued to different
levels of granularity – such as picture, slice, macroblock
row or any contiguous set of macroblocks. That level is referred
to as a "basic unit" at which rate control is resolved,
and for which distinct values of QP are calculated.
If the basic unit is smaller than a picture, then this block in
Figure 5 actually breaks out into two layers – one for the
picture itself and another for the basic unit. Figure 5 and our
discussion are limited to the case where the picture itself is the
basic unit. For details on how to treat smaller basic units, please
see [5] or [6].
For H.264, the emphasis is on computing QP for each stored picture
(usually a P-picture)[Strictly speaking, the H.264 standard allows
B pictures to be used as reference pictures. However, this is not
expected to be common usage.]. The QP's for non-stored pictures
(ordinarily B-pictures) are then interpolated (and offset) from
QP values for their neighboring P pictures. First, considering the
MAD of the picture, one can determine a target level for the buffer
fullness. Then using the buffer target level, it is easy to calculate
the target bits for the picture.
Comparison with MPEG-2 (Test Model 5) Rate Control
Because of the influence and familiarity of MPEG's Test Model 5 rate control [7], it is useful to compare its similarities and differences with the H.264 approach. To do so, we transmogrify Figure 5 into Figure 6, which corresponds conceptually to the MPEG2/TM5 approach.
Figure 6. Comparison to MPEG2 Test Model 5 |
Similarities include the use of the virtual buffer model, the calculation
of layered bit targets for the GOP and picture, and the overall
goal of generating a quantization parameter (in this case, called
Mquant) for a basic unit. The Mquant for the basic unit (always
a single macroblock) is adjusted in proportion to its estimated
complexity.
Differences include:
- The Basic Unit is always the macroblock in this scheme. It is possible to get significant variations of quantization parameter across different macroblocks in the same picture.
- Differences between I, P and B picture types arise in the allocation of target bits. Otherwise, they are treated similarly.
- MPEG-2 does not have the same multiplicity of prediction modes. In the absence of advanced intra prediction, it need not be so rigorous in relating quantization parameter (which controls residual quality) to measured properties of the residual itself.
- Macroblock-level spatial complexity is estimated from the source activity, regardless of whether the complexity is handled by transmitting motion vectors (inter-prediction) or residual coefficients.
- Allocation of bits to a picture considers the picture type, GOP structure and demanded bitrate, but not the picture's measured complexity. However, within the picture, the buffer fullness and relative spatial activity of each macroblock is used to allocate the picture bits among the macroblocks.
It
is easy to recognize this Test Model 5 approach as an ancestor of
the H.264 approach, which accommodates the more general prediction
methods of H.264 and provides more flexibility to scale the granularity
of control.
H.264 Rate-Distortion Optimization and Global
Rate Control
H.264 provides 7 modes for inter (temporal) prediction, 9 modes
for intra (spatial) prediction of 4x4 blocks, 4 modes for intra
prediction of 16 x 16 macroblocks, and one skip mode. Each 16 x
16 macroblock can be broken down in numerous ways. Thus, mode selection
for each macroblock is a critical and time-consuming step that enables
much of the dramatic bitrate reduction.
Selection of the optimal mode is done by an algorithm called rate-distortion
optimization (RDO) [8], which essentially involves
1) an exhaustive pre-calculation of all feasible modes to determine
the bits and distortion of each; 2) evaluation of a metric that
considers both bitrate and distortion; and 3) selection of the mode
that minimizes the metric.
QP is input to the RDO process, which does not regulate QP or modify
the quality of the residual coefficients. RDO is complementary to
rate control; these two aspects of the problem are decoupled because
a fully coupled optimization would require a more expensive iterative
solution.
The interplay with RDO, described in [4] as
a "chicken and egg" dilemma, influences implementation
of a rate control algorithm. The MAD is needed by the rate control
algorithm, but it is available only after the RDO has used a QP
value to generate it. Thus, the rate control algorithm must use
an estimate for MAD based upon complexity of prior pictures in the
sequence.
ExpertH264 Implementation of Rate Control
PixelTools has implemented the H.264 rate control recommendations
in a recent release of ExpertH264. For this release, we have provided
picture level control without frame skip. Especially for offline
applications for encoding to stored media, this algorithm provides
excellent tracking of bitrates for GOPs of a wide variety of sizes.
Typical results track GOP bitrate within 1% without B pictures or
2-3% with B pictures, with good stabilization of QP to prevent noticeable
swings in quality. You can try this for yourself by requesting a
free demo of ExpertH264 from PixelTools
Corporation.
In subsequent releases, we plan to allow flexibility for smaller
basic units, which will allow closer bitrate tracking on the individual
picture level, as well as for smaller virtual buffer capacities.
We will also support both frame skip and stuffing bits in a subsequent
release – depending upon the end requirements, use of one
or both of these techniques will reduce variations in bitrate.
The algorithm is a separate module having several interfaces that
can be called by the encoder, and with callbacks to the encoder
for retrieving key information such as residual bits and residual
coefficients. Construction of the complexity metric (i.e., prediction
error MAD) is part of the rate control algorithm. C Interfaces and
utility functions include:
- init_rateControl
- initRateControlParams
- gopRateControl
- frameRateControl
- getQB
- updateModel
- updateBFrameState
- getMbMAD
- initialQP
Thus, developers of hardware and software encoders can consider
integrating this algorithm into their own environments. For example,
after the encoding step, a call to updateModel refreshes the empirical
coefficients such as C1 and C2 in equation (2). Similarly frameRateControl
is called prior to encoding each picture and supplies the quantization
parameter.
Terminology
The following glossary is intended to help with a common understanding
of rate control issues.
Prediction. Both H.264 and MPEG-*
may predict a macroblock by traditional inter (temporal) prediction,
i.e., a motion estimation from previous reference pictures followed
by transmission of the motion vector. Additionally, H.264 supports
advanced intra (spatial) prediction of a macroblock from encoded
values for neighboring pixels that have already been encoded (e.g.,
in raster-scan order).
Residual. The difference between the
source and prediction signals is called the residual,
or the prediction error. A spatial transform is then applied to
the residual to produce transformed coefficients that carry any
spatial detail that is not captured in the prediction itself or
its reference pictures.
Distortion. Distortion refers to the
difference between the original source image x, and the reconstructed
image y after it has been decoded. In H.264, sum of squared difference
is used to quantify distortion as (1/N) i
|yi – xi |2, for any set of N pixels.
Complexity. As the saying goes, I can't
define complexity, but I know it when I see it! A single source
picture is complex if it is "busy" and has lots of spatial
detail. The term spatial activity is synonymous
with source complexity for this case. However, for a video sequence,
the meaning of complexity is, well, more complex! For example, if
a video sequence consists of one busy object that translates slowly
across the field of view, it may not require very many bits because
the temporal prediction can easily capture the motion using a single
reference picture and a series of motion vectors. It is difficult
to define an inclusive video complexity metric that is also easy
to calculate. See MAD
MAD: Mean Absolute Difference of Prediction
Error. For rate control, what is more important is the encoding
complexity of the residuals that are left over after the inter or
intra prediction process is finished. The Mean Absolute Difference
of Prediction Error is usually closely related to encoding complexity.
Suppose xi is the source value for ith pixel, then:
Spatial
Activity. This term is used to quantify the amount of spatial
variation within a part of the picture, normally a block of N pixels.
Suppose the N pixel values xi, i = 1,..,N. Then the activity for
those N pixels is: (1/N) i
(xi – <x> )2, where <x> = (1/N) i
xi. In other words the spatial activity is the sample variance of
a block's values. It is the measure for local complexity used in
MPEG-2.
Bitrate. Bitrate refers to the bits per
second consumed by a sequence of pictures, i.e., bitrate = (average
bits per picture) / (frames per second). In practice, it is equated
to the reliable network bandwidth that is provisioned or available
for the stream.
Quantization Parameter (QP). Residuals
are transformed into the spatial frequency domain by an integer
transform that approximates the familiar Discrete Cosine Transform
(DCT). The Quantization Parameter determines the step size for associating
the transformed coefficients with a finite set of steps. Large values
of QP represent big steps that crudely approximate the spatial transform,
so that most of the signal can be captured by only a few coefficients.
Small values of QP more accurately approximate the block's spatial
frequency spectrum, but at the cost of more bits. In H.264, each
unit increase of QP lengthens the step size by 12% and reduces the
bitrate by roughly 12%.
Group of Pictures (GOP). The
Group of Picture concept is inherited from MPEG and refers to an
I-picture, followed by all the P and B pictures until the next I
picture. A typical MPEG GOP structures might be IBBPBBPBBI. Although
H.264 does not strictly require more than one I picture per video
sequence, the recommended rate control approach does require a repeating
GOP structure to be effective. Thus, H.264 rate control will not
work properly if the IntraPeriod parameter is set to 0.
Basic unit. The authors of references
[4] and [5] introduced this
useful term that expresses the granularity on which QP is adjusted
in the feedback control loop. If the basic unit is a picture, then
the rate controller's adjustments to QP are uniform across the picture.
In MPEG-2, the basic unit is a macroblock. Initially, most H.264
applications will probably use the picture as basic unit, but ultimately
a full or partial row of macroblocks is expected to yield the best
compromise between uniform bitrate and uniform quality.
Summary
This white paper presents the basics of rate control for H.264 and
compares them to the Test Model 5 approach of MPEG-2. Implementers
needing a detailed description of the algorithm should see [5]
or [6]. The structure shown in our Figure 5, the discussion of its
modules, and the terminology glossary should provide a useful companion
to help in understanding the densely packed equations found in these
references.
References
1. C. Poynton, Digital Video
and HDTV, Elsevier Science 2003, pp. 491-2
2. A. Vetro, "MPEG-4 Rate Control for Multiple
Video Objects," IEEE Transactions on Circuits and Systems for
Video Technology," Vol. 9, No. 1, February 1999
3. G. Sullivan, T. Wiegand and K.P. Lim, "Joint
Model Reference Encoding Methods and Decoding Concealment Methods;
Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
4. Z. Li et al., "Adaptive Basic Unit Layer
Rate Control for JVT," JVT-G012, 7th Meeting: Pattaya, Thailand,
March 2003
5. Z. Li et al., "Proposed Draft of Adaptive
Rate Control," JVT-H017, 8th Meeting: Geneva, May 2003
6. G. Sullivan, T. Wiegand and K.P. Lim, "Joint
Model Reference Encoding Methods and Decoding Concealment Methods;
Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
7. MPEG 2 Test Model 5, Rev. 2, Section 10: Rate
Control and Quantization Optimization, ISO/IEC/JTC1SC29WG11, April
1993
8. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini
and G. Sullivan, "Rate-Constrained Coder Control and Comparison
of Video Coding Standards," IEEE Transactions on Circuits &
Systems for Video Technology, 13, #7, July 2003
Let us know if we can help or request a free demo of our products. View our products features at a glance.
Visit our products page and check out at our PixelTools Store to purchase any of our products
Thank you for your interest in PixelTools