Table Of ContentSeries Editor: J. Biemond, Delft University of Technology, The Netherlands
Volume 1 Three-Dimensional Object Recognition Systems
(edited by A.K. Jain and P.J. Flynn)
Volume 2 VLSI Implementations for Image Communications
(edited by P. Pirsch)
Volume 3 Digital Moving Pictures - Coding and Transmission on ATM Networks
(J.-P. Leduc)
Volume 4 Motion Analysis for Image Sequence Coding (G.Tziritas and C. Labit)
Volume 5 Wavelets in Image Communication (edited by M. Barlaud)
Volume 6 Subband Compression of Images: Principles and Examples
(T.A. Ramstad, S.O. Aase and J.H. Husey)
Volume 7 Advanced Video Coding: Principles and Techniques
(K.N. Ngan, T. Meier and D. Chai)
ADVANCES IN IMAGE COMMUNICATION 7
Advanced Video Coding:
Principles and Techniques
King N. Ngan, Thomas Meier and Douglas Chai
University of Western Australia,
Dept. of Electrical and Electronic Engineering,
Visual Communications Research Group,
Nedlands, Western Australia 6907
1999
Elsevier
Amsterdam - Lausanne - New York - Oxford - Shannon - Singapore - Tokyo
ELSEVIER SCIENCE B.V.
Sara Burgerhartstraat 25
P.O. Box 211, 1000 AE Amsterdam, The Netherlands
(cid:14)9 1999 Elsevier Science B.V. All rights reserved.
This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use:
Photocopying
Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher
and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or
promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make
photocopies for non-profit educational classroom use.
Permissions may be sought directly from Elsevier Science Rights & Permissions Department, PO Box 800, Oxford OX5 1DX, UK;
phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.co.uk. You may also contact Rights & Permissions
directly through Elsevier's home page (http://www.elsevier.nl), selecting first 'Customer Support', then 'General Information', then
'Permissions Query Form'.
In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive,
Danvers, MA 01923, USA; phone: (978) 7508400, fax: (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid
Clearance Service (CLARCS), 90 Tottenham Court Road, London WlP 0LP, UK; phone: (+44) 171 631 5555; fax: (+44) 171 631 5500.
Other countries may have a local reprographic rights agency for payments.
Derivative Works
Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or
distribution of such material.
Permission of the Publisher is required for all other derivative works, including compilations and translations.
Electronic Storage or Usage
Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part
of a chapter.
Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher.
Address permissions requests to: Elsevier Science Rights & Permissions Department, at the mail, fax and e-mail addresses noted above.
Notice
No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability,
negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein.
Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.
First edition 1999
Library of Congress Cataloging in Publication Data
A catalog record from the Library of Congress has been applied for.
ISBN: 0444 82667 X
The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).
Printed in The Netherlands.
To Nerissa, Xixiang, Simin, Siqi
To Elena
To June
Preface
The rapid advancement in computer and telecommunication technologies is
affecting every aspects of our daily lives. It is changing the way we interact
with each other, the way we conduct business and has profound impact on
the environment in which we live. Increasingly, we see the boundaries be-
tween computer, telecommunication and entertainment are blurring as the
three industries become more integrated with each other. Nowadays, one no
longer uses the computer solely as a computing tool, but often as a console
for video games, movies and increasingly as a telecommunication terminal
for fax, voice or videoconferencing. Similarly, the traditional telephone net-
work now supports a diverse range of applications such as video-on-demand,
videoconferencing, Internet, etc.
One of the main driving forces behind the explosion in information traffic
across the globe is the ability to move large chunks of data over the exist-
ing telecommunication infrastructure. This is made possible largely due to
the tremendous progress achieved by researchers around the world in data
compression technology, in particular for video data. This means that for
the first time in human history, moving images can be transmitted over long
distances in real-time, i.e., the same time as the event unfolds over at the
sender's end.
Since the invention of image and video compression using DPCM (differ-
ential pulse-code-modulation), followed by transform coding, vector quanti-
zation, subband/wavelet coding, fractal coding, object-oreinted coding and
model-based coding, the technology has matured to a stage that various cod-
ing standards had been promulgated to enable interoperability of different
equipment manufacturers implementing the standards. This promotes the
adoption of the standards by the equipment manufacturers and popularizes
the use of the standards in consumer products.
JPEG is an image coding standard for compressing still images accord-
ing to a compression/quality trade-off. It is a popular standard for image
exchange over the Internet. For video, MPEG-1 caters for storage media
vii
viii
up to a bit rate of 1.5 Mbits/s; MPEG-2 is aimed at video transmission
of typically 4-10 Mbits/s but it alSo can go beyond that range to include
HDTV (high-definition TV) image~. At the lower end of the bit rate spec-
trum, there are H.261 for videoconmrencing applications at p x 64 Kbits/s,
where p = ,1 2,... 30; , and H.263,~which can transmit at bit rates of less
than 64 Kbits/s, clearly aiming at the videophony market.
The standards above have a number of commonalities: firstly, they are
based on predictive/transform coder architecture, and secondly, they pro-
cess video images as rectangular frames. These place severe constraints as
demand for greater variety and access of video content increases. Multi-
media including sound, video, graphics, text, and animation is contained
in many of the information content encountered in daily life. Standards
have to evolve to integrate and code the multimedia content. The concept
of video as a sequence of rectangular frames displayed in time is outdated
since video nowadays can be captured in different locations and composed as
a composite scene. Furthermore, video can be mixed with graphics and an-
imation to form a new video, and so on. The new paradigm is to view video
content as audiovisual object which as an entity can be coded, manipulated
and composed in whatever way an application requires.
MPEG-4 is the emerging stanc lard for the coding of multimedia con-
tent. It defines a syntax for a set c f, content-based functionalities, namely,
content-based interactivity, compre ssion and universal access. However, it
does not specify how the video con tent is to be generated. The process of
video generation is difficult and under active research. One simple way is to
capture the visual objects separately , as it is done in TV weather reports,
where the weather reporter stands in front of a weather map captured sepa-
rately and then composed together yith the reporter. The problem is this is
not always possible as in the case jm outdoor live broadcasts. Therefore, au-
tomatic segmentation has to be employed to generate the visual content in
real-time for encoding. Visual content is segmented as semantically mean-
ingful object known as video objec I plane. The video object plane is then
tracked making use of the tempora ~ I correlation between frames so that its
location is known in subsequent frames. Encoding can then be carried out
using MPEG-4. L "
This book addresses the more ~dvanced topics in video coding not in-
cluded in most of the video codingbooks in the market. The focus of the
book is on coding of arbitrarily shaped visual objects and its associated
topics. |
It is organized into six chapters:Image and Video Segmentation (Chap-
ter 1), Face Segmentation (Chapter" 2), Foreground/Background Coding
ix
(Chapter 3), Model-based Coding (Chapter 4), Video Object Plane Ex-
traction and Tracking (Chapter 5), and MPEG-4 Video Coding Standard
(Chapter 6).
Chapter 1 deals with image and video segmentation. It begins with
a review of Bayesian inference and Markov random fields, which are used
in the various techniques discussed throughout the chapter. An important
component of many segmentation algorithms is edge detection. Hence, an
overview of some edge detection techniques is given. The next section deals
with low level image segmentation involving morphological operations and
Bayesian approaches. Motion is one of the key parameters used in video
segmentation and its representation is introduced in Section 1.4. Motion
estimation and some of its associated problems like occlusion are dealt with
in the following section. In the last section, video segmentation based on
motion information is discussed in detail.
Chapter 2 focuses on the specific problem of face segmentation and its
applications in videoconferencing. The chapter begins by defining the face
segmentation problem followed by a discussion of the various approaches
along with a literature review. The next section discusses a particular face
segmentation algorithm based on a skin color map. Results showed that this
particular approach is capable of segmenting facial images regardless of the
facial color and it presents a fast and reliable method for face segmentation
suitable for real-time applications. The face segmentation information is
exploited in a video coding scheme to be described in the next chapter where
the facial region is coded with a higher image quality than the background
region.
Chapter 3 describes the foreground/background (F/B) coding scheme
where the facial region (the foreground) is coded with more bits than the
background region. The objective is to achieve an improvement in the
perceptual quality of the region of interest, i.e., the face, in the encoded
image. The F/B coding algorithm is integrated into the H.261 coder with
full compatibility, and into the H.263 coder with slight modifications of
its syntax. Rate control in the foreground and background regions is also
investigated using the concept of joint bit assignment. Lastly, the MPEG-4
coding standard in the context of foreground/background coding scheme is
studied.
As mentioned above, multimedia content can contain synthetic objects
or objects which can be represented by synthetic models. One such model
is the 3-D wire-frame model (WFM) consisting of 500 triangles commonly
used to model human head and body. Model-based coding is the technique
used to code the synthetic wire-frame models. Chapter 4 describes the pro-
cedure involved in model-based coding for a human head. In model-based
coding, the most difficult problem is the automatic location of the object
in the image. The object location is crucial for accurate fitting of the 3-D
WFM onto the physical object to be coded. The techniques employed for
automatic facial feature contours extraction are active contours (or snakes)
for face profile and eyebrow extraction, and deformable templates for eye
and mouth extraction. For synthesis of the facial image sequence, head mo-
tion parameters and facial expression parameters need to be estimated. At
the decoder, the facial image sequence is synthesized using the facial struc-
ture deformation method which deforms the structure of the 3-D WFM to
stimulate facial expressions. Facial expressions can be represented by 44 ac-
tion units and the deformation of the WFM is done through the movement
of vertices according to the deformation rules defined by the action units.
Facial texture is then updated to improve the quality of the synthesized
images.
Chapter 5 addresses the extraction of video object planes (VOPs) and
their tracking thereafter. An intrinsic problem of video object plane extrac-
tion is that objects of interest are not homogeneous with respect to low-level
features such as color, intensity, or optical flow. Hence, conventional seg-
mentation techniques will fail to obtain semantically meaningful partitions.
The most important cue exploited by most of the VOP extraction algo-
rithms is motion. In this chapter, an algorithm which makes use of motion
information in successive frames to perform a separation of foreground ob-
jects from the background and to track them subsequently is described in
detail. The main hypothesis underlying this approach is the existence of
a dominant global motion that can be assigned to the background. Areas
in the frame that do not follow this background motion then indicate the
presence of independently moving physical objects which can be character-
ized by a motion that is different from the dominant global motion. The
algorithm consists of the following stages: global motion estimation, ob-
ject motion detection, model initialization, object tracking, model update
and VOP extraction. Two versions of the algorithm are presented where
the main difference is in the object motion detection stage. Version I uses
morphological motion filtering whilst Version II employs change detection
masks to detect the object motion. Results will be shown to illustrate the
effectiveness of the algorithm.
The last chapter of the book, Chapter ,6 contains a description of the
MPEG-4 standard. It begins with an explanation of the MPEG-4 devel-
opment process, followed by a brief description of the salient features of
MPEG-4 and an outline of the technical description. Coding of audio ob-
xi
jects including natural sound and synthesized sound coding is detailed in
Section 6.5. The next section containing the main part of the chapter, Cod-
ing of Natural Textures, Images And Video, is extracted from the MPEG-4
Video Verification Model .11 This section gives a succinct explanation of
the various techniques employed in the coding of natural images and video
including shape coding, motion estimation and compensation, prediction,
texture coding, scalable coding, sprite coding and still image coding. The
following section gives an overview of the coding of synthetic objects. The
approach adopted here is similar to that described in Chapter .4 In order
to handle video transmission in error-prone environment such as the mobile
channels, MPEG-4 has incorporated error resilience functionality into the
standard. The last section of the chapter describes the error resilient tech-
niques used in MPEG-4 for video transmission over mobile communication
networks.
King N. Ngan
Thomas Meier
Douglas Chai
June 1999
Acknowledgments
The authors would ike to thank Professor K. Aizawa of University of
Tokyo, Japan, for the use of the "Makeface" 3-D wireframe synthesis soft-
ware package, from which some of the images in Chapter 4 are obtained.
xi
jects including natural sound and synthesized sound coding is detailed in
Section 6.5. The next section containing the main part of the chapter, Cod-
ing of Natural Textures, Images And Video, is extracted from the MPEG-4
Video Verification Model .11 This section gives a succinct explanation of
the various techniques employed in the coding of natural images and video
including shape coding, motion estimation and compensation, prediction,
texture coding, scalable coding, sprite coding and still image coding. The
following section gives an overview of the coding of synthetic objects. The
approach adopted here is similar to that described in Chapter .4 In order
to handle video transmission in error-prone environment such as the mobile
channels, MPEG-4 has incorporated error resilience functionality into the
standard. The last section of the chapter describes the error resilient tech-
niques used in MPEG-4 for video transmission over mobile communication
networks.
King N. Ngan
Thomas Meier
Douglas Chai
June 1999
Acknowledgments
The authors would ike to thank Professor K. Aizawa of University of
Tokyo, Japan, for the use of the "Makeface" 3-D wireframe synthesis soft-
ware package, from which some of the images in Chapter 4 are obtained.