Table Of ContentReal-time H264/AVC encoder based on enhanced frame
level parallelism for smart multicore DSP camera
Nejmeddine Bahri, Nidhameddine Belhadj, Thierry Grandpierre, Mohamed
Ali Ben Ayed, Nouri Masmoudi, Mohamed Akil
To cite this version:
Nejmeddine Bahri, Nidhameddine Belhadj, Thierry Grandpierre, Mohamed Ali Ben Ayed, Nouri Mas-
moudi, et al.. Real-time H264/AVC encoder based on enhanced frame level parallelism for smart mul-
ticore DSP camera. Journal of Real-Time Image Processing, 2014, 12, pp.791-812. 10.1007/s11554-
014-0470-6. hal-01192770
HAL Id: hal-01192770
https://hal.science/hal-01192770
Submitted on 7 Sep 2015
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est
archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Real-time H264/AVC Encoder based on Enhanced
Frame Level Parallelism for Smart Multicore DSP
Camera
Nejmeddine Bahri(1), Nidhameddine Belhadj(1), Thierry Grandpierre(2), Mohamed Ali Ben Ayed(1), Nouri
Masmoudi(1), Mohamed Akil(2)
(1)National school of Engineers, LETI Laboratory, University of Sfax, Tunisia
(2)ESIEE Engineering, LIGM Laboratory, University Paris-EST, France
nejmeddine.bahri@gmail.com, Thierry.grandpierre@esiee.fr, nouri.masmoudi@enis.rnu.tn, mohamed.akil@esiee.fr
Abstract The latest generation of multicore Digital compared to General Purpose Processor (GPP) or
Signal Processors (DSP), their high computing power, Graphic Processor Unit (GPU). Several manufactures [4]
low consumption and integrated peripherals will allow such as Freescale [5] and Texas Instruments (TI) [6]
them to be embedded in the next generation of smart solve the challenges of smart cameras with their high
camera. Such DSPs allow designers to evolve the vision performance multicore DSP processors. Exploiting these
landscape and simplify the developer's tasks to run more embedded technologies, smart cameras are changing the
complex image and video processing applications vision landscape and pushing developers to run several
without the need to burden a separate Personal Computer applications without the need to use any connected PC.
(PC). This paper explains how exploiting the computing In the area of video application, compression represents
power of a multicore DSP TMS320C6472 in order to an interesting task among the main applications of smart
implement a real-time H264/AVC video encoder. This camera or machine vision in addition to other tasks such
work prepares the way to the implementation of the new object detection, tracking, recognition…etc. The
High Efficiency Video Coding standard (HEVC-H265). commercialized encoding IPs allow real-time
To improve encoding speed, the enhanced Frame Level performance but lack in flexibility. In fact, they cannot
Parallelism (FLP) approach is presented and be upgraded to follow the latest protocol enhancements
implemented. A real-time fully functional video demo is and the latest advances in video compression. Actually, a
given taken into account video capture and bitstream new video coding standard is appeared on the market
storage. Experimental results show how we efficiently which is the HEVC-H265 but in the side, there are
exploit the potentials and the features of the multicore several smart cameras still work until now with old video
platform without inducing PSNR degradation or bitrate coding standard as motion JPEG or MPEG4. So it is time
increase. The enhanced FLP using five DSP cores now to follow developments in this field.
achieves a speedup factor of 4.3 times in average DSPs offer software flexibility that is important to
compared to a mono-core processor implementation for allow upgradability. They allow us to build highly
Common Intermediate Format (CIF 352x288), Standard flexible and scalable cameras that can follow the latest
Definition (SD 720x480) and High Definition (HD advances in video compression. Encoder parameters can
1280x720) resolutions. This optimized implementation also be finely tuned depending on the application's
allows us to exceed the real-time by reaching an requirements. They are also characterized by relatively
encoding speed of 98 f/s (frame/second) and 32 f/s for low software development cost and time-to-market
CIF and SD resolutions respectively and saves up to 77% reduction compared to ASIC development or FPGA
of encoding time for the HD resolution. implementation that requires a tremendous VHDL
expertise which may not deal with time-to-market
Keywords H264/AVC encoder, DSP, multi-core, Frame constraint.
Level Parallelism, real-time. In this context, the TI’s high performance multicore
DSP processor TMS320C6472 is used in order to
1 Introduction achieve a real-time implementation for the H264/AVC
[7] video encoder. This work will be our start point for
Nowadays, smart cameras or machine vision solutions the new video standard HEVC [8]. Effectively; since
[1], [2] need to run complex image and video processing HEVC encoder keeps the majority of H264/AVC
applications on growing amounts of data while meeting features (GOPs, frames, and slices structures) our
hard real-time constraints. New technologies of proposed approach will also benefit for our future H265
programmable processors such multicore DSPs, implementation.
embedded heterogeneous systems (ARM-DSP [3], DSP- H264 encoder is characterized by high coding
FPGA, ARM-FPGA), offer a very promising solution for efficiency comparing with previous standards. However,
these applications that require high computing this efficiency is accompanied by a high computational
performances. They are characterized by a high complexity that requires a high-performance processing
processing frequency with low power consumption capability to satisfy real-time constraint (25 to 30 f/s).
When moving to high resolutions, encoding time is This standard divides a video sequence into a hierarchical
drastically increased. Frequency limitation of embedded structure with six levels as shown in Fig. 2. The top level
mono-core processor makes it hard to achieve real-time of this structure is the sequence that contains one or more
encoding especially for HD resolutions. Using parallel groups of pictures (GOP). Each GOP is composed of one
and multicore architectures will be crucial to reduce the or more frames. Finally, the frames are divided into one
processing time of H264/AVC encoder. or more independent slices, subdivided themselves into
Several works have been published exploiting the macroblocks of 16x16 pixels (MB) and blocks of 4x4
potential parallelism of H264/AVC standard by applying pixels. Each MB undergoes two prediction types: 1) intra
a functional partitioning algorithms, data partitioning prediction: it consists of performing intra16x16 and
algorithms or both. Multi-processor, multi-core, multi- intra4x4 prediction modes in order to reduce spatial
threading encoding system and parallel algorithms have redundancies in the current frame. 2) inter prediction: it
been discussed in many papers [9] to [27]. This paper consists of determining the motion vector of the current
presents the Frame Level Parallelism (FLP) approach and MB relative to its position in the reference frames. It
describes its complete implementation in a H.264/AVC includes 7 prediction modes in order to reduce temporal
encoder using a multicore DSP TMS320C6472. redundancies existed among successive frames. A mode
The remainder of this paper is organized as follows: decision is then performed to select the best prediction
next section provides an overview of data dependencies mode. Integer transform and quantification modules are
and parallelism in H.264/AVC standard. Section 3 details performed on the best predicted MB in order to keep only
the related works on the parallel implementations of the most significant coefficients. An entropy coding is
H264/AVC encoder. The internal architecture of our finally performed to generate the compressed bitstream.
multicore DSP TMS320C6472 is described in Sect.4. A decoding chain is included in the encoder structure in
Section 5 presents our optimized implementation of order to keep the reconstructed frame that will be filtered
H264 encoder on a single DSP core. Section 6 focuses on with a de-blocking filter in order to eliminate artifacts.
the FLP algorithm implementation on five DSP cores. It The reconstructed frame will be used as a reference for
details the whole coding chain (image capture, bitstream the next frames to perform motion estimation.
transfers), and finally gives experimental results. The
best approach, based on the enhanced FLP is detailed in
GOP
Sect.7 which also includes experimental results. Finally,
section 8 concludes this paper and presents some
Frame
perspectives.
2 Overview of data dependencies and parallelism in Slice
H264/AVC encoder
The H.264/AVC encoder baseline profile is a video
MB
compression standard used to reduce the video data
amount in order to overcome the limitation of
transmission bandwidth and the huge amount of memory Block
requirement for storing high definition video sequences.
Fig. 2 Hierarchical decomposition of an H.264 video sequence
This standard consists of several functions in order to
generate the compressed bitstream of the input video as
According to functions organization and hierarchical
shown in Fig.1.
sequence structure in H.264/AVC encoder, there are
mainly two partitioning families:
Input video
Task-level parallelization (TLP) or functional
Bitstream
+ TrIanntesgfoerrm Quantification Ecnotrdoepry partitioning: it consists of splitting the encoder into
- several steps, identify them into a different group of tasks
equal to the number of threads available on the system
Inverse Inverse
Transform Quantification and run these groups of tasks simultaneously as a
Intra pipeline. Thus, the appropriate functions that could be
Prediction
intra + grouped together to be processed in parallel and the other
functions that will be executed in serial to respect data
inter
dependencies should be efficiently chosen. Also, tasks
computational complexities should be taken into
Motion Motion Deblocking
Estimation Compensation filter consideration in order to maximize the encoding gain and
ensure a workload balance between the parallel tasks.
Ref1 Reconstructed Finally, when grouping functions, synchronization
Ref2 Frame
overhead should be minimized as much as possible by
Ref3
eliminating data dependency between the different
Fig. 1 H264/AVC video encoder strucure function blocks. For example intra prediction modes (13
2
modes) and inter prediction modes (7 modes) could be [13]. Finally, in the same frame, multiple MBs could be
processed in parallel because no dependencies existed encoded at the same time once its neighboring MBs are
among them. In the other side, integer transform, already encoded. This scheme is called MB level
quantification and entropy coding have to be processed in Parallelism [12].
serial way given the dependencies among them.
Data-level parallelization (DLP) or data partitioning: 3 Related works
it exploits the hierarchical data structure of H264/AVC
encoder by simultaneously processing several data levels To overcome the high complexity of H264/AVC encoder
on multiple processing units. DLP is limited by data and to resolve the problem of mono-core processor
dependencies among different data units. frequency limitation, many researchers have been
For H264/AVC encoder, there are two major types of worked on the parallelism of H264/AVC encoder in order
data dependencies: to meet the real-time constraint and achieve a good
Spatial dependencies: they exist amongst macroblocks encoding speedup which can be presented by the
within the current encoding frame. In fact, to perform following equation.
intra prediction modes, motion vector prediction and 𝑡𝑖𝑚𝑒 𝑜𝑓 𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = (1)
reconstructed MB filtering for the current MB, such data 𝑡𝑖𝑚𝑒 𝑜𝑓 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔
are required from its neighboring MBs (Left, Top Left Several implementations exploiting multi-threads, multi-
TOP and Top right) already encoded as shown in Fig. 3. processors and multicore architectures are discussed in
So, the current MB could be encoded only if its many papers:
neighboring MBs have been encoded. Zhibin Xiao et al. [9] exploit task level parallelism
approach. They partition and map the dataflow of
TOP LEFT TOP MB TOP right
MB MB H.264/AVC encoder to an array of 167-core
asynchronous array of simple processors (AsAP)
computation platform coupled with two shared memories
TOP TOP pixels TOP and a hardware accelerator for motion estimation. They
Left dependency right
process the luminance and the chrominance components
LEFT slexipycned Current icna lcpualraatleldel .i nI nptarraa4lxle4l . mOondlye s3 amndo dienst rfao1r6 ixn1tr6a 4mx4o diness teaarde
MB TFELneped MB of 9 and 3 modes for intra16x16 are considered to reduce
the top right dependency. Eight processors are used for
transform and quantification and 17 processors for
Fig. 3 spatial dependencies for the current MB
CAVLC. A hardware accelerator is used for motion
estimation. Despite all these hardware resources, a real-
Temporal dependency: to determine the motion vector
time implementation is not achieved. The presented
of the current MB in relative to its position in the
encoder is capable of encoding VGA (640 x 480) video at
previous encoded frames, a motion estimation (ME)
21 frames per second (fps). Reducing the number of
algorithm such as MB matching is performed. The search
candidate modes for intra4x4 and intra16x16 induces
of the corresponding MB is restricted in a specific area
visual quality degradation and bitrate increase.
called the "search window" in the reference frames (the
Sun et al. [10] implement a parallel algorithm for
previous encoded frames) instead of scanning the whole
H.264 encoder based on MB region partition (MBRP).
frame in order to reduce the computing complexity. So a
They split the frame into several MB regions composed
partial dependency among MBs of successive frames is
by adjoining columns of MBs. Then, they map the MB
imposed and limited to the search windows.
regions onto different processors to be encoded satisfying
As data partitioning is restricted by these data
data dependencies in the same MBs row. Simulation
dependencies, several points could be noticed. No
results on 4 processors running at 1.7 GHz show that the
dependencies existed among different GOPs because
proposed partitioning achieves a speedup by a factor of
each GOP is started by an intra frame “I” where only
3.33 without any rate distortion (Quality, Bitrate)
intra prediction is performed, so dependencies is existed
compared to H264 software JM10.2 [11]. In the other
only among MBs in the same frame. The remaining
side, they are still far from real-time implementation that
frame of the GOP are a predicted frames “P” where both
requires at least 25 f/s. They can encode only
intra and inter prediction are performed. Hence, several
1frame/1.67s for CIF resolution and 1frame/6.73s for SD
GOPs could be encoded in parallel. This method is called
resolution.
GOP Level Parallelism [14]. A partial dependency is
Zhuo Zhao et al. [12] propose a new wave-front
existed between successive frames of the same GOP due
parallelization method for H.264 encoder. They mix two
to motion estimation in the search window. Thus,
partitioning methods: MB row level parallelism and
multiple frames could also be encoded in pipeline once
frame level parallelism. All MBs in the same MB row are
the search window is encoded and this method is called
processed by the same processor or thread to reduce data
Frame Level Parallelism [12]. When dividing frame into
exchanges between processors. MBs in different frames
independent slices, several slices could be processed in
can be processed concurrently if the reconstructed MBs
parallel and this approach is called slice level parallelism
3
in the reference frame forming the search window are all estimation and parallel CABAC evolutional entropy
available. They implement this method using JM9.0 on a coder cause a bitrate increment. Real-time processing is
Pentium 4 processor running at 2.8 GHz. Simulations on not given in this paper.
4 processors prove that a speedup by a factor of 3 is Ming-Jiang Yang et al. [18] implement the
achieved (3.17 for QCIF resolution (Quarter CIF H264/AVC encoder on the dual-core DSP processor
176x144) and 3.08 for CIF). Encoding quality was not ADSP-BF561 chipset using functional partitioning. Core
changed and it remains the same as the original software A of the BF561 processor is dedicated to perform mode
JM 9.0. In the other side, the runtime is far from real- decision, intra prediction, motion compensation, integer
time implementation. In fact, only 1frame/1.72s is transform (IT), quantization, de-quantization, inverse
encoded for CIF resolution. integer transform, and entropy encoding. Core B is
Yen-Kuang et al. [13] parallelize the H.264 encoder assigned to perform in-loop filtering, boundary
exploiting thread-level parallelism using OpenMP extension, and half-pel interpolation. Core A and core B
programming model. Slice level partitioning is execute tasks in two pipeline stages. The proposed
performed on 4 Intel Xeon™ processors with Hyper- encoder system achieves real-time encoding for CIF
Threading Technology. Results show a speedups ranging resolution but not for higher resolutions (VGA, SD and
from 3.74x to 4.53x. The drawback of slice parallelism is HD).
that it affects the rate distortion performance. Indeed, it Zrida et al. [19] present a parallelization approach for
provides PSNR degradation and an important increase in embedded Systems on Chip (SoCs). It is based on
bitrate especially when the frame is decomposed into exploration of task and data levels parallelism, the
several independent slices. parallel Kahn process network (KPN) model of
S.Sankaraiah et al. [14] [15] apply the GOP level computation and the YAPI programming C++ runtime
parallelism using multithreading algorithm in order to library. The used SOC platform relies on 4 MIPS
avoid data dependencies. Each GOP is handled by a processors. Simulation results of this work show that a
separate thread. Frames in each GOP are encoded by two speedup of 3.6 is achieved for QCIF format but real-time
threads: I and P frames by the first thread and B frames encoding is not reached even for low resolutions (7.7 f/s
by the second thread. The obtained speedup using dual for QCIF format).
and quad core processors are 5.6 and 10 respectively. António Rodrigues et al. [20] implement the
The drawback of GOP level parallelism is its very high H264/AVC encoder on a 32-core Non-Uniform Memory
encoding latency that is not compatible with video Access (NUMA) computational architecture, with eight
conference applications. AMD 8384 quad-core chip processors running at
Rodriguez et al. [16] go a step further and propose an 2.7GHz. Two parallelism levels are combined: slice level
implementation of H.264 encoder using GOP level and macroblock level. A multi-threading algorithm
parallelism combined with slice level parallelism on a using openMP is used for JM software. The frame is
clustered workstations using Message Passing Interface decomposed into slices; each slice is processed by a
(MPI). The first approach speeds up the processing time group of cores. Several MBs in the same slice are
but provides a high latency and the second approach is encoded in parallel with respecting of data dependencies
used to reduce this latency by dividing each frame into by the different cores of the group. The achieved speedup
several slices and distributing these slices to computers using the whole set of 32 cores is between 2.5 and 6.8 for
belonging to a subgroup of computers. With this 4CIF video (704 × 576). These speedups are not
technique, the encoding latency is relatively reduced. significant compared to the number of cores used for
However, increasing the number of slices per frame has encoding. Using a MB level Parallelism requires that
significant adverse effects on the rate distortion (bitrate data have to be shared which leads to a memory bottleneck
increment). Also, clustered workstations are a costly and higher latency. Also, increasing the number of slices
solution and they are not intended for embedded introduces a bitrate distortion. Real-time is not noticed in
applications. this work.
Shenggang Chen et al. [17] introduce an Olli lehtoranta et al. [21] implement a row-wise data
implementation of an on-chip parallel H.264/AVC parallel video coding method on the quad TMS320C6201
encoder on hierarchical 64-cores DSP platform. This DSP system. The frame is decomposed into slices by
platform consists of 16 super nodes (4 DSP cores for row-wise and each slice is mapped to a DSP slave. A
each node). 2D WaveFront algorithm for macroblock DSP master is devoted to swap data to/from DSP slaves.
level parallelism is used and one macroblock is assigned The real-time 30f/s is reached only for CIF resolution but
to one super node. Subtasks for encoding one not yet for higher resolutions. The main drawback of this
macroblock such as motion estimation, intra prediction approach is an increase in bitrate and PSNR degradation
and mode decision are further parallelized to keep busy because of using slice level parallelism.
the four DSP cores that form a node. Speedup factors of In [22], author develops a parallel implementation of
13, 24, 26 and 49 are achieved for QCIF, SIF (352x240), the intra-prediction H.264/AVC module by using the
CIF and HD sequences respectively. The proposed computational resources of a GPU and exploiting the
wavefront parallel algorithm does not introduce any Compute Unified Device Architecture (CUDA)
quality loss; however, the used CABAC-based bitrate programming model. They apply two partitioning
4
methods: 1) data partitioning by processing the decomposing these modules into several tasks.
luminance Y and the two chroma components Cr, Cb in Experimental results show that about 20 times the
different parallel tasks. 2) Task partitioning by speedup can be obtained for the proposed parallel method
processing the intra prediction 16x16, intra prediction when compared to the reference program. The presented
4x4 and chroma intra prediction in parallel. This parallel H.264 encoder can satisfy the requirement of
implementation of the intra prediction module achieved a real-time HD encoding of 30 fps. In the other side, this
speedup of 11x, relatively to the sequential implementation affects the visual quality by inducing a
implementation of the reference software but as we know PSNR degradation ranging from 0.14 dB to 0.77 dB and
the inter prediction is the most important module in the a little increase in bitrate because of using multi-slice
H264/AVC encoder which takes the lion’s share of parallelism and some dependencies are not respected.
processing time. So, preferably, this module should be O. Adeyemi et al [26] presents a 4kUHD video
accelerated in addition to intra prediction. Moreover, streaming over wireless 802.11n. They perform the entire
processing chrominance data in parallel with the encoding chain including 4K camera capture, YUV color
luminance component does not give a significant space conversion, H264 encoding using CUDA on
speedup if we know that chrominance processing is NVIDIA Quadro 510 GPU and real-time live streaming.
negligible relatively to luminance processing. Finally, the To speed up the encoding, several modules are
luminance and chrominance processes are not totally parallelized such intra and inter prediction modules by
independent; thus; a dependency is existed among the exploiting a dynamic parallel motion algorithm and a
two components during filtering and entropy coding novel intra prediction mode. Also, they used a Zero-
processes which leads to a high latency. Copy memory allocation technique to reduce memory
Fang Ji et al. [23] propose a H264/AVC encoder on copy latencies between the host memory (CPU) and the
an MPSOC platform using GOP level parallelism GPU memory. Experiments confirm that 4kUHD real-
approach. They build three Microblaze soft cores based time encoding for live streaming at low bitrates is
on XILINX FPGA. A main processor is devoted to possible. A little deviation on visual quality is induced.
prepare the frames into the shared memory. Then, each Despite this, GPUs are a costly solution and they are not
processor among the remaining coprocessors will encode suitable for embedded applications because of high
its appropriate GOP. Experiments show that the average power consumption compared to other platforms.
speedup is 1.831. The problem of the GOP approach is Wajdi Elhamzi et al [27] present a configurable H264
its higher latency. Real-time is not achieved. This motion estimator dedicated to video codec on a smart
solution encodes only 3 f/s for QCIF resolution. camera accelerator based on Virtex6 FPGA component.
Xiulian Peng et al. [24] propose a pure line-by-line They propose a flexible solution to adjust the video
coding scheme (LBLC) for intra frame coding. The input stream transferred by the smart camera. The accelerator
image is processed line by line sequentially, and each is able to support several search strategies at IME
line is divided into small fixed-length segments. The (Integer Motion Estimation) stage and different
encoding of all segments from prediction to entropy configurations for FME (fractional Motion Estimation)
coding is completely independent and concurrent at stage. Experiments show that the obtained FPGA based
many cores. Results on a general-purpose computer architecture can process IME on 720x576 video streams
illustrate that the proposed scheme can get a 13.9 as at 67 fps using full search strategy. FPGA solution
speedup factor with 15 cores but in the other side, this remains an interesting way to achieve real-time
method affects the rate distortion because of discarding processing but when moving to implement the whole
the Left dependency for each MB. H264 encoder, a huge FPGA surface and a lot of design
Huayou Su et al [25] propose a parallel framework and compilation time with tremendous VHDL expertise
for H.264/AVC based on massively parallel architecture are required which may not deal with time-to-market
implemented on NVIDEA’s GPU using CUDA. They constraint. Finally, the low hardwired block frequency
present several optimizations to accelerate the encoding and bus bandwidth for data transfers between processor
speed on GPU. A parallel implementation of the inter and accelerator represent the major drawbacks of FPGA
prediction is proposed based on a novel algorithm implementations.
MRMW (Multi-resolutions Multi-windows) that consists
of using the motion trend of a lower resolution frame to 4 DSP platform description
estimate that of the original frame ( higher resolution).
The steps of MRMW are parallelized with CUDA on the Software flexibility, low power consumption, time-to-
different cores of the GPU. Also they perform a market reduction, and low cost make DSPs an attractive
multilevel parallelism for intra-coding. For that a multi- solution for embedded systems implementations and high
slice method is introduced. Each frame is partitioned into performance applications. Motivated by these merits and
independent slice. At the same time, the wave-front encouraged by the great evolution of DSP architectures,
method is adopted for parallelizing the MBs in the same we chose to implement the H264/AVC encoder on a low
slice. Some dependencies within MBs are not respected cost multicore DSP TMS320C6472 to profit from high
to maximize the parallelism. Moreover, CAVLC coding processing frequency and an optimized architecture in
and filtering processes are also parallelized by order to achieve real-time embedded video encoder.
5
TMS320C6472 DSP [28] belongs to the latest generation moved between internal and external memory.
of multicore DSPs made by Texas Instrument. Low TMS320C6472 DSP supports different communication
power consumption and a competitive price tag make the peripherals as Gigabit Ethernet for Internet Protocol (IP)
TMS320C6472 DSP ideal for high-performance networks, UTOPIA 2 for telecommunications and Serial
applications and suitable for many embedded RapidIO for DSP-to-DSP communications. This DSP
implementations. Several benchmarks are performed by consequently includes all the necessary components
Texas Instruments to compare between DSP, General (DMA, RAM (Random Access Memory), input output
Purpose Processors (GPP) and Graphic Processor Unit management) required to communicate with a camera
(GPU) [29] [30]. These benchmarks demonstrate that the sensor. Note also that VLIW architectures are
C6472 consumes 0.15 mW/MIPS (Million instructions deterministic and dedicated to embedded hard real-time
per second) at 3 GHz. Also, at 3.7 watts per device, it applications. This must be taken into account when
offers even greater power savings compared to GPP in compared to superscalar GPP (General Purpose
the same performance range. When the performance is Processors) based on expensive and consuming memory
distributed over power, the DSP is 4x more better than management units, out-of-order units etc.
GPU and 18x than GPP. Finally, it is clear that the power and the features of
As presented in Fig. 4, six C64x + DSP cores, very such DSP family perfectly fit the need of intelligent
long instruction word (VLIW) architecture, 4.8 M-Byte vision system embedded in smart cameras. It should also
(MB) of memory on chip, Single Instruction Multiple allow designers to build highly scalable camera that can
Data (SIMD) instruction set and a frequency of 700 MHz follow the latest advances in video compression.
for each core are combined to deliver 33600 MIPS
performance. 5 Optimized implementation on a single DSP core
Our choice for using this multicore DSP platform enables
us to develop an academic H264/AVC codec [31] in our
LETI laboratory (Laboratory of Electronics and
Information Technologies) for future research and
development targeting embedded video applications.
Standard compliant LETI’s codec was developed and
tested first on a PC environment for validation and then
migrated to TMS320C6472 DSP platform. This work
will also benefit for our future H265 implementation.
To efficiently take advantages of multicore
technology and the potential parallelism presented in the
H264 standard, we must as a first step, elaborate an
optimized H264/AVC architecture on a single DSP core
and then move to a multicore implementation. This step
consists of designing a data model that exploits DSP core
Fig. 4 Internal architecture of TMS320C6472 DSP
architecture and especially internal memory which is
faster than external SDRAM memory. Each core of
Each C64x+ core integrates a large amount of on-chip
TMS320C6472 DSP has 608 KB internal memory
memory organized as a two-level memory system. The
LL2RAM shared between program and data. Preferably
level-1 (L1) program and data memories on this C64x+
and to the extent of possible, we should load both
core are 32 K-Byte (KB) each. This memory can be
program and data within LL2RAM. For that reason, two
configured as mapped RAM, cache, or any combination
implementations are proposed [32].
of the two. The level 2 (L2) memory is shared between
program and data space and is 608 KB in size. L2
5.1 «MB level » implementation
memory can also be configured as mapped RAM, cache,
or any combination of the two. In addition to L1 and L2
memory dedicated to each core, the six cores also share This implementation is the conventional data structure
768 KB of L2 shared memory. Shared L2 memory is processing in H264/AVC standard. It is based on
managed by a separate controller and can be configured encoding a MB followed by another MB until finishing
as either program or data memory. This large amount of the entire frame MBs. The principle of this first proposed
on-chip memory may avoid access the external DDR2 architecture is detailed as follows: the program is loaded
memory, therefore reducing the power dissipation and into internal memory LL2RAM. The current, the
accelerating algorithms processing since internal memory reconstructed, the reference frames and the bitstream are
is faster than external memory. Performance is also stored into external SDRAM memory regarding their
enhanced using the EDMA (Enhanced Direct Memory important sizes for HD resolution. In order to avoid
Access) controller which is able to manage memory working directly with slow external memory, some data
transfers independently from the CPU. Therefore, no are moved into internal memory such as current MB,
additional overhead is caused when large data blocks are search window and reconstructed MB for the 3 YCrCb
6
components. The design of the MB level implementation copied into the 3 MBs rows of the search window).
is presented in Fig. 5. It highlights the memory Moving to the second current MBs row, it is not
allocations for the luminance components. necessary to load 3 MBs rows for the search window
from the reference frame, just shift up the last two MBs
External memory DDR2
rows of the search window in the internal memory and
Luma Y Reference Frame Luma Y Reconstructed Frame bring the third from the fourth MBs row of the reference
image.
Luma Y Reconstructed &
reference frame
…... …... Luma Y Current Frame …...
…... ……......
S Luma Y …...
tep 2 stCep1ur…re.n..t Frame 3 petS Bitstream …... …...
1 petS etS Step 3
p
2
Bitstream
Current MB
Search window Reconstructed MB External memory
Internal memory LL2RAM …... …...
CPU:Data processing Current MBs row Reconstructed MB row
…...
…...
Fig. 5 « MB level » implementation …...
3 MBs rows Search window
The DSP core transfers the current MB (16x16) and the
Internal memory LL2RAM
search window (48x48) respectively from the current and
CPU:Data processing
the reference frames from external to internal memory.
Consequently, the data processing can be performed by Fig. 6 « one MBs row level » implementation
the DSP core without external memory accesses. The
reconstructed MB (20x20), extended by 4 pixels at the This approach outstandingly reduces the access to
left and the top needed in the MB filtering, is transferred external memory. Thus only one memory access to
from the local memory into external memory at the external memory for reading one MB row instead of 80
reconstructed frame buffer. This process is repeated until accesses to read 80 MBs that form a MB row for HD
completion of the entire current frame MBs. The most 720p frame (1280/16=80). The same prevail for saving
important advantage of this architecture is its adaptability the reconstructed MB row. In addition, when proceeding
to any DSP even in the case of small internal memory. In at a MBs row level architecture, all the left boundaries
fact, only 55.54 KB of internal memory space is required required in the next MB prediction and filtering are
for 720p HD (1280x720) resolution. The major already available in internal memory, so the left
drawbacks of this architecture are the multiple accesses neighboring pixels backup is removed. Moreover, this
to external memory for each required to transfer a current implementation reduces the backup of TOP boundaries,
or a reconstructed MB. It also needs to store the left and since storing top boundaries is required only one time
top neighboring pixels used in the prediction and filtering after finishing processing the whole MBs row, whereas,
of the next MBs after each MB processing. the MB level implementation needs to store top
neighboring pixels after processing each current MB.
5.2 « MBs row level» implementation
5.3 Experimental Results for the mono-core
To avoid the first architecture’s drawbacks, a second implementation
implementation is proposed. The principle of this
implementation as illustrated in Fig. 6 consists of loading In this preliminary work, the two proposed
one MBs row (16 x frame_width) from the current frame architectures are implemented on a single DSP core
and 3 MBs rows (48 x (16+ frame_width +16)) for the TMS320C6472 running at 700 MHz using the
search window from the reference frame to the H264/AVC LETI’s codec. Experimental simulations are
appropriate buffers created in internal memory. The DSP performed on the most commonly used video test
core encodes the whole current MBs row without sequences with CIF resolution downloaded from this
external memory access. Then, the reconstructed MBs website [33]. These sequences are a raw data in the
row (20 x (16+ frame_width +16)) is transferred from YUV 4:2:0 format recommended by the Joint Video
LL2RAM to SDRAM memory in the reference frame. Team (JVT) of ISO/IEC MPEG & ITU-T VCEG
Thus, it is not necessary to create another memory buffer organizations [34] that developed the H264 codec. The
for the reconstructed frame. The reference frame buffer number of processed frame is 300. The GOP size is equal
can be exploited to store the reconstructed MBs row; to 8 as follows: IPPPPPP IPPP....P. Quantification
since overwritten data will not be used (they are already Parameter (QP) is 30. Table 1 shows the performance of
7
the two implementations based on encoding speed. The 6 Multicore implementation using Frame Level
second architecture can save up to 18.71% of run-time. Parallelism
Using a single DSP core, encoding time is yet close to
real-time (25 f/s) for CIF resolution. 6.1 Implementation strategy
Table 1 Performance evaluation of the proposed
From previous works detailed in the past section, the
implementations for CIF (352x288) resolution
conclusion that could be taken is that each of the
partitioning method has as many advantages as
Encoding speed Encoding speed using
CIF sequence using MB level MB row level drawbacks. 1) GOP level parallelism ensures a good
implementation (f/s) implementation (f/s) speedup but involves very high encoding latency. 2)
Frame level parallelism improves efficiently the
Foreman 19.72 24.19
encoding run-time with low latency. 3) Slice Level
Akiyo 20.69 24.73
parallelism improves the encoding speed but induces
News 21.76 25.21
PSNR degradation and bitrate increase. 4) Load balance,
Container 20.39 24.80
large data transfer and synchronizations between
Tb420 18.64 22.79
processors are the important drawbacks of MB level
Mobile 18.73 22.77
parallelism. 5) Regarding functional partitioning, this
Speed
19.98 24.58 approach is not suitable for H.264/AVC encoder [10] due
average (f/s)
to two reasons. First, large amount of data transfer
among processors will demand a large system bandwidth
After verifying that «MBs row level» architecture is the
to assure inter-processor communication. Second,
well optimized implementation, this architecture is then
functions in H.264/AVC encoder have different load
evaluated for higher resolutions: SD (720x480) and HD
balance, so it is hard to equally map functions among
(1280x720). Table 2 presents the achieved encoding
processors. Thus, the final performance is always
speeds when applying «MBs row level» architecture on a
restricted by the processor with the heaviest load. Based
single DSP core for several uncompressed YUV 4:2:0
on these observations, the frame level parallelism
SD and HD video sequences. At first, these video
approach will be applied in order to enhance the
sequences are downloaded in a compressed HD RGB
encoding speed and get a low latency without inducing
format from the video-sharing website YouTube; then
any rate distortion (PSNR degradation and bitrate
they are converted into YUV 4:2:0 format and resized
increase).
into SD resolution using OpenCv library [35]. The
Our multicore implementation using FLP approach
number of processed frames is 1200 frames, QP=30 and
will exploit the optimized mono-core architecture
GOP size is equal to 8.
implemented on a single DSP core which is the «MBs
Table 2 Performance evaluation of the second row level» implementation. The approach of our real-
implementations for SD and HD resolutions on a single time demo implementation is described in Fig. 7.
DSP core In a preliminary step, our DSP platform is connected
to a personal computer (PC) via a Gigabit Ethernet link
Encoding speed Encoding speed in order to achieve real-time TCP/IP (transmission
sequence for SD resolution for HD resolution Control Protocol /Internet Protocol) data transfers
on a single DSP on a single DSP
between them. The PC itself is linked to a Universal
core (f/s) core (f/s)
Serial Bus (USB) HD webcam to capture RAW video
Planets 7.047 2.663 and send it to DSP for encoding. Once the DSP will be
Power of 7.193 2.651 integrated in a smart camera system, the PC will no
natures longer be needed.
Turtle 6.827 2.609 In this work, the personal computer is used only for
Vague 7.03 2.696 validation purposes because our platform has not yet a
Nature 7.36 2.756 frame grabber interface. A commonly used video test
Bird 7.928 2.999 sequences in YUV 4:2:0 format are used for encoding.
Speed average Then, the similarity between the output of our DSP
7.23 2.73
(f/s) implementation and that of the PC implementation is
verified. Even the Ethernet data transfer is used only to
It is clear that mono-core processors with low CPU ensure a real-time data transfer but it is not our principle
frequency cannot meet the real-time requirement for high aim. The main of our work is to bring out the efficiency
resolution videos. Thus, moving to a multicore of our processor to meet real-time constraint for the most
implementation and exploiting the H264 parallelism are complex application (video encoding). So if this
mandatory to reach real-time encoding for VGA and SD processor will be the kernel of a smart camera platform,
resolutions and improve the encoding speed for HD several image and video processing applications could be
resolution. performed and real-time constraint could be satisfied.
8
EVMC6472 DSP
External memory
SDRAM
Core 1
2 current
frame SRC DSP/BIOS
core1 3 project
Visual C/C++ project
OpenCv library Core 0 RECT frame enHc2o6d4e r.
Frame capture, core1 Out
resize, conversion DSP/BIOS
froYmUV R4G:2B:0 to project 5 Bictsotrree1am 4 CSL APIs
Transfer data using Send 5 TCP . .
TCP/IP protocol cfruarmreenst server. out . .
1
Network . .
TCCPlie Snttr e(@amIP s, opcokrte t oBf i5tRs fe6trrcaevmames dCesSv LeK liAot pP&eI sr' 2 fraccmuorerr eeS5nRt C 3 DSpCrPoo/rjBeeI cO5tS
number)
RECT frame H264
7 TCP Stream core5 encOoudter.
Bitstream decoding TV socket 5 Bitstream 4 CSL APIs
Server (@IP, core5
port number)
Fig. 7 Encoding demo using Frame Level parallelism algorithm
The image acquisition will be in real-time and in this of the current frame processed by core i (see section 5.2).
case, TCP/IP data transfers with Gigabit Ethernet will not Thus, inter data dependency is respected and
be required since the camera interface will be directly consequently, no rate distortion will be provided.
connected with the DSP memory. This work may The steps of encoding a video sequence using FLP
encourage producers to develop a new generation of are detailed as follows (Cf. Fig. 8):
smart cameras based on multicore DSPs that can perform After establishing connection between the PC and the
a sophisticated image and video processing applications DSP, core0 receives 5 frames from the PC as 5 cores are
and capable to satisfy real-time constraint. devoted to encoding. Each frame is loaded into the SRC
As our DSP platform includes 6 DSP cores, the first buffer of each remaining core (1-5).
core “core0” is assigned as a master. It executes a TCP When the reception of the 5 current frames is
server program. It is devoted to establish TCP/IP completed, core0 sends 5 inter processor communication
connection with the client (PC) exploiting Texas interruption events (IPC) to cores 1-5; which are in a wait
Instruments (TI) NDK library (Network Developer’s Kit state for an interruption event from core0; to indicate that
[36]). In a first step, it receives the current frames sent by SRC frames are already in external memory so they can
the PC after camera capture and stores them into the start encoding.
external memory which is a shared memory between all Core1 is the first core that begins encoding. Upon
the DSP cores. The 5 remaining DSP cores are used to completion encoding the first 3 MBs rows of the SRC
encode the 5 received frames. For each core, a memory frame, it sends an IPC to the next core (core2) which itself
section is reserved to store the current frame (SRC), the is in a wait state for an interruption from core1 to start
reconstructed frame (RECT, which will be the reference encoding its appropriate frame. The same procedure will
frame for the next core) and finally a bitstream buffer be reproduced from core3 to core5.
where the bitstream will be stored. After encoding, the To avoid that core i exceeds core i-1 (which is
core0 server sends the bitstream of all encoded frames to possible because the load balance is not uniform between
the client (PC) in order to store or display it. Into the successive frames and it could give an erroneous result),
internal program memory of core0, a TCP server program the encoding of the next MBs row is conditioned with the
is loaded to establish connection between the DSP and the reception of an IPC from the previous core. Thus, each
PC. H264/AVC algorithm is loaded into each internal core will send an IPC to its next core after encoding a
program memory of the 5 remaining cores. Thus, a C++ MBs row that its index is higher than 3. Since each core
project is developed and executed on the PC in order to starts encoding after its previous core finishes encoding 3
capture video from the camera. Our program is based on MBs rows, it should not wait an IPC from the previous
OpenCv library which is used to convert the captured core to encode the last 2 MBs rows of each SRC frame;
frames from RGB to YCrCb 4:2:0 format. A TCP socket otherwise encoding will be blocked by waiting an
(@IP, Port number) is created to transmit data between incoming IPC. As a result, each core will totally send
core0 (server) and the PC (client). Max_MBs_rows - 2 interruptions to the next core. When
When applying frame level parallelism and exploiting all cores finish encoding the current frames and
«One MBs row level implementation», core i starts specifically core5 which is the last core that finishes its
encoding its appropriate frame only if core i-1 has task, cores1 to 5 send 5 IPCs to core0 which is in a wait
finished encoding at least 3 MBs rows from the previous state to indicate that the bitstream of 5 frames is ready in
frame. These 3 MBs rows will be used as the search external memory to be transferred to the PC.
window for the motion estimation of the first MBs row
Description:Frame Level Parallelism for Smart Multicore DSP. Camera. Nejmeddine Bahri(1), Nidhameddine Belhadj(1), Thierry Grandpierre(2), Mohamed Ali Ben Ayed(1), Nouri. Masmoudi(1), Mohamed Akil(2). (1)National school of Engineers, LETI Laboratory, University of Sfax, Tunisia. (2)ESIEE Engineering