Table Of Content

Research Assignment Multiview Imaging and 3D TV. A Survey. Anastasia Manta January 2008 Supervisors: Dr. Emile A. Hendriks Dr. ir. Andre Redert Multiview Imaging and 3DTV. A Survey. - 2 - Multiview Imaging and 3DTV. A Survey. Contents 1 Introduction .............................................................................................................................. - 4 - 2 Capturing system ...................................................................................................................... - 6 - 2.1 Static Scenes ....................................................................................................................... - 6 - 2.2 Dynamic Scenes ................................................................................................................. - 6 - 2.2.1 Camera Setup ............................................................................................................ - 7 - 2.2.2 Synchronization of cameras...................................................................................... - 8 - 2.2.3 Camera calibration .................................................................................................... - 8 - 2.3 Multiview correspondence ................................................................................................ - 9 - 3 3-D scene representation ........................................................................................................ - 13 - 3.1 Geometry-based modeling .............................................................................................. - 13 - 3.2 Image-based modeling .................................................................................................... - 15 - 3.3 Hybrid Image-Based modeling techniques .................................................................. - 18 - 4 Rendering ................................................................................................................................ - 22 - 4.1 Model-based rendering (MBR) - rendering with explicit geometry ........................... - 23 - 4.2 Image-based rendering (IBR) - rendering with no geometry ...................................... - 26 - 4.3 Rendering with implicit geometry ................................................................................. - 27 - 5 Coding ...................................................................................................................................... - 30 - 6 Transporting 3D Video .......................................................................................................... - 32 - 7 3-D Display .............................................................................................................................. - 34 - 8 Discussion and conclusions .................................................................................................... - 36 - 9 Appendix A .............................................................................................................................. - 38 - 10 References................................................................................................................................ - 43 - - 3 - Multiview Imaging and 3DTV. A Survey. 1 Introduction Multiview imaging has attracted increasing attention, thanks to the rapidly dropping cost of digital cameras. This opens a wide variety of interesting research topics and applications, such as virtual view synthesis, high-performance imaging, image/video segmentation, object tracking/recognition, environmental surveillance, remote education, industrial inspection and 3DTV. While some of these tasks can be handled with conventional single view images/video, the availability of multiple views of the scene significantly broadens the field of applications, while enhancing performance and user experience. 3DTV is one of the most important applications of multiview imaging and a new type of media that expands the user experience beyond what is offered by traditional media. It has been developed by the convergence of new technologies from computer graphics, computer vision, multimedia, and related fields. 3DTV, also referred to as stereo TV, offers a three-dimensional (3-D) depth impression of the observed scene. To enable the use of 3DTV in real-world applications, the entire processing chain, including multiview image capture, 3-D scene representation, coding, transmission, rendering and display need to be considered [1]. There are numerous challenges in this chain. A system that can capture and store large numbers of videos in real time has many building difficulties. An accurate calibration of camera position and color property is required. From acquired multiview data, one should consider how to represent a 3-D scene that is more suitable for the latter processes. Depth reconstruction is one central task in 3-D representation but still a very difficult problem for rendering novel images precisely. The amount of multiview image data is usually huge, hence the data compressing and streaming with less degradation and delay over limited bandwidth are also challenging tasks. In addition, there are also strong interrelations between all of the processes involved. The camera configuration (array or dome) and density (number of cameras) impose practical limitations on navigation and quality of rendered views at a certain virtual position. Therefore, there is a classical trade-off to consider between costs (for equipment, cameras, processors) and quality (navigation range, quality of virtual views). In general, the denser capturing of multiview images with a larger number of cameras provides a more precise 3-D representation, resulting in higher quality views through the rendering and display processes but requires a higher compression rate in the coding process, and vice versa. An interactive display that requires random access to 3-D data affects the performance of a coding scheme that is based on data prediction. Various types of quite diverse 3-D scene representations can be employed, which implies a number of different data types. - 4 - Multiview Imaging and 3DTV. A Survey. This report is aimed to discover some of the challenges in multiview imaging technology which can help to fulfill the ultimate research goal. This report provides an overview of multiview imaging by consulting the available literature about the subject. Focus is placed on the rendering part of the process. This report does not propose a new algorithm but it considers the up-to-date algorithms from literature, and recommends some improvement. It has been structured as follows. In the second chapter introduces a capturing system, possible camera configurations and also discusses the important issues of camera calibration and stereo correspondence. The third chapter reviews the different data representations available in current projects. Chapter four deals with rendering, interrelates rendering with data representations and assesses rendering algorithms for corresponding applications. Chapters five and six deal with the coding and the transporting of a 3-D video. Chapter seven outlines the available 3-D displays. Finally chapter eight addresses some open issues with respect to the literature reviewed. - 5 - Multiview Imaging and 3DTV. A Survey. 2 Capturing system For the generation of future 3D content two complementary approaches are anticipated. In the first case, novel three–dimensional material is created by simultaneously capturing video and associated per-pixel depth information. The techniques involved in this procedure are explained in this chapter. The second approach satisfies the need for sufficient three-dimensional content by converting already existing 2D video material into 3D, but is out of the scope of this report. 2.1 Static Scenes Capturing multiple views of a static scene is relatively simple because only a single camera is needed. The camera can be moved along a predetermined path to take multiple images of the scene. Novel views can then be synthesized from the camera position/geometry. In this case the camera position/geometry is assumed to be known. The camera geometry can be established in two ways. First, with the use of a robotic arm or a similar mechanism to control the movement of the camera. For instance, a camera gantry is used in [5] to capture light field, which assumes that the camera locations form a uniform grid on a 2-D plane. In concentric mosaics [6], a camera is mounted on the tip of a rotating arm, which captures a series of images whose centers of projection are along a circle. The second approach to obtain camera geometry is through calibration. In the work Lumigraph [7], the authors used a handheld camera to capture the scene. The scene contains three planar patterns, which are used for camera calibration. In [8], a camera attached to a spherical gantry arm is used to capture images roughly evenly over the sphere. Calibration is still performed to register the camera locations to the scene geometry obtained through range scan. When the scene itself contains a lot of points of interest, [9], it is possible to extract and match feature points directly for camera calibration. 2.2 Dynamic Scenes For the acquisition of dynamic scenes, an array of cameras is in most cases needed. Most existing camera arrays contain a set of static cameras. One exception is the self-reconfigurable camera array developed in [10], which has 48 cameras mounted on robotic servos. In this case, cameras move during capturing to acquire better images for rendering (they have to be calibrated on-the-fly using a calibration pattern in the scene). - 6 - Multiview Imaging and 3DTV. A Survey. Capturing dynamic scenes with multiple cameras has a number of challenges. For instance, the cameras need to be synchronized if correspondence between images will be explored (in the rendering stage). The amount of data captured by a camera array is often huge, and it is necessary to write these data into storage devices as fast as possible. Color calibration is another issue that needs to be addressed in order to render seamless synthetic views. 2.2.1 Camera Setup The camera setups range from dense configuration (Stanford Light Field Camera, [11]) to intermediate camera spacing, [12], to wide camera distribution (Virtualized RealityTM , [13]). The wider spacing between the cameras in this system provides more of a challenge in producing locally consistent geometries and hence photorealistic views. This is mainly because of occurring occlusions. A significantly denser camera configuration such as that of the Stanford Light Field Camera allows effects such as synthetic aperture and focusing - synthetic aperture imagery allows objects that are occluded with respect to any given camera, to be seen. In general, dense sampling permits photorealistic rendering with just either a simple planar geometric representation or a rough geometric approximation. However, the disadvantage is the large number of images required for rendering. Therefore there is an apparent trade-off of the image-geometry. Approaches in the middle are trying to reduce the number of required cameras and compensate for this by providing high-quality stereo data for example. Zitnick et al, [12] proposed a layered depth image representation using an 8 cameras configuration. This approach however still needs a quite dense camera setup for a limited viewing range (horizontal field of view of about 300) For configurations which cover an entire hemisphere with a small number of cameras, either model- based approaches need to be employed (e.g. Carranza et al. [6] with 8 cameras) or degradation in visual quality has to be accepted (e.g. Wurmlin et al. [33] with 16 cameras). The latter two systems are also limited by the employed reconstruction algorithms to the capture of foreground objects or even humans only. Scalability in terms of camera configurations is another important issue that Waschbusch et al., [11], try to solve. In their work they introduce sparsely placed, scalable 3D video bricks which act as low- cost z-cameras. The importance of z-cameras in the content acquisition process will become clearer in the stereo correspondence chapter. One single brick consists of a projector, two grayscale and one color camera. To fully cover 3600 in all dimensions about 8 to 10 3D video bricks are needed. - 7 - Multiview Imaging and 3DTV. A Survey. Resolution plays an important role in achieving photorealism as well, but having a higher resolution will not help if rendering artifacts are not properly handled. These artifacts include boundary or cut- out effects, incorrect or blurred texturing, missing data, and flickering. Humans are highly sensitive to high-frequency spatial and temporal artifacts. Although using a reduced resolution would conveniently help to mask or ameliorate such artifacts, it should not be viewed as a solution. Suggestively Zitnick in [12] uses high resolution (1024*768) color cameras capturing ad 15 fps, whereas Matusik et al., [19] used an array of cameras with 1300*1030 resolution and frame rate 12 frames per second. 2.2.2 Synchronization of cameras When the number of cameras in the array is small, synchronization between cameras is often simple. A series of 1394 FireWire cameras can be daisy chained to capture multiple videos, and the synchronization of exposure start of all the cameras are guaranteed on the same 1394 bus. Alternatively, the cameras‟ exposure can be synchronized using a common external trigger. This is a very widely used configuration and can scale up to large camera arrays [12], [15]–[18]. In the worst case, where the cameras in the system cannot be genlocked camera synchronization can still be roughly achieved by pulling images from the cameras at a common pace from the computer. Slightly unsynchronized images may cause artifacts in scene geometry reconstruction for rapid-moving objects, but the rendering results may still be acceptable since human eyes are not very sensitive about details in moving objects. When multiple videos are recorded simultaneously, the amount of data that needs to be stored/ processed is huge. Most existing systems employ multiple computers to record and process the data from the cameras. The Stanford multicamera array [29] used a modular embedded design based on the IEEE1394 high speed serial bus, with an image sensor and MPEG2 compression at each node. Since video compression is performed on the fly, the system is capable of recording a synchronized video data set from over 100 cameras to a hard disk. 2.2.3 Camera calibration Camera calibration is the process of determining the internal camera geometric and optical characteristics (intrinsic parameters) and/or the 3-D position and orientation of the camera frame relative to a certain world coordinate system (extrinsic parameters). The purpose of the calibration is - 8 - Multiview Imaging and 3DTV. A Survey. to establish the relationship between 3-D world coordinates and their corresponding 2-D image coordinates. Once this relationship is established, 3-D information can be inferred from 2-D information and vice versa. In an application involving multiple cameras this step is necessary to guarantee geometric consistency across the different terminals. Those techniques can be roughly classified into two categories as discussed in [21]: photogrammetric calibration and self-calibration. In photogrammetric calibration methods camera calibration is performed by observing a calibration object whose geometry in 3-D space is known with very good precision. Calibration can be done very efficiently [22]. The calibration object usually consists of two or three planes orthogonal to each other. Sometimes, a plane undergoing a precisely known translation is also used [23]. These approaches require an expensive calibration apparatus, and an elaborate setup. Self-calibration techniques do not use any calibration object. Just by moving a camera in a static scene, the rigidity of the scene provides in general two constraints on the cameras‟ internal parameters from one camera displacement by using image information alone. Therefore, if images are taken by the same camera with fixed internal parameters, correspondences between three images are sufficient to recover both the internal and external parameters which allow us to reconstruct 3-D structure up to a similarity. While this approach is very flexible, it is not yet mature. Recent research on camera calibration has focused on the problem of self-calibration. A critical review of self-calibration techniques can be found in [24]. 2.3 Multiview correspondence Multiview correspondence, or multiple views matching, is the fundamental problem of determining which parts of two or more images (views) are projections of the same scene element. The output is a disparity map for each pair of cameras, giving the relative displacement, or disparity, of corresponding image elements, Figure 1, 2. Disparity maps allow us to estimate the 3-D structure of the scene and the geometry of the cameras in space. Passive stereo, using projectors and glasses with polarizing filters, [25], remains one of the fundamental technologies for estimating 3-D geometry. It is desirable in many applications because it requires no modifications to the scene and because dense information (that is, at each image pixel) can nowadays be achieved at video rate on standard processors for medium-resolution images (e.g., CIF, CCIR) [26]–[28]. For instance, systems in the late 1990s already reported a frame rate of 22 Hz for - 9 - Multiview Imaging and 3DTV. A Survey. images of size 320 x 240 on a Pentium III at 500 MHz [29]. The availability of real-time disparity maps also enables segmentation by depth, which can be useful for layered scene representation [30], [31]–[33]. Large-baseline stereo, generating significantly different images, can be of great importance for some virtual environment applications, as it is not always possible to position cameras close enough to achieve small baselines or because doing so would imply using too many cameras given speed or bandwidth constraints. The VIRTUE system [34] is an example: four cameras can only be positioned around a large plasma screen, and using more than four cameras would increase delay and latency beyond acceptable levels for usability (but see recent systems using high numbers of cameras [13], [35], [36]). Figure 1: Original stereo pair acquired from two cameras a and b Figure 2: Visualization of the associated disparity maps from camera a to b (left) and from b to a (right) There are two broad classes of correspondence algorithms seeking to achieve, respectively, a sparse set of corresponding points (yielding a sparse disparity map) or a dense set (yielding a dense disparity map). 1) Sparse Disparities and Rectification: Determining a sparse set of correspondences among the images is a key problem for multiview analysis. It is usually performed as the first step in order to calibrate (fully or weakly) the system, when nothing about the geometry of the imaging system is known yet and no geometric constraint can be used in order to help the search. We can classify the algorithms presented in literature so far in two categories: feature matching and - 10 -

Description:

Anastasia Manta. January 2008. Supervisors: Dr. total, more than 21,000 triangles make up the human body model. (b). Figure 4: (a) Surface

Research Assignment Anastasia Manta PDF

49 Pages·2008·0.8 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Research Assignment Anastasia Manta

Description:

Anastasia Manta. January 2008. Supervisors: Dr. total, more than 21,000 triangles make up the human body model. (b). Figure 4: (a) Surface

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.