3. Registration

3.1 Introduction

One of the most important elements in any augmented reality system is registration. Virtual objects that are inserted into images should appear to really exist in the environment. To achieve such an illusion, virtual objects and real objects must be properly aligned with respect to each other. Registration is a general AR problem; many interactive AR applications, such as surgery systems, require accurate registration. Augmented imagery systems must contend with the incredible sensitivity of human users to registration errors [20].

Registration is the determination of transformations between the real image model and the virtual content model. A transformation is a function from points in one model to points in another. A function that translates pixel coordinates in a virtual advertisement image to pixel coordinates in a destination image is an example transformation. This 2D to 2D example describes the warping of the original image to an arbitrary placement on the screen. This placement may have to contend with perspective projection of the advertisement patch; so the transformation might convert an upright rectangle to an arbitrary quadrilateral.

Transformations in augmented imagery applications can be 2D to 2D or 3D to 2D. 2D to 2D registration assumes a 2D model and is typically used to replace planer regions in images. 3D to 2D registration assumes a 3D graphical model will be aligned with a 3D representation of the environment as projected to a 2D image by the camera system. Both applications are common in systems.

Registration to a camera image requires knowledge of the location of the camera relative to the scene. Many trackers and sensors, such as ultrasonic trackers, GPS receivers, mechanical gyroscopes and accelerometers, have been used to mechanically assist in achieving accurate registration. Tracking is the determination of location and/or orientation of physical objects such as cameras. As an example, a pan/tilt/zoom (PTZ) head measures the physical parameters of a camera. Tracking of just position or orientation (but not both) is called 3 degree-of-freedom (3DOF) tracking. Tracking of position and orientation is called 6 degree-of-freedom (6DOF) tracking. In addition, other system parameters such as zoom may be tracked. Tracking can be used to determine position and orientation of additional objects in an environment in addition to the camera. Instrumenting a portable, handheld advertisement in a sporting event with tracking would allow for replacement of the advertisement at a later date.

Tracking can be achieved optically from the video image or through instrumentation of the camera or objects. Optical tracking can utilize natural features (unprepared spaces) [21] or placed fiducial marks (prepared spaces) [22]. Rolland, Davis, and Baillot present an overview of tracking systems [23]. Hybrid tracking techniques combine multiple tracking technologies and are employed to compensate for the weakness of individual tracking technologies. Vision-based tracking techniques have received more attention in recent years. Bajura and Neumann [24] point out that vision-based techniques have an advantage in that the digitized image provides a mechanism for bringing feedback into the system. It is possible to aid the registration with that feedback.

3.2 2D–2D Registration

A simple augmented imagery system might replace a blank planar region in a video sequence with a commercial advertisement. This is a typical 2D–2D registration problem. All that need be done is to track the blank planar correctly and replace pixels in the planar with corresponding pixels in the advertisement. The replacement advertisement is warped to the proper shape and rendered into the space using a transformation that converts locations in the ad to locations in the final image. Region tracking techniques differ in the geometric transformation model chosen and the optimization criteria. Rigid, affine, homographic and deformable transformation models have been used. The optimization criteria include texture correlation [25], mutual information [26] optical flow measurement, etc. Three simple transformation models are introduced here. A review of image registration methods can be found in Brown [27].

3.2.1 Geometric Transformation Models

A geometric transformation model transforms points in one frame of reference to points in another. Transformations are used to translate, rotate, scale and warp 2D and 3D content. This section describes several common 2D geometric transformations and their application in augmented imagery.

Rigid Transformation

A rigid transformation maps a point (x, y) to a point (x', y') via rotation and translation only:

(13.1)

In this equation, T is a translation vector and R_θ a rotation matrix:

(13.2)

It is a convenient notation to use homogeneous coordinates for points. The homogeneous coordinates of a 2D point (x, y) are (sx, sy, s), where s is an arbitrary scale factor, commonly 1.0. Thus Equation 13.1 can be written as:

(13.3)

This transformation is easily modified to include scaling. Rigid transformations are very basic and generally only used for overlay information. As an example, a rigid transformation with scaling is used to place the bug (a small station or network logo) in the corner of a television picture. However, for most applications the rigid transformation is too restrictive.

Affine Transformation

Affine motion describes translation, rotation, scaling and skew. These are the most common objects motions. The mathematic model for the affine transformation is:

(13.4)

A is an arbitrary 2 2 matrix and T is a vector of dimension two. Using homogeneous coordinates, an affine transformation can be written as:

(13.5)

This model has the property that parallel lines remain parallel under transformation. The affine transformation is simple, but applies well in many applications. V. Ferrari et al. [28] implemented an augmented reality system using an affine region tracker. First, the affine transformation mapping the region from the reference frame to the current frame is computed. Then an augmentation image is attached to that region using an affine transformation.

General Homographic Transformation

The general homographic transformation maps a point (x, y) to a point (x', y') as follows:

(13.6)

In this equation, H is an arbitrary 3 3 matrix and s is an arbitrary value other than 0. H has 9 parameters, 8 of which are independent since s can have an arbitrary value. Because H can be arbitrarily scaled, it is common to set the lower-right corner to 1. Unlike rigid and affine motion model, the general homographic transformation can be used to model perspective effects.

Other nonlinear motion models are also used. For example, B. Bascle and R. Deriche [25] applied a deformable motion model to track a region.

3.2.2 Computation of Transformation Matrices

In most cases the required transformation is not known and must be computed from image content or tracking results. In the planer region example, the transformation might be computed from the corner points of the destination region. Let p₁-p₄ be 2D coordinates of the corners of the region that the advertisement is to be written into. A transformation is required that will warp the advertisement such that the corners of the ad, c₁–c₄, will be warped to p₁–p₄. If p₁–p₄ represent a region with the same width and height as the image (unlikely), a rigid transformation can be computed by determining the rotation that will rotate c₁c₂ to the same angle as p₁p₂, then composing this rotation with a translation of c₁ to p₁ (assuming cl is (0,0)).

A similar solution exists for computing an affine transform. Only three point correspondences are required to compute an affine transform. The problem is easily modelled as:

(13.7)

It is assumed in this example that p₁–p₄ and c₁–c₄ are represented as homogeneous coordinates (three element column vectors with the third element set to 1). In this example, T can be easily computed as:

(13.8)

The most general solution for a four point correspondence is a general homographic transform and, indeed, this is the solution required for most applications. Let the values of p_i be x_i, y_i and z_i and the coordinates of c_i be x_i', y_i' and z_i'. The general homographic transformation, as described in equation 13.6, can be written as:

(13.9)

(13.10)

Equations 13.9 and 13.10 can be restructured as:

(13.11)

(13.12)

Equations 13.11 and 13.12 can be used to construct a matrix solution for the values of T, as illustrated in equation 13.13. Equation 13.13 represents the problem as the product of a 8 by 8 matrix of knowns multiplied by an 8 element vector of unknowns equal to an 8 element vector of knowns. The problem is in the familiar Ax=b form for simultaneous equations and can be solved using a variety of conventional matrix methods.

(13.13)

Similar computations can be used to determine the appropriate affine or general homographic transformation that will correspond to the points. The general homographic transformation is the most common solution for this problem, since it can model the effects of perspective projection.

In general, control points must be determined in the images that are to be corresponded. These control points may be defined in source content or located using tracking, manual selection or image recognition. Fiducial images can be placed in the physical environment to assist in the location of control points. For example, fiducial chemical markers are widely used in medical imaging. Control points can be features of a region. For example corners, intersection of lines can be selected. Obviously, selected features must be robust in the image sequences, meaning they undergo small changes from frame to frame and are easy to detect. Shi and Tomasi [29] proposed a method to select "good" features to track. To match features in each image, template matching is often applied. After n pairs of points are found, a transformation can be computed.

For many applications, more points will be determined than are absolutely needed for the transformation computation. In these cases, least-squares estimation of the transformation allows the additional points to help decrease errors. A least-squares error criterion function for an affine transformation is defined as follows:

(13.14)

Least-squares estimation minimizes this error function.

3.3 2D–3D Registration

2D to 3D registration is typically used to match a virtual camera in a 3D graphics environment to a real camera in the physical world. The starting point for class of registration is nearly always a set of points in the real world and the corresponding points in the camera image. These points may be determined automatically or manually.

3.3.1 The Camera Model.

For convenience homogeneous coordinates will be used to describe the model for a camera. For a 3D point X and a corresponding 2D point x, the homogeneous coordinates can be written as X= (x, y, z,1) and x= (u, v,l) , respectively. The 2D–3D correspondence is expressed as:

(13.15)

In this equation, P is a 3 4 camera projection matrix, which can be decomposed into:

(13.16)

R is a 3 3 rotation matrix and t is a translation vector of dimension 3. K is the calibration matrix of the camera. The | symbol indicates concatenation. The equation for K is:

(13.17)

In this equation, f is the camera focal length, (c_x,c_y) is a principal point, the center of projection in the camera image, s is skew and a is the image aspect ratio. These intrinsic parameters may be calibrated in advance.

In a video-based augmented reality system, the world coordinates of 3D virtual objects are often known, and, if the camera projection matrix is also known, 3D virtual objects can be inserted into the real scene using equation 13.14 to determine the 2D coordinate that any 3D virtual point will project to.

3.3.2 Estimation of Projection Matrix.

Model-based Method.

The most common approach for computing the projection matrix is the model-based method. The projections of landmarks (artificial or natural) in an image are identified and the 3D-2D correspondences are established based on knowledge of the real-world coordinates of those landmarks. Like estimation of transformation matrices in the 2D–2D registration problem, the projection matrix can also be computed by using a least-squares method. The error function is:

(13.18)

This function is also referred to as the re-projection error. Computer vision texts describe methods to compute P given this error function [30].

Natural landmarks include street lamps and 3D curves [31, 32]. Multi-ring color fiducials (artificial landmarks) [33] and black squares with different patterns inside [34] have been proposed as fiducials. The primary advantage of model-based registration is high accuracy and absence of drift, as long as enough landmarks are in the view. For outdoor applications, it is hard to prepare the fiducials or measure natural landmarks. For this reason, several extendible tracking methods have been proposed. A tracking method that extends the tracking range to unprepared environments by line auto-calibration is introduced in [35].

Structure From Motion.

A technology which appears to provide general solutions for outdoor applications is structure-from-motion estimation. Two-dimensional image motion is the projection of the three-dimensional motion of objects, relative to a camera, onto the image plane. Sequences of time-ordered images allow the estimation of projected two-dimensional image motion as optical flow. Provided that optical flow is a reliable approximation to image motion, the flow data can be used to recover the three-dimensional motion of the camera [21, 36]. A review of computation of optical flow can be found in [37]. However, this method is time-consuming and not suitable for real-time applications in the immediate future.