Bernd Porr, Alex Cozzi and Florentin Wörgötter
In a stereoscopic system both eyes or cameras have a slightly different view. As a consequence small variations between the projected images exist (``disparities'') which are spatially evaluated in order to retrieve depth information [Sanger, 1988,Fleet et al., 1991]. A strong similarity exists between the analysis of visual disparities and the determination of the azimuth of a sound source [Wagner and Frost, 1993]. The direction of the sound is thereby determined from the temporal delay between the left and right ear signals [Konishi and Sullivan, 1986]. Similarly, here we transpose the spatially defined problem of disparity analysis into the temporal domain and utilize two resonators implemented in the form of causal (electronic) filters to determine the disparity as local temporal phase differences between the left and right filter responses. This approach permits real-time analysis and can be solved analytically for step function contrast changes, which is an important case in all real-world applications. The proposed theoretical framework for spatial depth retrieval directly utilizes a temporal algorithm borrowed from auditory signal analysis. Thus, the suggested similarity between the visual and the auditory system in the brain [Wagner and Frost, 1993] finds its analogy here at the algorithmical level. We will compare the results from the temporal resonance algorithm with those obtained from several other techniques like cross-correlation or spatial phase-based disparity estimation showing that the novel algorithm achieves performances similar to the ``classical'' approaches using much less computational resources.
The field of biological cybernetics and neural modeling has undergone several transitions over the last decades. Classical ``cybernetical'' approaches which dominated before 1970 (often involving linear systems theory) were soon followed by neuronal network models with different degrees of biological realism. The domain of artificial neural networks (ANN) began to exert its massive influence in the last 10 years or so. The strongest driving force behind ANN research was probably the attempt to transfer ideas taken from biology to a more technological domain. Thus, this aspect of neuronal modeling (in its widest sense) was especially attractive to engeneers and other application-oriented researchers. As a consequence, at about this time neural modeling ``became useful'' also outside the field of brain science. The transfer of biological ideas to technology, however, is not necessarily restricted to ANNs and this may be a sensible consideration given the intrinsic disadvantages of ANNs (e.g., slow relaxation behavior). Instead, sometimes it is possible to design an application oriented algorithm in a rather direct way from a biologically inspired model.
Thus, the goal of this article is twofold: 1) we will try to show that an algorithm stolen from the auditory system can be applied to a visual problem, and 2) that it is possible to transfer this algorithm directly to a chain of electronic filters which operate in real-time. To this end we will concentrate on the problem of stereo-image analysis.
In any vision based system the 3-dimensional world is projected onto 2-dimensional receptor surfaces. These could be the two retinas of a binocularly viewing animal or the cameras of an artificial system. During that process depth information is lost but can be recovered from the disparities between matching image parts. In technical systems many times vertical disparities are neglected by assuming a strictly fronto-parallel camera geometry. In this case, it is sufficient to analyze corresponding cross-sections of both images line by line because the epipolar-lines are now horizontal. Thus, stereoscopic depth estimation is reduced to a 1-dimensional spatial problem and common methods use acausal spatial filters to retrieve the disparity as a convolution result [Sanger, 1988,Fleet et al., 1991]. The inherently present restriction to one dimension, however, makes it also possible to interpret each line from the left and right image as a temporal signal x(t), which could for example be imagined as scan-line from a CCD camera arriving pixel after pixel. With the help of this interpretation a causal filter approach can now be defined such that the disparity is detected continuously with the incoming data.
The system we present is very simple: It takes the luminance signal of the image scan-lines from the left and the right image and pipes it through a left and a right band-pass filter (a resonator). This way two signals are generated which are quasi-oscillatory at the resonance frequency. The (local) phase difference between these two oscillations is directly equivalent to the disparity. Thus, subsequently our system measures this phase difference by two more simple electronic operations as shown in Fig. 1 and explained in the next section.
![]() |
We assume a fronto-parallel camera arrangement which leads to
horizontal epipolar lines. Disparity changes can only be detected when
they concur with a luminance change. For digitized camera data the
smallest luminance change is a 1-bit step function. In addition,
stronger step like luminance changes in general occur rather often in
images, for example at the edges of a protruding object. Thus, step
functions are a very generic case for which we will solve the
``Ansatz'' analytically. Let
xl(t), xr(t) be the two corresponding
pixel lines of a stereo image pair in which a single contrast step
exists at different disparities (viz. different times tl and
tr). To obtain the disparity between the images each signal is used
to excite a resonator with characteristic frequency f0. We assume
that the contrast step in the left image occurs earlier than that of
the right image (tl<tr), therefore the resulting resonance starts
earlier for xl than for xr. This temporal phase difference is
directly equivalent to the spatial disparity between the images and
can be obtained from an operator which compares the phases.
The two step functions
and
are defined in the Laplace domain by (Fig.1):
![]() |
(1) |
and the transfer function of the resonator is given as:
![]() |
(2) |
where
is a filter pole and specifies the filter
characteristic defined by f0 and the filter quality Q, which
determines the damping; the ``*'' denotes the complex conjugate.
| (3) |
Convolution of signal and filter yields for the right image:
We define
,
then the inverse Laplace transformation of Yr(s) yields:
The temporal resonator signal y(t) reflects a damped sine-wave with frequency f0 (Fig. 1, yl,yr). The number of full cycles until the signal fades is roughly equivalent to the value of Q. Note that any DC component present in the input signal is removed by the resonator. This is an advantage of the new method because the DC usually poses a severe problem in all spatial filter approaches [Sanger, 1988,Fleet et al., 1991].
Finally, disparity is determined from the phase difference between the
resonator signals from both images. Phase comparison is achieved by
multiplication of the two signals in the time domain and subsequent
low-pass filtering (Fig.1,
).
Multiplication yields (Fig. 1,
):
with:
and
The term
g2f0(t) reflects an oscillation with 2f0. In an
implementation it will be eliminated by low-pass filtering with low
cut-off (Fig. 1 LP). The second part represents the phase
between the two signals and contains an exponential relaxation term
and a constant term K, which encodes the true disparity.
Like all other phase based approaches also our algorithm is subject to
the so called phase wrap-around problem. The periodic
characteristic of the resonators limits the resolution of the system.
This generic problem is reflected at the output of the system in
Eq. 9 by the periodic behavior of the cosine. To avoid such
ambivalencies we restrict the argument of the cosine to:
.
From this we get:
;
a constraint which is similar to that observed
in the spatial filter (Gabor filter) approaches. The phase wrap-around
problem disappears for
at the cost of low spatial
resolution and an increasing noise susceptibility because of the
shallow filter characteristic. As in the other spatial phase-based
stereo-algorithms also our approach could be used in a cascaded way
utilizing several filter-modules with different frequencies in order
to address the phase-wrap around problem.
We get for the damping coefficient Q>1/2, which means that the whole resonance may be restricted to about one half-cycle of the sine-wave. Given that disparity changes rarely exceed 5-10 pixels (empirical observation from publically available stereo-image data), this restriction drastically limits the necessary computational effort in any implementation.
An analytical solution can also be obtained for other simple functional descriptions of disparity changes. In general, however, all disparity changes can be detected by such a system regardless of their shape as long as the frequency content of the change contains enough power at the resonance frequency.
The block diagram in Fig. 1 shows that this system can be easily
implemented in analog or digital hardware. In particular, a few modern
digital signal processors can be used to implement the individual
filters which are then coupled to a rather simple quasi real-time
processing system such as the one used to generate the data in the
figure. In such a system the disparity is determined continuously from
the incoming data and the computational delay observed in the
implementation (Fig. 1,
)
is constant. Its duration is mainly
determined by the low pass filter (Fig. 1 LP) and it is independent
of the input image. In order to make this algorithm applicable the
output signal needs to be normalized to be independent of overall
luminance variations. Such a normalization has been performed to
obtain the results shown in Fig. 4.
![]() |
Fig. 2 shows how the signal which originates from a single scan-line looks like at the different stages of the filtering process. The aperiodic brightness signal becomes transformed into a quasi-periodic signal at the resonator, where only a single frequency dominates (see spectra). The phase comparison (by multiplication) produces a signal with a DC and a ``double-frequency'' component. Only the DC-component survives the low-pass filtering and - as explained above - this DC signal represents the phase and, hence, the disparity.
In the following we will compare our approach with several existing techniques. The next section gives some background about the algorithms used for comparison. Readers familiar with them should probably skip it.
Several techniques have been proposed to recover depth information from epipolar line pairs. The classical approach uses a measure of similarity, cross-correlation for example, to find matching points in the two images composing the stereo pair. This technique selects one image of the stereo pair, for example the left one, as the reference image. For each point of the reference image the corresponding point is sought in the other image by searching for a maximum in the similarity measure along the corresponding epipolar line. To this end the algorithm selects a rectangular window around a point in the reference image and computes its similarity measure with all the rectangular windows surrounding every points on the corresponding line in the second image. The point in the second image where the similarity measure has its maximum is considered the correct match.
This scheme, called ``area-based matching'' [Haralick and Shapiro, 1992] can be implemented in quite different ways, depending on the chosen similarity measure, on the algorithmic solution and on the complexity of the model of the disparity field. The similarity measures frequently used are sum of products, covariances, sum of squared differences, sum of absolute differences and cross correlation. The algorithmic solutions range from complete search to iterative least squares, simplex algorithms and dynamic programming, highly depending on the a-priori knowledge about the the scene, the similarity measure and the model of the disparity field. The model of the disparity field varies from simple translation (horizontal plane) to affinity (locally planar surface) to smooth (smooth surfaces without occlusions) or piecewise smooth (piecewise smooth, possibly with occlusions).
For comparison purposes we implemented an area based stereo
algorithm that uses extensive search to identify the minimum at
integer position, then the sub-pixel value of the minimum is computed
via cubic interpolation of the similarity function. We produced
disparity maps of a test scene with different similarity measures. The
assumed model of the disparity field is that of locally constant
disparity. We denote the signal of each corresponding
pair of scan lines as
and
,
where
the subscript indicates that the scan-line comes from the right or the
from left image of the stereo pair. Wx and Wy define the size of
the window in the x and y coordinates respectively. Using this
notation, the cross-correlation of point
(x,y) with the disparity value d is defined as:
As an alternative to correlation-based techniques, in 1988 Sanger
[Sanger, 1988] proposed the use of the phase difference between two
local filter responses to compute the disparities of the different
object in the two stereo images. To achieve this Gabor filters are
commonly used. In the phase difference method
[Sanger, 1988,Fleet et al., 1991], disparity is computed from the phase
difference between the convolutions of the two stereo images with
local bandpass filters. Since the two signals,
and
,
are locally related by a shift
,
i.e., in the vicinity of
each point x0
| (14) |
We can extract the local Fourier components by convolving the images
with the Gabor filters:
| = | ![]() |
||
| = | (15) |
As a function of the spatial position, the phase of the filter
response,
,
has a quasi linear behavior dictated
by the center k0:
The local frequency, i.e., the derivative of the phase
,
is
generally close to the value of the center frequency k0. In fact,
the Gabor filter is a bandpass filter around k0.
In the Fleet et al. algorithm [Fleet et al., 1991], the disparity is
extracted from the phase difference,
,
by expanding
to
the second order in
:
The phase is not defined when the amplitude vanishes, i.e., when
(singularity). Around these singular points the phase is
very sensitive to spatial or scale variations. As a consequence, the
approximation of Eq. 17 fails, and the calculation of
disparity in the neighborhood of a singularity is unreliable. The
neighborhoods of singular points can be detected by
means of [Fleet et al., 1991]:
In Fig. 4 we show the disparity maps produced by six
different techniques for disparity estimation. The Temporal
Resonance (TR) technique used parameters Q=1.5 and
f0 =
0.08. For the phase-based difference technique of Fleet and Jepson we
used two different Gabor filters: the first with a modulation period
of 10 pixels and half-octave bandwidth (FJ10-0.5) and the second with
a modulation period of 20 pixels and one-octave bandwidth
(FJ20-1). All the correlation-based techniques, Normalized
Cross-Correlation (NCC), Zero-Mean Cross-Correlation (ZCC), and
Sum of Squared Differences (SSD), used a window size of
pixels with a disparity limit of
pixels.
Judging Fig. 4 the Temporal Resonance technique produces intermediated results as compared to the other techniques, except on the uniform part of the source image. In these areas there is no way to measure disparity and the resonator response slowly fades. The filter based techniques produce results in characteristic ``bands'' centered on the original image's edges. These bands are induced as a consequence of the spatial extend of the Gabor filter and have approximatively the size of the filter's modulation period. The correlation based techniques produce widely different results. In the case of ZCC the differences in luminance causes the technique to find incorrect maxima in the correlation function, thus most of the disparities are discarded in the verification phase. The resulting effect is that the disparity map is almost black. This problem is overcome by the normalization used by the NCC, that produces much better maps, visually very similar to the maps from the SSD.
![]() |
In Fig. 5 we show a quantitative summary of the results from the different techniques we presented. Two quantities are preponderant in characterizing the performance of disparity estimators. (1) Density, i.e., the number of pixels where the algorithms are able to measure a disparity value, and (2) Precision, i.e., the mean error that affects the measurements. By varying the parameters in the different algorithms it is usually possible to trade density against precision or vice-versa. Using as testbed the synthetic image ``corridor'' (Fig. 3), Temporal Resonance (TR) achieves the best results in density (91%) with an intermediate score in error (2.49 pixels). Interesting is the comparison with NCC, a much slower technique that still is not able to beat it neither in density nor in precision. Fleet and Jepson's algorithm is very sensitive to the choice of the Gabor filter parameters [Cozzi et al., 1997], achieving very good precision (FJ20-1) or very bad precision (FJ10-.5) with nearly constant density around 40%. The ZCC produced the worse result in density (10%) with sufficient precision (1.1 pixels). A good tradeoff is reached by SSD, which gives the best precision (0.53 pixels) with a reasonable density (52%). The group of T. Kanade at CMU [Kanade et al., 1995] succeeded in producing a real-time implementation of the SSD, but this implementation is rather computationally expensive.
![]() |
Fig. 6 shows an example of the performance of our algorithm tested with a real image pair. Disparity is retrieved with sufficient accuracy but the map is more blurred than that obtained from the artificial scene which contained much sharper edges. A comparison of this result with those obtained from the other algorithms (not shown) demonstrates that the performance of the temporal resonance algorithm falls well within the range of the other approaches.
The theoretical framework presented here borrows its strength from the combination of computational principles found in the auditory and visual system of vertebrates. The convolution of the stereo images with oscillating local filters (Eq. 4) is a step quite commonly performed in the majority of technical approaches [Sanger, 1988,Fleet et al., 1991] and reflects the response of cortical simple cells which have Gabor-like receptive fields [Marcelja, 1980,Daugman, 1980,Jones and Palmer, 1987]. It has been suggested that the evaluation of spatial phase differences from such receptive field responses could indeed be used to compute disparity in the brain [De Angelis et al., 1991]. The dense coverage of the visual field by cells with many different receptive field sizes necessary to produce reliable depth maps does not pose a problem for the massive parallel architecture of the visual cortex. Technical systems, however, soon reach their limits by the tremendously high computational effort of such an architecture.
The 1-dimensional structure of an auditory signal on the other hand permits to determine the phase differences of two incoming sound waves sequentially by temporal correlation of both signals and such a process probably takes place in the auditory cortex [Konishi and Sullivan, 1986]. The interpretation of image lines as sequential signals allows for a similar ``temporal'' processing. The computational complexity of the temporal correlation involved, however, is reduced to simple multiplication and low-pass filtering as the consequence of the previously applied resonance filter. The transfer from the auditory to the visual domain worked well concerning the design of the novel algorithm. It should, however, not be forgotten that the algorithm shown here is rather unlikely to play any role in the visual system of the brain. After all, our visual cortex does not perform scan-line analysis.
On the other hand, the system is ideally suited for technical implementation in serial data acquisition systems and it performs in quasi real-time. More than that, however, the combination of spatial and temporal computational principles similar to those in the visual and auditory system generates a theoretical framework for causal stereoscopic depth processing in which the computational effort is strongly reduced. The comparison of the novel algorithm with other well-known techniques shows that intermediate results are obtained. Better results, however, require rather complex algorithms and real-time performance is then mostly prevented. Thus, our novel approach may be a good compromise in all those situations where real-time performance is necessary and a limited accuracy is sufficient. It may, however, be possible to further improve the performance of the temporal resonance algorithm by wiring up several algorithmic modules with different parameter in parallel. Such an architecture would still operate in real-time and the results should be more correct. In particular, the so-called stereo-correspondence problem could also be addressed by combining modules. The correspondence problem always arises if the same features (here gray levels) occur more than once on an image scan-line. In this case the match between left and right image can become ambiguous. One could use very wide filters to avoid this problem, but these filters would almost always average over different disparity changes leading to a wrong estimate. Thus, commonly used spatial approaches implement filter cascades of different width and combine their results in order to achieve more accuracy and to reduce or eliminate the correspondence problem. To this end, we are currently investigating the theoretical background for such a combination of temporal resonance modules. One should, however, realize that the spatial resolution of all techniques which use filter cascades is usually reduced due to the limited spatial resolution of the filter with the lowest frequency. For this reason Henkel [Henkel, 94] has designed a more intelligent approach of combining different filters without changing their spatial frequency.
So far our algorithm remains restricted to one dimension. Thus, an additional extension of the algorithm would be to try to combine the results from several scan-lines. Due to the continuity of objects many times similar or identical disparity changes can be tracked over a certain vertical distance. Thus, it would make sense to combine the results from different scan-lines obtained from our algorithm in order to exploit vertical disparity-continuities. Our algorithm in itself omits the second dimension, but this important source of information could be re-introduced afterward by means of regularization techniques. This way, real-time performance would be depending on the speed of the regularization method but seems still within range when using fast and simple techniques (e.g. averaging).
The authors are grateful to R. Opara for critical comments on the manuscript. The ``corridor'' test image is courtesy of the Computer Vision Group of Prof. D. Fellner, Computer Science Dept., Bonn University, Germany. F.W. acknowledges the support of the Deutsche Forschungsgemeinschaft and from the European Community ESPRIT 3, BRA 8305. A patent is pending for this system. In addition, preliminary results using multiple, cascaded filter modules can be inspected here.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
back to REAL-TIME-STEREO