How to ``hear'' visual disparities: real-time stereoscopic spatial depth analysis using temporal resonance

Bernd Porr, Alex Cozzi and Florentin Wörgötter

Institut für Physiologie, Ruhr-Universität Bochum,

D-44780 Bochum, Germany

worgott@neurop.ruhr-uni-bochum.de
porr@neurop.ruhr-uni-bochum.de

Published in Biological Cybernetics

Abstract

In a stereoscopic system both eyes or cameras have a slightly different view. As a consequence small variations between the projected images exist (``disparities'') which are spatially evaluated in order to retrieve depth information [Sanger, 1988,Fleet et al., 1991]. A strong similarity exists between the analysis of visual disparities and the determination of the azimuth of a sound source [Wagner and Frost, 1993]. The direction of the sound is thereby determined from the temporal delay between the left and right ear signals [Konishi and Sullivan, 1986]. Similarly, here we transpose the spatially defined problem of disparity analysis into the temporal domain and utilize two resonators implemented in the form of causal (electronic) filters to determine the disparity as local temporal phase differences between the left and right filter responses. This approach permits real-time analysis and can be solved analytically for step function contrast changes, which is an important case in all real-world applications. The proposed theoretical framework for spatial depth retrieval directly utilizes a temporal algorithm borrowed from auditory signal analysis. Thus, the suggested similarity between the visual and the auditory system in the brain [Wagner and Frost, 1993] finds its analogy here at the algorithmical level. We will compare the results from the temporal resonance algorithm with those obtained from several other techniques like cross-correlation or spatial phase-based disparity estimation showing that the novel algorithm achieves performances similar to the ``classical'' approaches using much less computational resources.

Introduction

The field of biological cybernetics and neural modeling has undergone several transitions over the last decades. Classical ``cybernetical'' approaches which dominated before 1970 (often involving linear systems theory) were soon followed by neuronal network models with different degrees of biological realism. The domain of artificial neural networks (ANN) began to exert its massive influence in the last 10 years or so. The strongest driving force behind ANN research was probably the attempt to transfer ideas taken from biology to a more technological domain. Thus, this aspect of neuronal modeling (in its widest sense) was especially attractive to engeneers and other application-oriented researchers. As a consequence, at about this time neural modeling ``became useful'' also outside the field of brain science. The transfer of biological ideas to technology, however, is not necessarily restricted to ANNs and this may be a sensible consideration given the intrinsic disadvantages of ANNs (e.g., slow relaxation behavior). Instead, sometimes it is possible to design an application oriented algorithm in a rather direct way from a biologically inspired model.

Thus, the goal of this article is twofold: 1) we will try to show that an algorithm stolen from the auditory system can be applied to a visual problem, and 2) that it is possible to transfer this algorithm directly to a chain of electronic filters which operate in real-time. To this end we will concentrate on the problem of stereo-image analysis.

In any vision based system the 3-dimensional world is projected onto 2-dimensional receptor surfaces. These could be the two retinas of a binocularly viewing animal or the cameras of an artificial system. During that process depth information is lost but can be recovered from the disparities between matching image parts. In technical systems many times vertical disparities are neglected by assuming a strictly fronto-parallel camera geometry. In this case, it is sufficient to analyze corresponding cross-sections of both images line by line because the epipolar-lines are now horizontal. Thus, stereoscopic depth estimation is reduced to a 1-dimensional spatial problem and common methods use acausal spatial filters to retrieve the disparity as a convolution result [Sanger, 1988,Fleet et al., 1991]. The inherently present restriction to one dimension, however, makes it also possible to interpret each line from the left and right image as a temporal signal x(t), which could for example be imagined as scan-line from a CCD camera arriving pixel after pixel. With the help of this interpretation a causal filter approach can now be defined such that the disparity is detected continuously with the incoming data.

Causal Filtering of the Stereo-Images

General Description

The system we present is very simple: It takes the luminance signal of the image scan-lines from the left and the right image and pipes it through a left and a right band-pass filter (a resonator). This way two signals are generated which are quasi-oscillatory at the resonance frequency. The (local) phase difference between these two oscillations is directly equivalent to the disparity. Thus, subsequently our system measures this phase difference by two more simple electronic operations as shown in Fig. 1 and explained in the next section.


  
Figure: diagram of the computational process and results of a disparity estimation from the two input step functions xr and xl. The initial disparity was 1 pixel, f0 was 0.1 pixel-1and for graphical reasons we have set Q to 2.0 such that two full oscillation cycles are shown. yryl reflect the resonator responses (Eq. 5), $\Phi $ is the signal from the phase comparison (Eq. 6) and $\phi $ shows the disparity output after low-pass filtering (Eq. 8). The scaling of $\phi $ reflects the damped cosine characteristic of Eq. 9. In this technical implementation the constant delay until read-out of the disparity was 10.0 pixels. For a CCD camera based system we can assume a pixel input rate of 10 MHz. Thus, the final output of this disparity processing system would be available after a total delay of only $1.00~\mu s$. The measured disparity after that delay, i.e., at the peak of $\phi $, is 0.97 pixels.
\begin{figure}
\begin{center}\leavevmode
\psfig{file=dsp7.eps,width=12cm}
\end{center}\end{figure}

Equations for the Generic Case

We assume a fronto-parallel camera arrangement which leads to horizontal epipolar lines. Disparity changes can only be detected when they concur with a luminance change. For digitized camera data the smallest luminance change is a 1-bit step function. In addition, stronger step like luminance changes in general occur rather often in images, for example at the edges of a protruding object. Thus, step functions are a very generic case for which we will solve the ``Ansatz'' analytically. Let xl(t), xr(t) be the two corresponding pixel lines of a stereo image pair in which a single contrast step exists at different disparities (viz. different times tl and tr). To obtain the disparity between the images each signal is used to excite a resonator with characteristic frequency f0. We assume that the contrast step in the left image occurs earlier than that of the right image (tl<tr), therefore the resulting resonance starts earlier for xl than for xr. This temporal phase difference is directly equivalent to the spatial disparity between the images and can be obtained from an operator which compares the phases.

The two step functions $x_l(t) \leftrightarrow X_l(s)$ and $x_r(t)
\leftrightarrow X_r(s)$ are defined in the Laplace domain by (Fig.1):

\begin{displaymath}X_l(s):={1\over s}{\mathrm e}^{-t_l s},~~~and~~~X_r(s):={1\over s}{\mathrm e}^{-t_r s},
\end{displaymath} (1)

and the transfer function of the resonator is given as:

\begin{displaymath}H(s) = {{s\over{(s-s_\infty)(s - s_\infty^*)}}}
\end{displaymath} (2)

where $s_\infty$ is a filter pole and specifies the filter characteristic defined by f0 and the filter quality Q, which determines the damping; the ``*'' denotes the complex conjugate.

\begin{displaymath}{\mathrm{Re}(s_\infty)} = -2\pi f_0 / 2Q;~~~{\mathrm{Im}(s_\infty)} = \sqrt{(2\pi f_0)^2 -
({\mathrm{Re}(s_\infty)})^2}
\end{displaymath} (3)

Convolution of signal and filter yields for the right image:

 \begin{displaymath}Y_r(s) = X_r(s)H(s)={{s\over{(s-s_\infty)(s-s_\infty^*)}}}{1\over s}{\mathrm e}^{-t_r s}
\end{displaymath} (4)

A similar convolution is performed for the left image.

We define $a:=(s_\infty-s_\infty^*)^{-1}$, then the inverse Laplace transformation of Yr(s) yields:

 \begin{displaymath}y_r(t)=\left\{\begin{array}{ll}
a {\mathrm e}^{s_\infty (t-t...
...$t\geq t_r$ } \\
0 & \mbox{if $t<t_r$ }
\end{array} \right.
\end{displaymath} (5)

The temporal resonator signal y(t) reflects a damped sine-wave with frequency f0 (Fig. 1, yl,yr). The number of full cycles until the signal fades is roughly equivalent to the value of Q. Note that any DC component present in the input signal is removed by the resonator. This is an advantage of the new method because the DC usually poses a severe problem in all spatial filter approaches [Sanger, 1988,Fleet et al., 1991].

Finally, disparity is determined from the phase difference between the resonator signals from both images. Phase comparison is achieved by multiplication of the two signals in the time domain and subsequent low-pass filtering (Fig.1, $\Phi,~LP$).

Multiplication yields (Fig. 1, $\Phi $):

 \begin{displaymath}\Phi(t) = y_l(t)y_r(t) =\left\{\begin{array}{ll}
g_{2f_0}(t)...
...$t\geq t_r$ } \\
0 & \mbox{if $t<t_r$ }
\end{array} \right.
\end{displaymath} (6)

with:

 
g2f0(t) = $\textstyle \underbrace{
a^2 {\mathrm e}^{s_\infty(2t-t_l-t_r)}+{a^*}^2 {\mathrm e}^{s_\infty^*(2t-t_l-t_r)}
}_{\textrm{\small double frequency term}}$   (7)

and

 
$\displaystyle \phi(t) =$ $\textstyle \underbrace{
\vert a\vert^2 {\mathrm e}^{-s_\infty t_l-s_\infty^* t_...
...{-s_\infty^* t_l-s_\infty t_r+(s_\infty^*+s_\infty)t}
}_{\textrm{\small phase}}$    
       
= $\textstyle \underbrace{2 \vert a\vert^2 \cos\left[(t_r-t_l){\mathrm{Im}(s_\infty)}\right]}_{K}
{\mathrm e}^{{\mathrm{Re}(s_\infty)}(2t-t_r-t_l)}$   (8)

The term g2f0(t) reflects an oscillation with 2f0. In an implementation it will be eliminated by low-pass filtering with low cut-off (Fig. 1 LP). The second part represents the phase $\phi(t)$between the two signals and contains an exponential relaxation term and a constant term K, which encodes the true disparity.

 \begin{displaymath}K = {{Q^2}\over{2 \pi^2 f_0^2 (4 Q^2 - 1)}} \cos\left[(t_r-t_l){\mathrm{Im}(s_\infty)}\right]
\end{displaymath} (9)

The disparity which is the spatial equivalent of tr-tl can be computed by inverting Eq. 9 and is obtained immediately at the second contrast step (i.e., for t=tr), after which the signal relaxes to zero. This relaxation behavior which originates from the characteristic of the resonator assures temporal (viz. spatial) locality. Otherwise only the average phase (viz. disparity) of each image line could be computed.

Like all other phase based approaches also our algorithm is subject to the so called phase wrap-around problem. The periodic characteristic of the resonators limits the resolution of the system. This generic problem is reflected at the output of the system in Eq. 9 by the periodic behavior of the cosine. To avoid such ambivalencies we restrict the argument of the cosine to: $0<(t_r-t_l){\mathrm{Im}(s_\infty)}<\pi$. From this we get: $0<f_0< {{1}\over
{2}} (t_r - t_l)^{-1}$; a constraint which is similar to that observed in the spatial filter (Gabor filter) approaches. The phase wrap-around problem disappears for $f_0 \rightarrow 0$ at the cost of low spatial resolution and an increasing noise susceptibility because of the shallow filter characteristic. As in the other spatial phase-based stereo-algorithms also our approach could be used in a cascaded way utilizing several filter-modules with different frequencies in order to address the phase-wrap around problem.

We get for the damping coefficient Q>1/2, which means that the whole resonance may be restricted to about one half-cycle of the sine-wave. Given that disparity changes rarely exceed 5-10 pixels (empirical observation from publically available stereo-image data), this restriction drastically limits the necessary computational effort in any implementation.

An analytical solution can also be obtained for other simple functional descriptions of disparity changes. In general, however, all disparity changes can be detected by such a system regardless of their shape as long as the frequency content of the change contains enough power at the resonance frequency.

The block diagram in Fig. 1 shows that this system can be easily implemented in analog or digital hardware. In particular, a few modern digital signal processors can be used to implement the individual filters which are then coupled to a rather simple quasi real-time processing system such as the one used to generate the data in the figure. In such a system the disparity is determined continuously from the incoming data and the computational delay observed in the implementation (Fig. 1, $\phi $) is constant. Its duration is mainly determined by the low pass filter (Fig. 1 LP) and it is independent of the input image. In order to make this algorithm applicable the output signal needs to be normalized to be independent of overall luminance variations. Such a normalization has been performed to obtain the results shown in Fig. 4.


  
Figure: of the signals from a complete scan-line at different filtering stages and the power-spectra from these signals.
\begin{figure}
\begin{center}\leavevmode
\psfig{file=uebers2.eps,width=0.9\textwidth} \end{center}\end{figure}

Fig. 2 shows how the signal which originates from a single scan-line looks like at the different stages of the filtering process. The aperiodic brightness signal becomes transformed into a quasi-periodic signal at the resonator, where only a single frequency dominates (see spectra). The phase comparison (by multiplication) produces a signal with a DC and a ``double-frequency'' component. Only the DC-component survives the low-pass filtering and - as explained above - this DC signal represents the phase and, hence, the disparity.

Results

In the following we will compare our approach with several existing techniques. The next section gives some background about the algorithms used for comparison. Readers familiar with them should probably skip it.

Other methods for disparity estimation

Several techniques have been proposed to recover depth information from epipolar line pairs. The classical approach uses a measure of similarity, cross-correlation for example, to find matching points in the two images composing the stereo pair. This technique selects one image of the stereo pair, for example the left one, as the reference image. For each point of the reference image the corresponding point is sought in the other image by searching for a maximum in the similarity measure along the corresponding epipolar line. To this end the algorithm selects a rectangular window around a point in the reference image and computes its similarity measure with all the rectangular windows surrounding every points on the corresponding line in the second image. The point in the second image where the similarity measure has its maximum is considered the correct match.

This scheme, called ``area-based matching'' [Haralick and Shapiro, 1992] can be implemented in quite different ways, depending on the chosen similarity measure, on the algorithmic solution and on the complexity of the model of the disparity field. The similarity measures frequently used are sum of products, covariances, sum of squared differences, sum of absolute differences and cross correlation. The algorithmic solutions range from complete search to iterative least squares, simplex algorithms and dynamic programming, highly depending on the a-priori knowledge about the the scene, the similarity measure and the model of the disparity field. The model of the disparity field varies from simple translation (horizontal plane) to affinity (locally planar surface) to smooth (smooth surfaces without occlusions) or piecewise smooth (piecewise smooth, possibly with occlusions).

For comparison purposes we implemented an area based stereo algorithm that uses extensive search to identify the minimum at integer position, then the sub-pixel value of the minimum is computed via cubic interpolation of the similarity function. We produced disparity maps of a test scene with different similarity measures. The assumed model of the disparity field is that of locally constant disparity. We denote the signal of each corresponding pair of scan lines as $f_{\rm R}(x, y)$ and $f_{\rm L}(x, y)$, where the subscript indicates that the scan-line comes from the right or the from left image of the stereo pair. Wx and Wy define the size of the window in the x and y coordinates respectively. Using this notation, the cross-correlation of point (x,y) with the disparity value d is defined as:

 \begin{displaymath}
{\rm CC}(x, y, d) = \sum_{i=-W_x}^{W_x}\sum_{j=-W_y}^{W_y}
f_{\rm L}(x+i, y+j)\, f_{\rm R}(x+i + d, y+j)
\end{displaymath} (10)

Plain cross-correlation is too sensitive to the local characteristics of the signal to be used in real applications. A better alternative is to use the zero-mean cross-correlation:

 \begin{displaymath}
{\rm ZCC}(x, y, d) = \sum_{i=-W_x}^{W_x}\sum_{j=-W_y}^{W_y}...
...
\left(f_{\rm R}(x+i + d, y+j) - \overline{f_{\rm R}} \right)
\end{displaymath} (11)

where $\overline{f_{\rm R/L}}$ are the means of the signals in the windows. An even better alternative is the normalized cross correlation:

 \begin{displaymath}
{\rm NCC}(x, y, d) = \sum_{i=-W_x}^{W_x}\sum_{j=-W_y}^{W_y}...
... R}(x+i + d, y+j)}
{\sigma_{f_{\rm L}} \, \sigma_{f_{\rm R}}}
\end{displaymath} (12)

where $\sigma_{f_{R/L}}$ are the standard deviations of the signals in the windows. Another approach is to search the minimum of the sum of squared differences between the two signals:

 \begin{displaymath}
{\rm SSD}(x, y, d) = \sum_{i=-W_x}^{W_x}\sum_{j=-W_y}^{W_y}
\left[f_{\rm L}(x+i, y+j) - f_{\rm R}(x+i + d, y+j)\right]^2
\end{displaymath} (13)

As an alternative to correlation-based techniques, in 1988 Sanger [Sanger, 1988] proposed the use of the phase difference between two local filter responses to compute the disparities of the different object in the two stereo images. To achieve this Gabor filters are commonly used. In the phase difference method [Sanger, 1988,Fleet et al., 1991], disparity is computed from the phase difference between the convolutions of the two stereo images with local bandpass filters. Since the two signals, $f_{\rm R}(x)$ and $f_{\rm L}(x)$, are locally related by a shift $\delta(x_0)$, i.e., in the vicinity of each point x0

\begin{displaymath}f_{\rm L}(x+\delta(x_0)/2) \approx f_{\rm R}(x-\delta(x_0)/2)
\end{displaymath} (14)

the local k0 Fourier components of $f_{\rm L}(x)$ and $f_{\rm R}(x)$:

\begin{displaymath}\widehat{f}_{\rm L/R}(k_0) = \int \, {\mathrm e}^{-{\mathrm i...
...ho(x)_{\rm L/R} \, {\mathrm e}{-{\mathrm i}\phi(x)_{\rm L/R}}
\end{displaymath}

are related by a phase difference equal to $\Delta \phi(x) =
\phi_L(x)-\phi_R(x)~=~k_0 \, \delta$.

We can extract the local Fourier components by convolving the images with the Gabor filters:

$\displaystyle F_{\rm L/R}(x,k_0)$ = $\displaystyle \int G(x-y) \exp({{\mathrm i}\,k_0\,(x-y)})
f_{\rm L/R}(y) \,dy$  
  = $\displaystyle \rho_{\rm L/R}(x) \, \exp({\mathrm i}\, \psi_{\rm L/R}(x))$ (15)

where G(x-y) is the Gaussian function and k0 is the tuning frequency of the filter:

\begin{displaymath}G(x) = \frac{1}{\sqrt{2 \, \pi} \, \sigma} \, \exp ({-
\frac{x^2}{2 \sigma^2}})
\end{displaymath}

As a function of the spatial position, the phase of the filter response, $\psi(x)$, has a quasi linear behavior dictated by the center k0:

 \begin{displaymath}\psi(x) \approx \psi^{'}(x_0) \, (x-x_0) \approx k_0 \, (x-x_0).
\end{displaymath} (16)


  
Figure: ``corridor'' synthetic stereo pair and its disparity map.
\begin{figure}
\begin{center}\leavevmode \psfig{file=orig.eps,width=0.5\textwidth}
\end{center}\end{figure}

The local frequency, i.e., the derivative of the phase $\psi(x)$, is generally close to the value of the center frequency k0. In fact, the Gabor filter is a bandpass filter around k0. In the Fleet et al. algorithm [Fleet et al., 1991], the disparity is extracted from the phase difference, $\Delta \psi(x) = \psi_{\rm
L}(x) - \psi_{\rm R}(x)$, by expanding $\Delta \psi(x)$ to the second order in $\delta$:

 \begin{displaymath}\delta(x) \approx 2 \,\frac{\left[ \Delta \psi(x) \right]_{2\,\pi}}
{\psi^{'}_L(x)+\psi^{'}_R(x)}.
\end{displaymath} (17)

The phase is not defined when the amplitude vanishes, i.e., when $\rho(x) =0$ (singularity). Around these singular points the phase is very sensitive to spatial or scale variations. As a consequence, the approximation of Eq. 17 fails, and the calculation of disparity in the neighborhood of a singularity is unreliable. The neighborhoods of singular points can be detected by means of [Fleet et al., 1991]:

  
    $\displaystyle S(x) = \sigma \,
\sqrt{\left(\psi'-k_0\right)^2+\left(\frac{\rho'}{\rho}\right)^2} \leq T_1$ (18)
    $\displaystyle \rho(x)/\rho^* > T_2$ (19)

where T1 and T2 are opportunely chosen constants, and $\rho^*$ denotes the maximum value of the amplitude. The first term of Eq. 18 measures the difference between the peak frequency, k0, and the local frequency, $\psi^{'}(x)$, in relation to the width of the filter $1/\sigma$. The second term of Eq. 18 measures local amplitude variations with respect to the spatial width $\sigma$. The relation in Eq. 19 measures the ``energy'' of the response. The result at point x is accepted only if the above relations are satisfied. Usually, T2 is set to $\approx$ 5%, and $T_1
\approx 1.25$.

Comparison of the results

In Fig. 4 we show the disparity maps produced by six different techniques for disparity estimation. The Temporal Resonance (TR) technique used parameters Q=1.5 and f0 = 0.08. For the phase-based difference technique of Fleet and Jepson we used two different Gabor filters: the first with a modulation period of 10 pixels and half-octave bandwidth (FJ10-0.5) and the second with a modulation period of 20 pixels and one-octave bandwidth (FJ20-1). All the correlation-based techniques, Normalized Cross-Correlation (NCC), Zero-Mean Cross-Correlation (ZCC), and Sum of Squared Differences (SSD), used a window size of $7
\times 3$ pixels with a disparity limit of $\pm 20 $ pixels.


  
Figure: disparity maps produced by the different disparity estimation techniques.
\begin{figure}
\begin{center}\leavevmode
\psfig{file=compari3.eps,width=12cm}
\end{center}\end{figure}

Judging Fig. 4 the Temporal Resonance technique produces intermediated results as compared to the other techniques, except on the uniform part of the source image. In these areas there is no way to measure disparity and the resonator response slowly fades. The filter based techniques produce results in characteristic ``bands'' centered on the original image's edges. These bands are induced as a consequence of the spatial extend of the Gabor filter and have approximatively the size of the filter's modulation period. The correlation based techniques produce widely different results. In the case of ZCC the differences in luminance causes the technique to find incorrect maxima in the correlation function, thus most of the disparities are discarded in the verification phase. The resulting effect is that the disparity map is almost black. This problem is overcome by the normalization used by the NCC, that produces much better maps, visually very similar to the maps from the SSD.


  
Figure: and density of different disparity estimation techniques. Results for the ``corridor'' synthetic test image. The labels have the following meaning: TR is Temporal Resonance, NCC is Normalized Cross-Correlation, ZCC is Zero-mean Cross-Correlation, SSD is Sum of Squared Differences, FJ is Fleet and Jepson, with two parameters for the Gabor filter: 20 and 10 pixels of tuning period, 1 and 0.5 octaves of bandwidth.
\begin{figure}
\begin{center}
\leavevmode \psfig{file=quanti.eps,width=12cm}
\end{center}\end{figure}

In Fig. 5 we show a quantitative summary of the results from the different techniques we presented. Two quantities are preponderant in characterizing the performance of disparity estimators. (1) Density, i.e., the number of pixels where the algorithms are able to measure a disparity value, and (2) Precision, i.e., the mean error that affects the measurements. By varying the parameters in the different algorithms it is usually possible to trade density against precision or vice-versa. Using as testbed the synthetic image ``corridor'' (Fig. 3), Temporal Resonance (TR) achieves the best results in density (91%) with an intermediate score in error (2.49 pixels). Interesting is the comparison with NCC, a much slower technique that still is not able to beat it neither in density nor in precision. Fleet and Jepson's algorithm is very sensitive to the choice of the Gabor filter parameters [Cozzi et al., 1997], achieving very good precision (FJ20-1) or very bad precision (FJ10-.5) with nearly constant density around 40%. The ZCC produced the worse result in density (10%) with sufficient precision (1.1 pixels). A good tradeoff is reached by SSD, which gives the best precision (0.53 pixels) with a reasonable density (52%). The group of T. Kanade at CMU [Kanade et al., 1995] succeeded in producing a real-time implementation of the SSD, but this implementation is rather computationally expensive.


  
Figure: Disparity map obtained by the temporal resonance algorithm from a stereo image pair only the left image of which is shown. The disparity is coded with a gray scale (bottom), parameters where: f0=0.1443 pixel-1, Q=1.0.
\begin{figure}\begin{center}
\makebox{
\psfig{file=baumf.eps,width=0.9\textwidth} }
\end{center}
\end{figure}

Fig. 6 shows an example of the performance of our algorithm tested with a real image pair. Disparity is retrieved with sufficient accuracy but the map is more blurred than that obtained from the artificial scene which contained much sharper edges. A comparison of this result with those obtained from the other algorithms (not shown) demonstrates that the performance of the temporal resonance algorithm falls well within the range of the other approaches.

Discussion

The theoretical framework presented here borrows its strength from the combination of computational principles found in the auditory and visual system of vertebrates. The convolution of the stereo images with oscillating local filters (Eq. 4) is a step quite commonly performed in the majority of technical approaches [Sanger, 1988,Fleet et al., 1991] and reflects the response of cortical simple cells which have Gabor-like receptive fields [Marcelja, 1980,Daugman, 1980,Jones and Palmer, 1987]. It has been suggested that the evaluation of spatial phase differences from such receptive field responses could indeed be used to compute disparity in the brain [De Angelis et al., 1991]. The dense coverage of the visual field by cells with many different receptive field sizes necessary to produce reliable depth maps does not pose a problem for the massive parallel architecture of the visual cortex. Technical systems, however, soon reach their limits by the tremendously high computational effort of such an architecture.

The 1-dimensional structure of an auditory signal on the other hand permits to determine the phase differences of two incoming sound waves sequentially by temporal correlation of both signals and such a process probably takes place in the auditory cortex [Konishi and Sullivan, 1986]. The interpretation of image lines as sequential signals allows for a similar ``temporal'' processing. The computational complexity of the temporal correlation involved, however, is reduced to simple multiplication and low-pass filtering as the consequence of the previously applied resonance filter. The transfer from the auditory to the visual domain worked well concerning the design of the novel algorithm. It should, however, not be forgotten that the algorithm shown here is rather unlikely to play any role in the visual system of the brain. After all, our visual cortex does not perform scan-line analysis.

On the other hand, the system is ideally suited for technical implementation in serial data acquisition systems and it performs in quasi real-time. More than that, however, the combination of spatial and temporal computational principles similar to those in the visual and auditory system generates a theoretical framework for causal stereoscopic depth processing in which the computational effort is strongly reduced. The comparison of the novel algorithm with other well-known techniques shows that intermediate results are obtained. Better results, however, require rather complex algorithms and real-time performance is then mostly prevented. Thus, our novel approach may be a good compromise in all those situations where real-time performance is necessary and a limited accuracy is sufficient. It may, however, be possible to further improve the performance of the temporal resonance algorithm by wiring up several algorithmic modules with different parameter in parallel. Such an architecture would still operate in real-time and the results should be more correct. In particular, the so-called stereo-correspondence problem could also be addressed by combining modules. The correspondence problem always arises if the same features (here gray levels) occur more than once on an image scan-line. In this case the match between left and right image can become ambiguous. One could use very wide filters to avoid this problem, but these filters would almost always average over different disparity changes leading to a wrong estimate. Thus, commonly used spatial approaches implement filter cascades of different width and combine their results in order to achieve more accuracy and to reduce or eliminate the correspondence problem. To this end, we are currently investigating the theoretical background for such a combination of temporal resonance modules. One should, however, realize that the spatial resolution of all techniques which use filter cascades is usually reduced due to the limited spatial resolution of the filter with the lowest frequency. For this reason Henkel [Henkel, 94] has designed a more intelligent approach of combining different filters without changing their spatial frequency.

So far our algorithm remains restricted to one dimension. Thus, an additional extension of the algorithm would be to try to combine the results from several scan-lines. Due to the continuity of objects many times similar or identical disparity changes can be tracked over a certain vertical distance. Thus, it would make sense to combine the results from different scan-lines obtained from our algorithm in order to exploit vertical disparity-continuities. Our algorithm in itself omits the second dimension, but this important source of information could be re-introduced afterward by means of regularization techniques. This way, real-time performance would be depending on the speed of the regularization method but seems still within range when using fast and simple techniques (e.g. averaging).

Bibliography

Cozzi et al., 1997
Cozzi, A., Crespi, B., Valentinotti, F., and Woergoetter, F. (1997).
Performance of phase-based algorithms for disparity estimation.
Machine Vision and Applications, 9(5-6):334-340.

Daugman, 1980
Daugman, J. (1980).
Two-dimensional spectral analysis of cortical receptive field profiles.
Vision Res., 20:847-856.

De Angelis et al., 1991
De Angelis, G., Ohzawa, I., and Freeman, R. (1991).
Depth is encoded in the visual cortex by a specialized receptive field structure.
Nature, 352:156-159.

Fleet et al., 1991
Fleet, D., Jepson, A., and Jenkin, M. (1991).
Phase-based disparity measurement.
Computer Vision, Graphic and Image Processing, 53(2):198-210.

Haralick and Shapiro, 1992
Haralick, R. M. and Shapiro, L. G. (1992).
Computer and Robot Vision, volume 1.
Addison Wesley.

Henkel, 94
Henkel, R. D. (94).
Hierachical calculation of 3d-structure.
Technical report, Technical Report: Zentrum für Kognitionswissenschaften, Universität Bremen.
http://axon.physik.uni-bremen.de/ rdh/research/.

Jones and Palmer, 1987
Jones, J. and Palmer, L. (1987).
An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex.
J. Neurophysiol., 58:1233-1258.

Kanade et al., 1995
Kanade, T., Kano, H., and Kimura, S. (1995).
Development of a video-rate stereo machine.
In Proc. of International Robotics and System Conference (IROS-95), Pittsburg (PA).

Konishi and Sullivan, 1986
Konishi, M. and Sullivan, W. (1986).
Neural map of interaural phase difference in the owl's brainstem.
Proc Natl Acad Sci USA, 83:8400-8404.

Marcelja, 1980
Marcelja, S. (1980).
Mathematical description of the responses of simple cortical cells.
J. Opt. Soc. Amer., 70:1297-1300.

Sanger, 1988
Sanger, T. D. (1988).
Stereo disparity computation using gabor filters.
Biol. Cybern., (59):405-418.

Wagner and Frost, 1993
Wagner, H. and Frost, B. (1993).
Disparity-sensitive cells in the owl have a characteristic disparity.
Nature, 364:796-757.

Acknowledgements

The authors are grateful to R. Opara for critical comments on the manuscript. The ``corridor'' test image is courtesy of the Computer Vision Group of Prof. D. Fellner, Computer Science Dept., Bonn University, Germany. F.W. acknowledges the support of the Deutsche Forschungsgemeinschaft and from the European Community ESPRIT 3, BRA 8305. A patent is pending for this system. In addition, preliminary results using multiple, cascaded filter modules can be inspected here.

About this document ...

How to ``hear'' visual disparities: real-time stereoscopic spatial depth analysis using temporal resonance

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

back to REAL-TIME-STEREO