Two eyes see more than one. By combining two flat images, we can reconstruct the 3D geometry of the entire world.
1The Geometry of Two Eyes
Stereo Vision is based on Epipolar Geometry. When you have two cameras (Left and Right) looking at the same scene, a point in the real world will appear at different pixel coordinates in each image. The line connecting the two camera centers is the Baseline. Because we know the focal length and the baseline, we can use simple trigonometry to calculate the exact distance (z) to that point. This is effectively 'Triangulation' using light.
2The Search for Matches
The hardest part of stereo vision is the Correspondence Problem: how do we know that pixel (100, 200) in the left image is the same physical object as pixel (90, 200) in the right image? We use Matching Algorithms like SSD (Sum of Squared Differences) or SGM (Semi-Global Matching). These algorithms look for similar patterns of light and texture. The difference in their horizontal position is called Disparity. Large disparity = Close object; Small disparity = Far object.
3Calibration and Constraints
For the math to work, the cameras must be perfectly aligned. We use Camera Calibration (often with a checkerboard pattern) to find the 'Intrinsics' and 'Extrinsics' of the lenses. We then Rectify the images, mathematically warping them so that matching points always lie on the same horizontal row. Stereo vision's biggest weakness is Textureless Surfaces (like a plain white wall) where there are no patterns to match, and Repetitive Patterns which can cause the algorithm to get confused about which 'Brick' it is looking at.
