An image is an orderly grid of data points. To manipulate vision, you must first understand the matrix and its coordinates.
1The CV Coordinate System
Welcome to the absolute foundation of Computer Vision. Before we can build intelligent algorithms, we must understand exactly how a computer sees an image. To a machine, there are no shapes, no colors, no faces—there is only a vast mathematical grid.
Unlike traditional Cartesian geometry where the origin (0,0) is in the bottom-left corner and Y goes up, in Computer Vision the origin is in the TOP-LEFT corner, and the Y-axis goes DOWN. Why? Because early CRT monitors and computer memory literally read data starting from the top-left.
# Standard Math vs Computer Vision
# Math: Origin = Bottom-Left, Y goes UP
# CV: Origin = Top-Left, Y goes DOWN2Spatial Mapping (X, Y)
This means when we talk about coordinates, (x,y) represents a physical pixel location. 'X' is the column (Width) moving right, and 'Y' is the row (Height) moving down.
A coordinate like (100, 50) means moving 100 pixels to the right, and then moving 50 pixels down from the top edge. It is crucial to internalize this spatial mapping before attempting to slice or crop images in code.
# Visualizing Coordinates
# (0,0) --------> +X (Columns / Width)
# |
# |
# v +Y (Rows / Height)3The Indexing Trap (Row-First)
However, there is a massive trap here. When we actually code this in Python using NumPy, matrices are indexed 'Row-First'. This means to access a pixel, the syntax is image[row, column].
Since rows define height, the syntax translates to image[Y, X]. It feels backward, but it is the source of 90% of beginner errors. This 'Row-First' logic also applies to the .shape attribute of an image matrix. If you ask Python for the shape of an image, it returns (Height, Width, Channels). So a 1080p image (1920x1080) will return (1080, 1920). Always remember: Matrix logic prioritizes the vertical rows over the horizontal columns.
import numpy as np
image = np.zeros((10, 20)) # H=10, W=20
# WARNING: Accessing pixel at x=5, y=2
# Syntax is image[row, col] -> image[y, x]
pixel = image[2, 5]4Resolution & Bit Depth
Now let's talk about the actual values inside these matrix cells. A pixel is just a number representing brightness. In standard computer vision, we use an 8-bit format called uint8 (unsigned 8-bit integer).
This gives us 2^8, or 256 possible values. Therefore, pixel brightness ranges exactly from 0 (pure black) to 255 (pure white). Resolution is simply the total count of these pixels. A 1920x1080 image contains over 2 million pixels.
# Bit Depth and Pixel Values
# Data Type: uint8 (0 to 255)
black_pixel = 0
white_pixel = 255
mid_gray = 1275The 3D Color Tensor
What about color? A grayscale image is just a 2D matrix (Height x Width). But an RGB color image is a 3D matrix (Height x Width x 3 Channels).
It is literally three separate 2D matrices (one for Red, one for Green, one for Blue) stacked perfectly on top of each other. For a 1080p color image, that means over 6 million individual integer values that a neural network must process simultaneously. This massive data volume is why Computer Vision requires powerful GPUs.
# Color Depth
# Grayscale: Shape = (H, W)
# Color (RGB): Shape = (H, W, 3)
# Accessing the Red value at y=10, x=5
red_value = image[10, 5, 0] # Assuming RGB order