I(x,y) = R(x,y) · L(x,y)
I(x,y)
Measured intensity — what the camera actually records
R(x,y) — Reflectance
Intrinsic surface property. Ideally invariant to lighting. Encodes color + material.
L(x,y) — Illumination
How light arrives at the surface. Varies over time & scene.
Key ambiguity: The same pixel value can come from a bright surface in shadow OR a dark surface under strong light. A computer can't tell them apart without extra info.
Linear Mapping to 8-bit Grayscale
Assume a linear mapping model where:
pixel_value = (I / 1000) · 255, where I = R · L
Let's compare two different scene points that produce the exact same pixel intensity:
  • Pixel A (Bright surface in shadow):
    Reflectance RA = 0.80 (80% reflecting), Illumination LA = 250 units.
    Intensity: IA = 0.80 · 250 = 200.
    Pixel Value = (200 / 1000) · 255 = 51.0
  • Pixel B (Dark surface under strong light):
    Reflectance RB = 0.20 (20% reflecting), Illumination LB = 1000 units.
    Intensity: IB = 0.20 · 1000 = 200.
    Pixel Value = (200 / 1000) · 255 = 51.0
Both Pixel A and Pixel B map to the same digital value (51.0), meaning the camera records them identically, losing the distinct physical attributes of color and lighting.

Geometry

Controls: Structure & shape
Q: Where does a 3D point land?
Determined by: Camera pose, focal length, FOV, projection model

Photometry

Controls: Appearance (brightness/color)
Q: How bright/what color is that point?
Determined by: Illumination, reflectance, sensor response
Suppress L(x,y) — illumination
Normalize away shadows, shading, time-varying light → illumination normalization, gradient-based features, ratio representations
Preserve R(x,y) — reflectance
Capture material/texture/color for reliable recognition, matching, segmentation → reflectance-based features
1
Scene Interaction & Photometry
Real World Scene (geometry + materials) → Illumination (light sources, shadows) → Light-Surface Interaction (reflectance/material) → Photometry / Appearance (brightness, color, shading) modeled as I(x,y) = R(x,y) · L(x,y).
2
Imaging Geometry
Real World Scene (geometry + materials) → Imaging Geometry (pinhole, projection, field of view (FOV), depth of field (DOF)). Determines where 3D points project on the 2D plane.
3
Camera System Integration
Imaging Geometry + Photometry / Appearance → Camera System (optics + sensor + ISP). Combines physical projection rays with measured light intensities.
4
Optical & Digital Processing
Camera System → Optical Imperfections (lens distortion) + Digital Representation (sampling, quantization, coordinate grids) + Color Representation (RGB and other color spaces).
Final Output
Optical Imperfections + Digital Representation + Color Representation → Final Digital Image (grid of pixels as digital numbers).
Mental model: We do not "see objects" — we measure light.
Phase 1: Optics
Optics & Optical Controls
Light enters through the physical camera lens and is restricted by the aperture.
Components: Lens Elements, Aperture, Shutter, Focus Mechanism.
Phase 2: Sensors
Sensors & Electronic Components
Photons strike the Color Filter Array (CFA) sensor and generate analog voltages, which are conditioned by the Analog Front-End and digitized.
Intermediate Output: RAW Image (Mosaiced, Linear, 12-bit).
Phase 3: Digital ISP
In-Camera Image Processing Pipeline
The RAW image goes through digital processing steps:
1. Denoise 2. Demosaic 3. White Balance 4. Color Transform 5. Tone Reproduction 6. Compression

Final Output: RGB Image (Full Color, Non-linear/Gamma-encoded, 8-bit).
What is a Mosaiced RAW Image?

A digital camera sensor does not measure full RGB color at each pixel. Instead, it is covered by a Color Filter Array (CFA). Each individual pixel measures only one color component: Red (R), Green (G), or Blue (B).

The spatial arrangement of these filters forms a mosaic pattern, commonly called the Bayer Pattern (a repeating 2×2 grid containing 2 Green filters, 1 Red filter, and 1 Blue filter per block). More green is used because the human eye is most sensitive to green luminance/detail.

G R
B G

Because each pixel contains only a single intensity value, the raw sensor output looks black-and-white. The demosaicing step mathematically interpolates neighboring values to reconstruct the missing color channels and create a full-color RGB image.

RAW image (after sensor)

Color: Mosaiced (1 channel/pixel)
Linearity: Linear w.r.t. scene radiance
Bit depth: 12-bit
Viewable? No — requires ISP processing

Final RGB image

Color: Full RGB at every pixel
Linearity: Non-linear (gamma encoded)
Bit depth: 8-bit per channel
Viewable? Yes — standard display format
1. Denoising
Removes photon noise + electronic noise from RAW. Done FIRST so noise isn't spread into all channels by demosaicing. Over-denoising destroys texture.
2. CFA Demosaicing
Each pixel only has R, G, or B (Bayer pattern: RGGB — 2 green, 1 red, 1 blue per 2×2 block). Demosaicing estimates the 2 missing channels per pixel from neighbors → full RGB.
3. White Balance
Different light sources (sun, incandescent, fluorescent) cast different color temperatures → color cast. WB adjusts per-channel gains so neutral objects appear neutral.
4. Color Transform
Sensor RGB ≠ display RGB. Mathematical matrix maps from sensor color space to a standard display color space for consistency across devices.
5. Tone Reproduction
Real scenes have much wider dynamic range than displays. Tone mapping compresses range while preserving contrast. Defines the image's "look".
6. Compression
Encodes final image (JPEG/PNG) for storage. Trades file size vs visual quality. Lossy compression introduces artifacts.
Critical exam point: Many CV failures originate in early ISP stages (noise, white balance, demosaicing) — NOT in the vision algorithm itself.
Only 1 color/pixel (CFA) Linear — looks washed out 12-bit — displays are 8-bit

Sensors: What They Measure

An image sensor does not measure an ideal mathematical point in the scene. Each pixel records the integrated light energy arriving at that pixel.
Integration occurs over:
Spatial area: The pixel footprint on the image plane.
Time interval: Exposure time (can cause motion blur).
Spectral range: Defined by the color filter placed above the pixel.
Thus, a pixel represents an average measurement, not an exact sample.

"All Points Contribute" Problem

In the absence of optical constraints (no lens or pinhole), light rays from every scene point scatter and reach every sensor pixel.
Consequences of spatial mixing:
• Each pixel contains information from multiple scene locations.
• Fine spatial details become completely indistinguishable.
• The resulting output is a completely blurred, featureless gradient.
The role of optics (lenses/pinholes) is to enforce a one-to-one mapping.
Core principle
An infinitesimal aperture enforces a single ray path per scene point → one-to-one mapping from 3D scene to 2D image. Result: inverted image.

Larger pinhole

More light → brighter image
Multiple rays per point → blur

Smaller pinhole

Less light → dimmer image
Sharper until diffraction dominates
Optimal size exists — there's a sweet spot before diffraction takes over.
u = f · (X / Z) v = f · (Y / Z)
(u, v)
Image point on the 2D plane
f
Focal length — distance from center to image plane
(X, Y, Z)
3D scene point in camera coords. Z = depth.
Consequences of the 1/Z Term

The perspective projection is non-linear due to the division by depth Z. Any two points on the same projection ray (X, Y, Z) and (λX, λY, λZ) project to the same image location (u, v), meaning depth information is lost.

  • Foreshortening: Objects appear smaller as depth Z increases (larger Z moves coordinates closer to the principal point).
  • View-dependent appearance: The same object projects to different shapes depending on camera viewpoint.
  • "Visual Illusions": Many visual illusions are natural geometric consequences of perspective projection rather than errors in human perception.
m = y' / y = f / Z
Magnification decreases as depth Z increases. Derived from similar triangles. Larger f → more magnification at same distance.

Preserved

Straight lines → still straight
Incidence relations (points on lines)

NOT preserved

Distances
Angles
Parallelism (lines can converge)
Implication: Metric measurements (real distances, angles) require camera calibration and explicit geometric modeling.
Why they exist
3D parallel lines DO NOT intersect in Euclidean space. Under perspective projection they converge to a vanishing point in the image — a point at infinity in projective geometry. Railroad tracks are the classic example.
What they encode
Camera orientation (rotation), dominant scene directions. Used for: camera calibration, horizon estimation, scene understanding (roads, corridors, architecture).

Larger f (telephoto)

Narrower FOV
More magnification (zoom-in)
Smaller scene portion captured

Smaller f (wide-angle)

Wider FOV
Less magnification
Larger scene portion visible
Important: Changing f affects magnification, NOT scene depth.
FOV = 2 · tan⁻¹(w / (2f))

w = sensor dimension, f = focal length. FOV depends on BOTH sensor size and focal length.

  • Focal Length: Increasing focal length f reduces FOV and produces a zoomed-in view.
  • Sensor Size: Increasing sensor size w (with fixed f) increases FOV and captures a larger portion of the scene.
  • Measurement: Defined in the camera coordinate system and measured with respect to the optical axis. It can be horizontal, vertical, or diagonal.
  • Conceptual Note: FOV describes how much of the world is visible, not how large objects appear. Object size in the image depends on both focal length and distance. It does not depend on scene depth.
Definition
Range of scene distances that appear acceptably sharp. A consequence of finite aperture size. Points OUTSIDE DOF project to blurred circles (circle of confusion), not single points.

Large aperture (small f-number, e.g. f/2.8)

More light → brighter
Shallow DOF → strong background blur (bokeh)
Portrait / subject isolation

Small aperture (large f-number, e.g. f/16)

Less light → darker
Deep DOF → everything sharp
Landscape / architecture
DOF is optical, not projection geometry. It's about blur from aperture, not about where things project.
More light collection Focus control Controllable FOV and magnification
A lens approximates pinhole projection with much better light efficiency. But lenses introduce distortion.

Barrel distortion

Magnification decreases from center
Straight lines bow OUTWARD
Wide-angle / action cameras

Pincushion distortion

Magnification increases from center
Straight lines bow INWARD
Telephoto / zoom lenses
r_d = r · (1 + k₁·r² + k₂·r⁴ + k₃·r⁶ + ...)
r = distance from image center. k₁, k₂, k₃ = radial distortion coefficients. Distortion is systematic → correctable through calibration. Points farther from center → larger displacement.
Directional
Strong shadows, high contrast. Like direct sunlight.
Coaxial
Light along camera axis → no shadows, uniform. Used for flat reflective surfaces.
Diffuse
Soft transitions, reduced shadows. Like overcast sky or softbox.
Backlighting
Light behind subject → silhouettes, reduced front detail.
Key: Changes in illumination can dominate pixel values even when objects and camera don't move at all.

Diffuse (matte)

Light scattered in all directions.
Appearance same from any angle.
Shows texture well.

Specular (glossy)

Light concentrated in highlights.
View-dependent appearance.
Can hide texture, create glare.
  • Transparency & Reflections: Transparent and highly reflective materials mix background colors with reflections, making their appearance depend strongly on the environment.
  • Color & Spectral Reflectance: Surface color depends on spectral reflectance, which determines which specific wavelengths of light are absorbed versus reflected.
  • Texture & Micro-geometry: Physical texture is determined by micro-geometry, which creates fine-scale shading and microscopic specular highlights.
Continuous: I : Ω ⊂ ℝ² → ℝᵏ Digital: I : ℤ² → ℝᵏ
k = 1
Grayscale image (single intensity value per pixel)
k = 3
Color image (RGB channels)
k = N
Multispectral / hyperspectral (multiple bands)

RGB Images

Number of Bands: k = 3
Spectrum: Three discrete channels (Red, Green, Blue).
Usage: Acquisition and display systems matching human sight.

Multispectral Images

Number of Bands: k = N (typically 3 to 10)
Spectrum: Multiple separated, discrete bands.
Usage: Standard remote sensing, satellite imaging.
Hyperspectral Images

Number of Bands: k = hundreds of narrow bands.

Spectrum: A continuous, contiguous spectral range across electromagnetic wavelengths.

Usage: Chemical analysis, precise agricultural monitoring, mineral mapping.

Sampling

Selects discrete spatial locations
Each location = one pixel
More samples = finer spatial detail

Quantization

Converts continuous intensity to discrete levels
8-bit = 256 levels (0–255)
More bits = finer intensity resolution

Image coordinates

Row, column indexing → (y, x)
Origin: top-left corner
Used in pixel access / arrays

Cartesian coordinates

Position as (x, y)
Used in geometric / projection math
Must specify convention to avoid bugs
RGB (Red, Green, Blue)
Acquisition & Display

Purpose & Design: Directly matches camera sensors and monitor displays. Mixes brightness and color information.

Limitations: Highly sensitive to illumination changes, making pure RGB comparison difficult under shadows or varying brightness.

Applications: Image capture, standard displays, object detection, image classification, face recognition, deep learning pipelines.

HSV / HSL (Hue, Saturation, Value / Lightness)
Segmentation & Tracking

Purpose & Design: Separates color info (Hue) from brightness (Value/Lightness) and purity (Saturation) for intuitive color-based analysis.

Limitations: Not perceptually uniform (a mathematical distance between colors doesn't map to human color perception).

Applications: Color-based segmentation, tracking object markers, traffic sign detection, skin segmentation.

YCbCr (Luminance + Chrominance Blue / Red)
Video Compression

Purpose & Design: Separates brightness (Luminance Y) from color (Chrominance Cb, Cr). Exploits human eye limits by subsampling color data.

Advantages: Significantly reduces storage/bandwidth needs. Skin tone clusters very compactly in Cb-Cr space.

Applications: Video compression (MPEG, H.264), JPEG images, broadcasting, face detection, surveillance systems.

CMY / CMYK (Cyan, Magenta, Yellow, Black)
Printing Only

Purpose & Design: Subtractive color mixing used for physical inks (reflecting light instead of emitting it).

Limitations: Completely unsuited for image sensing (cameras) or digital computer vision processing algorithms.

Applications: Industrial printing, packaging inspection, color consistency verification in production.

HSI (Hue, Saturation, Intensity)
Vision Analysis

Purpose & Design: Similar to HSV but optimized for mathematical vision analysis, separating intensity from pure color properties.

Limitations: Has limited hardware standardization, requiring color space conversions in software.

Applications: Satellite image analysis, agricultural monitoring, medical imaging, industrial surface inspection.

Why switch color spaces? RGB mixes brightness and color together — that makes many tasks harder. Alternative spaces isolate the property you care about → simpler algorithms, better robustness to lighting.
Demonstrates how two completely different surfaces can produce identical pixel values. Adjust R and L for each pixel.
Pixel A
R_A (reflectance)0.80
L_A (illumination)250
Pixel B
R_B (reflectance)0.20
L_B (illumination)1000
Given a 3D point P=(X,Y,Z) and focal length f, compute image coordinates (u,v).
X100
Y50
Z (depth)500
f100
Sensor width (px)1000
Focal length f100
Visual Cheat Sheet Summary
1-Image Summary
50-Question Practice Quiz
This comprehensive practice quiz contains 50 multiple-choice questions loaded directly from the lecture database.
25-Question True/False Practice
Answer each statement, reveal optional hints, and review the explanation after submitting.
Figures Extracted from the Original Lecture Document
These figures are preserved in their original document order as a complete visual reference. Captions identify the source part and figure number; explanatory text remains in the study-guide sections.
Lecture 2 — original figure 1
Lecture 2 — original figure 1
Lecture 2 — original figure 2
Lecture 2 — original figure 2
Lecture 2 — original figure 3
Lecture 2 — original figure 3
Lecture 2 — original figure 4
Lecture 2 — original figure 4
Lecture 2 — original figure 5
Lecture 2 — original figure 5
Lecture 2 — original figure 6
Lecture 2 — original figure 6
Lecture 2 — original figure 7
Lecture 2 — original figure 7
Lecture 2 — original figure 8
Lecture 2 — original figure 8
Lecture 2 — original figure 9
Lecture 2 — original figure 9
Lecture 2 — original figure 10
Lecture 2 — original figure 10
Lecture 2 — original figure 11
Lecture 2 — original figure 11
Lecture 2 — original figure 12
Lecture 2 — original figure 12
Lecture 2 — original figure 13
Lecture 2 — original figure 13
Lecture 2 — original figure 14
Lecture 2 — original figure 14
Lecture 2 — original figure 15
Lecture 2 — original figure 15
Lecture 2 — original figure 16
Lecture 2 — original figure 16
Lecture 2 — original figure 17
Lecture 2 — original figure 17
Lecture 2 — original figure 18
Lecture 2 — original figure 18
Lecture 2 — original figure 19
Lecture 2 — original figure 19
Lecture 2 — original figure 20
Lecture 2 — original figure 20
Lecture 2 — original figure 21
Lecture 2 — original figure 21
Lecture 2 — original figure 22
Lecture 2 — original figure 22
Lecture 2 — original figure 23
Lecture 2 — original figure 23
Lecture 2 — original figure 24
Lecture 2 — original figure 24
Lecture 2 — original figure 25
Lecture 2 — original figure 25
Lecture 2 — original figure 26
Lecture 2 — original figure 26
Lecture 2 — original figure 27
Lecture 2 — original figure 27