ICS604 — Introduction to Image Processing & Computer Vision
Midterm Exam Guide — Full Consolidated Study Page (Lectures 1–6) · Dr. Maggie Shammaa
Lecture 1 — Introduction to Computer Vision
IP vs CV
Image Processing
Does: transforms images
Level: pixel-level
Output: an image
Meaning: no semantics
Computer Vision
Does: interprets images
Level: object/scene reasoning
Output: decisions/info
Meaning: high-level understanding
Ventral vs Dorsal stream
Ventral = "what" pathway → Classification/Recognition. Dorsal = "where/how" pathway → Detection/Tracking. Detection = Recognition + Localization (harder, needs both).
Why a photo ≠ ground truth
Every imaging step introduces bias: optics, sensor noise, exposure, color processing, denoising, compression.
Real-world constraints (exam favorite)
Environment: illumination change, occlusion, clutter. Sensors/Data: motion blur, noise, dataset bias. Deployment: real-time limits, memory/power — high benchmark accuracy ≠ real-world robustness.
3 conditions for CV progress
Sufficient dataEnough computeSound math models — all three needed simultaneously.
Camera pipeline (overview)
Optics (lens, aperture) → Sensor + AFE (+ CFA) → ISP (demosaic, WB, denoise) → Post-capture (enhance, compress)
CFA = Color Filter Array | AFE = Analog Front-End | WB = White Balance | ISP = Image Signal Processor
Lecture 2 — Image Formation
Core image formation equation
\[I(x,y) = R(x,y) \cdot L(x,y)\]
$R$ = reflectance (intrinsic surface property, ideally light-invariant). $L$ = illumination (varies with scene/time). Ambiguity: same pixel value can come from a bright surface in shadow OR a dark surface in strong light — camera can't tell apart.
Worked example — intrinsic ambiguity
Model: pixel_value = (I/1000)·255, I = R·L
Pixel A: R=0.80, L=250 → I=200 → pixel=51.0
Pixel B: R=0.20, L=1000 → I=200 → pixel=51.0
Both map to same digital value despite totally different physical causes.
Geometry vs Photometry
Geometry
Controls structure/shape
Where does 3D point land?
Camera pose, f, FOV, projection
Photometry
Controls brightness/color
How bright is that point?
Illumination, reflectance, sensor
ISP pipeline
Optics → Sensor (Bayer CFA: 2G,1R,1B per 2×2) → RAW (mosaiced, linear, 12-bit)
→ Denoise → Demosaic → White Balance → Color Transform → Tone Reproduction → Compression
→ Final RGB (full color, non-linear/gamma, 8-bit)
Exam point: many CV failures originate in early ISP stages (noise, WB, demosaicing), not the algorithm.
RAW vs final RGB
RAW
Mosaiced (1 ch/px)
Linear w.r.t. radiance
12-bit
Not viewable directly
Final RGB
Full RGB/pixel
Non-linear (gamma)
8-bit/channel
Display-ready
Perspective projection
\[u = f \cdot \frac{X}{Z}, \qquad v = f \cdot \frac{Y}{Z}\]
Non-linear due to division by depth $Z$. Points on same ray $(X,Y,Z)$ and $(\lambda X,\lambda Y,\lambda Z)$ project identically → depth lost. Causes foreshortening (objects shrink with depth).
Magnification
\[m = \frac{y'}{y} = \frac{f}{Z}\]
What projection preserves / loses
Preserved
Straight lines
Incidence relations
NOT preserved
Distances
Angles
Parallelism (→ vanishing points)
Field of View
\[\text{FOV} = 2 \cdot \arctan\!\left(\frac{w}{2f}\right)\]
Depends on BOTH sensor width $w$ and focal length $f$. Larger $f$ → narrower FOV (zoom). Larger $w$ → wider FOV.
Depth of Field
Large aperture (small f-number)
More light, brighter
Shallow DOF, bokeh
Small aperture (large f-number)
Less light, darker
Deep DOF, all sharp
Radial lens distortion
\[r_d = r \cdot (1 + k_1 r^2 + k_2 r^4 + k_3 r^6 + \dots)\]
Barrel (wide-angle): magnification ↓ from center, lines bow outward. Pincushion (telephoto): magnification ↑ from center, lines bow inward.
Color spaces
| Space | Use case | Limitation |
| RGB | Cameras, displays, DL | Mixes brightness + color |
| HSV/HSL | Segmentation, tracking | Not perceptually uniform |
| YCbCr | Video compression, face detection | Less intuitive |
| CMYK | Printing only | Not for digital sensing |
| HSI | Medical, satellite, agriculture | Limited standardization |
Lecture 3 — Digital Images
Sensor = photon bucket
Photosite accumulates electrons ∝ light intensity × exposure time → AFE (analog voltage) → ADC (digital value) → RAW. Saturation when well fills.
Pixel size trade-off
Larger pixels
Higher full-well capacity
Higher dynamic range
Better low-light/SNR
Lower spatial resolution
Smaller pixels
Lower full-well capacity
Higher spatial resolution
More noise
Reduced dynamic range
Bayer CFA
Each photosite measures only one of R/G/B. RGGB pattern: 2 green, 1 red, 1 blue per 2×2 block (human vision green-sensitive for luminance). Demosaicing interpolates missing channels from neighbors.
Sampling vs Quantization — KEY distinction
Sampling
WHERE you measure (spatial)
Determines resolution
Artifact: aliasing/pixelation
Quantization
HOW PRECISE the value (intensity)
Determines bit depth
Artifact: banding
Binning
Groups neighboring pixels (e.g. 2×2 sum) → better SNR, smaller data, lower spatial resolution, irreversible detail loss.
10-step ISP (detailed)
1. RAW → 2. Pre-processing → 3. Noise reduction → 4. Demosaicing → 5. White balance
→ 6. Color transform I → 7. Color manipulation → 8. Tone mapping → 9. Color transform II → 10. sRGB output
Histograms
Count of pixels per intensity value [0,255]. Discards ALL spatial info — irreversible, two structurally different images can share identical histograms.
Under-exposed
Spike at low values
Over-exposed
Spike at high values
| Pattern | Cause | Effect |
| Spike at 0/255 | Severe under/over-exposure clipping | Irreversible detail loss |
| Gaps between bins | Contrast stretch | Posterization/banding |
| Spikes at intervals | Contrast compression | Flat washed look |
| Few occupied bins | GIF/heavy quantization | Posterized bands |
| Altered frequency pattern | JPEG (DCT quantization) | Block/ringing artifacts |
Per-channel histogram limitation: marginal R/G/B histograms can be identical for very different-colored images. Use 2D/joint histograms to capture channel relationships.
HDR vs LDR
LDR forces a compromise (clip highlights OR crush shadows). HDR = bracket exposures, merge, then tone-map. Capture HDR, downsample later — clipped/saturated data can never be recovered afterward.
RGB → HSV/HSL conversion formulas
\[V_{max}=\max(R,G,B), \quad V_{min}=\min(R,G,B), \quad \text{diff}=V_{max}-V_{min}\]
\[S_{HSV} = \frac{\text{diff}}{V_{max}} \ \ (\text{0 if } V_{max}=0)\]
\[L_{HSL} = \frac{V_{max}+V_{min}}{2}\]
\[S_{HSL} = \begin{cases}\dfrac{\text{diff}}{V_{max}+V_{min}} & L<0.5\\[4pt]\dfrac{\text{diff}}{2-(V_{max}+V_{min})} & L\ge 0.5\end{cases}\]
\[H = \begin{cases}60\cdot\dfrac{G-B}{\text{diff}} & V_{max}=R\\[4pt]120+60\cdot\dfrac{B-R}{\text{diff}} & V_{max}=G\\[4pt]240+60\cdot\dfrac{R-G}{\text{diff}} & V_{max}=B\end{cases}\]
Lecture 4 — Image Enhancement
General point operation
\[g(x) = T(f(x)) \qquad \text{or} \qquad I'(u,v) = f\big(I(u,v)\big)\]
Output depends ONLY on input value at that pixel — spatially invariant, no neighbors. Contrast with neighborhood operations: $I'(u,v) = f(I(u,v), N(p))$.
All point operations — reference table
| Operation | Formula | Histogram effect | Reversible? |
| Add k | clamp(I+k) | shift right | yes if no clamp |
| Subtract k | clamp(I−k) | shift left | yes if no clamp |
| Multiply α>1 | clamp(αI) | stretch+clip | no (clamping) |
| Multiply 0<α<1 | clamp(αI) | compress | yes if no clamp |
| Invert | 255−I | mirror at 127 | yes |
| Threshold | a₀ or a₁ | two spikes | no |
| Quantize | ⌊I/Δ⌋·Δ | bin merging | no |
\[\text{clamp}(v) = \max(0, \min(255, v))\]
Clamping destroys info irreversibly — collapses many out-of-range inputs into one boundary value.
Image differencing — motion detection
\[D(u,v) = I_1(u,v) - I_2(u,v)\]
\[|D_t(u,v)| = \left| I_t(u,v) - I_{t-1}(u,v) \right|\]
\[M_t(u,v) = \begin{cases}1 & D_t(u,v) \ge T\\0 & \text{otherwise}\end{cases}\]
Use absolute difference — plain clamped subtraction hides darkening (negative → 0); $|D|$ treats brightening and darkening equally as "change."
Quantization
\[\Delta = \frac{256}{L} \qquad I'(u,v) = \left\lfloor \frac{I(u,v)}{\Delta} \right\rfloor \cdot \Delta\]
Contrast stretching
Simple: stretch actual min/max to [0,255] — single outlier pixel can dominate, leaves image flat.
Robust (percentile clipping): use percentiles $q_{low}, q_{high}$ instead of absolute extremes.
\[a' = \left[\frac{a'_{high}-a'_{low}}{a_{high}-a_{low}}\right]\cdot(a-a_{low}) + a'_{low}\]
Histogram equalization
\[H(a) = \sum_{i=0}^{a} h(i), \qquad \text{CDF}(a) = \frac{H(a)}{M\cdot N}\]
\[a' = \left\lfloor (K-1)\cdot \text{CDF}(a) \right\rfloor\]
Steep CDF (spike regions) → expanded contrast. Flat CDF → compressed. Limitation: can over-amplify noise in flat regions (skies); global, ignores local structure (→ motivates CLAHE).
Histogram matching (specification)
\[a' = P_R^{-1}\big(P_A(a)\big)\]
Maps original CDF $P_A$ to reference CDF $P_R$. Preserves relative pixel rank, transfers tonal "style".
Global thresholding
\[I'(u,v) = \begin{cases}a_0 & I(u,v) < a_{th}\\a_1 & I(u,v) \ge a_{th}\end{cases}\]
Alpha blending
\[I_{blend}(u,v) = \alpha \cdot I_{left}(u,v) + (1-\alpha)\cdot I_{right}(u,v)\]
Small feathering window → preserves detail, risk of visible seam. Large window → smooth, risk of ghosting (both images visible).
Gamma correction
\[\gamma \equiv \frac{\Delta D}{\Delta B} \qquad\qquad b = f_\gamma(a) = a^\gamma,\ a\in[0,1]\]
$\gamma<1$: expands/brightens shadows. $\gamma>1$: compresses/darkens shadows. $\gamma=1$: identity. Historical origin: CRT non-linearity compensation for TV broadcast.
Lecture 5 — Image Filtering
LSI filtering model
\[g[m,n] = \sum_{k,l} h[k,l]\cdot f[m+k, n+l]\]
Linear (output = weighted sum of neighbors) + Shift-invariant (same kernel at every pixel). Foundation of CNN convolutional layers.
Cross-correlation vs Convolution — THE only difference
\[\text{Correlation: } G[i,j] = \sum_u\sum_v H[u,v]\cdot F[i+u,j+v] \qquad G = H \otimes F\]
\[\text{Convolution: } G[i,j] = \sum_u\sum_v H[u,v]\cdot F[i-u,j-v] \qquad G = H \star F\]
Convolution flips the kernel 180° before sliding. Identical results for symmetric kernels (Gaussian, box, Laplacian); differ for asymmetric kernels (Sobel).
Cross-correlation properties
NOT associative
NOT commutative
Convolution properties
Commutative: f⋆h=h⋆f
Associative: (f⋆h₁)⋆h₂=f⋆(h₁⋆h₂)
Separable filters
\[H = v\cdot w^T \qquad G = H\star F = v \star (w^T \star F)\]
\[\text{Non-separable: } O(M^2 N^2) \qquad \text{Separable: } O(M^2 \cdot 2N)\]
Gaussian is separable → big speedup for large kernels (e.g. N=15: 225→30 mult/pixel, 7.5× speedup).
Boundary handling
| Strategy | Behavior |
| Zero padding | Dark border artifacts |
| Clamp/replicate | Good default, avoids dark border |
| Mirror/reflect | Smooth, preferred for derivative filters |
| Wrap/toroidal | Periodic, used in Fourier domain |
Box vs Gaussian filter
Box filter
Equal weights
Blocky artifacts, sharp freq cutoff
Gaussian filter
Center-weighted decay
Separable, smooth blur, no blocky artifacts
Non-linear filters
Median filter: sorts neighborhood, takes middle. Best for salt-and-pepper/impulse noise — rejects outliers, preserves edges (strictly non-linear). Dilation (max filter): grows bright regions. Erosion (min filter): shrinks bright regions.
Edge detection — derivatives
\[f'(x) = \frac{f(x+1) - f(x-1)}{2}\]
Derivatives amplify noise → always smooth (Gaussian) before differentiating.
Sobel — direction of detection vs direction of edge (critical)
G_x (detects VERTICAL edges):
+1 0 −1
+2 0 −2
+1 0 −1
G_y (detects HORIZONTAL edges):
−1 −2 −1
0 0 0
+1 +2 +1
\[|G| = \sqrt{G_x^2 + G_y^2} \qquad \theta = \tan^{-1}\!\left(\frac{G_y}{G_x}\right)\]
DoG vs LoG
Derivative of Gaussian (DoG)
1st derivative, directional
$\frac{\partial}{\partial x}(G\star I)=(\frac{\partial G}{\partial x})\star I$
Laplacian of Gaussian (LoG)
2nd derivative, isotropic
$\nabla^2(G\star I)=(\nabla^2 G)\star I$
\[\text{DoG} = G_{\sigma_1} - G_{\sigma_2} \approx \text{LoG} \quad (\text{faster — used in SIFT})\]
Lecture 6 — Image Analysis
Why image pyramids
Multi-scale object search — same object appears at different scales depending on distance. Pyramid organizes representation scale-by-scale.
Reduce vs Expand
Reduce (up the pyramid)
1. Gaussian smooth
2. Subsample by 2
Must smooth BEFORE subsampling (anti-aliasing)
Expand (down the pyramid)
1. Double dimensions
2. Insert zeros
3. Interpolating low-pass filter
Storage: each level = 1/4 pixels of level below. Full pyramid = 4/3× original storage (geometric series $1+\frac14+\frac1{16}+\dots=\frac43$).
Gaussian vs Laplacian pyramid
Gaussian pyramid
Smoothed + downsampled
Lossy — no exact reconstruction
Laplacian pyramid
Stores difference between levels
Lossless — exact reconstruction
Laplacian pyramid build & reconstruct
\[L_l = G_l - \text{Expand}(G_{l+1})\]
\[G_l = \text{Expand}(G_{l+1}) + L_l\]
Full representation: $\{L_0, L_1, \dots, L_{n-1}, G_n\}$. Reconstruct bottom-up starting from smallest $G_n$ to recover exact $G_0$.
Laplacian pyramid blending (apple/orange)
1) Build Laplacian pyramids of both images. 2) Build Gaussian pyramid of binary mask (smoother/wider transition at coarse levels). 3) Blend per level: $B_l = mask_l\cdot LA_l + (1-mask_l)\cdot LB_l$. 4) Reconstruct bottom-up. Low-freq blended wide, high-freq blended narrow → no ghosting, no harsh seam.
Aliasing & Nyquist
\[f_{sampling} \ge 2\times f_{max\_signal}\]
Need ≥2 samples per cycle. Violate it → high frequencies fold back as false low-frequency patterns (jagged edges, moiré, distorted textures — zebra-stripe example). Fix: low-pass filter BEFORE subsampling (exactly the Gaussian pyramid Reduce step).
Frequency domain
\[f(x) = A\cdot\sin(\omega x + \varphi)\]
Low freq = smooth regions, center of spectrum. High freq = edges/textures/noise, periphery of spectrum.
Low-pass
Keeps low freq
Smoothing (Gaussian)
High-pass
Keeps high freq
Edge enhancement (Sobel/Laplacian)
Band-pass
Keeps a range
Texture/blob (DoG)
Convolution theorem
\[f \star g \ \longleftrightarrow\ F(u,v)\cdot G(u,v)\]
Spatial convolution = frequency-domain multiplication. Steps: FFT(image) → multiply by FFT(kernel) → IFFT. Faster than spatial convolution for large kernels.
Lecture 2 — Image Formation
\[I(x,y) = R(x,y)\cdot L(x,y)\]
\[u = f\cdot\frac{X}{Z}, \qquad v = f\cdot\frac{Y}{Z}\]
\[m = \frac{y'}{y} = \frac{f}{Z}\]
\[\text{FOV} = 2\cdot\arctan\!\left(\frac{w}{2f}\right)\]
\[r_d = r\cdot(1+k_1 r^2+k_2 r^4+k_3 r^6+\dots)\]
Lecture 3 — Digital Images / Color
\[V_{max}=\max(R,G,B),\quad V_{min}=\min(R,G,B),\quad L=\frac{V_{max}+V_{min}}{2}\]
\[S_{HSV} = \frac{V_{max}-V_{min}}{V_{max}}\]
\[S_{HSL} = \frac{V_{max}-V_{min}}{V_{max}+V_{min}}\ (L<0.5) \quad\text{or}\quad \frac{V_{max}-V_{min}}{2-(V_{max}+V_{min})}\ (L\ge 0.5)\]
\[H=60\cdot\frac{G-B}{V_{max}-V_{min}}\ (V_{max}=R)\]
Lecture 4 — Enhancement
\[g(x)=T(f(x))\]
\[I'(u,v)=I(u,v)+k\]
\[D_t(u,v)=\left|I_t(u,v)-I_{t-1}(u,v)\right|\]
\[M_t(u,v)=1\ (D_t\ge T),\ \text{else } 0\]
\[I'(u,v)=\text{clamp}(\alpha\cdot I(u,v))\]
\[I'(u,v)=\left\lfloor\frac{I(u,v)}{\Delta}\right\rfloor\cdot\Delta, \quad \Delta=\frac{L}{256}\]
\[s=255-r\]
\[a' = \left[\frac{a'_{high}-a'_{low}}{a_{high}-a_{low}}\right](a-a_{low})+a'_{low}\]
\[a'=\left\lfloor(K-1)\cdot\text{CDF}(a)\right\rfloor\]
\[a'=P_R^{-1}(P_A(a))\]
\[I'(u,v)=a_0\ (I
Lecture 5 — Filtering
\[g[m,n]=\sum_{k,l}h[k,l]\cdot f[m+k,n+l]\]
\[G=H\otimes F \quad (\text{correlation}) \qquad G=H\star F \quad (\text{convolution})\]
\[(f\star h_1)\star h_2 = f\star(h_1\star h_2)\]
\[f'(x)=\frac{f(x+1)-f(x-1)}{2}\]
\[\frac{\partial}{\partial x}(G\star I)=\left(\frac{\partial G}{\partial x}\right)\star I\]
\[\nabla^2(G\star I)=(\nabla^2 G)\star I\]
\[\text{DoG}=G_{\sigma_1}-G_{\sigma_2}\]
\[G=\sqrt{G_x^2+G_y^2}\]
Lecture 6 — Analysis
\[L_l = G_l - \text{Expand}(G_{l+1})\]
\[G_l = \text{Expand}(G_{l+1}) + L_l\]
\[f(x) = A\cdot\sin(\omega x+\varphi)\]
\[f_{sampling} \ge 2\cdot f_{max\_signal}\]
Key Comparisons (exam favorites)
| Pair | Quick answer |
| Image processing vs computer vision | pixel transform (image out) vs semantic interpretation (decision out) |
| Geometry vs photometry | where a point lands vs how bright/what color |
| Reflectance vs illumination | intrinsic material property vs external light source |
| RAW vs final RGB | mosaiced/linear/12-bit vs full-color/gamma/8-bit |
| Sampling vs quantization | spatial discretization (resolution) vs intensity discretization (bit depth) |
| Point ops vs neighborhood ops | own pixel only vs local region |
| Box filter vs Gaussian filter | uniform weights/blocky vs center-weighted/smooth |
| Cross-correlation vs convolution | kernel as-is vs kernel flipped 180° |
| Smoothing vs edge-detection filters | low-pass (Gaussian) vs high-pass/derivative (Sobel, Laplacian) |
| Gaussian pyramid vs Laplacian pyramid | lossy blur+downsample vs lossless residual storage |
| Low-pass vs high-pass filtering | keeps center/smooth vs keeps periphery/edges |
Answering Strategy
Conceptual questions
Define the concept → explain why it's needed → connect to an image example.
Numerical/operation questions
1. Identify input pixel values/region/kernel. 2. Apply the operation. 3. Check if clamping/normalization/thresholding is needed. 4. Interpret result in words (brighter, darker, smoother, sharper, saturated, detected as foreground, etc.) — a correct number alone is not enough.
Connect everything: image formation (why pixel values exist) → digital representation (how continuous becomes discrete) → histograms (describe exposure/contrast) → point operations (independent pixel transforms) → filtering (neighborhood-based: smooth/detect edges) → pyramids/frequency (multi-scale, multi-frequency analysis).