ICS604 — Introduction to Image Processing & Computer Vision

Midterm Exam Guide — Full Consolidated Study Page (Lectures 1–6) · Dr. Maggie Shammaa

Contents

Lecture 1 — Intro to CV
Lecture 2 — Image Formation
Lecture 3 — Digital Images
Lecture 4 — Image Enhancement
Lecture 5 — Image Filtering
Lecture 6 — Image Analysis
Master Formula Sheet
Key Comparisons (exam favorites)
Answering Strategy

Lecture 1 — Introduction to Computer Vision

IP vs CV

Image Processing

Does: transforms images

Level: pixel-level

Output: an image

Meaning: no semantics

Computer Vision

Does: interprets images

Level: object/scene reasoning

Output: decisions/info

Meaning: high-level understanding

Ventral vs Dorsal stream

Ventral = "what" pathway → Classification/Recognition. Dorsal = "where/how" pathway → Detection/Tracking. Detection = Recognition + Localization (harder, needs both).

Why a photo ≠ ground truth

Every imaging step introduces bias: optics, sensor noise, exposure, color processing, denoising, compression.

Real-world constraints (exam favorite)

Environment: illumination change, occlusion, clutter. Sensors/Data: motion blur, noise, dataset bias. Deployment: real-time limits, memory/power — high benchmark accuracy ≠ real-world robustness.

3 conditions for CV progress

Sufficient dataEnough computeSound math models — all three needed simultaneously.

Camera pipeline (overview)

Optics (lens, aperture) → Sensor + AFE (+ CFA) → ISP (demosaic, WB, denoise) → Post-capture (enhance, compress) CFA = Color Filter Array | AFE = Analog Front-End | WB = White Balance | ISP = Image Signal Processor

Lecture 2 — Image Formation

Core image formation equation

\[I(x,y) = R(x,y) \cdot L(x,y)\]

$R$ = reflectance (intrinsic surface property, ideally light-invariant). $L$ = illumination (varies with scene/time). Ambiguity: same pixel value can come from a bright surface in shadow OR a dark surface in strong light — camera can't tell apart.

Worked example — intrinsic ambiguity

Model: pixel_value = (I/1000)·255, I = R·L
Pixel A: R=0.80, L=250 → I=200 → pixel=51.0
Pixel B: R=0.20, L=1000 → I=200 → pixel=51.0
Both map to same digital value despite totally different physical causes.

Geometry vs Photometry

Geometry

Controls structure/shape

Where does 3D point land?

Camera pose, f, FOV, projection

Photometry

Controls brightness/color

How bright is that point?

Illumination, reflectance, sensor

ISP pipeline

Optics → Sensor (Bayer CFA: 2G,1R,1B per 2×2) → RAW (mosaiced, linear, 12-bit) → Denoise → Demosaic → White Balance → Color Transform → Tone Reproduction → Compression → Final RGB (full color, non-linear/gamma, 8-bit)

Exam point: many CV failures originate in early ISP stages (noise, WB, demosaicing), not the algorithm.

RAW vs final RGB

RAW

Mosaiced (1 ch/px)

Linear w.r.t. radiance

12-bit

Not viewable directly

Final RGB

Full RGB/pixel

Non-linear (gamma)

8-bit/channel

Display-ready

Perspective projection

\[u = f \cdot \frac{X}{Z}, \qquad v = f \cdot \frac{Y}{Z}\]

Non-linear due to division by depth $Z$. Points on same ray $(X,Y,Z)$ and $(\lambda X,\lambda Y,\lambda Z)$ project identically → depth lost. Causes foreshortening (objects shrink with depth).

Magnification

\[m = \frac{y'}{y} = \frac{f}{Z}\]

What projection preserves / loses

Preserved

Straight lines

Incidence relations

NOT preserved

Distances

Angles

Parallelism (→ vanishing points)

Field of View

\[\text{FOV} = 2 \cdot \arctan\!\left(\frac{w}{2f}\right)\]

Depends on BOTH sensor width $w$ and focal length $f$. Larger $f$ → narrower FOV (zoom). Larger $w$ → wider FOV.

Depth of Field

Large aperture (small f-number)

More light, brighter

Shallow DOF, bokeh

Small aperture (large f-number)

Less light, darker

Deep DOF, all sharp

Radial lens distortion

\[r_d = r \cdot (1 + k_1 r^2 + k_2 r^4 + k_3 r^6 + \dots)\]

Barrel (wide-angle): magnification ↓ from center, lines bow outward. Pincushion (telephoto): magnification ↑ from center, lines bow inward.

Color spaces

Space	Use case	Limitation
RGB	Cameras, displays, DL	Mixes brightness + color
HSV/HSL	Segmentation, tracking	Not perceptually uniform
YCbCr	Video compression, face detection	Less intuitive
CMYK	Printing only	Not for digital sensing
HSI	Medical, satellite, agriculture	Limited standardization

Lecture 3 — Digital Images

Sensor = photon bucket

Photosite accumulates electrons ∝ light intensity × exposure time → AFE (analog voltage) → ADC (digital value) → RAW. Saturation when well fills.

Pixel size trade-off

Larger pixels

Higher full-well capacity

Higher dynamic range

Better low-light/SNR

Lower spatial resolution

Smaller pixels

Lower full-well capacity

Higher spatial resolution

More noise

Reduced dynamic range

Bayer CFA

Each photosite measures only one of R/G/B. RGGB pattern: 2 green, 1 red, 1 blue per 2×2 block (human vision green-sensitive for luminance). Demosaicing interpolates missing channels from neighbors.

Sampling vs Quantization — KEY distinction

Sampling

WHERE you measure (spatial)

Determines resolution

Artifact: aliasing/pixelation

Quantization

HOW PRECISE the value (intensity)

Determines bit depth

Artifact: banding

Binning

Groups neighboring pixels (e.g. 2×2 sum) → better SNR, smaller data, lower spatial resolution, irreversible detail loss.

10-step ISP (detailed)

1. RAW → 2. Pre-processing → 3. Noise reduction → 4. Demosaicing → 5. White balance → 6. Color transform I → 7. Color manipulation → 8. Tone mapping → 9. Color transform II → 10. sRGB output

Histograms

Count of pixels per intensity value [0,255]. Discards ALL spatial info — irreversible, two structurally different images can share identical histograms.

Under-exposed

Spike at low values

Over-exposed

Spike at high values

Pattern	Cause	Effect
Spike at 0/255	Severe under/over-exposure clipping	Irreversible detail loss
Gaps between bins	Contrast stretch	Posterization/banding
Spikes at intervals	Contrast compression	Flat washed look
Few occupied bins	GIF/heavy quantization	Posterized bands
Altered frequency pattern	JPEG (DCT quantization)	Block/ringing artifacts

Per-channel histogram limitation: marginal R/G/B histograms can be identical for very different-colored images. Use 2D/joint histograms to capture channel relationships.

HDR vs LDR

LDR forces a compromise (clip highlights OR crush shadows). HDR = bracket exposures, merge, then tone-map. Capture HDR, downsample later — clipped/saturated data can never be recovered afterward.

RGB → HSV/HSL conversion formulas

\[V_{max}=\max(R,G,B), \quad V_{min}=\min(R,G,B), \quad \text{diff}=V_{max}-V_{min}\] \[S_{HSV} = \frac{\text{diff}}{V_{max}} \ \ (\text{0 if } V_{max}=0)\] \[L_{HSL} = \frac{V_{max}+V_{min}}{2}\] \[S_{HSL} = \begin{cases}\dfrac{\text{diff}}{V_{max}+V_{min}} & L<0.5\\[4pt]\dfrac{\text{diff}}{2-(V_{max}+V_{min})} & L\ge 0.5\end{cases}\] \[H = \begin{cases}60\cdot\dfrac{G-B}{\text{diff}} & V_{max}=R\\[4pt]120+60\cdot\dfrac{B-R}{\text{diff}} & V_{max}=G\\[4pt]240+60\cdot\dfrac{R-G}{\text{diff}} & V_{max}=B\end{cases}\]

Lecture 4 — Image Enhancement

General point operation

\[g(x) = T(f(x)) \qquad \text{or} \qquad I'(u,v) = f\big(I(u,v)\big)\]

Output depends ONLY on input value at that pixel — spatially invariant, no neighbors. Contrast with neighborhood operations: $I'(u,v) = f(I(u,v), N(p))$.

All point operations — reference table

Operation	Formula	Histogram effect	Reversible?
Add k	clamp(I+k)	shift right	yes if no clamp
Subtract k	clamp(I−k)	shift left	yes if no clamp
Multiply α>1	clamp(αI)	stretch+clip	no (clamping)
Multiply 0<α<1	clamp(αI)	compress	yes if no clamp
Invert	255−I	mirror at 127	yes
Threshold	a₀ or a₁	two spikes	no
Quantize	⌊I/Δ⌋·Δ	bin merging	no

\[\text{clamp}(v) = \max(0, \min(255, v))\]

Clamping destroys info irreversibly — collapses many out-of-range inputs into one boundary value.

Image differencing — motion detection

\[D(u,v) = I_1(u,v) - I_2(u,v)\] \[|D_t(u,v)| = \left| I_t(u,v) - I_{t-1}(u,v) \right|\] \[M_t(u,v) = \begin{cases}1 & D_t(u,v) \ge T\\0 & \text{otherwise}\end{cases}\]

Use absolute difference — plain clamped subtraction hides darkening (negative → 0); $|D|$ treats brightening and darkening equally as "change."

Quantization

\[\Delta = \frac{256}{L} \qquad I'(u,v) = \left\lfloor \frac{I(u,v)}{\Delta} \right\rfloor \cdot \Delta\]

Contrast stretching

Simple: stretch actual min/max to [0,255] — single outlier pixel can dominate, leaves image flat.

Robust (percentile clipping): use percentiles $q_{low}, q_{high}$ instead of absolute extremes.

\[a' = \left[\frac{a'_{high}-a'_{low}}{a_{high}-a_{low}}\right]\cdot(a-a_{low}) + a'_{low}\]

Histogram equalization

\[H(a) = \sum_{i=0}^{a} h(i), \qquad \text{CDF}(a) = \frac{H(a)}{M\cdot N}\] \[a' = \left\lfloor (K-1)\cdot \text{CDF}(a) \right\rfloor\]

Steep CDF (spike regions) → expanded contrast. Flat CDF → compressed. Limitation: can over-amplify noise in flat regions (skies); global, ignores local structure (→ motivates CLAHE).

Histogram matching (specification)

\[a' = P_R^{-1}\big(P_A(a)\big)\]

Maps original CDF $P_A$ to reference CDF $P_R$. Preserves relative pixel rank, transfers tonal "style".

Global thresholding

\[I'(u,v) = \begin{cases}a_0 & I(u,v) < a_{th}\\a_1 & I(u,v) \ge a_{th}\end{cases}\]

Alpha blending

\[I_{blend}(u,v) = \alpha \cdot I_{left}(u,v) + (1-\alpha)\cdot I_{right}(u,v)\]

Small feathering window → preserves detail, risk of visible seam. Large window → smooth, risk of ghosting (both images visible).

Gamma correction

\[\gamma \equiv \frac{\Delta D}{\Delta B} \qquad\qquad b = f_\gamma(a) = a^\gamma,\ a\in[0,1]\]

$\gamma<1$: expands/brightens shadows. $\gamma>1$: compresses/darkens shadows. $\gamma=1$: identity. Historical origin: CRT non-linearity compensation for TV broadcast.

Lecture 5 — Image Filtering

LSI filtering model

\[g[m,n] = \sum_{k,l} h[k,l]\cdot f[m+k, n+l]\]

Linear (output = weighted sum of neighbors) + Shift-invariant (same kernel at every pixel). Foundation of CNN convolutional layers.

Cross-correlation vs Convolution — THE only difference

\[\text{Correlation: } G[i,j] = \sum_u\sum_v H[u,v]\cdot F[i+u,j+v] \qquad G = H \otimes F\] \[\text{Convolution: } G[i,j] = \sum_u\sum_v H[u,v]\cdot F[i-u,j-v] \qquad G = H \star F\]

Convolution flips the kernel 180° before sliding. Identical results for symmetric kernels (Gaussian, box, Laplacian); differ for asymmetric kernels (Sobel).

Cross-correlation properties

NOT associative

NOT commutative

Convolution properties

Commutative: f⋆h=h⋆f

Associative: (f⋆h₁)⋆h₂=f⋆(h₁⋆h₂)

Separable filters

\[H = v\cdot w^T \qquad G = H\star F = v \star (w^T \star F)\] \[\text{Non-separable: } O(M^2 N^2) \qquad \text{Separable: } O(M^2 \cdot 2N)\]

Gaussian is separable → big speedup for large kernels (e.g. N=15: 225→30 mult/pixel, 7.5× speedup).

Boundary handling

Strategy	Behavior
Zero padding	Dark border artifacts
Clamp/replicate	Good default, avoids dark border
Mirror/reflect	Smooth, preferred for derivative filters
Wrap/toroidal	Periodic, used in Fourier domain

Box vs Gaussian filter

Box filter

Equal weights

Blocky artifacts, sharp freq cutoff

Gaussian filter

Center-weighted decay

Separable, smooth blur, no blocky artifacts

Non-linear filters

Median filter: sorts neighborhood, takes middle. Best for salt-and-pepper/impulse noise — rejects outliers, preserves edges (strictly non-linear). Dilation (max filter): grows bright regions. Erosion (min filter): shrinks bright regions.

Edge detection — derivatives

\[f'(x) = \frac{f(x+1) - f(x-1)}{2}\]

Derivatives amplify noise → always smooth (Gaussian) before differentiating.

Sobel — direction of detection vs direction of edge (critical)

G_x (detects VERTICAL edges): +1 0 −1 +2 0 −2 +1 0 −1 G_y (detects HORIZONTAL edges): −1 −2 −1 0 0 0 +1 +2 +1

\[|G| = \sqrt{G_x^2 + G_y^2} \qquad \theta = \tan^{-1}\!\left(\frac{G_y}{G_x}\right)\]

DoG vs LoG

Derivative of Gaussian (DoG)

1st derivative, directional

$\frac{\partial}{\partial x}(G\star I)=(\frac{\partial G}{\partial x})\star I$

Laplacian of Gaussian (LoG)

2nd derivative, isotropic

$\nabla^2(G\star I)=(\nabla^2 G)\star I$

\[\text{DoG} = G_{\sigma_1} - G_{\sigma_2} \approx \text{LoG} \quad (\text{faster — used in SIFT})\]

Lecture 6 — Image Analysis

Why image pyramids

Multi-scale object search — same object appears at different scales depending on distance. Pyramid organizes representation scale-by-scale.

Reduce vs Expand

Reduce (up the pyramid)

1. Gaussian smooth

2. Subsample by 2

Must smooth BEFORE subsampling (anti-aliasing)

Expand (down the pyramid)

1. Double dimensions

2. Insert zeros

3. Interpolating low-pass filter

Storage: each level = 1/4 pixels of level below. Full pyramid = 4/3× original storage (geometric series $1+\frac14+\frac1{16}+\dots=\frac43$).

Gaussian vs Laplacian pyramid

Gaussian pyramid

Smoothed + downsampled

Lossy — no exact reconstruction

Laplacian pyramid

Stores difference between levels

Lossless — exact reconstruction

Laplacian pyramid build & reconstruct

\[L_l = G_l - \text{Expand}(G_{l+1})\] \[G_l = \text{Expand}(G_{l+1}) + L_l\]

Full representation: $\{L_0, L_1, \dots, L_{n-1}, G_n\}$. Reconstruct bottom-up starting from smallest $G_n$ to recover exact $G_0$.

Laplacian pyramid blending (apple/orange)

1) Build Laplacian pyramids of both images. 2) Build Gaussian pyramid of binary mask (smoother/wider transition at coarse levels). 3) Blend per level: $B_l = mask_l\cdot LA_l + (1-mask_l)\cdot LB_l$. 4) Reconstruct bottom-up. Low-freq blended wide, high-freq blended narrow → no ghosting, no harsh seam.

Aliasing & Nyquist

\[f_{sampling} \ge 2\times f_{max\_signal}\]

Need ≥2 samples per cycle. Violate it → high frequencies fold back as false low-frequency patterns (jagged edges, moiré, distorted textures — zebra-stripe example). Fix: low-pass filter BEFORE subsampling (exactly the Gaussian pyramid Reduce step).

Frequency domain

\[f(x) = A\cdot\sin(\omega x + \varphi)\]

Low freq = smooth regions, center of spectrum. High freq = edges/textures/noise, periphery of spectrum.

Low-pass

Keeps low freq

Smoothing (Gaussian)

High-pass

Keeps high freq

Edge enhancement (Sobel/Laplacian)

Band-pass

Keeps a range

Texture/blob (DoG)

Convolution theorem

\[f \star g \ \longleftrightarrow\ F(u,v)\cdot G(u,v)\]

Spatial convolution = frequency-domain multiplication. Steps: FFT(image) → multiply by FFT(kernel) → IFFT. Faster than spatial convolution for large kernels.

Master Formula Sheet

Lecture 2 — Image Formation

\[I(x,y) = R(x,y)\cdot L(x,y)\] \[u = f\cdot\frac{X}{Z}, \qquad v = f\cdot\frac{Y}{Z}\] \[m = \frac{y'}{y} = \frac{f}{Z}\] \[\text{FOV} = 2\cdot\arctan\!\left(\frac{w}{2f}\right)\] \[r_d = r\cdot(1+k_1 r^2+k_2 r^4+k_3 r^6+\dots)\]

Lecture 3 — Digital Images / Color

\[V_{max}=\max(R,G,B),\quad V_{min}=\min(R,G,B),\quad L=\frac{V_{max}+V_{min}}{2}\] \[S_{HSV} = \frac{V_{max}-V_{min}}{V_{max}}\] \[S_{HSL} = \frac{V_{max}-V_{min}}{V_{max}+V_{min}}\ (L<0.5) \quad\text{or}\quad \frac{V_{max}-V_{min}}{2-(V_{max}+V_{min})}\ (L\ge 0.5)\] \[H=60\cdot\frac{G-B}{V_{max}-V_{min}}\ (V_{max}=R)\]

Lecture 4 — Enhancement

\[g(x)=T(f(x))\] \[I'(u,v)=I(u,v)+k\] \[D_t(u,v)=\left|I_t(u,v)-I_{t-1}(u,v)\right|\] \[M_t(u,v)=1\ (D_t\ge T),\ \text{else } 0\] \[I'(u,v)=\text{clamp}(\alpha\cdot I(u,v))\] \[I'(u,v)=\left\lfloor\frac{I(u,v)}{\Delta}\right\rfloor\cdot\Delta, \quad \Delta=\frac{L}{256}\] \[s=255-r\] \[a' = \left[\frac{a'_{high}-a'_{low}}{a_{high}-a_{low}}\right](a-a_{low})+a'_{low}\] \[a'=\left\lfloor(K-1)\cdot\text{CDF}(a)\right\rfloor\] \[a'=P_R^{-1}(P_A(a))\] \[I'(u,v)=a_0\ (I

Lecture 5 — Filtering

\[g[m,n]=\sum_{k,l}h[k,l]\cdot f[m+k,n+l]\] \[G=H\otimes F \quad (\text{correlation}) \qquad G=H\star F \quad (\text{convolution})\] \[(f\star h_1)\star h_2 = f\star(h_1\star h_2)\] \[f'(x)=\frac{f(x+1)-f(x-1)}{2}\] \[\frac{\partial}{\partial x}(G\star I)=\left(\frac{\partial G}{\partial x}\right)\star I\] \[\nabla^2(G\star I)=(\nabla^2 G)\star I\] \[\text{DoG}=G_{\sigma_1}-G_{\sigma_2}\] \[G=\sqrt{G_x^2+G_y^2}\]

Lecture 6 — Analysis

\[L_l = G_l - \text{Expand}(G_{l+1})\] \[G_l = \text{Expand}(G_{l+1}) + L_l\] \[f(x) = A\cdot\sin(\omega x+\varphi)\] \[f_{sampling} \ge 2\cdot f_{max\_signal}\]

Key Comparisons (exam favorites)

Pair	Quick answer
Image processing vs computer vision	pixel transform (image out) vs semantic interpretation (decision out)
Geometry vs photometry	where a point lands vs how bright/what color
Reflectance vs illumination	intrinsic material property vs external light source
RAW vs final RGB	mosaiced/linear/12-bit vs full-color/gamma/8-bit
Sampling vs quantization	spatial discretization (resolution) vs intensity discretization (bit depth)
Point ops vs neighborhood ops	own pixel only vs local region
Box filter vs Gaussian filter	uniform weights/blocky vs center-weighted/smooth
Cross-correlation vs convolution	kernel as-is vs kernel flipped 180°
Smoothing vs edge-detection filters	low-pass (Gaussian) vs high-pass/derivative (Sobel, Laplacian)
Gaussian pyramid vs Laplacian pyramid	lossy blur+downsample vs lossless residual storage
Low-pass vs high-pass filtering	keeps center/smooth vs keeps periphery/edges

Answering Strategy

Conceptual questions

Define the concept → explain why it's needed → connect to an image example.

Numerical/operation questions

1. Identify input pixel values/region/kernel. 2. Apply the operation. 3. Check if clamping/normalization/thresholding is needed. 4. Interpret result in words (brighter, darker, smoother, sharper, saturated, detected as foreground, etc.) — a correct number alone is not enough.

Connect everything: image formation (why pixel values exist) → digital representation (how continuous becomes discrete) → histograms (describe exposure/contrast) → point operations (independent pixel transforms) → filtering (neighborhood-based: smooth/detect edges) → pyramids/frequency (multi-scale, multi-frequency analysis).