ICS604 Final — Full Study Guide

Final Exam Guide — Full Consolidated Study Page (Lectures 7–12 + MCQ Revision) · Dr. Maggie Shammaa

Lecture 7 — Feature Detection

Why features, not full images

Compact, distinctive structures enable matching/recognition/reconstruction without comparing every pixel. Four common feature types: edges (1D, strong intensity change, Sobel/Canny/LoG), corners (multi-directional intensity change — strongest interest points), blobs (homogeneous regions distinct from surroundings, DoG/LoG), ridges (curves at local extrema — road networks, vessels).

Why corners > edges

Edges suffer the aperture problem: ambiguous along the edge direction itself. Corners are fully localized in both directions.

The three region types

Moravec detector

\[E_{m,n}(x,y) = \sum_{(u,v)\in W} \big[I(m+u,n+v) - I(m+x+u,n+y+v)\big]^2\] \[F_{m,n} = \min_{(x,y)\in D} E_{m,n}(x,y), \qquad D = \{(1,0),(-1,0),(0,1),(0,-1)\}\] \[\text{Corner at } (m,n) \iff F_{m,n}\text{ is a local max AND } F_{m,n} > T\]

Region	SSD pattern	Min SSD	Status
Flat	small in all directions	small	not interest point
Edge	large ⊥, small ∥ to edge	small	not a corner
Corner	large in all directions	large	interest point

Algorithm: place window → shift in 4 directions → compute SSD each → take minimum → mark corner if locally maximal & above threshold.

Worked example — corner-like boundary

Background=128, object=0. Shift up: SSD=(0−128)²+(128−0)²=32768 (large). Shift right: SSD=16384 (large). Only when ALL 4 shift directions give large SSD does the minimum stay large → corner.

Moravec limitations → motivates Harris

Lecture 8 — Harris Corner Detection

Structure tensor (covariance matrix)

\[M = \sum_{x,y} w(x,y)\begin{bmatrix}I_x^2 & I_x I_y\\ I_x I_y & I_y^2\end{bmatrix}\]

$I_x,I_y$ = image gradients, $w$ = Gaussian weighting window (center pixels weighted more → better localization, less noise sensitivity).

Taylor expansion derivation

\[I(x+u,y+v) \approx I(x,y) + u I_x + v I_y\] \[E(u,v) \approx [u\ v]\, M \,[u\ v]^T\]

Harris response

\[R = \det(M) - \alpha \cdot \text{trace}(M)^2, \qquad \alpha \in [0.04, 0.06]\] \[\det(M)=\lambda_1\lambda_2 \qquad \text{trace}(M)=\lambda_1+\lambda_2\] \[\lambda = \frac{\text{trace}(M) \pm \sqrt{\text{trace}(M)^2 - 4\det(M)}}{2}\]

Eigenvalue / R classification

α too small → edges falsely flagged as corners. α too large → true corners missed. Standard α=0.04.

Limitation	Cause	Harris fix
Not rotation invariant	only 4 discrete shifts	gradient structure tensor (continuous)
Sensitive to noise	raw intensity comparison	Gaussian-weighted window
Edges as corners	min over 4 dirs misses parallel edges	eigenvalue analysis
Poor localization	uniform window weighting	center-weighted Gaussian
Anisotropic response	depends on chosen shift set	continuous gradient formulation

Eigenvalues	R	Region
both ≈ 0	R ≈ 0	Flat
one large, one ≈0	R < 0	Edge
both large	R ≫ 0	Corner

Worked example — Harris R at pixel (2,2)

$\Sigma I_x^2=403,\ \Sigma I_y^2=381,\ \Sigma I_xI_y=385$
$M=\begin{bmatrix}403&385\\385&381\end{bmatrix}$
$\det(M)=403\times381-385^2=5318$, $\text{trace}(M)=784$
$R = 5318 - 0.04\times784^2 = 5318-24586.24 = -19268.24$ → R<0 → EDGE.

Shi-Tomasi (alternative response)

\[R_{ShiTomasi} = \min(\lambda_1,\lambda_2)\]

No empirical α needed, better localization, used in "Good Features to Track" — optical flow/tracking.

Detector comparison

Property	Moravec	Harris	Shi-Tomasi
Directions	8 discrete	all continuous	all continuous
Window	uniform	Gaussian	Gaussian
Response	min(SSD)	det(M)−α·tr(M)²	min(λ₁,λ₂)
Free param	none	α	none
Rotation invariant	no	yes	yes
Use	historical	general corners	tracking, optical flow

Why no explicit eigenvalue decomposition: Harris uses only det(M) and trace(M), both directly computable from ΣI_x², ΣI_y², ΣI_xI_y — efficient for real-time.

Lecture 9 — Feature Descriptors (SIFT)

Harris is NOT scale-invariant

Rotation invariant (eigenvalues unchanged under rotation) but window size is fixed — same corner looks different at different zoom. Motivates SIFT.

Scale-normalized LoG & DoG

\[\text{LoG}_{norm}(x,y,\sigma) = \sigma^2 \nabla^2 G(x,y,\sigma)\] \[D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma) \approx (k-1)\sigma^2 \nabla^2 G(x,y,\sigma) * I(x,y)\] \[k = 2^{1/S} \quad (S = \text{scale intervals per octave})\]

DoG is a fast approximation of LoG. Canonical SIFT S=3 → 6 Gaussian images, 5 DoG images per octave.

SIFT 4-step pipeline

1. Scale-space extrema detection

Build Gaussian pyramid (octaves)

Compute DoG at each scale

Compare to 26 neighbors (8+9+9)

2. Keypoint localization & filtering

Fit 3D quadratic, subpixel refine

Reject low contrast (<0.03)

Reject edge-like via Hessian ratio (r_th=10)

\[x^* = -\left(\frac{\partial^2 D}{\partial x^2}\right)^{-1}\frac{\partial D}{\partial x}\] \[\text{Reject edge-like if } \frac{\text{Tr}(H)^2}{\det(H)} \ge \frac{(r_{th}+1)^2}{r_{th}}\]

3. Orientation assignment

Gradient mag + direction per pixel

36 bins (10° each), Gaussian-weighted vote

Dominant bin = orientation; ≥80% peak → extra keypoint

4. Descriptor (128-D)

16×16 window, rotated to orientation

4×4 cells × 8 orientation bins = 128

Interpolated votes reduce boundary effects

\[m(x,y)=\sqrt{(L(x{+}1,y){-}L(x{-}1,y))^2+(L(x,y{+}1){-}L(x,y{-}1))^2}\] \[\theta(x,y) = \text{atan2}\big(L(x,y{+}1){-}L(x,y{-}1),\ L(x{+}1,y){-}L(x{-}1,y)\big)\]

Descriptor normalization (illumination robustness)

\[v_1 = v/\|v\|_2 \quad\to\quad v_2[i]=\min(v_1[i],0.2) \quad\to\quad \hat v = v_2/\|v_2\|_2\]

Matching — Lowe ratio test

\[d(d_A,d_B) = \sqrt{\sum_{i=1}^{128}(d_A[i]-d_B[i])^2}\] \[\text{ratio} = d_1/d_2, \quad \text{accept if ratio} < \tau\ (\tau\approx 0.7\text{–}0.8)\]

Harris vs SIFT

Lecture 10 — Classical Image Recognition

Part I — Hough Transform

\[x\cos\theta + y\sin\theta = \rho, \qquad \theta\in[0^\circ,180^\circ)\]

Property	Harris	SIFT
Feature type	corners	blobs
Scale invariant	no	yes (multi-scale DoG)
Rotation invariant	yes	yes (orientation align)
Descriptor	none	128-D histogram
Matching	n/a	Euclidean + Lowe ratio

Why polar > slope-intercept: $m\to\infty$ for vertical lines under $y=mx+b$; polar form is bounded, uniform, compact accumulator.

Duality

1 image point → 1 sinusoid in (ρ,θ)

Collinear points → sinusoids intersect at common (ρ*,θ*)

Algorithm (6 steps)

Edge detect → init accumulator → vote per edge pixel → find peaks → threshold/NMS → convert peaks to lines

\[\rho_i(\theta) = x_i\cos\theta + y_i\sin\theta\]

Gradient-guided Hough: vote only near gradient direction θ_g (≈ line normal) instead of all θ → O(N_edge) instead of O(N_edge·N_θ), fewer false peaks.

Part II — K-Means

\[J = \sum_{i=1}^K \sum_{x_n\in C_i} \|x_n-\mu_i\|_2^2\]

Steps: choose K → init centroids → assign each point to nearest centroid → update centroid = mean of assigned → repeat to convergence. Clustering ≠ classification: cluster IDs are arbitrary, not semantic labels.

Part III — Bag of Visual Words

\[q(d_j) = \arg\min_{k} \|d_j - w_k\|_2^2\] \[h_k = \sum_{j=1}^M \mathbb{1}[q(d_j)=k]\] \[h_{L1} = h/\textstyle\sum_k|h_k| \qquad h_{L2}=h/\|h\|_2\] \[\text{tf}_{k,d} = \frac{n_{k,d}}{\sum_j n_{j,d}}, \qquad \text{idf}_k = \log\frac{N}{df_k}\]

Feature	Groups by	Caution
[I]	brightness	distant same-intensity objects merge
[R,G,B]	color	ignores spatial coherence
[I,x,y]	brightness+proximity	features must be balanced/scaled
[R,G,B,λx·x,λy·y]	color+spatial	raw coords can dominate w/o scaling

Pipeline: detect keypoints → describe (SIFT) → K-means cluster training descriptors → vocabulary {w₁..w_K} → quantize each descriptor to nearest word → build K-D histogram per image → normalize → feed to SVM. Converts variable-length descriptor sets into fixed-length vectors, permutation-invariant.

Part IV — SVM

\[f(x) = w^Tx+b, \qquad \hat y = \text{sign}(f(x))\] \[\text{Hard-margin: } \min_{w,b}\tfrac12\|w\|_2^2 \quad \text{s.t. } y_i(w^Tx_i+b)\ge1\] \[\text{Soft-margin: } \min_{w,b,\xi}\tfrac12\|w\|_2^2+C\sum_i\xi_i \quad \text{s.t. } y_i(w^Tx_i+b)\ge1-\xi_i,\ \xi_i\ge0\] \[\hat y = \text{sign}\Big(\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b\Big)\]

Larger C → tighter margin, penalizes violations more. Kernel trick (RBF: $\exp(-\gamma\|x-z\|_2^2)$) for non-linear separation. Critical: vocabulary/IDF/SVM all fit on training set only — fitting on test = data leakage.

\[\text{Classical pipeline: } \text{keypoints} \to \text{descriptors} \to \text{visual words} \to \text{histogram} \to \text{SVM}\]

Lecture 11 — Classical Vision to Deep Learning

RANSAC

\[\lambda[u,v,1]^T = H[x,y,1]^T \qquad e = \|x'-\hat x'\|_2\] \[N = \left\lceil \frac{\log(1-p)}{\log(1-w^s)} \right\rceil\]

$w$=inlier fraction, $s$=min sample size (2 for line, 4 for homography), $p$=desired success prob. Algorithm: sample minimal set → fit model → count inliers (residual<τ) → keep best consensus set → re-estimate using ALL inliers after N iterations.

CNN spatial arithmetic

\[N_{out} = \left\lfloor \frac{N-F+2P}{S} \right\rfloor + 1\] \[N_{out}^{(dilated)} = \left\lfloor \frac{N-D(F-1)-1}{S} \right\rfloor + 1\] \[r_l = r_{l-1} + (F_l-1)D_l\, j_{l-1}, \qquad j_l = j_{l-1}\cdot S_l\] \[\text{Params} = (F_h\times F_w\times C_{in}+1)\times C_{out}\]

Worked example — 3×3 convolution

X=[[5,2,6],[4,3,4],[3,9,2]], K=[[-1,0,1],[2,1,2],[1,-2,0]]
X⊙K = -5+0+6+8+3+8+3-18+0 = 5

Activation functions

Loss & training

\[\theta_{t+1} = \theta_t - \eta\nabla_\theta L_{batch}(\theta_t)\] \[p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}, \qquad L_{CE} = -\log p_{true} \quad (\text{softmax, multi-class})\] \[\sigma(z_k)=\frac1{1+e^{-z_k}}, \qquad L_{BCE} = -\sum_k[y_k\log p_k+(1-y_k)\log(1-p_k)] \quad (\text{multi-label})\]

Function	Equation	Note
ReLU	max(0,x)	simple, sparse, efficient
Leaky ReLU	max(0.1x,x)	small grad for negatives
Sigmoid	1/(1+e⁻ˣ)	(0,1), vanishing gradient
Tanh	tanh(x)	(-1,1), zero-centered

Training loop (6 steps): init filters → forward pass → compute loss → backprop ($\nabla_\theta L$ via chain rule) → optimizer update → repeat over batches/epochs.

Dropout

\[\tilde a_i = \frac{m_i}{q}\cdot a_i \ (\text{train}), \qquad \tilde a_i = a_i \ (\text{inference}), \qquad m_i\sim\text{Bernoulli}(q)\]

Padding, stride, pooling

Zero padding: fair edge processing, F=3,S=1,P=1 → same output size. Stride: larger S downsamples, less compute, less detail. Max pooling: keeps strongest feature. GAP: H×W×C→C, compact + regularizing (vs flattening H×W×C→huge vector, overfit risk).

Classical vs CNN pipeline

Classical

Hand-designed features (SIFT)

Separate classifier (SVM)

CNN

Learns filters from data

Unified feature extraction + classification

Feature hierarchy: early layers = edges/textures, middle = shapes/parts, deep = class-specific (faces, wheels).

Lecture 12 — Modern Vision Models & Applications

Five core output types

IoU, NMS, detection metrics

\[\text{IoU}(A,B) = \frac{|A\cap B|}{|A\cup B|}\] \[\text{NMS: keep highest-score box, suppress others with IoU} > \tau_{NMS}\] \[L_{YOLO} = \lambda_{box}L_{box}+\lambda_{cls}L_{cls}+\lambda_{obj}L_{obj}\]

Segmentation & classification metrics

\[\text{Dice}(P,G) = \frac{2|P\cap G|}{|P|+|G|} = \frac{2\cdot\text{IoU}}{1+\text{IoU}}\] \[\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}, \quad \text{Precision}=\frac{TP}{TP+FP}, \quad \text{Recall}=\frac{TP}{TP+FN}\] \[F1 = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} = \frac{2TP}{2TP+FP+FN}\] \[\text{AP}_c=\int_0^1 P_c(R)\,dR, \qquad \text{mAP}=\frac1C\sum_c \text{AP}_c\]

Recall matters when missing positives is dangerous (disease, defects). Precision matters when false alarms are costly.

CLIP

\[s_k = \cos(v,t_k)=\frac{v^Tt_k}{\|v\|_2\|t_k\|_2}, \qquad p_k=\frac{e^{s_k/T}}{\sum_j e^{s_j/T}}\]

Maps image + text to shared embedding space. Zero-shot classification via cosine similarity to candidate text prompts. Does NOT generate captions.

Two-stage vs single-stage detectors

Task	Question	Output
Classification	What is main content?	one label
Object Detection	What and where?	boxes + labels
Segmentation	Which pixel = what?	pixel-level masks
Pose Estimation	Structural layout?	keypoints
Tracking	Same object across frames?	persistent track IDs

Two-stage (R-CNN family)

Region proposals → classify/refine

Faster R-CNN, Mask R-CNN

Single-stage (YOLO family)

Direct box+class prediction

YOLO, SSD, RetinaNet — real-time

YOLO architecture: Backbone (feature extraction) → Neck (feature pyramid, multi-scale) → Detection Head (boxes + class/confidence). Post-process: confidence threshold + NMS.

SAM vs CLIP vs task models

Model	Output	Use
YOLO/SSD/RetinaNet	boxes+classes	real-time detection
Faster/Mask R-CNN	proposals, boxes, masks	accurate detect/instance seg
U-Net, DeepLab	pixel class maps	semantic seg (medical, satellite)
SAM/SAM2	promptable masks	zero-shot interactive seg
CLIP	image-text similarity	zero-shot recognition

Semantic vs instance segmentation: semantic = one label/pixel, no instance split. Instance = separate mask per object instance.

Tracking algorithms

Algorithm	Mechanism	Best for	Limitation
SORT	Kalman + IoU	fast, clear scenes	ID switches in crowding
DeepSORT	SORT + appearance embeddings	crowded, re-ID after occlusion	more compute
ByteTrack	high+low conf detections	unstable detections	tied to detector quality
BoT-SORT	motion+overlap+appearance	complex tracking	heavy pipeline

Failures: ID switch (swap identities), fragmentation (one object → multiple tracks), occlusion failure (lost/reassigned ID).

Master Formula Sheet

Lecture 7 — Feature Detection

\[E_{m,n}(x,y) = \sum_{(u,v)\in W}\big[I(m+u,n+v)-I(m+x+u,n+y+v)\big]^2\] \[F_{m,n} = \min_{(x,y)\in D} E_{m,n}(x,y)\]

Lecture 8 — Harris

\[M=\sum w(x,y)\begin{bmatrix}I_x^2&I_xI_y\\I_xI_y&I_y^2\end{bmatrix}\] \[R=\det(M)-\alpha\,\text{trace}(M)^2\] \[R_{ShiTomasi}=\min(\lambda_1,\lambda_2)\]

Lecture 9 — SIFT

\[D(x,y,\sigma)\approx(k-1)\sigma^2\nabla^2G(x,y,\sigma)*I(x,y)\] \[m=\sqrt{(L_{x+1}-L_{x-1})^2+(L_{y+1}-L_{y-1})^2}\] \[\text{ratio}=d_1/d_2 < \tau\]

Lecture 10 — Hough / K-Means / BoVW / SVM

\[x\cos\theta+y\sin\theta=\rho\] \[J=\sum_i\sum_{x_n\in C_i}\|x_n-\mu_i\|_2^2\] \[\text{idf}_k=\log(N/df_k)\] \[\min_{w,b}\tfrac12\|w\|_2^2 \ \text{s.t.}\ y_i(w^Tx_i+b)\ge1\]

Lecture 11 — RANSAC / CNN

\[N=\left\lceil\frac{\log(1-p)}{\log(1-w^s)}\right\rceil\] \[N_{out}=\left\lfloor\frac{N-F+2P}{S}\right\rfloor+1\] \[\theta_{t+1}=\theta_t-\eta\nabla_\theta L\]

Lecture 12 — Detection / Metrics

\[\text{IoU}=\frac{|A\cap B|}{|A\cup B|}, \qquad \text{Dice}=\frac{2|P\cap G|}{|P|+|G|}\] \[F1=\frac{2\cdot P\cdot R}{P+R}\] \[s_k=\cos(v,t_k)\]

Key Comparisons (exam favorites)

Answering Strategy

Pair	Quick answer
Moravec vs Harris	8 discrete shifts/SSD vs continuous gradient structure tensor
Harris vs Shi-Tomasi	det−α·tr² (needs α) vs min(λ₁,λ₂) (no α, better for tracking)
Harris vs SIFT	rotation-invariant corners only vs scale+rotation invariant blobs+descriptor
DoG vs LoG	fast approx via Gaussian subtraction vs exact 2nd-derivative
Slope-intercept vs polar Hough	unbounded m, fails on vertical lines vs bounded θ∈[0,180°)
Clustering vs classification	unsupervised, arbitrary IDs vs supervised, predefined labels
Two-stage vs single-stage detector	region proposals then classify (accurate) vs direct prediction (fast/real-time)
Semantic vs instance segmentation	one label/pixel vs separate mask per object instance
SAM vs CLIP	promptable pixel masks, no labels vs image-text similarity, zero-shot labels
Flattening vs Global Average Pooling	huge vector, overfit risk vs compact, regularizing
Precision vs Recall	cost of false alarms vs cost of missed positives

Conceptual questions

Define concept → explain why needed → connect to example/application.

Numerical questions

1. Identify inputs (pixel values, gradients, K, thresholds). 2. Apply formula step by step. 3. Check rounding/thresholding/classification needed. 4. State result in words (corner/edge/flat, inlier/outlier, detected/rejected).

4 guiding questions per method: 1) What problem does it solve? 2) What's the input/output? 3) Why better than the previous method? 4) What are its limitations/trade-offs?

Connections: Moravec→Harris (robustness). Harris→SIFT (scale invariance). SIFT descriptors→BoVW (fixed-length representation). BoVW histograms→SVM (classification needs fixed vectors). Classical pipeline→CNN (learned vs hand-crafted features). Modern models chosen by required output type (label/box/mask/keypoints/track ID).

Practice Assignment Revision — Lectures 7–12 (30 MCQs)

1. Robot needs stable points to match shelf corners across frames. Best image structure?

a) Flat regions b) Corners with strong multi-directional variation c) Uniform walls d) Global brightness

Answer: b — corners produce significant changes in multiple directions, easy to detect/match.

2. Moravec on plain white wall — expected SSD behavior?

a) Large in all dirs b) Large in one dir c) Small in all dirs d) Random

Answer: c — flat region, no intensity change on shift.

3. Moravec on vertical edge, shifted horizontally/vertically — response?

a) Large in all dirs, definitely corner b) Small in all dirs c) Large ⊥ to edge, small ∥ to edge d) Depends on histogram only

Answer: c — minimum SSD stays small along the edge, distinguishing edge from corner.

4. Moravec misclassifies a diagonal edge as a corner. Why?

a) Too many color channels b) Evaluates only limited discrete shift directions c) Wrong eigenvalues d) Uses learned CNN filter

Answer: b — only checks 4 discrete shifts, rotated/diagonal structures misread.

5. Harris structure tensor gives two large eigenvalues. Indicates?

a) Flat b) Edge c) Corner d) Saturated

Answer: c — strong variation in two independent directions = corner.

6. Flower field image: corners near petals/centers, few in smooth sky. Why?

a) Sky has stronger gradients b) Flower regions have multi-directional intensity variation c) Harris detects only blue pixels d) Smooth regions always give large positive R

Answer: b — sky has small gradients, response near zero.

7. R = det(M) − α·trace(M)². If R large and positive, region is most likely?

a) Corner b) Flat c) Pure edge d) Low-contrast

Answer: a — strong variation in both principal directions.

8. Student raises the Harris threshold significantly. Effect on detected corners?

a) More weak corners detected b) Detected points decrease c) All flat regions become corners d) Resolution increases

Answer: b — higher threshold keeps only stronger responses.

9. Same object far vs close-up. Harris corners don't match well. Why?

a) Harris not scale-invariant b) Harris can't detect corners c) Works only on binary images d) Doesn't use gradients

Answer: a — Harris is rotation-invariant but not scale-invariant; SIFT fixes this.

10. Logo must be recognized at different sizes. Which SIFT stage supports this directly?

a) Scale-space extrema detection b) Histogram equalization c) RGB normalization d) Alpha blending

Answer: a — searches extrema across scale, detects features at their most stable scale.

11. SIFT descriptor window aligned to dominant gradient orientation. Main purpose?

a) Reduce image size b) Improve rotation invariance c) Remove all noise d) Convert to grayscale

Answer: b — same structure produces similar descriptor under rotation.

12. SIFT descriptor: 16×16 neighborhood, 4×4 cells, 8-bin histogram each. Final length?

a) 32 b) 64 c) 128 d) 256

Answer: c — 4×4=16 cells × 8 bins = 128.

13. Road-lane detection needs straight lines even with missing parts. Best method?

a) K-means b) Hough Transform c) Dropout d) Global average pooling

Answer: b — Hough lets multiple edge points vote for the same line even with gaps.

14. Polar Hough accumulator: two parallel lines detected. Expected accumulator pattern?

a) One peak, same d and θ b) Two peaks, similar θ, different d c) No peaks d) Single intensity histogram

Answer: b — same orientation (θ) but different distance from origin (d/ρ).

15. Satellite K-means on intensity only — roads/rooftops merge. Why?

a) K-means can't use numerical features b) Regions may share similar brightness c) K-means needs labels d) Pixel count too small

Answer: b — intensity-only fails when objects share brightness; add color/spatial coords.

16. K-means with very small K. Likely result?

a) Over-segmented into tiny regions b) Different objects merged into same cluster c) Becomes supervised d) Image automatically sharper

Answer: b — too few clusters to represent all meaningful regions.

17. Variable number of SIFT descriptors per image, want SVM. Why need Bag of Visual Words?

a) Convert variable-length descriptors to fixed-length histograms b) Remove all support vectors c) Replace feature extraction d) Perform tracking

Answer: a — SVM needs fixed-length feature vectors.

18. BoVW vocabulary too small. Likely effect?

a) Different structures grouped together, reduced discriminative power b) Histograms too sparse and perfectly accurate c) Model becomes scale-invariant automatically d) SVM no longer needs training data

Answer: a — small vocabulary forces different descriptors into same words.

19. Panorama stitching: many wrong SIFT matches from repeated building windows. Method before homography estimation?

a) RANSAC b) Dropout c) Non-Maximum Suppression d) Global thresholding

Answer: a — RANSAC rejects geometrically inconsistent outliers.

20. RANSAC for homography. Minimum point correspondences needed?

a) 2 b) 3 c) 4 d) 8

Answer: c — homography has 8 DOF, each correspondence gives 2 constraints → 4 needed.

21. CNN classifying cats/dogs/horses/birds. Main difference from classical hand-designed filters?

a) CNN filters learned from data during training b) Always fixed before training c) Can't detect edges d) Used only after SVM

Answer: a — classical pipelines use hand-designed filters/descriptors; CNNs learn them.

22. CNN layer uses stride 2 instead of 1. Main effect?

a) Kernel moves farther each step, usually reduces output size b) Number of classes doubles c) Loss function removed d) Image color-inverted

Answer: a — larger stride skips more positions → smaller feature maps.

23. CNN performs great on training, poorly on new test images (different lighting). Problem?

a) Overfitting b) Perfect generalization c) Correct dropout use d) Under-segmentation

Answer: a — model learned training-specific patterns instead of general features.

24. Traffic system needs real-time bounding boxes + class labels for cars/buses/trucks/pedestrians. Best model family?

a) YOLO b) K-means c) SIFT only d) Harris only

Answer: a — single-stage detector, boxes+labels+confidence in one forward pass, real-time.

25. Detector produces several overlapping boxes around same pedestrian. Post-processing step to keep strongest?

a) Non-Maximum Suppression b) Histogram equalization c) K-means centroid update d) RANSAC homography

Answer: a — NMS removes duplicate overlapping detections, keeps highest-confidence box.

26. Medical system must mark exact tumor boundary, not just abnormal/normal. Best output?

a) One image-level label b) Histogram of visual words c) Segmentation mask d) Single Harris corner point

Answer: c — boundary marking requires pixel-level localization.

27. Dataset: 950 normal, 50 disease images. High accuracy but misses many disease cases. Metric to emphasize?

a) Recall b) File size c) Number of conv layers d) Image width

Answer: a — recall measures detected actual positives; missing disease cases dangerous → minimize false negatives.

28. Sports tracker swaps player identities when two players cross. Best fix?

a) Use only image classification b) DeepSORT with appearance features c) Increase Harris threshold d) Histogram equalization only

Answer: b — appearance features (beyond motion/position) better distinguish overlapping objects.

29. Quality inspection must separate each defective bottle with its own mask. Best output?

a) Classification label b) Detection bounding box only c) Instance segmentation mask d) Global histogram

Answer: c — instance segmentation separates each object instance with own pixel mask.

30. Team wants quick annotation: user clicks object, model produces mask. Best model?

a) SAM b) YOLO only c) SVM d) K-means only

Answer: a — SAM is promptable segmentation, generates masks from point/box prompts.

ICS604 — Introduction to Image Processing & Computer Vision

Lecture 7 — Feature Detection

Why features, not full images

Why corners > edges

The three region types

Moravec detector

Moravec limitations → motivates Harris

Lecture 8 — Harris Corner Detection

Structure tensor (covariance matrix)

Taylor expansion derivation

Harris response

Eigenvalue / R classification

Shi-Tomasi (alternative response)

Detector comparison

Lecture 9 — Feature Descriptors (SIFT)

Harris is NOT scale-invariant

Scale-normalized LoG & DoG

SIFT 4-step pipeline

1. Scale-space extrema detection

2. Keypoint localization & filtering

3. Orientation assignment

4. Descriptor (128-D)

Descriptor normalization (illumination robustness)

Matching — Lowe ratio test

Harris vs SIFT

Lecture 10 — Classical Image Recognition

Part I — Hough Transform

Duality

Algorithm (6 steps)

Part II — K-Means

Part III — Bag of Visual Words

Part IV — SVM

Lecture 11 — Classical Vision to Deep Learning

RANSAC

CNN spatial arithmetic

Activation functions

Loss & training

Dropout

Padding, stride, pooling

Classical vs CNN pipeline

Classical

CNN

Lecture 12 — Modern Vision Models & Applications

Five core output types

IoU, NMS, detection metrics

Segmentation & classification metrics

CLIP

Two-stage vs single-stage detectors

Two-stage (R-CNN family)

Single-stage (YOLO family)

SAM vs CLIP vs task models

Tracking algorithms

Master Formula Sheet

Lecture 7 — Feature Detection

Lecture 8 — Harris

Lecture 9 — SIFT

Lecture 10 — Hough / K-Means / BoVW / SVM

Lecture 11 — RANSAC / CNN

Lecture 12 — Detection / Metrics

Key Comparisons (exam favorites)

Answering Strategy

Practice Assignment Revision — Lectures 7–12 (30 MCQs)