ICS604 — Introduction to Image Processing & Computer Vision

Final Exam Guide — Full Consolidated Study Page (Lectures 7–12 + MCQ Revision) · Dr. Maggie Shammaa
Contents
  1. Lecture 7 — Feature Detection
  2. Lecture 8 — Harris Corner Detection
  3. Lecture 9 — Feature Descriptors (SIFT)
  4. Lecture 10 — Classical Image Recognition (Hough, K-Means, BoVW, SVM)
  5. Lecture 11 — Classical Vision to Deep Learning (RANSAC, CNN)
  6. Lecture 12 — Modern Vision Models & Applications
  7. Master Formula Sheet
  8. Key Comparisons (exam favorites)
  9. Answering Strategy
  10. Practice Assignment Revision — 30 MCQs

Lecture 7 — Feature Detection

Why features, not full images

Compact, distinctive structures enable matching/recognition/reconstruction without comparing every pixel. Four common feature types: edges (1D, strong intensity change, Sobel/Canny/LoG), corners (multi-directional intensity change — strongest interest points), blobs (homogeneous regions distinct from surroundings, DoG/LoG), ridges (curves at local extrema — road networks, vessels).

Why corners > edges

Edges suffer the aperture problem: ambiguous along the edge direction itself. Corners are fully localized in both directions.

The three region types

RegionSSD patternMin SSDStatus
Flatsmall in all directionssmallnot interest point
Edgelarge ⊥, small ∥ to edgesmallnot a corner
Cornerlarge in all directionslargeinterest point

Moravec detector

\[E_{m,n}(x,y) = \sum_{(u,v)\in W} \big[I(m+u,n+v) - I(m+x+u,n+y+v)\big]^2\] \[F_{m,n} = \min_{(x,y)\in D} E_{m,n}(x,y), \qquad D = \{(1,0),(-1,0),(0,1),(0,-1)\}\] \[\text{Corner at } (m,n) \iff F_{m,n}\text{ is a local max AND } F_{m,n} > T\]

Algorithm: place window → shift in 4 directions → compute SSD each → take minimum → mark corner if locally maximal & above threshold.

Worked example — corner-like boundary
Background=128, object=0. Shift up: SSD=(0−128)²+(128−0)²=32768 (large). Shift right: SSD=16384 (large). Only when ALL 4 shift directions give large SSD does the minimum stay large → corner.

Moravec limitations → motivates Harris

LimitationCauseHarris fix
Not rotation invariantonly 4 discrete shiftsgradient structure tensor (continuous)
Sensitive to noiseraw intensity comparisonGaussian-weighted window
Edges as cornersmin over 4 dirs misses parallel edgeseigenvalue analysis
Poor localizationuniform window weightingcenter-weighted Gaussian
Anisotropic responsedepends on chosen shift setcontinuous gradient formulation

Lecture 8 — Harris Corner Detection

Structure tensor (covariance matrix)

\[M = \sum_{x,y} w(x,y)\begin{bmatrix}I_x^2 & I_x I_y\\ I_x I_y & I_y^2\end{bmatrix}\]

$I_x,I_y$ = image gradients, $w$ = Gaussian weighting window (center pixels weighted more → better localization, less noise sensitivity).

Taylor expansion derivation

\[I(x+u,y+v) \approx I(x,y) + u I_x + v I_y\] \[E(u,v) \approx [u\ v]\, M \,[u\ v]^T\]

Harris response

\[R = \det(M) - \alpha \cdot \text{trace}(M)^2, \qquad \alpha \in [0.04, 0.06]\] \[\det(M)=\lambda_1\lambda_2 \qquad \text{trace}(M)=\lambda_1+\lambda_2\] \[\lambda = \frac{\text{trace}(M) \pm \sqrt{\text{trace}(M)^2 - 4\det(M)}}{2}\]

Eigenvalue / R classification

EigenvaluesRRegion
both ≈ 0R ≈ 0Flat
one large, one ≈0R < 0Edge
both largeR ≫ 0Corner
α too small → edges falsely flagged as corners. α too large → true corners missed. Standard α=0.04.
Worked example — Harris R at pixel (2,2)
$\Sigma I_x^2=403,\ \Sigma I_y^2=381,\ \Sigma I_xI_y=385$
$M=\begin{bmatrix}403&385\\385&381\end{bmatrix}$
$\det(M)=403\times381-385^2=5318$, $\text{trace}(M)=784$
$R = 5318 - 0.04\times784^2 = 5318-24586.24 = -19268.24$ → R<0 → EDGE.

Shi-Tomasi (alternative response)

\[R_{ShiTomasi} = \min(\lambda_1,\lambda_2)\]

No empirical α needed, better localization, used in "Good Features to Track" — optical flow/tracking.

Detector comparison

PropertyMoravecHarrisShi-Tomasi
Directions8 discreteall continuousall continuous
WindowuniformGaussianGaussian
Responsemin(SSD)det(M)−α·tr(M)²min(λ₁,λ₂)
Free paramnoneαnone
Rotation invariantnoyesyes
Usehistoricalgeneral cornerstracking, optical flow
Why no explicit eigenvalue decomposition: Harris uses only det(M) and trace(M), both directly computable from ΣI_x², ΣI_y², ΣI_xI_y — efficient for real-time.

Lecture 9 — Feature Descriptors (SIFT)

Harris is NOT scale-invariant

Rotation invariant (eigenvalues unchanged under rotation) but window size is fixed — same corner looks different at different zoom. Motivates SIFT.

Scale-normalized LoG & DoG

\[\text{LoG}_{norm}(x,y,\sigma) = \sigma^2 \nabla^2 G(x,y,\sigma)\] \[D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma) \approx (k-1)\sigma^2 \nabla^2 G(x,y,\sigma) * I(x,y)\] \[k = 2^{1/S} \quad (S = \text{scale intervals per octave})\]

DoG is a fast approximation of LoG. Canonical SIFT S=3 → 6 Gaussian images, 5 DoG images per octave.

SIFT 4-step pipeline

1. Scale-space extrema detection

Build Gaussian pyramid (octaves)
Compute DoG at each scale
Compare to 26 neighbors (8+9+9)

2. Keypoint localization & filtering

Fit 3D quadratic, subpixel refine
Reject low contrast (<0.03)
Reject edge-like via Hessian ratio (r_th=10)
\[x^* = -\left(\frac{\partial^2 D}{\partial x^2}\right)^{-1}\frac{\partial D}{\partial x}\] \[\text{Reject edge-like if } \frac{\text{Tr}(H)^2}{\det(H)} \ge \frac{(r_{th}+1)^2}{r_{th}}\]

3. Orientation assignment

Gradient mag + direction per pixel
36 bins (10° each), Gaussian-weighted vote
Dominant bin = orientation; ≥80% peak → extra keypoint

4. Descriptor (128-D)

16×16 window, rotated to orientation
4×4 cells × 8 orientation bins = 128
Interpolated votes reduce boundary effects
\[m(x,y)=\sqrt{(L(x{+}1,y){-}L(x{-}1,y))^2+(L(x,y{+}1){-}L(x,y{-}1))^2}\] \[\theta(x,y) = \text{atan2}\big(L(x,y{+}1){-}L(x,y{-}1),\ L(x{+}1,y){-}L(x{-}1,y)\big)\]

Descriptor normalization (illumination robustness)

\[v_1 = v/\|v\|_2 \quad\to\quad v_2[i]=\min(v_1[i],0.2) \quad\to\quad \hat v = v_2/\|v_2\|_2\]

Matching — Lowe ratio test

\[d(d_A,d_B) = \sqrt{\sum_{i=1}^{128}(d_A[i]-d_B[i])^2}\] \[\text{ratio} = d_1/d_2, \quad \text{accept if ratio} < \tau\ (\tau\approx 0.7\text{–}0.8)\]

Harris vs SIFT

PropertyHarrisSIFT
Feature typecornersblobs
Scale invariantnoyes (multi-scale DoG)
Rotation invariantyesyes (orientation align)
Descriptornone128-D histogram
Matchingn/aEuclidean + Lowe ratio

Lecture 10 — Classical Image Recognition

Part I — Hough Transform

\[x\cos\theta + y\sin\theta = \rho, \qquad \theta\in[0^\circ,180^\circ)\]

Why polar > slope-intercept: $m\to\infty$ for vertical lines under $y=mx+b$; polar form is bounded, uniform, compact accumulator.

Duality

1 image point → 1 sinusoid in (ρ,θ)
Collinear points → sinusoids intersect at common (ρ*,θ*)

Algorithm (6 steps)

Edge detect → init accumulator → vote per edge pixel → find peaks → threshold/NMS → convert peaks to lines
\[\rho_i(\theta) = x_i\cos\theta + y_i\sin\theta\]
Gradient-guided Hough: vote only near gradient direction θ_g (≈ line normal) instead of all θ → O(N_edge) instead of O(N_edge·N_θ), fewer false peaks.

Part II — K-Means

\[J = \sum_{i=1}^K \sum_{x_n\in C_i} \|x_n-\mu_i\|_2^2\]

Steps: choose K → init centroids → assign each point to nearest centroid → update centroid = mean of assigned → repeat to convergence. Clustering ≠ classification: cluster IDs are arbitrary, not semantic labels.

FeatureGroups byCaution
[I]brightnessdistant same-intensity objects merge
[R,G,B]colorignores spatial coherence
[I,x,y]brightness+proximityfeatures must be balanced/scaled
[R,G,B,λx·x,λy·y]color+spatialraw coords can dominate w/o scaling

Part III — Bag of Visual Words

\[q(d_j) = \arg\min_{k} \|d_j - w_k\|_2^2\] \[h_k = \sum_{j=1}^M \mathbb{1}[q(d_j)=k]\] \[h_{L1} = h/\textstyle\sum_k|h_k| \qquad h_{L2}=h/\|h\|_2\] \[\text{tf}_{k,d} = \frac{n_{k,d}}{\sum_j n_{j,d}}, \qquad \text{idf}_k = \log\frac{N}{df_k}\]

Pipeline: detect keypoints → describe (SIFT) → K-means cluster training descriptors → vocabulary {w₁..w_K} → quantize each descriptor to nearest word → build K-D histogram per image → normalize → feed to SVM. Converts variable-length descriptor sets into fixed-length vectors, permutation-invariant.

Part IV — SVM

\[f(x) = w^Tx+b, \qquad \hat y = \text{sign}(f(x))\] \[\text{Hard-margin: } \min_{w,b}\tfrac12\|w\|_2^2 \quad \text{s.t. } y_i(w^Tx_i+b)\ge1\] \[\text{Soft-margin: } \min_{w,b,\xi}\tfrac12\|w\|_2^2+C\sum_i\xi_i \quad \text{s.t. } y_i(w^Tx_i+b)\ge1-\xi_i,\ \xi_i\ge0\] \[\hat y = \text{sign}\Big(\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b\Big)\]

Larger C → tighter margin, penalizes violations more. Kernel trick (RBF: $\exp(-\gamma\|x-z\|_2^2)$) for non-linear separation. Critical: vocabulary/IDF/SVM all fit on training set only — fitting on test = data leakage.

\[\text{Classical pipeline: } \text{keypoints} \to \text{descriptors} \to \text{visual words} \to \text{histogram} \to \text{SVM}\]

Lecture 11 — Classical Vision to Deep Learning

RANSAC

\[\lambda[u,v,1]^T = H[x,y,1]^T \qquad e = \|x'-\hat x'\|_2\] \[N = \left\lceil \frac{\log(1-p)}{\log(1-w^s)} \right\rceil\]

$w$=inlier fraction, $s$=min sample size (2 for line, 4 for homography), $p$=desired success prob. Algorithm: sample minimal set → fit model → count inliers (residual<τ) → keep best consensus set → re-estimate using ALL inliers after N iterations.

CNN spatial arithmetic

\[N_{out} = \left\lfloor \frac{N-F+2P}{S} \right\rfloor + 1\] \[N_{out}^{(dilated)} = \left\lfloor \frac{N-D(F-1)-1}{S} \right\rfloor + 1\] \[r_l = r_{l-1} + (F_l-1)D_l\, j_{l-1}, \qquad j_l = j_{l-1}\cdot S_l\] \[\text{Params} = (F_h\times F_w\times C_{in}+1)\times C_{out}\]
Worked example — 3×3 convolution
X=[[5,2,6],[4,3,4],[3,9,2]], K=[[-1,0,1],[2,1,2],[1,-2,0]]
X⊙K = -5+0+6+8+3+8+3-18+0 = 5

Activation functions

FunctionEquationNote
ReLUmax(0,x)simple, sparse, efficient
Leaky ReLUmax(0.1x,x)small grad for negatives
Sigmoid1/(1+e⁻ˣ)(0,1), vanishing gradient
Tanhtanh(x)(-1,1), zero-centered

Loss & training

\[\theta_{t+1} = \theta_t - \eta\nabla_\theta L_{batch}(\theta_t)\] \[p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}, \qquad L_{CE} = -\log p_{true} \quad (\text{softmax, multi-class})\] \[\sigma(z_k)=\frac1{1+e^{-z_k}}, \qquad L_{BCE} = -\sum_k[y_k\log p_k+(1-y_k)\log(1-p_k)] \quad (\text{multi-label})\]

Training loop (6 steps): init filters → forward pass → compute loss → backprop ($\nabla_\theta L$ via chain rule) → optimizer update → repeat over batches/epochs.

Dropout

\[\tilde a_i = \frac{m_i}{q}\cdot a_i \ (\text{train}), \qquad \tilde a_i = a_i \ (\text{inference}), \qquad m_i\sim\text{Bernoulli}(q)\]

Padding, stride, pooling

Zero padding: fair edge processing, F=3,S=1,P=1 → same output size. Stride: larger S downsamples, less compute, less detail. Max pooling: keeps strongest feature. GAP: H×W×C→C, compact + regularizing (vs flattening H×W×C→huge vector, overfit risk).

Classical vs CNN pipeline

Classical

Hand-designed features (SIFT)
Separate classifier (SVM)

CNN

Learns filters from data
Unified feature extraction + classification

Feature hierarchy: early layers = edges/textures, middle = shapes/parts, deep = class-specific (faces, wheels).

Lecture 12 — Modern Vision Models & Applications

Five core output types

TaskQuestionOutput
ClassificationWhat is main content?one label
Object DetectionWhat and where?boxes + labels
SegmentationWhich pixel = what?pixel-level masks
Pose EstimationStructural layout?keypoints
TrackingSame object across frames?persistent track IDs

IoU, NMS, detection metrics

\[\text{IoU}(A,B) = \frac{|A\cap B|}{|A\cup B|}\] \[\text{NMS: keep highest-score box, suppress others with IoU} > \tau_{NMS}\] \[L_{YOLO} = \lambda_{box}L_{box}+\lambda_{cls}L_{cls}+\lambda_{obj}L_{obj}\]

Segmentation & classification metrics

\[\text{Dice}(P,G) = \frac{2|P\cap G|}{|P|+|G|} = \frac{2\cdot\text{IoU}}{1+\text{IoU}}\] \[\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}, \quad \text{Precision}=\frac{TP}{TP+FP}, \quad \text{Recall}=\frac{TP}{TP+FN}\] \[F1 = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} = \frac{2TP}{2TP+FP+FN}\] \[\text{AP}_c=\int_0^1 P_c(R)\,dR, \qquad \text{mAP}=\frac1C\sum_c \text{AP}_c\]
Recall matters when missing positives is dangerous (disease, defects). Precision matters when false alarms are costly.

CLIP

\[s_k = \cos(v,t_k)=\frac{v^Tt_k}{\|v\|_2\|t_k\|_2}, \qquad p_k=\frac{e^{s_k/T}}{\sum_j e^{s_j/T}}\]

Maps image + text to shared embedding space. Zero-shot classification via cosine similarity to candidate text prompts. Does NOT generate captions.

Two-stage vs single-stage detectors

Two-stage (R-CNN family)

Region proposals → classify/refine
Faster R-CNN, Mask R-CNN

Single-stage (YOLO family)

Direct box+class prediction
YOLO, SSD, RetinaNet — real-time

YOLO architecture: Backbone (feature extraction) → Neck (feature pyramid, multi-scale) → Detection Head (boxes + class/confidence). Post-process: confidence threshold + NMS.

SAM vs CLIP vs task models

ModelOutputUse
YOLO/SSD/RetinaNetboxes+classesreal-time detection
Faster/Mask R-CNNproposals, boxes, masksaccurate detect/instance seg
U-Net, DeepLabpixel class mapssemantic seg (medical, satellite)
SAM/SAM2promptable maskszero-shot interactive seg
CLIPimage-text similarityzero-shot recognition

Semantic vs instance segmentation: semantic = one label/pixel, no instance split. Instance = separate mask per object instance.

Tracking algorithms

AlgorithmMechanismBest forLimitation
SORTKalman + IoUfast, clear scenesID switches in crowding
DeepSORTSORT + appearance embeddingscrowded, re-ID after occlusionmore compute
ByteTrackhigh+low conf detectionsunstable detectionstied to detector quality
BoT-SORTmotion+overlap+appearancecomplex trackingheavy pipeline

Failures: ID switch (swap identities), fragmentation (one object → multiple tracks), occlusion failure (lost/reassigned ID).

Master Formula Sheet

Lecture 7 — Feature Detection

\[E_{m,n}(x,y) = \sum_{(u,v)\in W}\big[I(m+u,n+v)-I(m+x+u,n+y+v)\big]^2\] \[F_{m,n} = \min_{(x,y)\in D} E_{m,n}(x,y)\]

Lecture 8 — Harris

\[M=\sum w(x,y)\begin{bmatrix}I_x^2&I_xI_y\\I_xI_y&I_y^2\end{bmatrix}\] \[R=\det(M)-\alpha\,\text{trace}(M)^2\] \[R_{ShiTomasi}=\min(\lambda_1,\lambda_2)\]

Lecture 9 — SIFT

\[D(x,y,\sigma)\approx(k-1)\sigma^2\nabla^2G(x,y,\sigma)*I(x,y)\] \[m=\sqrt{(L_{x+1}-L_{x-1})^2+(L_{y+1}-L_{y-1})^2}\] \[\text{ratio}=d_1/d_2 < \tau\]

Lecture 10 — Hough / K-Means / BoVW / SVM

\[x\cos\theta+y\sin\theta=\rho\] \[J=\sum_i\sum_{x_n\in C_i}\|x_n-\mu_i\|_2^2\] \[\text{idf}_k=\log(N/df_k)\] \[\min_{w,b}\tfrac12\|w\|_2^2 \ \text{s.t.}\ y_i(w^Tx_i+b)\ge1\]

Lecture 11 — RANSAC / CNN

\[N=\left\lceil\frac{\log(1-p)}{\log(1-w^s)}\right\rceil\] \[N_{out}=\left\lfloor\frac{N-F+2P}{S}\right\rfloor+1\] \[\theta_{t+1}=\theta_t-\eta\nabla_\theta L\]

Lecture 12 — Detection / Metrics

\[\text{IoU}=\frac{|A\cap B|}{|A\cup B|}, \qquad \text{Dice}=\frac{2|P\cap G|}{|P|+|G|}\] \[F1=\frac{2\cdot P\cdot R}{P+R}\] \[s_k=\cos(v,t_k)\]

Key Comparisons (exam favorites)

PairQuick answer
Moravec vs Harris8 discrete shifts/SSD vs continuous gradient structure tensor
Harris vs Shi-Tomasidet−α·tr² (needs α) vs min(λ₁,λ₂) (no α, better for tracking)
Harris vs SIFTrotation-invariant corners only vs scale+rotation invariant blobs+descriptor
DoG vs LoGfast approx via Gaussian subtraction vs exact 2nd-derivative
Slope-intercept vs polar Houghunbounded m, fails on vertical lines vs bounded θ∈[0,180°)
Clustering vs classificationunsupervised, arbitrary IDs vs supervised, predefined labels
Two-stage vs single-stage detectorregion proposals then classify (accurate) vs direct prediction (fast/real-time)
Semantic vs instance segmentationone label/pixel vs separate mask per object instance
SAM vs CLIPpromptable pixel masks, no labels vs image-text similarity, zero-shot labels
Flattening vs Global Average Poolinghuge vector, overfit risk vs compact, regularizing
Precision vs Recallcost of false alarms vs cost of missed positives

Answering Strategy

Conceptual questions
Define concept → explain why needed → connect to example/application.
Numerical questions
1. Identify inputs (pixel values, gradients, K, thresholds). 2. Apply formula step by step. 3. Check rounding/thresholding/classification needed. 4. State result in words (corner/edge/flat, inlier/outlier, detected/rejected).
4 guiding questions per method: 1) What problem does it solve? 2) What's the input/output? 3) Why better than the previous method? 4) What are its limitations/trade-offs?

Connections: Moravec→Harris (robustness). Harris→SIFT (scale invariance). SIFT descriptors→BoVW (fixed-length representation). BoVW histograms→SVM (classification needs fixed vectors). Classical pipeline→CNN (learned vs hand-crafted features). Modern models chosen by required output type (label/box/mask/keypoints/track ID).

Practice Assignment Revision — Lectures 7–12 (30 MCQs)

1. Robot needs stable points to match shelf corners across frames. Best image structure?
a) Flat regions   b) Corners with strong multi-directional variation   c) Uniform walls   d) Global brightness
Answer: b — corners produce significant changes in multiple directions, easy to detect/match.
2. Moravec on plain white wall — expected SSD behavior?
a) Large in all dirs   b) Large in one dir   c) Small in all dirs   d) Random
Answer: c — flat region, no intensity change on shift.
3. Moravec on vertical edge, shifted horizontally/vertically — response?
a) Large in all dirs, definitely corner   b) Small in all dirs   c) Large ⊥ to edge, small ∥ to edge   d) Depends on histogram only
Answer: c — minimum SSD stays small along the edge, distinguishing edge from corner.
4. Moravec misclassifies a diagonal edge as a corner. Why?
a) Too many color channels   b) Evaluates only limited discrete shift directions   c) Wrong eigenvalues   d) Uses learned CNN filter
Answer: b — only checks 4 discrete shifts, rotated/diagonal structures misread.
5. Harris structure tensor gives two large eigenvalues. Indicates?
a) Flat   b) Edge   c) Corner   d) Saturated
Answer: c — strong variation in two independent directions = corner.
6. Flower field image: corners near petals/centers, few in smooth sky. Why?
a) Sky has stronger gradients   b) Flower regions have multi-directional intensity variation   c) Harris detects only blue pixels   d) Smooth regions always give large positive R
Answer: b — sky has small gradients, response near zero.
7. R = det(M) − α·trace(M)². If R large and positive, region is most likely?
a) Corner   b) Flat   c) Pure edge   d) Low-contrast
Answer: a — strong variation in both principal directions.
8. Student raises the Harris threshold significantly. Effect on detected corners?
a) More weak corners detected   b) Detected points decrease   c) All flat regions become corners   d) Resolution increases
Answer: b — higher threshold keeps only stronger responses.
9. Same object far vs close-up. Harris corners don't match well. Why?
a) Harris not scale-invariant   b) Harris can't detect corners   c) Works only on binary images   d) Doesn't use gradients
Answer: a — Harris is rotation-invariant but not scale-invariant; SIFT fixes this.
10. Logo must be recognized at different sizes. Which SIFT stage supports this directly?
a) Scale-space extrema detection   b) Histogram equalization   c) RGB normalization   d) Alpha blending
Answer: a — searches extrema across scale, detects features at their most stable scale.
11. SIFT descriptor window aligned to dominant gradient orientation. Main purpose?
a) Reduce image size   b) Improve rotation invariance   c) Remove all noise   d) Convert to grayscale
Answer: b — same structure produces similar descriptor under rotation.
12. SIFT descriptor: 16×16 neighborhood, 4×4 cells, 8-bin histogram each. Final length?
a) 32   b) 64   c) 128   d) 256
Answer: c — 4×4=16 cells × 8 bins = 128.
13. Road-lane detection needs straight lines even with missing parts. Best method?
a) K-means   b) Hough Transform   c) Dropout   d) Global average pooling
Answer: b — Hough lets multiple edge points vote for the same line even with gaps.
14. Polar Hough accumulator: two parallel lines detected. Expected accumulator pattern?
a) One peak, same d and θ   b) Two peaks, similar θ, different d   c) No peaks   d) Single intensity histogram
Answer: b — same orientation (θ) but different distance from origin (d/ρ).
15. Satellite K-means on intensity only — roads/rooftops merge. Why?
a) K-means can't use numerical features   b) Regions may share similar brightness   c) K-means needs labels   d) Pixel count too small
Answer: b — intensity-only fails when objects share brightness; add color/spatial coords.
16. K-means with very small K. Likely result?
a) Over-segmented into tiny regions   b) Different objects merged into same cluster   c) Becomes supervised   d) Image automatically sharper
Answer: b — too few clusters to represent all meaningful regions.
17. Variable number of SIFT descriptors per image, want SVM. Why need Bag of Visual Words?
a) Convert variable-length descriptors to fixed-length histograms   b) Remove all support vectors   c) Replace feature extraction   d) Perform tracking
Answer: a — SVM needs fixed-length feature vectors.
18. BoVW vocabulary too small. Likely effect?
a) Different structures grouped together, reduced discriminative power   b) Histograms too sparse and perfectly accurate   c) Model becomes scale-invariant automatically   d) SVM no longer needs training data
Answer: a — small vocabulary forces different descriptors into same words.
19. Panorama stitching: many wrong SIFT matches from repeated building windows. Method before homography estimation?
a) RANSAC   b) Dropout   c) Non-Maximum Suppression   d) Global thresholding
Answer: a — RANSAC rejects geometrically inconsistent outliers.
20. RANSAC for homography. Minimum point correspondences needed?
a) 2   b) 3   c) 4   d) 8
Answer: c — homography has 8 DOF, each correspondence gives 2 constraints → 4 needed.
21. CNN classifying cats/dogs/horses/birds. Main difference from classical hand-designed filters?
a) CNN filters learned from data during training   b) Always fixed before training   c) Can't detect edges   d) Used only after SVM
Answer: a — classical pipelines use hand-designed filters/descriptors; CNNs learn them.
22. CNN layer uses stride 2 instead of 1. Main effect?
a) Kernel moves farther each step, usually reduces output size   b) Number of classes doubles   c) Loss function removed   d) Image color-inverted
Answer: a — larger stride skips more positions → smaller feature maps.
23. CNN performs great on training, poorly on new test images (different lighting). Problem?
a) Overfitting   b) Perfect generalization   c) Correct dropout use   d) Under-segmentation
Answer: a — model learned training-specific patterns instead of general features.
24. Traffic system needs real-time bounding boxes + class labels for cars/buses/trucks/pedestrians. Best model family?
a) YOLO   b) K-means   c) SIFT only   d) Harris only
Answer: a — single-stage detector, boxes+labels+confidence in one forward pass, real-time.
25. Detector produces several overlapping boxes around same pedestrian. Post-processing step to keep strongest?
a) Non-Maximum Suppression   b) Histogram equalization   c) K-means centroid update   d) RANSAC homography
Answer: a — NMS removes duplicate overlapping detections, keeps highest-confidence box.
26. Medical system must mark exact tumor boundary, not just abnormal/normal. Best output?
a) One image-level label   b) Histogram of visual words   c) Segmentation mask   d) Single Harris corner point
Answer: c — boundary marking requires pixel-level localization.
27. Dataset: 950 normal, 50 disease images. High accuracy but misses many disease cases. Metric to emphasize?
a) Recall   b) File size   c) Number of conv layers   d) Image width
Answer: a — recall measures detected actual positives; missing disease cases dangerous → minimize false negatives.
28. Sports tracker swaps player identities when two players cross. Best fix?
a) Use only image classification   b) DeepSORT with appearance features   c) Increase Harris threshold   d) Histogram equalization only
Answer: b — appearance features (beyond motion/position) better distinguish overlapping objects.
29. Quality inspection must separate each defective bottle with its own mask. Best output?
a) Classification label   b) Detection bounding box only   c) Instance segmentation mask   d) Global histogram
Answer: c — instance segmentation separates each object instance with own pixel mask.
30. Team wants quick annotation: user clicks object, model produces mask. Best model?
a) SAM   b) YOLO only   c) SVM   d) K-means only
Answer: a — SAM is promptable segmentation, generates masks from point/box prompts.