ICS604 — Introduction to Image Processing & Computer Vision
Final Exam Guide — Full Consolidated Study Page (Lectures 7–12 + MCQ Revision) · Dr. Maggie Shammaa
Lecture 7 — Feature Detection
Why features, not full images
Compact, distinctive structures enable matching/recognition/reconstruction without comparing every pixel. Four common feature types: edges (1D, strong intensity change, Sobel/Canny/LoG), corners (multi-directional intensity change — strongest interest points), blobs (homogeneous regions distinct from surroundings, DoG/LoG), ridges (curves at local extrema — road networks, vessels).
Why corners > edges
Edges suffer the aperture problem: ambiguous along the edge direction itself. Corners are fully localized in both directions.
The three region types
| Region | SSD pattern | Min SSD | Status |
| Flat | small in all directions | small | not interest point |
| Edge | large ⊥, small ∥ to edge | small | not a corner |
| Corner | large in all directions | large | interest point |
Moravec detector
\[E_{m,n}(x,y) = \sum_{(u,v)\in W} \big[I(m+u,n+v) - I(m+x+u,n+y+v)\big]^2\]
\[F_{m,n} = \min_{(x,y)\in D} E_{m,n}(x,y), \qquad D = \{(1,0),(-1,0),(0,1),(0,-1)\}\]
\[\text{Corner at } (m,n) \iff F_{m,n}\text{ is a local max AND } F_{m,n} > T\]
Algorithm: place window → shift in 4 directions → compute SSD each → take minimum → mark corner if locally maximal & above threshold.
Worked example — corner-like boundary
Background=128, object=0. Shift up: SSD=(0−128)²+(128−0)²=32768 (large). Shift right: SSD=16384 (large). Only when ALL 4 shift directions give large SSD does the minimum stay large → corner.
Moravec limitations → motivates Harris
| Limitation | Cause | Harris fix |
| Not rotation invariant | only 4 discrete shifts | gradient structure tensor (continuous) |
| Sensitive to noise | raw intensity comparison | Gaussian-weighted window |
| Edges as corners | min over 4 dirs misses parallel edges | eigenvalue analysis |
| Poor localization | uniform window weighting | center-weighted Gaussian |
| Anisotropic response | depends on chosen shift set | continuous gradient formulation |
Lecture 8 — Harris Corner Detection
Structure tensor (covariance matrix)
\[M = \sum_{x,y} w(x,y)\begin{bmatrix}I_x^2 & I_x I_y\\ I_x I_y & I_y^2\end{bmatrix}\]
$I_x,I_y$ = image gradients, $w$ = Gaussian weighting window (center pixels weighted more → better localization, less noise sensitivity).
Taylor expansion derivation
\[I(x+u,y+v) \approx I(x,y) + u I_x + v I_y\]
\[E(u,v) \approx [u\ v]\, M \,[u\ v]^T\]
Harris response
\[R = \det(M) - \alpha \cdot \text{trace}(M)^2, \qquad \alpha \in [0.04, 0.06]\]
\[\det(M)=\lambda_1\lambda_2 \qquad \text{trace}(M)=\lambda_1+\lambda_2\]
\[\lambda = \frac{\text{trace}(M) \pm \sqrt{\text{trace}(M)^2 - 4\det(M)}}{2}\]
Eigenvalue / R classification
| Eigenvalues | R | Region |
| both ≈ 0 | R ≈ 0 | Flat |
| one large, one ≈0 | R < 0 | Edge |
| both large | R ≫ 0 | Corner |
α too small → edges falsely flagged as corners. α too large → true corners missed. Standard α=0.04.
Worked example — Harris R at pixel (2,2)
$\Sigma I_x^2=403,\ \Sigma I_y^2=381,\ \Sigma I_xI_y=385$
$M=\begin{bmatrix}403&385\\385&381\end{bmatrix}$
$\det(M)=403\times381-385^2=5318$, $\text{trace}(M)=784$
$R = 5318 - 0.04\times784^2 = 5318-24586.24 = -19268.24$ → R<0 → EDGE.
Shi-Tomasi (alternative response)
\[R_{ShiTomasi} = \min(\lambda_1,\lambda_2)\]
No empirical α needed, better localization, used in "Good Features to Track" — optical flow/tracking.
Detector comparison
| Property | Moravec | Harris | Shi-Tomasi |
| Directions | 8 discrete | all continuous | all continuous |
| Window | uniform | Gaussian | Gaussian |
| Response | min(SSD) | det(M)−α·tr(M)² | min(λ₁,λ₂) |
| Free param | none | α | none |
| Rotation invariant | no | yes | yes |
| Use | historical | general corners | tracking, optical flow |
Why no explicit eigenvalue decomposition: Harris uses only det(M) and trace(M), both directly computable from ΣI_x², ΣI_y², ΣI_xI_y — efficient for real-time.
Lecture 9 — Feature Descriptors (SIFT)
Harris is NOT scale-invariant
Rotation invariant (eigenvalues unchanged under rotation) but window size is fixed — same corner looks different at different zoom. Motivates SIFT.
Scale-normalized LoG & DoG
\[\text{LoG}_{norm}(x,y,\sigma) = \sigma^2 \nabla^2 G(x,y,\sigma)\]
\[D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma) \approx (k-1)\sigma^2 \nabla^2 G(x,y,\sigma) * I(x,y)\]
\[k = 2^{1/S} \quad (S = \text{scale intervals per octave})\]
DoG is a fast approximation of LoG. Canonical SIFT S=3 → 6 Gaussian images, 5 DoG images per octave.
SIFT 4-step pipeline
1. Scale-space extrema detection
Build Gaussian pyramid (octaves)
Compute DoG at each scale
Compare to 26 neighbors (8+9+9)
2. Keypoint localization & filtering
Fit 3D quadratic, subpixel refine
Reject low contrast (<0.03)
Reject edge-like via Hessian ratio (r_th=10)
\[x^* = -\left(\frac{\partial^2 D}{\partial x^2}\right)^{-1}\frac{\partial D}{\partial x}\]
\[\text{Reject edge-like if } \frac{\text{Tr}(H)^2}{\det(H)} \ge \frac{(r_{th}+1)^2}{r_{th}}\]
3. Orientation assignment
Gradient mag + direction per pixel
36 bins (10° each), Gaussian-weighted vote
Dominant bin = orientation; ≥80% peak → extra keypoint
4. Descriptor (128-D)
16×16 window, rotated to orientation
4×4 cells × 8 orientation bins = 128
Interpolated votes reduce boundary effects
\[m(x,y)=\sqrt{(L(x{+}1,y){-}L(x{-}1,y))^2+(L(x,y{+}1){-}L(x,y{-}1))^2}\]
\[\theta(x,y) = \text{atan2}\big(L(x,y{+}1){-}L(x,y{-}1),\ L(x{+}1,y){-}L(x{-}1,y)\big)\]
Descriptor normalization (illumination robustness)
\[v_1 = v/\|v\|_2 \quad\to\quad v_2[i]=\min(v_1[i],0.2) \quad\to\quad \hat v = v_2/\|v_2\|_2\]
Matching — Lowe ratio test
\[d(d_A,d_B) = \sqrt{\sum_{i=1}^{128}(d_A[i]-d_B[i])^2}\]
\[\text{ratio} = d_1/d_2, \quad \text{accept if ratio} < \tau\ (\tau\approx 0.7\text{–}0.8)\]
Harris vs SIFT
| Property | Harris | SIFT |
| Feature type | corners | blobs |
| Scale invariant | no | yes (multi-scale DoG) |
| Rotation invariant | yes | yes (orientation align) |
| Descriptor | none | 128-D histogram |
| Matching | n/a | Euclidean + Lowe ratio |
Lecture 10 — Classical Image Recognition
Part I — Hough Transform
\[x\cos\theta + y\sin\theta = \rho, \qquad \theta\in[0^\circ,180^\circ)\]
Why polar > slope-intercept: $m\to\infty$ for vertical lines under $y=mx+b$; polar form is bounded, uniform, compact accumulator.
Duality
1 image point → 1 sinusoid in (ρ,θ)
Collinear points → sinusoids intersect at common (ρ*,θ*)
Algorithm (6 steps)
Edge detect → init accumulator → vote per edge pixel → find peaks → threshold/NMS → convert peaks to lines
\[\rho_i(\theta) = x_i\cos\theta + y_i\sin\theta\]
Gradient-guided Hough: vote only near gradient direction θ_g (≈ line normal) instead of all θ → O(N_edge) instead of O(N_edge·N_θ), fewer false peaks.
Part II — K-Means
\[J = \sum_{i=1}^K \sum_{x_n\in C_i} \|x_n-\mu_i\|_2^2\]
Steps: choose K → init centroids → assign each point to nearest centroid → update centroid = mean of assigned → repeat to convergence. Clustering ≠ classification: cluster IDs are arbitrary, not semantic labels.
| Feature | Groups by | Caution |
| [I] | brightness | distant same-intensity objects merge |
| [R,G,B] | color | ignores spatial coherence |
| [I,x,y] | brightness+proximity | features must be balanced/scaled |
| [R,G,B,λx·x,λy·y] | color+spatial | raw coords can dominate w/o scaling |
Part III — Bag of Visual Words
\[q(d_j) = \arg\min_{k} \|d_j - w_k\|_2^2\]
\[h_k = \sum_{j=1}^M \mathbb{1}[q(d_j)=k]\]
\[h_{L1} = h/\textstyle\sum_k|h_k| \qquad h_{L2}=h/\|h\|_2\]
\[\text{tf}_{k,d} = \frac{n_{k,d}}{\sum_j n_{j,d}}, \qquad \text{idf}_k = \log\frac{N}{df_k}\]
Pipeline: detect keypoints → describe (SIFT) → K-means cluster training descriptors → vocabulary {w₁..w_K} → quantize each descriptor to nearest word → build K-D histogram per image → normalize → feed to SVM. Converts variable-length descriptor sets into fixed-length vectors, permutation-invariant.
Part IV — SVM
\[f(x) = w^Tx+b, \qquad \hat y = \text{sign}(f(x))\]
\[\text{Hard-margin: } \min_{w,b}\tfrac12\|w\|_2^2 \quad \text{s.t. } y_i(w^Tx_i+b)\ge1\]
\[\text{Soft-margin: } \min_{w,b,\xi}\tfrac12\|w\|_2^2+C\sum_i\xi_i \quad \text{s.t. } y_i(w^Tx_i+b)\ge1-\xi_i,\ \xi_i\ge0\]
\[\hat y = \text{sign}\Big(\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b\Big)\]
Larger C → tighter margin, penalizes violations more. Kernel trick (RBF: $\exp(-\gamma\|x-z\|_2^2)$) for non-linear separation. Critical: vocabulary/IDF/SVM all fit on training set only — fitting on test = data leakage.
\[\text{Classical pipeline: } \text{keypoints} \to \text{descriptors} \to \text{visual words} \to \text{histogram} \to \text{SVM}\]
Lecture 11 — Classical Vision to Deep Learning
RANSAC
\[\lambda[u,v,1]^T = H[x,y,1]^T \qquad e = \|x'-\hat x'\|_2\]
\[N = \left\lceil \frac{\log(1-p)}{\log(1-w^s)} \right\rceil\]
$w$=inlier fraction, $s$=min sample size (2 for line, 4 for homography), $p$=desired success prob. Algorithm: sample minimal set → fit model → count inliers (residual<τ) → keep best consensus set → re-estimate using ALL inliers after N iterations.
CNN spatial arithmetic
\[N_{out} = \left\lfloor \frac{N-F+2P}{S} \right\rfloor + 1\]
\[N_{out}^{(dilated)} = \left\lfloor \frac{N-D(F-1)-1}{S} \right\rfloor + 1\]
\[r_l = r_{l-1} + (F_l-1)D_l\, j_{l-1}, \qquad j_l = j_{l-1}\cdot S_l\]
\[\text{Params} = (F_h\times F_w\times C_{in}+1)\times C_{out}\]
Worked example — 3×3 convolution
X=[[5,2,6],[4,3,4],[3,9,2]], K=[[-1,0,1],[2,1,2],[1,-2,0]]
X⊙K = -5+0+6+8+3+8+3-18+0 = 5
Activation functions
| Function | Equation | Note |
| ReLU | max(0,x) | simple, sparse, efficient |
| Leaky ReLU | max(0.1x,x) | small grad for negatives |
| Sigmoid | 1/(1+e⁻ˣ) | (0,1), vanishing gradient |
| Tanh | tanh(x) | (-1,1), zero-centered |
Loss & training
\[\theta_{t+1} = \theta_t - \eta\nabla_\theta L_{batch}(\theta_t)\]
\[p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}, \qquad L_{CE} = -\log p_{true} \quad (\text{softmax, multi-class})\]
\[\sigma(z_k)=\frac1{1+e^{-z_k}}, \qquad L_{BCE} = -\sum_k[y_k\log p_k+(1-y_k)\log(1-p_k)] \quad (\text{multi-label})\]
Training loop (6 steps): init filters → forward pass → compute loss → backprop ($\nabla_\theta L$ via chain rule) → optimizer update → repeat over batches/epochs.
Dropout
\[\tilde a_i = \frac{m_i}{q}\cdot a_i \ (\text{train}), \qquad \tilde a_i = a_i \ (\text{inference}), \qquad m_i\sim\text{Bernoulli}(q)\]
Padding, stride, pooling
Zero padding: fair edge processing, F=3,S=1,P=1 → same output size. Stride: larger S downsamples, less compute, less detail. Max pooling: keeps strongest feature. GAP: H×W×C→C, compact + regularizing (vs flattening H×W×C→huge vector, overfit risk).
Classical vs CNN pipeline
Classical
Hand-designed features (SIFT)
Separate classifier (SVM)
CNN
Learns filters from data
Unified feature extraction + classification
Feature hierarchy: early layers = edges/textures, middle = shapes/parts, deep = class-specific (faces, wheels).
Lecture 12 — Modern Vision Models & Applications
Five core output types
| Task | Question | Output |
| Classification | What is main content? | one label |
| Object Detection | What and where? | boxes + labels |
| Segmentation | Which pixel = what? | pixel-level masks |
| Pose Estimation | Structural layout? | keypoints |
| Tracking | Same object across frames? | persistent track IDs |
IoU, NMS, detection metrics
\[\text{IoU}(A,B) = \frac{|A\cap B|}{|A\cup B|}\]
\[\text{NMS: keep highest-score box, suppress others with IoU} > \tau_{NMS}\]
\[L_{YOLO} = \lambda_{box}L_{box}+\lambda_{cls}L_{cls}+\lambda_{obj}L_{obj}\]
Segmentation & classification metrics
\[\text{Dice}(P,G) = \frac{2|P\cap G|}{|P|+|G|} = \frac{2\cdot\text{IoU}}{1+\text{IoU}}\]
\[\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}, \quad \text{Precision}=\frac{TP}{TP+FP}, \quad \text{Recall}=\frac{TP}{TP+FN}\]
\[F1 = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} = \frac{2TP}{2TP+FP+FN}\]
\[\text{AP}_c=\int_0^1 P_c(R)\,dR, \qquad \text{mAP}=\frac1C\sum_c \text{AP}_c\]
Recall matters when missing positives is dangerous (disease, defects). Precision matters when false alarms are costly.
CLIP
\[s_k = \cos(v,t_k)=\frac{v^Tt_k}{\|v\|_2\|t_k\|_2}, \qquad p_k=\frac{e^{s_k/T}}{\sum_j e^{s_j/T}}\]
Maps image + text to shared embedding space. Zero-shot classification via cosine similarity to candidate text prompts. Does NOT generate captions.
Two-stage vs single-stage detectors
Two-stage (R-CNN family)
Region proposals → classify/refine
Faster R-CNN, Mask R-CNN
Single-stage (YOLO family)
Direct box+class prediction
YOLO, SSD, RetinaNet — real-time
YOLO architecture: Backbone (feature extraction) → Neck (feature pyramid, multi-scale) → Detection Head (boxes + class/confidence). Post-process: confidence threshold + NMS.
SAM vs CLIP vs task models
| Model | Output | Use |
| YOLO/SSD/RetinaNet | boxes+classes | real-time detection |
| Faster/Mask R-CNN | proposals, boxes, masks | accurate detect/instance seg |
| U-Net, DeepLab | pixel class maps | semantic seg (medical, satellite) |
| SAM/SAM2 | promptable masks | zero-shot interactive seg |
| CLIP | image-text similarity | zero-shot recognition |
Semantic vs instance segmentation: semantic = one label/pixel, no instance split. Instance = separate mask per object instance.
Tracking algorithms
| Algorithm | Mechanism | Best for | Limitation |
| SORT | Kalman + IoU | fast, clear scenes | ID switches in crowding |
| DeepSORT | SORT + appearance embeddings | crowded, re-ID after occlusion | more compute |
| ByteTrack | high+low conf detections | unstable detections | tied to detector quality |
| BoT-SORT | motion+overlap+appearance | complex tracking | heavy pipeline |
Failures: ID switch (swap identities), fragmentation (one object → multiple tracks), occlusion failure (lost/reassigned ID).
Lecture 7 — Feature Detection
\[E_{m,n}(x,y) = \sum_{(u,v)\in W}\big[I(m+u,n+v)-I(m+x+u,n+y+v)\big]^2\]
\[F_{m,n} = \min_{(x,y)\in D} E_{m,n}(x,y)\]
Lecture 8 — Harris
\[M=\sum w(x,y)\begin{bmatrix}I_x^2&I_xI_y\\I_xI_y&I_y^2\end{bmatrix}\]
\[R=\det(M)-\alpha\,\text{trace}(M)^2\]
\[R_{ShiTomasi}=\min(\lambda_1,\lambda_2)\]
Lecture 9 — SIFT
\[D(x,y,\sigma)\approx(k-1)\sigma^2\nabla^2G(x,y,\sigma)*I(x,y)\]
\[m=\sqrt{(L_{x+1}-L_{x-1})^2+(L_{y+1}-L_{y-1})^2}\]
\[\text{ratio}=d_1/d_2 < \tau\]
Lecture 10 — Hough / K-Means / BoVW / SVM
\[x\cos\theta+y\sin\theta=\rho\]
\[J=\sum_i\sum_{x_n\in C_i}\|x_n-\mu_i\|_2^2\]
\[\text{idf}_k=\log(N/df_k)\]
\[\min_{w,b}\tfrac12\|w\|_2^2 \ \text{s.t.}\ y_i(w^Tx_i+b)\ge1\]
Lecture 11 — RANSAC / CNN
\[N=\left\lceil\frac{\log(1-p)}{\log(1-w^s)}\right\rceil\]
\[N_{out}=\left\lfloor\frac{N-F+2P}{S}\right\rfloor+1\]
\[\theta_{t+1}=\theta_t-\eta\nabla_\theta L\]
Lecture 12 — Detection / Metrics
\[\text{IoU}=\frac{|A\cap B|}{|A\cup B|}, \qquad \text{Dice}=\frac{2|P\cap G|}{|P|+|G|}\]
\[F1=\frac{2\cdot P\cdot R}{P+R}\]
\[s_k=\cos(v,t_k)\]
Key Comparisons (exam favorites)
| Pair | Quick answer |
| Moravec vs Harris | 8 discrete shifts/SSD vs continuous gradient structure tensor |
| Harris vs Shi-Tomasi | det−α·tr² (needs α) vs min(λ₁,λ₂) (no α, better for tracking) |
| Harris vs SIFT | rotation-invariant corners only vs scale+rotation invariant blobs+descriptor |
| DoG vs LoG | fast approx via Gaussian subtraction vs exact 2nd-derivative |
| Slope-intercept vs polar Hough | unbounded m, fails on vertical lines vs bounded θ∈[0,180°) |
| Clustering vs classification | unsupervised, arbitrary IDs vs supervised, predefined labels |
| Two-stage vs single-stage detector | region proposals then classify (accurate) vs direct prediction (fast/real-time) |
| Semantic vs instance segmentation | one label/pixel vs separate mask per object instance |
| SAM vs CLIP | promptable pixel masks, no labels vs image-text similarity, zero-shot labels |
| Flattening vs Global Average Pooling | huge vector, overfit risk vs compact, regularizing |
| Precision vs Recall | cost of false alarms vs cost of missed positives |
Answering Strategy
Conceptual questions
Define concept → explain why needed → connect to example/application.
Numerical questions
1. Identify inputs (pixel values, gradients, K, thresholds). 2. Apply formula step by step. 3. Check rounding/thresholding/classification needed. 4. State result in words (corner/edge/flat, inlier/outlier, detected/rejected).
4 guiding questions per method: 1) What problem does it solve? 2) What's the input/output? 3) Why better than the previous method? 4) What are its limitations/trade-offs?
Connections: Moravec→Harris (robustness). Harris→SIFT (scale invariance). SIFT descriptors→BoVW (fixed-length representation). BoVW histograms→SVM (classification needs fixed vectors). Classical pipeline→CNN (learned vs hand-crafted features). Modern models chosen by required output type (label/box/mask/keypoints/track ID).
Practice Assignment Revision — Lectures 7–12 (30 MCQs)
1. Robot needs stable points to match shelf corners across frames. Best image structure?
a) Flat regions b) Corners with strong multi-directional variation c) Uniform walls d) Global brightness
Answer: b — corners produce significant changes in multiple directions, easy to detect/match.
2. Moravec on plain white wall — expected SSD behavior?
a) Large in all dirs b) Large in one dir c) Small in all dirs d) Random
Answer: c — flat region, no intensity change on shift.
3. Moravec on vertical edge, shifted horizontally/vertically — response?
a) Large in all dirs, definitely corner b) Small in all dirs c) Large ⊥ to edge, small ∥ to edge d) Depends on histogram only
Answer: c — minimum SSD stays small along the edge, distinguishing edge from corner.
4. Moravec misclassifies a diagonal edge as a corner. Why?
a) Too many color channels b) Evaluates only limited discrete shift directions c) Wrong eigenvalues d) Uses learned CNN filter
Answer: b — only checks 4 discrete shifts, rotated/diagonal structures misread.
5. Harris structure tensor gives two large eigenvalues. Indicates?
a) Flat b) Edge c) Corner d) Saturated
Answer: c — strong variation in two independent directions = corner.
6. Flower field image: corners near petals/centers, few in smooth sky. Why?
a) Sky has stronger gradients b) Flower regions have multi-directional intensity variation c) Harris detects only blue pixels d) Smooth regions always give large positive R
Answer: b — sky has small gradients, response near zero.
7. R = det(M) − α·trace(M)². If R large and positive, region is most likely?
a) Corner b) Flat c) Pure edge d) Low-contrast
Answer: a — strong variation in both principal directions.
8. Student raises the Harris threshold significantly. Effect on detected corners?
a) More weak corners detected b) Detected points decrease c) All flat regions become corners d) Resolution increases
Answer: b — higher threshold keeps only stronger responses.
9. Same object far vs close-up. Harris corners don't match well. Why?
a) Harris not scale-invariant b) Harris can't detect corners c) Works only on binary images d) Doesn't use gradients
Answer: a — Harris is rotation-invariant but not scale-invariant; SIFT fixes this.
10. Logo must be recognized at different sizes. Which SIFT stage supports this directly?
a) Scale-space extrema detection b) Histogram equalization c) RGB normalization d) Alpha blending
Answer: a — searches extrema across scale, detects features at their most stable scale.
11. SIFT descriptor window aligned to dominant gradient orientation. Main purpose?
a) Reduce image size b) Improve rotation invariance c) Remove all noise d) Convert to grayscale
Answer: b — same structure produces similar descriptor under rotation.
12. SIFT descriptor: 16×16 neighborhood, 4×4 cells, 8-bin histogram each. Final length?
a) 32 b) 64 c) 128 d) 256
Answer: c — 4×4=16 cells × 8 bins = 128.
13. Road-lane detection needs straight lines even with missing parts. Best method?
a) K-means b) Hough Transform c) Dropout d) Global average pooling
Answer: b — Hough lets multiple edge points vote for the same line even with gaps.
14. Polar Hough accumulator: two parallel lines detected. Expected accumulator pattern?
a) One peak, same d and θ b) Two peaks, similar θ, different d c) No peaks d) Single intensity histogram
Answer: b — same orientation (θ) but different distance from origin (d/ρ).
15. Satellite K-means on intensity only — roads/rooftops merge. Why?
a) K-means can't use numerical features b) Regions may share similar brightness c) K-means needs labels d) Pixel count too small
Answer: b — intensity-only fails when objects share brightness; add color/spatial coords.
16. K-means with very small K. Likely result?
a) Over-segmented into tiny regions b) Different objects merged into same cluster c) Becomes supervised d) Image automatically sharper
Answer: b — too few clusters to represent all meaningful regions.
17. Variable number of SIFT descriptors per image, want SVM. Why need Bag of Visual Words?
a) Convert variable-length descriptors to fixed-length histograms b) Remove all support vectors c) Replace feature extraction d) Perform tracking
Answer: a — SVM needs fixed-length feature vectors.
18. BoVW vocabulary too small. Likely effect?
a) Different structures grouped together, reduced discriminative power b) Histograms too sparse and perfectly accurate c) Model becomes scale-invariant automatically d) SVM no longer needs training data
Answer: a — small vocabulary forces different descriptors into same words.
19. Panorama stitching: many wrong SIFT matches from repeated building windows. Method before homography estimation?
a) RANSAC b) Dropout c) Non-Maximum Suppression d) Global thresholding
Answer: a — RANSAC rejects geometrically inconsistent outliers.
20. RANSAC for homography. Minimum point correspondences needed?
a) 2 b) 3 c) 4 d) 8
Answer: c — homography has 8 DOF, each correspondence gives 2 constraints → 4 needed.
21. CNN classifying cats/dogs/horses/birds. Main difference from classical hand-designed filters?
a) CNN filters learned from data during training b) Always fixed before training c) Can't detect edges d) Used only after SVM
Answer: a — classical pipelines use hand-designed filters/descriptors; CNNs learn them.
22. CNN layer uses stride 2 instead of 1. Main effect?
a) Kernel moves farther each step, usually reduces output size b) Number of classes doubles c) Loss function removed d) Image color-inverted
Answer: a — larger stride skips more positions → smaller feature maps.
23. CNN performs great on training, poorly on new test images (different lighting). Problem?
a) Overfitting b) Perfect generalization c) Correct dropout use d) Under-segmentation
Answer: a — model learned training-specific patterns instead of general features.
24. Traffic system needs real-time bounding boxes + class labels for cars/buses/trucks/pedestrians. Best model family?
a) YOLO b) K-means c) SIFT only d) Harris only
Answer: a — single-stage detector, boxes+labels+confidence in one forward pass, real-time.
25. Detector produces several overlapping boxes around same pedestrian. Post-processing step to keep strongest?
a) Non-Maximum Suppression b) Histogram equalization c) K-means centroid update d) RANSAC homography
Answer: a — NMS removes duplicate overlapping detections, keeps highest-confidence box.
26. Medical system must mark exact tumor boundary, not just abnormal/normal. Best output?
a) One image-level label b) Histogram of visual words c) Segmentation mask d) Single Harris corner point
Answer: c — boundary marking requires pixel-level localization.
27. Dataset: 950 normal, 50 disease images. High accuracy but misses many disease cases. Metric to emphasize?
a) Recall b) File size c) Number of conv layers d) Image width
Answer: a — recall measures detected actual positives; missing disease cases dangerous → minimize false negatives.
28. Sports tracker swaps player identities when two players cross. Best fix?
a) Use only image classification b) DeepSORT with appearance features c) Increase Harris threshold d) Histogram equalization only
Answer: b — appearance features (beyond motion/position) better distinguish overlapping objects.
29. Quality inspection must separate each defective bottle with its own mask. Best output?
a) Classification label b) Detection bounding box only c) Instance segmentation mask d) Global histogram
Answer: c — instance segmentation separates each object instance with own pixel mask.
30. Team wants quick annotation: user clicks object, model produces mask. Best model?
a) SAM b) YOLO only c) SVM d) K-means only
Answer: a — SAM is promptable segmentation, generates masks from point/box prompts.