Current Research
Cross-City Generalization of SSL Backbones
How well do self-supervised vision backbones generalize across different cities? This study systematically evaluates I-JEPA, DINOv2, and MAE within the NAVSIM framework across four geographically diverse driving environments.
Problem Statement
Autonomous driving systems trained in one city often fail when deployed in another. Different traffic patterns, road layouts, weather conditions, and driving cultures create distribution shifts that expose the brittleness of learned planning policies.
Self-supervised learning offers a promising direction: by learning general visual representations without task-specific labels, SSL backbones may capture transferable features that generalize better across domains. But this hypothesis has not been systematically tested for end-to-end driving planning.
Three-Phase Experimental Design
Single-City Training
Train on each city independently and evaluate across all four cities to establish per-city baselines and measure raw cross-city transfer.
Leave-One-City-Out
Train on three cities, evaluate on the held-out city. Tests genuine generalization to unseen driving environments and traffic patterns.
Data Efficiency Analysis
Vary training data percentage (10%, 25%, 50%, 100%) to understand how SSL backbones perform under data-constrained conditions.
SSL Backbones Under Study
I-JEPA
ViT-H/14, 630M params
Joint-Embedding Predictive Architecture. Learns by predicting representations of masked image regions from visible context, without pixel reconstruction.
DINOv2
ViT-L/14, 300M params
Self-distillation approach producing versatile visual features. Trained on a curated 142M image dataset with both image and patch-level objectives.
MAE
ViT-H/14, 630M params
Masked Autoencoder. Reconstructs masked patches in pixel space, learning representations through a high masking ratio (75%) reconstruction task.
Evaluation Cities
Boston
~2,500 scenes
Dense urban, narrow streets
Las Vegas
~3,200 scenes
Wide boulevards, complex intersections
Pittsburgh
~2,100 scenes
Hilly terrain, bridges, irregular grids
Singapore
~1,800 scenes
Left-hand traffic, tropical conditions