Current Research

Cross-City Generalization of SSL Backbones

How well do self-supervised vision backbones generalize across different cities? This study systematically evaluates I-JEPA, DINOv2, and MAE within the NAVSIM framework across four geographically diverse driving environments.

Problem Statement

Autonomous driving systems trained in one city often fail when deployed in another. Different traffic patterns, road layouts, weather conditions, and driving cultures create distribution shifts that expose the brittleness of learned planning policies.

Self-supervised learning offers a promising direction: by learning general visual representations without task-specific labels, SSL backbones may capture transferable features that generalize better across domains. But this hypothesis has not been systematically tested for end-to-end driving planning.

Three-Phase Experimental Design

Phase 1

Single-City Training

Train on each city independently and evaluate across all four cities to establish per-city baselines and measure raw cross-city transfer.

Phase 2

Leave-One-City-Out

Train on three cities, evaluate on the held-out city. Tests genuine generalization to unseen driving environments and traffic patterns.

Phase 3

Data Efficiency Analysis

Vary training data percentage (10%, 25%, 50%, 100%) to understand how SSL backbones perform under data-constrained conditions.

SSL Backbones Under Study

I-JEPA

ViT-H/14, 630M params

Joint-Embedding Predictive Architecture. Learns by predicting representations of masked image regions from visible context, without pixel reconstruction.

DINOv2

ViT-L/14, 300M params

Self-distillation approach producing versatile visual features. Trained on a curated 142M image dataset with both image and patch-level objectives.

MAE

ViT-H/14, 630M params

Masked Autoencoder. Reconstructs masked patches in pixel space, learning representations through a high masking ratio (75%) reconstruction task.

Evaluation Cities

Boston

~2,500 scenes

Dense urban, narrow streets

Las Vegas

~3,200 scenes

Wide boulevards, complex intersections

Pittsburgh

~2,100 scenes

Hilly terrain, bridges, irregular grids

Singapore

~1,800 scenes

Left-hand traffic, tropical conditions