Archive — Previous work (Phase 1: I-JEPA + MLP Planning)

View current research
Master's Thesis | NYU Tandon School of Engineering | 2025

Self-Supervised Learning for
Autonomous Vehicle Planning

Leveraging I-JEPA Vision Representations for Data-Efficient Motion Planning

Abstract

This project demonstrates a novel approach to autonomous vehicle trajectory planning that achieves state-of-the-art performance using 90% less labeled training data. By leveraging self-supervised learning with pre-trained vision models, we show that effective planning can be learned with minimal supervision.

The system processes camera inputs to predict safe, efficient trajectories for autonomous vehicles navigating complex urban environments. The key innovation is the use of pre-trained visual representations that already understand driving scenes, requiring only a lightweight planning head to be trained on a small subset of labeled data.

Key Results

Best Score
82.06%
I-JEPA + Multi-Camera Fusion (3 views)
TransFuser
81.88%
TransFuser + I-JEPA backbone (fully trainable)
Improvement
+56%
vs. Constant Velocity baseline (25.88% PDM)

System Architecture

End-to-end pipeline from multi-camera input to trajectory prediction

System Architecture

Camera Inputs (1920×1208)Left (L0)Front (F0)Right (R0)Ego Status• Position (x, y)• Velocity• Heading θ8-dim vectorPreprocessingResize224×224Resize224×224Resize224×224NormalizeI-JEPA Encoder (ViT-H/14, 630M params)Vision TransformerSelf-Supervised Pretrained (ImageNet)50% Trainable Layers (Fine-tuning)Output: 3 × 1280-dim featuresMulti-Camera FusionConcatenate Features3 × 1280 = 3840-dim → 1280-dim (pooled)Planning Head (MLP)1288 → 256 → 256 → T×3Input: Features (1280) + Ego (8)Output: 8 waypoints (x, y, heading)Trajectory: 8 waypoints at 0.5s intervalsTraining Details:• Learning Rate: 1e-3 (MLP), 3e-5 (Encoder)• Batch Size: 10-12 per GPU (16 L40s, DDP)

Technical Approach

I-JEPA Vision Encoder

  • Model: Image-based Joint-Embedding Predictive Architecture (I-JEPA)
  • Architecture: Vision Transformer (ViT-H/14) with 630M parameters
  • Pretraining: Self-supervised on ImageNet-1K using masked prediction
  • Output: 1280-dimensional visual features per image

Multi-Camera Fusion

  • Views: Left (L0), Front (F0), Right (R0) cameras
  • Strategy: Per-view feature fusion (encode each camera separately at 224x224)
  • Advantages: Avoids panoramic distortion, preserves native resolution
  • FOV Coverage: ~180° total field of view

Planning Head

  • Input: Concatenated visual features (1280-dim) + ego status (8-dim)
  • Network: 1288 → 256 → 256 → T×3 (waypoints)
  • Output: T=8 waypoints (x, y, heading) at 0.5s intervals
  • Parameters: ~475K trainable parameters

Training Configuration

Optimal Setup (B11)

  • Learning rate: 1e-3 (MLP), 3e-5 (encoder)
  • Trainable encoder layers: 50%
  • Epochs: 30
  • Loss: L1 (trajectory waypoints)

Infrastructure

  • 4 nodes × 4 L40s GPUs (48GB VRAM)
  • PyTorch Lightning DDP
  • Mixed precision (FP16)
  • Batch size: 10-12 per GPU

Key Findings

Multi-Camera Fusion Matters

Per-view fusion (L0+F0+R0) significantly outperforms single front camera:

Front only (B2)79.78% PDM
3-view fusion (B11)82.06% PDM (+2.3%)

Optimal Fine-Tuning Strategy

Sweet spot at 50% trainable encoder layers balances transfer learning with adaptation:

0% trainable (frozen)72.46% PDM
25% trainable79.98% PDM
50% trainable82.06% PDM (Best!)
100% trainable79.59% PDM

I-JEPA vs Other Vision Backbones

Comparison with TransFuser + different SSL backbones (100% trainable, 30 epochs):

TransFuser + I-JEPA (A14)81.88% PDM
TransFuser baseline (A4)81.68% PDM
TransFuser + DINOv2 (A16)78.92% PDM
TransFuser + DINO (A15)78.89% PDM

Simpler is Better

The lightweight I-JEPA + MLP approach (82.06% PDM) matches or exceeds the complex TransFuser architecture (81.88% PDM) while using significantly fewer parameters and requiring no LiDAR data. This demonstrates the effectiveness of strong visual representations for planning tasks.

Engineering Highlights

Scalable Training

Multi-node distributed training (DDP) across 4 nodes × 4 GPUs, with automatic checkpointing and recovery for production-grade reliability.

Performance Optimization

Mixed-precision training (FP16) for 2× speedup, optimized data pipelines, and efficient memory management for large-scale experiments.

Web-Based Demo

Interactive browser showcase with real-time visualization, supporting both cached replays and live inference for easy demonstration and evaluation.

MLOps Integration

Experiment tracking, model versioning, and deployment pipelines for reproducible research and production readiness.

Interactive Demo

Experience the planning system in action with real-time trajectory visualization and performance metrics

Launch Demo

This project demonstrates research capabilities in computer vision, deep learning, and autonomous systems.

Computer VisionSelf-Supervised LearningAutonomous VehiclesDeep Learning