Cosmos World Foundation Model Platform for Physical AI
NVIDIA (Prithvijit Chattopadhyay: Core Contributor)
ArXiv 2025
Best of AI & Best of CES Winner, CES 2025
[Project Page]
TL;DR, Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a
digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model
Platform to help developers build customized world models for their Physical AI setups. We position
a world foundation model as a general-purpose world model that can be fine-tuned into customized
world models for downstream applications. Our platform covers a video curation pipeline, pre-trained
world foundation models, examples of post-training of pre-trained world foundation models, and video
tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our
platform open-source and our models open-weight with permissive licenses.
SkyScenes: A Synthetic Dataset for
Aerial Scene Understanding
Sahil Khose*, Anisha Pal*, Aayushi Agarwal*, Deepanshi*,
Judy Hoffman, Prithvijit
ECCV 2024
TL;DR, We introduce SkyScenes, a
large-scale synthetic dataset
of densely
annotated aerial images captured from Unmanned Aerial Vehicle (UAV) perspectives. We carefully curate
SkyScenes images from CARLA to comprehensively capture diversity across layout (urban and rural maps),
weather conditions, times of day, pitch angles and altitudes with corresponding semantic, instance and
annotations. Through our experiments using SkyScenes, we show that (1) Models trained on SkyScenes
generalize well to different real-world scenarios, (2) augmenting training on real images with SkyScenes
data can improve real-world performance, (3) controlled variations in SkyScenes can offer insights into
models respond to changes in viewpoint conditions, and (4) additionally incorporating other sensor
modalities (depth) can improve aerial scene understanding.
AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration
on Augmented Synthetic Images
Prithvijit Chattopadhyay, Bharat Goyal, Boglarka Ecsedi, Viraj Prabhu, Judy Hoffman
ICLR 2024
Workshop on Uncertainty Quantification for Computer Vision, ICCV 2023 (Extended
TL;DR, Mispredictions made by Sim2Real
adaptation methods on real data can often
be attributed to “miscalibration” – often caused by overconfident predictions. We propose a simple patch,
AugCal, to improve uncertainty calibration of existing Sim2Real adaptation methods. Given a base Sim2Real
adaptation algorithm, at training time, AugCal involves replacing vanilla "Sim" images with strongly
augmented views (Aug intervention) and additionally optimizing for a training time calibration loss on
augmented "Sim" predictions (Cal intervention). Through our experiments, we empirically show the efficacy
AugCal across multiple adaptation methods, backbones, tasks and shifts.
We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation
Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit Chattopadhyay, Judy
Hoffman, Viraj Prabhu
TMLR 2024
TL;DR, Domain Adaptive Semantic
Segmentation (DAS) seeks to adapt a model trained on images
from a labeled source domain to an unlabeled target domain. Unlike the traditional Image-DAS settings, a
few Video-DAS works have sought to additionally leverage
the temporal signal present in videos on a
distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. We
address this gap by conducting experiments that reveal that (1) even after carefully controlling for data
and model architecture,
state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established
Video-DAS benchmarks (+14.5 mIoU on Viper→Cityscapes-Seq, +19.0 mIoU on Synthia-Seq→Cityscapes-Seq), and
naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across
Battle of the Backbones: A Large-Scale Comparison of Pretrained
Models across Computer Vision Tasks
Micah Goldblum*, Hossein Souri*, Renkun Ni, Manli Shu, Viraj Uday Prabhu, Gowthami
Prithvijit Chattopadhyay, Adrien Bardes, Mark Ibrahim, Judy Hoffman, Rama Chellappa, Andrew Gordon
Wilson, Tom Goldstein
NeurIPS Datasets and Benchmarks 2023
TL;DR, Most neural network based computer
systems are built on a backbone,
a pretrained or randomly initialized feature extractor. Several years ago, the default option was an
ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of
backbones pretrained using various algorithms and datasets. We benchmark a diverse suite of pretrained
models across a diverse set of computer vision tasks ranging from classification to object detection to
generalization and more. Our Battle of Backbones (BoB) sheds light on promising directions for the
community to advance computer vision by illuminating strengths and weakness of existing backbones through
comprehensive analysis conducted on 1500 training runs.
LANCE: Stress-testing Visual Models by Generating
Counterfactual Images
Viraj Prabhu,
Sriram Yenamandra,
Prithvijit Chattopadhyay,
Judy Hoffman
NeurIPS 2023
[project page]
TL;DR, We propose an automated algorithm to
stress-test a trained visual model by
generating language-guided counterfactual test images (LANCE). Our method leverages
recent progress in large language modeling and text-based image editing to augment an IID test set with a
suite of diverse, realistic, and challenging test images
without altering model weights.
Benchmarking Low-Shot Robustness to Natural Distribution Shifts
Aaditya Singh,
Kartik Sarangmath,
Prithvijit Chattopadhyay,
Judy Hoffman
ICCV 2023
TL;DR, Robustness to natural distribution
has seen remarkable progress
thanks to recent pre-training strategies combined with better fine-tuning methods. However, such
assumes access to large amounts of labelled data, and the extent to which the observations hold when the
amount of training data is not as high remains unknown. We address this gap by performing the first
study of robustness to various natural distribution shifts in different low-shot regimes: spanning
architectures, pre-trained initializations, and state-of-the-art robustness interventions.
PASTA: Proportional Amplitude Spectrum Training Augmentation for
Syn-to-Real Domain Generalization
Prithvijit Chattopadhyay*,
Kartik Sarangmath*,
Vivek Vijaykumar,
Judy Hoffman
ICCV 2023
TL;DR, PASTA is a simple and effective
domain augmentation strategy to
improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves
the amplitude spectra of the synthetic images in the Fourier domain in a structured manner to generate
augmented views. For the tasks of semantic segmentation (GTAV→Real), object detection (Sim10K→Real), and
object recognition (VisDA-C Syn→Real), across a total of 5 syn-to-real shifts, we find that PASTA either
outperforms or is consistently competitive with more complex state-of-the-art methods while being
complementary to other generalization approaches.
RobustNav: Towards Benchmarking Robustness in Embodied
Prithvijit Chattopadhyay,
Judy Hoffman,
Roozbeh Mottaghi,
Ani Kembhavi
ICCV 2021
Oral presentation
[project page]
TL;DR, As an attempt towards assessing
robustness of embodied navigation
agents, we
propose RobustNav, a framework to quantify the performance of embodied navigation agents when
exposed to a wide variety of visual – affecting RGB inputs – and dynamics – affecting transition
dynamics – corruptions. We find that standard end-to-end RL policies significantly underperform (or
fail) in the presence of visual or dynamics corruptions, warranting more research in this direction.
Likelihood Landscapes: A Unifying Principle Behind Many
Adversarial Defenses
Fu Lin,
Rohit Mittapali,
Prithvijit Chattopadhyay,
Daniel Bolya,
Judy Hoffman
Adversarial Robustness in the Real World (AROW), ECCV 2020
NVIDIA Best Paper Runner Up
TL;DR, Convolutional Neural Networks (CNNs)
have been shown to be vulnerable to
adversarial examples, which are known to locate in subspaces close to where normal data lies but are
not naturally occurring and have low probability. In this work, we investigate the potential effect
defense techniques have on the geometry of the likelihood landscape - likelihood of the input images
under the trained model. We first propose a way to visualize the likelihood landscape by leveraging
an energy-based model interpretation of discriminative classifiers. Then we introduce a measure to
quantify the flatness of the likelihood landscape. We observe that a subset of adversarial defense
techniques results in a similar effect of flattening the likelihood landscape. We further explore
directly regularizing towards a flat landscape for adversarial robustness.
Learning to Balance Specificity and Invariance for In and Out of
Domain Generalization
Prithvijit Chattopadhyay,
Yogesh Balaji,
Judy Hoffman
ECCV 2020
Visual Learning with Limited Labels (LwLL), CVPR 2020
TL;DR, We introduce Domain-specific Masks for
Generalization, a model for
improving both
in-domain and out-of-domain generalization performance. To produce a model which best generalizes
to both seen and unseen domains, we propose learning domain specific masks (encouraged
to learn a balance of domain-invariant and domain-specific features) enabling a model to
benefit from the predictive power of specialized features while retaining the universal
applicability of domain-invariant features. We demonstrate competitive performance compared to naive
baselines and state-of-the-art methods on both PACS and DomainNet.
Improving Generative Visual Dialog by Answering Diverse
Vishvak Murahari,
Prithvijit Chattopadhyay,
Dhruv Batra,
Devi Parikh,
Abhishek Das
EMNLP 2019
Visual Question Answering and Dialog Workshop,
CVPR 2019
TL;DR, While generative visual dialog
trained with self-talk based RL
better at the associated downstream task, they suffer from repeated interactions -- resulting in
saturation in improvements as the number of rounds increase. To counter this, we devise a simple
auxiliary objective that incentivizes Q-Bot to ask diverse questions, thus reducing repetitions and
in turn enabling A-Bot to explore a larger state space during RL i.e., be exposed to more visual
concepts to talk about, and varied questions to answer.
IR-VIC: Unsupervised Discovery of Sub-goals for Transfer in RL
Nirbhay Modhe,
Prithvijit Chattopadhyay,
Mohit Sharma,
Abhishek Das,
Devi Parikh,
Dhruv Batra,
Ramakrishna Vedantam
IJCAI 2020
Workshop on Task Agnostic Reinforcement Learning (TARL), ICLR 2019
TL;DR, We propose a novel framework to
subgoals useful for exploration
in sequential decision
making tasks under partial observability. We utilize
the variational intrinsic control framework (Gregor, 2016) which maximizes empowerment –
the ability to reliably reach a diverse set of states
and show how to identify sub-goals as states with
high necessary option information through an information theoretic regularizer. Despite being discovered
without explicit goal supervision, our subgoals provide better exploration and sample complexity on
challenging grid-world navigation tasks
compared to supervised counterparts in prior work.
EvalAI: Towards Better Evaluation Systems for AI Agents
Deshraj Yadav,
Rishabh Jain,
Harsh Agrawal,
Prithvijit Chattopadhyay,
Taranjeet Singh,
Akash Jain,
Shiv Baran Singh,
Stefan Lee,
Dhruv Batra
Workshop on AI Systems, SOSP 2019
TL;DR, We introduce EvalAI, an open source
platform for evaluating and comparing
learning (ML) and artificial intelligence algorithms (AI) at scale. EvalAI is built to provide a
scalable solution to the research community to fulfill the critical need of evaluating machine
learning models and agents acting in an environment against annotations or with a human-in-the-loop.
This will help researchers, students, and data scientists to create, collaborate, and participate in
AI challenges organized around the globe.
Choose Your Neuron: Incorporating Domain-Knowledge through
Ramprasaath R. Selvaraju*,
Prithvijit Chattopadhyay*,
Mohamed Elhoseiny,
Tilak Sharma,
Dhruv Batra,
Devi Parikh,
Stefan Lee
ECCV, 2018
Continual Learning Workshop, NeurIPS 2018
Visually Grounded Interaction and Language (ViGIL) Workshop, NeurIPS 2018
TL;DR, We introduce a simple, efficient
zero-shot learning approach -- NIWT --
based on
the observation that individual neurons in CNNs have been shown to implicitly learn a dictionary of
semantically meaningful concepts (simple textures and shapes to whole or partial objects). NIWT
learns to map domain knowledge about "unseen" classes onto this dictionary of learned concepts and
optimizes for network parameters that can effectively combine these concepts - essentially learning
classifiers by discovering and composing learned semantic concepts in deep networks.
Do explanation modalities make VQA Models more predictable to a
Arjun Chandrasekaran*,
Viraj Prabhu*,
Deshraj Yadav*,
Prithvijit Chattopadhyay*,
Devi Parikh
EMNLP 2018
TL;DR, A rich line of research attempts to
make deep neural networks more
transparent by
generating human-interpretable 'explanations' of their decision process, especially for interactive
tasks like Visual Question Answering (VQA). In this work, we analyze if existing explanations indeed
make a VQA model -- its responses as well as failures -- more predictable to a human.
Evaluating Visual Conversational Agents via Cooperative Human-AI
Prithvijit Chattopadhyay*,
Deshraj Yadav*,
Viraj Prabhu,
Arjun Chandrasekaran,
Abhishek Das,
Stefan Lee,
Dhruv Batra,
Devi Parikh
HCOMP 2017
Oral presentation
TL;DR, We design a cooperative game -
GuessWhich -
to measure human-AI team
performance in
the specific context of the AI being a visual conversational agent. GuessWhich involves live
interaction between the human and the AI and is designed to gauge the extent to which progress in
isolated metrics for AI (& AI-AI teams) transfers to human-AI collaborative scenarios.
It Takes Two to Tango: Towards Theory of AI's Mind
Arjun Chandrasekaranu*,
Deshraj Yadav*,
Prithvijit Chattopadhyay*,
Viraj Prabhu*,
Devi Parikh
Chalearn Looking at People Workshop, CVPR 2017
TL;DR, To effectively leverage the progress
Artificial Intelligence (AI) to
make our
lives more productive, it is important for humans and AI to work well together in a team. In this
work, we argue that for human-AI teams to be effective, in addition to making AI more accurate and
human-like, humans must also develop a theory of AI's mind (ToAIM) - get to know its strengths,
weaknesses, beliefs, and quirks.
Counting Everyday Objects in Everyday Scenes
Prithvijit Chattopadhyay*,
Ramakrishna Vedantam*,
Ramprasaath R. Selvaraju,
Dhruv Batra,
Devi Parikh
CVPR 2017
Spotlight presentation
TL;DR, We study the numerosity of object
classes in natural, everyday images and
dedicated models for counting designed to tackle the large variance in counts, appearances, and
scales of objects found in natural scenes. We propose a contextual counting approach inspired by the
phenomenon of subitizing - the ability of humans to make quick assessments of counts given a
perceptual signal, for small count values.