Prithvijit Chattopadhyay
Prithvijit Chattopadhyay

Prithvijit Chattopadhyay

Research Scientist, NVIDIA Cosmos Lab

Ph.D., Georgia Tech (2019–2024)

advised by Judy Hoffman

Rising Star Doctoral Student Award

M.S., Georgia Tech (2017–2019)

advised by Devi Parikh

M.S. Research Award

emailscholarsemantic scholargithubcvtwitterlinkedin

About

I am a Research Scientist at NVIDIA Cosmos Lab, where I work on world-foundation models: video-generation models for spatio-temporal forecasting and VLMs for understanding. My work spans model-design, pre-training, data curation and evaluation of large multi-modal foundation models.

I'm driven by challenging open-ended problems that stretch what I know. My latest obsessions are.

My past work has spanned core computer vision (generalization, robustness, learning from limited supervision or synthetic data), the intersection of computer vision and language, and embodied AI.

[Past Life]

I also actively participate in reviewing for top computer vision and machine learning conferences & workshops (have accumulated a few reviewer awards — ICCV 2025, CVPR 2023, CVPR 2022, CVPR 2021, ICLR 2022, MLRC 2021, ICML 2020, NeurIPS 2019, ICLR 2019, NeurIPS 2018 - in the process).

Affiliations

DTU

2012-2016

IIIT

Winter 2014

VT

2016-2017

Georgia Tech

2017-2024

Microsoft Research

Summer 2018

AI2

Summer 2020, 2022

NVIDIA

2024–Present

News

[Read More]

Research

(* indicates equal contribution)

Cosmos-Reason2
Huggingface · 2026
NVIDIA (Prithvijit Chattopadhyay: Core Contributor)
TL;DR, Cosmos-Reason2 is an open, state-of-the-art reasoning vision-language model for Physical AI. Building on Cosmos-Reason1, it enables robots and AI agents to perceive, reason about, and act in the physical world through long chain-of-thought reasoning grounded in physics and common sense. The model adds OCR support, 2D/3D point localization, set-of-mark understanding, and trajectory-coordinate generation for robot vision-language-action (VLA) pipelines, topping the Physical AI Bench and Physical Reasoning leaderboards as the #1 open model for visual understanding.
DDRL
ArXiv · 2025
Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai, Ming-Yu Liu, James Zou, Stefano Ermon
TL;DR, DDRL (Data-regularized Diffusion Reinforcement Learning) is an RL framework for aligning diffusion models with human preferences that uses forward KL divergence to anchor the policy to an off-policy data distribution. By combining reward maximization with diffusion-loss regularization, DDRL avoids the reward hacking — quality degradation, over-stylization, reduced diversity — common in prior diffusion-RL methods. Validated with over a million H100 GPU hours and ten thousand double-blind human evaluations on high-resolution video generation, it powers post-training for the Cosmos-Predict2.5 video foundation models.
Cosmos-Predict2.5
ArXiv · 2025
NVIDIA (Prithvijit Chattopadhyay: Core Contributor)
TL;DR, Cosmos-Predict2.5 is a flow-based world foundation model that unifies Text2World, Image2World, and Video2World generation in a single model for Physical AI. Released in 2B and 14B variants, it is trained on a curated corpus of ~200M video clips with reinforcement-learning post-training for improved fidelity. The platform supports general world simulation, multi-view autonomous-driving rollouts, and action-conditioned robot rollouts, outperforming prior Cosmos-Predict releases across benchmark evaluations.
Cosmos-Embed1
Huggingface · 2025
NVIDIA (Prithvijit Chattopadhyay: Core Contributor)
TL;DR, Cosmos Embed1 is a joint video-text embedder tailored for physical AI. Multi-modal embeddings, particularly joint video-text embedders, are critical for physical AI development pipelines. They enable essential data curation tasks including text-to-video search, inverse video search, semantic deduplication, and targeted filtering. Additionally, these embeddings can also serve as representations to condition on for downstream physical AI models. While existing video-text embedders perform well in general domains, they underperform substantially on physical AI tasks. To bridge this gap, we introduce Cosmos Embed1, a joint video-text embedder specifically tailored for physical AI applications.
Cosmos-Reason1
ArXiv · 2025
NVIDIA (Prithvijit Chattopadhyay: Core Contributor)
TL;DR, Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements.
Cosmos-Predict1
ArXiv · 2025
NVIDIA (Prithvijit Chattopadhyay: Core Contributor)
Best of AI & Best of CES Winner, CES 2025
TL;DR, Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses.
SkyScenes
ECCV · 2024
Sahil Khose*, Anisha Pal*, Aayushi Agarwal*, Deepanshi*, Judy Hoffman, Prithvijit Chattopadhyay
TL;DR, We introduce SkyScenes, a large-scale synthetic dataset of densely annotated aerial images captured from Unmanned Aerial Vehicle (UAV) perspectives. We carefully curate SkyScenes images from CARLA to comprehensively capture diversity across layout (urban and rural maps), weather conditions, times of day, pitch angles and altitudes with corresponding semantic, instance and depth annotations. Through our experiments using SkyScenes, we show that (1) Models trained on SkyScenes generalize well to different real-world scenarios, (2) augmenting training on real images with SkyScenes data can improve real-world performance, (3) controlled variations in SkyScenes can offer insights into how models respond to changes in viewpoint conditions, and (4) additionally incorporating other sensor modalities (depth) can improve aerial scene understanding.
AUGCAL
ICLR · 2024
Prithvijit Chattopadhyay, Bharat Goyal, Boglarka Ecsedi, Viraj Prabhu, Judy Hoffman
Workshop on Uncertainty Quantification for Computer Vision, ICCV 2023 (Extended Abstract)
TL;DR, Mispredictions made by Sim2Real adaptation methods on real data can often be attributed to "miscalibration" — often caused by overconfident predictions. We propose a simple patch, AugCal, to improve uncertainty calibration of existing Sim2Real adaptation methods. Given a base Sim2Real adaptation algorithm, at training time, AugCal involves replacing vanilla "Sim" images with strongly augmented views (Aug intervention) and additionally optimizing for a training time calibration loss on augmented "Sim" predictions (Cal intervention). Through our experiments, we empirically show the efficacy of AugCal across multiple adaptation methods, backbones, tasks and shifts.
Video DA
TMLR · 2024
Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit Chattopadhyay, Judy Hoffman, Viraj Prabhu
TL;DR, Domain Adaptive Semantic Segmentation (DAS) seeks to adapt a model trained on images from a labeled source domain to an unlabeled target domain. Unlike the traditional Image-DAS settings, a few Video-DAS works have sought to additionally leverage the temporal signal present in videos on a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. We address this gap by conducting experiments that reveal that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper→Cityscapes-Seq, +19.0 mIoU on Synthia-Seq→Cityscapes-Seq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets.
Battle of the Backbones
NeurIPS Datasets and Benchmarks · 2023
Micah Goldblum*, Hossein Souri*, Renkun Ni, Manli Shu, Viraj Uday Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Adrien Bardes, Mark Ibrahim, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein
TL;DR, Most neural network based computer vision systems are built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. We benchmark a diverse suite of pretrained models across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Our Battle of Backbones (BoB) sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing backbones through a comprehensive analysis conducted on 1500 training runs.
LANCE
NeurIPS · 2023
Viraj Prabhu, Sriram Yenamandra, Prithvijit Chattopadhyay, Judy Hoffman
TL;DR, We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE). Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights.
Low-Shot Robustness
ICCV · 2023
Aaditya Singh, Kartik Sarangmath, Prithvijit Chattopadhyay, Judy Hoffman
TL;DR, Robustness to natural distribution shifts has seen remarkable progress thanks to recent pre-training strategies combined with better fine-tuning methods. However, such fine-tuning assumes access to large amounts of labelled data, and the extent to which the observations hold when the amount of training data is not as high remains unknown. We address this gap by performing the first in-depth study of robustness to various natural distribution shifts in different low-shot regimes: spanning datasets, architectures, pre-trained initializations, and state-of-the-art robustness interventions.
PASTA
ICCV · 2023
Prithvijit Chattopadhyay*, Kartik Sarangmath*, Vivek Vijaykumar, Judy Hoffman
TL;DR, PASTA is a simple and effective frequency domain augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves perturbing the amplitude spectra of the synthetic images in the Fourier domain in a structured manner to generate augmented views. For the tasks of semantic segmentation (GTAV→Real), object detection (Sim10K→Real), and object recognition (VisDA-C Syn→Real), across a total of 5 syn-to-real shifts, we find that PASTA either outperforms or is consistently competitive with more complex state-of-the-art methods while being complementary to other generalization approaches.
RobustNav
ICCV · 2021
Prithvijit Chattopadhyay, Judy Hoffman, Roozbeh Mottaghi, Ani Kembhavi
Oral Presentation
TL;DR, As an attempt towards assessing the robustness of embodied navigation agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual — affecting RGB inputs — and dynamics — affecting transition dynamics — corruptions. We find that standard end-to-end RL policies significantly underperform (or fail) in the presence of visual or dynamics corruptions, warranting more research in this direction.
Likelihood Landscapes
Adversarial Robustness in the Real World (AROW), ECCV · 2020
Fu Lin, Rohit Mittapali, Prithvijit Chattopadhyay, Daniel Bolya, Judy Hoffman
NVIDIA Best Paper Runner Up
TL;DR, Convolutional Neural Networks (CNNs) have been shown to be vulnerable to adversarial examples, which are known to locate in subspaces close to where normal data lies but are not naturally occurring and have low probability. In this work, we investigate the potential effect defense techniques have on the geometry of the likelihood landscape — likelihood of the input images under the trained model. We first propose a way to visualize the likelihood landscape by leveraging an energy-based model interpretation of discriminative classifiers. Then we introduce a measure to quantify the flatness of the likelihood landscape. We observe that a subset of adversarial defense techniques results in a similar effect of flattening the likelihood landscape. We further explore directly regularizing towards a flat landscape for adversarial robustness.
DMG
ECCV · 2020
Prithvijit Chattopadhyay, Yogesh Balaji, Judy Hoffman
Visual Learning with Limited Labels (LwLL), CVPR 2020
TL;DR, We introduce Domain-specific Masks for Generalization, a model for improving both in-domain and out-of-domain generalization performance. To produce a model which best generalizes to both seen and unseen domains, we propose learning domain specific masks (encouraged to learn a balance of domain-invariant and domain-specific features) enabling a model to benefit from the predictive power of specialized features while retaining the universal applicability of domain-invariant features. We demonstrate competitive performance compared to naive baselines and state-of-the-art methods on both PACS and DomainNet.
Diverse Visual Dialog
EMNLP · 2019
Vishvak Murahari, Prithvijit Chattopadhyay, Dhruv Batra, Devi Parikh, Abhishek Das
Visual Question Answering and Dialog Workshop, CVPR 2019
TL;DR, While generative visual dialog models trained with self-talk based RL perform better at the associated downstream task, they suffer from repeated interactions — resulting in saturation in improvements as the number of rounds increase. To counter this, we devise a simple auxiliary objective that incentivizes Q-Bot to ask diverse questions, thus reducing repetitions and in turn enabling A-Bot to explore a larger state space during RL i.e., be exposed to more visual concepts to talk about, and varied questions to answer.
IR-VIC
IJCAI · 2020
Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, Ramakrishna Vedantam
Workshop on Task Agnostic Reinforcement Learning (TARL), ICLR 2019
TL;DR, We propose a novel framework to identify subgoals useful for exploration in sequential decision making tasks under partial observability. We utilize the variational intrinsic control framework (Gregor et al., 2016) which maximizes empowerment — the ability to reliably reach a diverse set of states — and show how to identify sub-goals as states with high necessary option information through an information theoretic regularizer. Despite being discovered without explicit goal supervision, our subgoals provide better exploration and sample complexity on challenging grid-world navigation tasks compared to supervised counterparts in prior work.
EvalAI
Workshop on AI Systems, SOSP · 2019
Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, Dhruv Batra
TL;DR, We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale. EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop. This will help researchers, students, and data scientists to create, collaborate, and participate in AI challenges organized around the globe.
NIWT
ECCV · 2018
Ramprasaath R. Selvaraju*, Prithvijit Chattopadhyay*, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee
Continual Learning Workshop, NeurIPS 2018 · Visually Grounded Interaction and Language (ViGIL) Workshop, NeurIPS 2018
TL;DR, We introduce a simple, efficient zero-shot learning approach — NIWT — based on the observation that individual neurons in CNNs have been shown to implicitly learn a dictionary of semantically meaningful concepts (simple textures and shapes to whole or partial objects). NIWT learns to map domain knowledge about "unseen" classes onto this dictionary of learned concepts and optimizes for network parameters that can effectively combine these concepts — essentially learning classifiers by discovering and composing learned semantic concepts in deep networks.
Visual Explanations
EMNLP · 2018
Arjun Chandrasekaran*, Viraj Prabhu*, Deshraj Yadav*, Prithvijit Chattopadhyay*, Devi Parikh
TL;DR, A rich line of research attempts to make deep neural networks more transparent by generating human-interpretable 'explanations' of their decision process, especially for interactive tasks like Visual Question Answering (VQA). In this work, we analyze if existing explanations indeed make a VQA model — its responses as well as failures — more predictable to a human.
GuessWhich
HCOMP · 2017
Prithvijit Chattopadhyay*, Deshraj Yadav*, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, Devi Parikh
Oral Presentation
TL;DR, We design a cooperative game — GuessWhich — to measure human-AI team performance in the specific context of the AI being a visual conversational agent. GuessWhich involves live interaction between the human and the AI and is designed to gauge the extent to which progress in isolated metrics for AI (& AI-AI teams) transfers to human-AI collaborative scenarios.
ToAIM
Chalearn Looking at People Workshop, CVPR · 2017
Arjun Chandrasekaran*, Deshraj Yadav*, Prithvijit Chattopadhyay*, Viraj Prabhu*, Devi Parikh
TL;DR, To effectively leverage the progress in Artificial Intelligence (AI) to make our lives more productive, it is important for humans and AI to work well together in a team. In this work, we argue that for human-AI teams to be effective, in addition to making AI more accurate and human-like, humans must also develop a theory of AI's mind (ToAIM) — get to know its strengths, weaknesses, beliefs, and quirks.
Counting
CVPR · 2017
Prithvijit Chattopadhyay*, Ramakrishna Vedantam*, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh
Spotlight Presentation
TL;DR, We study the numerosity of object classes in natural, everyday images and build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. We propose a contextual counting approach inspired by the phenomenon of subitizing — the ability of humans to make quick assessments of counts given a perceptual signal, for small count values.

Achievements

[Read More]

Projects

Investigating Visual Dialog Models for Goal-Driven Self-Talk
2019
Prithvijit Chattopadhyay (advised by Devi Parikh)
Exploring Weak-Supervision and Generative Models for Semantic Segmentation
2018
Prithvijit Chattopadhyay, Ramprasaath R. Selvaraju, Viraj Prabhu
DTU AUV: Autonomous Underwater Vehicle
2012-2016
Prithvijit Chattopadhyay (Acoustics & Control Systems Department), co-authored with DTU AUV members

Theses

Harnessing Synthetic Data for Robust and Reliable Vision
Ph.D. in Computer Science, Georgia Tech, 2024
Evaluating Visual Conversational Agents in the Context of Human-AI Cooperative Games
Masters in Computer Science (specialization Machine Learning), 2017-2019