NutritionVerse-Synth: Synthetic Generation of Food Scenes for Dietary Intake Estimation

1University of Waterloo,  2Waterloo Artificial Intelligence Institute

Abstract

Manually tracking nutritional intake via food diaries is error-prone and burdensome. Automated computer vision techniques show promise for dietary monitoring but require large and diverse food image datasets. To address this need, we introduce NutritionVerse-Synth (NV-Synth), a large-scale synthetic food image dataset. NV-Synth contains 84,984 photorealistic meal images rendered from 7,082 dynamically plated 3D scenes. Each scene is captured from 12 viewpoints and includes perfect ground truth annotations such as RGB, depth, semantic, instance, and amodal segmentation masks, bounding boxes, and detailed nutritional information per food item. We demonstrate the diversity of NV-Synth across foods, compositions, viewpoints, and lighting. As the largest open-source synthetic food dataset, NV-Synth highlights the value of physics-based simulations for enabling scalable and controllable generation of diverse photorealistic meal images to overcome data limitations and drive advancements in automated dietary assessment using computer vision. In addition to the dataset, the source code for our data generation framework is also made publicly available.

NV-Synth Pipeline Overview

NV-Synth Data Generation Framework

Our data generation framework leverages the Isaac Sim physics engine from NVIDIA Omniverse for GPU-accelerated simulation and uses high-resolution 3D food models from the NutritionVerse-3D dataset. To allow both scalable domain-randomized data generation and flexibly customizing scene compositions, we support two methods for generating synthetic meal images: dynamic plating and procedural plating.

Dynamic Plating
Procedural Plating


NV-Synth Dataset

The NV-Synth dataset features randomly composed food scenes that are dynamically plated and captured from multiple angles. Each scene has automatically generated perfect ground-truth annotations to enable downstream training of models for nutrition prediction. Each scene also contains rich nutritional annotations, including mass, calories, carbohydrates, fats, and protein contents, and food item labels indicating the ingredients present in each dish.

Multiple Viewpoints
Diverse Compositions
Multimodal Annotations


Class Distribution
Nutritional Information


Synthetic Meal Image Generation