DreamFusion: Text-to-3D using 2D Diffusion

Abstract

Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs. Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D assets and efficient architectures for denoising 3D data, neither of which currently exist. In this work, we circumvent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.

Given a caption, DreamFusion generates relightable 3D objects with high-fidelity appearance, depth, and normals. Objects are represented as a Neural Radiance Field and leverage a pretrained text-to-image diffusion prior such as Imagen.

Generate 3D from text yourself!

Example generated objects

DreamFusion generates objects and scenes from diverse captions.

A teddy bear pushing a shopping cart full of fruits and vegetables.

a sliced loaf of fresh bread.

a zoomed out DSLR photo of Sydney opera house, aerial view.

Composing objects into a scene

Mesh exports

Our generated NeRF models can be exported to meshes using an algorithm based on marching cubes for easy integration into 3D renderers or modeling software.

[...] a frog wearing a sweater

an iridescent metal scorpion

[...] a classic Packard car

[...] Sydney opera house, aerial view

[...] a delicious croissant

[...] a humanoid robot holding a human brain

How does DreamFusion work?

Given a caption, DreamFusion uses a text-to-image generative model called Imagen to optimize a 3D scene. We propose Score Distillation Sampling (SDS), a way to generate samples from a diffusion model by optimizing a loss function. SDS allows us to optimize samples in an arbitrary parameter space, such as a 3D space, as long as we can map back to images differentiably. We use a 3D scene parameterization similar to Neural Radiance Fields, or NeRFs, to define this differentiable mapping. SDS alone produces reasonable scene appearance, but DreamFusion adds additional regularizers and optimization strategies to improve geometry. The resulting trained NeRFs are coherent, with high-quality normals, surface geometry and depth, and are relightable with a Lambertian shading model.

Citation

Anonymous. DreamFusion: Text-to-3D using 2D Diffusion. OpenReview, 2022.


                    @article{anon2022dreamfusion,

                      author = {Anonymous},

                      title  = {DreamFusion: Text-to-3D using 2D Diffusion},

                      joural = {OpenReview},

                      year   = {2022},

                }