Listen

Description

[00:00] Intro to L3DG for 3D modeling

[00:32] Solving room-sized 3D scene complexity

[01:36] VQ-VAE compresses 3D Gaussian representation

[02:41] Generative sparse transpose convolution

[03:20] Latent diffusion for scene generation

[04:30] Visual improvements over baselines

[05:14] Scalability challenges for room-sized scenes

[06:13] Spherical harmonics for view dependence

[06:58] RGB and perceptual loss in training

[07:59] L1 and SSIM for 3D Gaussian optimization

[08:55] Training pipeline overview

[09:59] Densification in 3D Gaussian optimization

[10:49] Hyperparameter selection impact

[11:45] Future research directions

[12:42] Implementation optimization potential

[13:29] Comparison with GANs and diffusion methods

[14:27] Sparse grid representation trade-offs

[15:26] Evaluation datasets

[16:20] Chamfer distance for geometric analysis

[17:13] Applications

Authors: Barbara Roessle, Norman Müller, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, Angela Dai, Matthias Nießner

Affiliations: Technical University of Munich, Meta Reality Labs Zurich

Abstract: We propose L3DG, the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. This enables effective generative 3D modeling, scaling to generation of entire room-scale scenes which can be very efficiently rendered. To enable effective synthesis of 3D Gaussians, we propose a latent diffusion formulation, operating in a compressed latent space of 3D Gaussians. This compressed latent space is learned by a vector-quantized variational autoencoder (VQ-VAE), for which we employ a sparse convolutional architecture to efficiently operate on room-scale scenes. This way, the complexity of the costly generation process via diffusion is substantially reduced, allowing higher detail on object-level generation, as well as scalability to large scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time. We demonstrate that our approach significantly improves visual quality over prior work on unconditional object-level radiance field synthesis and showcase its applicability to room-scale scene generation.

Link: https://arxiv.org/abs/2410.13530