概述

stable-dreamfusion是文本到 3D 模型 Dreamfusion 的 pytorch 实现，由 Stable Diffusion 文本到 2D 模型提供支持。该项目正在进行中，与论文有很多不同之处。当前的生成质量无法与原始论文的结果相提并论，许多提示仍然失败！由于 Imagen 模型不公开可用，我们使用稳定扩散来替换它（从扩散器实现）。与Imagen不同，Stable-Diffusion是一种潜在的扩散模型，它在潜在空间而不是原始图像空间中扩散。因此，我们也需要损失从 VAE 的编码器部分传播回来，这会在训练中引入额外的时间成本。我们使用多分辨率网格编码器来实现 NeRF 主干（来自 torch-ngp 的实现），从而实现更快的渲染。

图像到3D

[video mp4="https://oss.ccopyright.net.cn/2023/06/20230614023007381.mp4"][/video]

文本到3D

[video mp4="https://oss.ccopyright.net.cn/2023/06/20230614023012471.mp4"][/video]

安装

[hidecontent type="logged" desc="隐藏内容：登录后可查看"]

git clone https://github.com/ashawkey/stable-dreamfusion.git
cd stable-dreamfusion

可选：创建一个python虚拟环境

为避免 python 包冲突，我们建议使用虚拟环境，例如：使用 conda 或 venv：

python -m venv venv_stable-dreamfusion
source venv_stable-dreamfusion/bin/activate # you need to repeat this step for every new terminal

使用 pip 安装

pip install -r requirements.txt

下载预训练模型

要使用 image-conditioned 3D generation，您需要手动下载一些预训练的检查点：

扩散后端从 1 到 3 。我们105000.ckpt默认使用，它是硬编码在guidance/zero123_utils.py.
```
cd pretrained/zero123
wget https://huggingface.co/cvlab/zero123-weights/resolve/main/105000.ckpt
```

用于深度和法线预测的Omnidata 。这些 ckpts 在preprocess_image.py.

mkdir pretrained/omnidata
cd pretrained/omnidata
# assume gdown is installed
gdown '1Jrh-bRnJEjyMCS7f-WsaFlccfPjJPPHI&confirm=t' # omnidata_dpt_depth_v2.ckpt
gdown '1wNxVO4vVbDEMEpnAi_jwQObf2MFodcBR&confirm=t' # omnidata_dpt_normal_v2.ckpt

要使用DeepFloyd-IF，您需要接受拥抱脸的使用条件，并huggingface-cli login在命令行中登录。

对于 DMTet，我们将预先生成的32/64/128分辨率四面体网格移植到tets. 可以在此处找到 256 分辨率。

构建扩展（可选）

默认情况下，我们用于load在运行时构建扩展。我们还提供setup.py构建每个扩展的方法：

cd stable-dreamfusion

# install all extension modules
bash scripts/install_ext.sh

# if you want to install manually, here is an example:
pip install ./raymarching # install to python path (you still need the raymarching/ folder, since this only installs the built extension.)

太极后端（可选）

将Taichi后端用于 Instant-NGP。它实现了与 CUDA 实现相当的性能，同时不需要 CUDA构建。使用 pip 安装 Taichi：

pip install -i https://pypi.taichi.graphics/simple/ taichi-nightly

故障排除：

我们假设使用所有依赖项的最新版本，如果您遇到特定依赖项的任何问题，请先尝试升级它（例如，pip install -U diffusers）。如果问题仍然存在，将不胜感激报告错误问题！
[F glutil.cpp:338] eglInitialize() failed Aborted (core dumped)：这通常表示 OpenGL 安装存在问题。尝试重新安装 Nvidia 驱动程序，或者如果您使用的是无头服务器，则按照#131中的建议使用 nvidia-docker。
TypeError: xxx_forward(): incompatible function arguments：当我们更新了CUDA源，而你setup.py之前安装了扩展时，就会出现这种情况。尝试重新安装相应的扩展（例如，pip install ./gridencoder）。

测试环境

在 V100 上安装 Torch 1.12 和 CUDA 11.6 的 Ubuntu 22。

用法

第一次运行需要一些时间来编译 CUDA 扩展。

#### stable-dreamfusion setting

### Instant-NGP NeRF Backbone
# + faster rendering speed
# + less GPU memory (~16G)
# - need to build CUDA extensions (a CUDA-free Taichi backend is available)

## train with text prompt (with the default settings)
# `-O` equals `--cuda_ray --fp16`
# `--cuda_ray` enables instant-ngp-like occupancy grid based acceleration.
python main.py --text "a hamburger" --workspace trial -O

# reduce stable-diffusion memory usage with `--vram_O`
# enable various vram savings (https://huggingface.co/docs/diffusers/optimization/fp16).
python main.py --text "a hamburger" --workspace trial -O --vram_O

# You can collect arguments in a file. You can override arguments by specifying them after `--file`. Note that quoted strings can't be loaded from .args files...
python main.py --file scripts/res64.args --workspace trial_awesome_hamburger --text "a photo of an awesome hamburger"

# use CUDA-free Taichi backend with `--backbone grid_taichi`
python3 main.py --text "a hamburger" --workspace trial -O --backbone grid_taichi

# choose stable-diffusion version (support 1.5, 2.0 and 2.1, default is 2.1 now)
python main.py --text "a hamburger" --workspace trial -O --sd_version 1.5

# use a custom stable-diffusion checkpoint from hugging face:
python main.py --text "a hamburger" --workspace trial -O --hf_key andite/anything-v4.0

# use DeepFloyd-IF for guidance (experimental):
python main.py --text "a hamburger" --workspace trial -O --IF
python main.py --text "a hamburger" --workspace trial -O --IF --vram_O # requires ~24G GPU memory

# we also support negative text prompt now:
python main.py --text "a rose" --negative "red" --workspace trial -O

## after the training is finished:
# test (exporting 360 degree video)
python main.py --workspace trial -O --test
# also save a mesh (with obj, mtl, and png texture)
python main.py --workspace trial -O --test --save_mesh
# test with a GUI (free view control!)
python main.py --workspace trial -O --test --gui

### Vanilla NeRF backbone
# + pure pytorch, no need to build extensions!
# - slow rendering speed
# - more GPU memory

## train
# `-O2` equals `--backbone vanilla`
python main.py --text "a hotdog" --workspace trial2 -O2

# if CUDA OOM, try to reduce NeRF sampling steps (--num_steps and --upsample_steps)
python main.py --text "a hotdog" --workspace trial2 -O2 --num_steps 64 --upsample_steps 0

## test
python main.py --workspace trial2 -O2 --test
python main.py --workspace trial2 -O2 --test --save_mesh
python main.py --workspace trial2 -O2 --test --gui # not recommended, FPS will be low.

### DMTet finetuning

## use --dmtet and --init_with <nerf checkpoint> to finetune the mesh at higher reslution
python main.py -O --text "a hamburger" --workspace trial_dmtet --dmtet --iters 5000 --init_with trial/checkpoints/df.pth

## init dmtet with a mesh to generate texture
# require install of cubvh: pip install git+https://github.com/ashawkey/cubvh
# remove --lock_geo to also finetune geometry, but performance may be bad.
python main.py -O --text "a white bunny with red eyes" --workspace trial_dmtet_mesh --dmtet --iters 5000 --init_with ./data/bunny.obj --lock_geo

## test & export the mesh
python main.py -O --text "a hamburger" --workspace trial_dmtet --dmtet --iters 5000 --test --save_mesh

## gui to visualize dmtet
python main.py -O --text "a hamburger" --workspace trial_dmtet --dmtet --iters 5000 --test --gui

### Image-conditioned 3D Generation

## preprocess input image
# note: the results of image-to-3D is dependent on zero-1-to-3's capability. For best performance, the input image should contain a single front-facing object, it should have square aspect ratio, with <1024 pixel resolution. Check the examples under ./data.
# this will exports `<image>_rgba.png`, `<image>_depth.png`, and `<image>_normal.png` to the directory containing the input image.
python preprocess_image.py <image>.png
python preprocess_image.py <image>.png --border_ratio 0.4 # increase border_ratio if the center object appears too large and results are unsatisfying.

## zero123 train
# pass in the processed <image>_rgba.png by --image and do NOT pass in --text to enable zero-1-to-3 backend.
python main.py -O --image <image>_rgba.png --workspace trial_image --iters 5000

# if the image is not exactly front-view (elevation = 0), adjust default_polar (we use polar from 0 to 180 to represent elevation from 90 to -90)
python main.py -O --image <image>_rgba.png --workspace trial_image --iters 5000 --default_polar 80

# by default we leverage monocular depth estimation to aid image-to-3d, but if you find the depth estimation inaccurate and harms results, turn it off by:
python main.py -O --image <image>_rgba.png --workspace trial_image --iters 5000 --lambda_depth 0

python main.py -O --image <image>_rgba.png --workspace trial_image_dmtet --dmtet --init_with trial_image/checkpoints/df.pth

## zero123 with multiple images
python main.py -O --image_config config/<config>.csv --workspace trial_image --iters 5000

## render <num> images per batch (default 1)
python main.py -O --image_config config/<config>.csv --workspace trial_image --iters 5000 --batch_size 4

# providing both --text and --image enables stable-diffusion backend (similar to make-it-3d)
python main.py -O --image hamburger_rgba.png --text "a DSLR photo of a delicious hamburger" --workspace trial_image_text --iters 5000

python main.py -O --image hamburger_rgba.png --text "a DSLR photo of a delicious hamburger" --workspace trial_image_text_dmtet --dmtet --init_with trial_image_text/checkpoints/df.pth

## test / visualize
python main.py -O --image <image>_rgba.png --workspace trial_image_dmtet --dmtet --test --save_mesh
python main.py -O --image <image>_rgba.png --workspace trial_image_dmtet --dmtet --test --gui

### Debugging

# Can save guidance images for debugging purposes. These get saved in trial_hamburger/guidance.
# Warning: this slows down training considerably and consumes lots of disk space!
python main.py --text "a hamburger" --workspace trial_hamburger -O --vram_O --save_guidance --save_guidance_interval 5 # save every 5 steps

例如命令，检查scripts.

有关高级技巧和其他开发内容，请查看Advanced Tips。

重要的提醒

这个项目是一个正在进行的工作，与论文有很多不同之处。当前的生成质量无法与原始论文的结果相提并论，许多提示仍然失败！

与论文的显着差异

由于 Imagen 模型未公开，我们使用Stable Diffusion来替换它（来自diffusers 的实现）。与Imagen不同，Stable-Diffusion是一种潜在的扩散模型，它在潜在空间而不是原始图像空间中扩散。因此，我们也需要损失从 VAE 的编码器部分传播回来，这会在训练中引入额外的时间成本。
我们使用多分辨率网格编码器来实现 NeRF 主干（来自torch-ngp的实现），从而实现更快的渲染（在 800x800 下约为 10FPS）。
我们默认使用Adan优化器。

[/hidecontent]