stable-diffusion:潜在的文本到图像扩散模型

概述

stable-diffusion是潜在的文本到图像扩散模型。

stable-diffusion之所以成为可能，要归功于与Stability AI和Runway的合作。

stable-diffusion是一种潜在的文本到图像扩散型。与谷歌的Imagen类似，该模型使用冻结的 CLIP ViT-L/14 文本编码器根据文本提示对模型进行调节。凭借其860M UNet和123M文本编码器，该模型相对轻巧，可以在至少具有10GB VRAM的GPU上运行。

要求

可以创建 conda 环境并通过以下方式激活：ldm

conda env create -f environment.yaml
conda activate ldm

您还可以通过运行

conda install pytorch torchvision -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

[hidecontent type="logged" desc="隐藏内容：登录后可查看"]

稳定扩散 v1

稳定扩散 v1 是指模型的特定配置使用降采样因子 8 自动编码器和 860M UNet 的架构以及用于扩散模型的 CLIP ViT-L/14 文本编码器。该模型在 256x256 图像上进行了预训练，并且然后在 512x512 图像上进行微调。

注意：稳定扩散v1是一般的文本到图像扩散模型，因此反映了存在的偏见和（错误）概念在其训练数据中。有关训练过程和数据以及模型预期用途的详细信息，请参见相应的模型卡。

这些砝码可通过Hugging Face的CompVis组织获得，该许可证包含特定的基于使用的限制，以防止模型卡告知的滥用和伤害，但除此之外仍然是允许的。虽然许可条款允许商业使用，但我们不建议在没有额外安全机制和考虑因素的情况下将提供的权重用于服务或产品，因为权重存在已知的限制和偏差，并且研究一般文本到图像模型的安全和道德部署是一项持续的努力。权重是研究工件。

权重

我们目前提供以下检查点：

sd-v1-1.ckpt：在 laion237B-en 上分辨率下为 2k 步。在LAION高分辨率分辨率下为194k步（LAION-170B的5M示例具有分辨率）。256x256512x512>= 1024x1024
sd-v1-2.ckpt：从恢复。 515k 步长在 laion-aesthetics v2 5+ 上的分辨率（laion2B-en 的子集，估计的美学分数，此外过滤到具有原始大小的图像，以及估计的水印概率。水印估计值来自 LAION-5B 元数据，美学分数使用 LAION-美学预测器 V2 估算）。sd-v1-1.ckpt512x512> 5.0>= 512x512< 0.5
sd-v1-3.ckpt：从恢复。在“laion-aesthetics v195 2+”的分辨率下有 5k 步，文本条件反射下降 10%，以改善无分类器引导采样。sd-v1-2.ckpt512x512
sd-v1-4.ckpt：从恢复。在“laion-aesthetics v225 2+”的分辨率下有 5k 步，文本条件反射下降 10%，以改善无分类器引导采样。sd-v1-2.ckpt512x512

使用不同的无分类指导量表（1.5、2.0、3.0、4.0、 5.0、6.0、7.0、8.0）和 50 PLMS 采样步骤显示了检查点的相对改进：

具有稳定扩散的文本到图像

稳定扩散是一种潜在扩散模型，条件是 CLIP ViT-L/14 文本编码器的（非池化）文本嵌入。我们提供了一个用于采样的参考脚本，但是还存在扩散器集成，我们期待看到更积极的社区发展。

参考采样脚本

我们提供了一个参考采样脚本，其中包含

a Safety Checker Module, to reduce the probability of explicit outputs,
输出的不可见水印，以帮助查看者识别机器生成的图像。

获得稳定扩散-v1-*-原始权重后，将它们链接起来

mkdir -p models/ldm/stable-diffusion-v1/
ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt

并采样

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms

默认情况下，这使用指导比例，凯瑟琳克劳森对 PLMS 采样器的实现，并分 512 步渲染大小为 512x50（经过训练）的图像。下面列出了所有支持的参数（类型）。--scale 7.5python scripts/txt2img.py --help

usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
                  [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
                  [--seed SEED] [--precision {full,autocast}]

optional arguments:
  -h, --help            show this help message and exit
  --prompt [PROMPT]     the prompt to render
  --outdir [OUTDIR]     dir to write results to
  --skip_grid           do not save a grid, only individual samples. Helpful when evaluating lots of samples
  --skip_save           do not save individual samples. For speed measurements.
  --ddim_steps DDIM_STEPS
                        number of ddim sampling steps
  --plms                use plms sampling
  --laion400m           uses the LAION400M model
  --fixed_code          if enabled, uses the same starting code across samples
  --ddim_eta DDIM_ETA   ddim eta (eta=0.0 corresponds to deterministic sampling
  --n_iter N_ITER       sample this often
  --H H                 image height, in pixel space
  --W W                 image width, in pixel space
  --C C                 latent channels
  --f F                 downsampling factor
  --n_samples N_SAMPLES
                        how many samples to produce for each given prompt. A.k.a. batch size
  --n_rows N_ROWS       rows in the grid (default: n_samples)
  --scale SCALE         unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
  --from-file FROM_FILE
                        if specified, load prompts from this file
  --config CONFIG       path to config which constructs model
  --ckpt CKPT           path to checkpoint of model
  --seed SEED           the seed (for reproducible sampling)
  --precision {full,autocast}
                        evaluate at this precision

注意：所有 v1 版本的推理配置都设计为与仅限 EMA 的检查点一起使用。因此在配置中设置，否则代码将尝试从非 EMA 到 EMA 权重。如果您想检查 EMA 与无 EMA 的效果，我们提供“完整”检查点包含两种类型的权重。对于这些，将加载并使用非 EMA 权重。use_ema=Falseuse_ema=False

扩散器集成

下载和采样稳定扩散的一种简单方法是使用扩散器库：

# make sure you're logged in with `huggingface-cli login`
from torch import autocast
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
	"CompVis/stable-diffusion-v1-4", 
	use_auth_token=True
).to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
    image = pipe(prompt)["sample"][0]  
    
image.save("astronaut_rides_horse.png")

具有稳定扩散的图像修改

通过使用SDEdit首次提出的扩散去噪机制，该模型可用于不同的文本引导的图像到图像转换和升级等任务。与 txt2img 采样脚本类似，我们提供了一个脚本来执行具有稳定扩散的图像修改。

下面描述了一个示例，其中在 Pinta 中绘制的粗略草图被转换为详细的图稿。

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8

此处，强度是介于 0.0 和 1.0 之间的值，用于控制添加到输入图像的噪声量。接近 1.0 的值允许大量变化，但也会产生与输入语义不一致的图像。

[/hidecontent]