stanford_alpaca:一种遵循指令的 LLaMA 模型

概述

当前的 Alpaca 模型是从 7B LLaMA 模型 [1] 对 Self-Instruct [52] 论文中的技术生成的 2K 指令跟踪数据进行微调的，并进行了一些修改，我们将在下一节中讨论。在初步的人类评估中，我们发现 Alpaca 7B 模型的行为与自指令指令遵循评估套件上的模型相似 [2]。text-davinci-003

stanford_alpaca仍在开发中，有许多限制必须解决。重要的是，我们尚未将stanford_alpaca模型微调为安全无害。因此，我们鼓励用户在与 Alpaca 互动时保持谨慎，并报告任何相关行为，以帮助提高模型的安全性和道德考虑。

数据发布

[hidecontent type="logged" desc="隐藏内容：登录后可查看"]

alpaca_data.json包含52K指令跟踪数据，我们用于微调Alpaca模型。此 JSON 文件是字典列表，每个字典包含以下字段：

instruction：，描述模型应执行的任务。52K 指令中的每一个都是唯一的。str
input：、任务的可选上下文或输入。例如，当指令为“总结以下文章”时，输入的是文章。大约 40% 的示例有输入。str
output：，由生成的指令的答案。strtext-davinci-003

我们使用以下提示来微调 Alpaca 模型：

对于具有非空输入字段的示例：

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

对于输入字段为空的示例：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

在推理过程中（例如，对于网络演示），我们使用带有空输入字段的用户指令（第二个选项）。

数据生成过程

运行代码

我们基于自学数据生成管道构建，并进行了以下修改：

我们过去生成指令数据而不是.text-davinci-003davinci
我们编写了一个新的提示（），明确地将指令生成的要求提供给。注意：我们使用的提示中有一个轻微的错误，未来的用户应该在#24中合并编辑prompt.txttext-davinci-003
我们采用了更积极的批量解码，即一次生成20条指令，这大大降低了数据生成的成本。
我们通过丢弃分类和非分类指令之间的差异来简化数据生成管道。
我们只为每个指令生成了一个实例，而不是像 [2] 中那样生成 3 到 1 个实例。

这产生了一个遵循指令的数据集，其中包含以低得多的成本（不到 52 美元）获得的 500K 个示例。在初步研究中，我们还发现我们的52K生成数据比自学发布的数据更加多样化。

微调

我们使用标准的拥抱脸训练代码微调我们的模型。我们使用以下超参数微调LLaMA-7B和LLaMA-13B：

要重现我们对 LLaMA 的微调运行，请首先安装要求

pip install -r requirements.txt

下面是一个命令，该命令在具有 7 个 A4 100G GPU 的机器上用我们的数据集微调 LLaMA-80B，处于 FSDP 模式。我们能够使用 Python 3.10 使用以下命令重现与我们在演示中托管的模型质量相似的模型。替换为您自己的端口，替换为转换后的检查点和分词器的路径（按照 PR 中的说明进行操作），以及要存储输出的位置。full_shard<your_random_port><your_path_to_hf_converted_llama_ckpt_and_tokenizer><your_output_dir>

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

相同的脚本也适用于 OPT 微调。下面是微调 OPT-6.7B 的示例

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

请注意，给定的训练脚本旨在简单易用，并且没有特别优化。要在更多 GPU 上运行，您可能更愿意关闭以保持全局批大小为 128。尚未测试全局批大小的最优性。gradient_accumulation_steps

寻址 OOM

天真地，微调 7B 模型需要大约 7 x 4 x 4 = 112 GB 的 VRAM。上面给出的命令启用参数分片，因此不会在任何 GPU 上存储冗余模型副本。如果要进一步减少内存占用，以下是一些选项：

使用打开 FSDP 的 CPU 卸载。这样可以节省VRAM，但代价是运行时间更长。--fsdp "full_shard auto_wrap offload"

根据我们的经验，DeepSpeed stage-3（带卸载）有时比带卸载的 FSDP 更节省内存。下面是使用具有 3 个 GPU 的 DeepSpeed stage-4 的示例，同时具有参数和优化器卸载：

pip install deepspeed
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

DeepSpeed 库还提供了一些有用的函数来估计内存使用情况。

LoRA 微调查询的低秩切片，键，和值嵌入头.这可以将总内存占用量从 112GB 减少到大约 7x4=28GB。我们可能会在未来发布我们的重新实现，但现在 peft 代码库可能是一个有用的资源。

恢复Alpaca重量

Alpaca-7B和LLaMA-7B之间的重量差异位于此处。要恢复原始的 Alpaca-7B 重量，请执行以下步骤：

1. Convert Meta's released weights into huggingface format. Follow this guide:
    https://huggingface.co/docs/transformers/main/model_doc/llama
2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at:
    https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
3. Run this function with the correct paths. E.g.,
    python weight_diff.py recover --path_raw <path_to_step_1_dir> --path_diff <path_to_step_2_dir> --path_tuned <path_to_store_recovered_weights>

步骤 3 完成后，您应该有一个包含恢复权重的目录，您可以从中加载模型，如下所示

import transformers
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")

[/hidecontent]