llama.cpp DeepSeek R1 Deployment

前言

本教程将介绍如何使用 llama.cpp 部署 DeepSeek R1 模型，首先是 32B 蒸馏版本，然后会尝试部署 671B 完整版本。以下是一些基础信息：

Ubuntu 20.04 amd64
Docker
Nvidia Container Toolkit
CUDA: 12.4
llama.cpp: b4743

llama.cpp 最大的优点就是兼容性好，不挑设备。vllm 虽然性能更强大，吞吐量高，但是必须 Ampere GPU 或者更高才能运行同样的模型，在较旧的设备上问题很多。

需要管理员权限的环境配置

由于我使用的计算服务器没有外网，需要通过代理联网，所以 sudo 命令均带有-E 参数，以保留环境变量中的代理设置。

安装 Docker

建议 Ubuntu 20.04 或者更高，有必要的话先更新系统。目前 18.04 似乎也能运行，但是需要一点特殊处理。以下是官方文档

https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository

# Add Docker's official GPG key:
sudo -E apt-get update
sudo -E apt-get -y install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo -E curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo -E apt-get update
sudo -E apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

配置 Rootless Docker Daemon

以下是官方文档

https://docs.docker.com/engine/security/rootless/#prerequisites

sudo apt-get install -y dbus-user-session
sudo apt-get install -y uidmap
sudo apt-get install -y systemd-container
sudo apt-get install -y docker-ce-rootless-extras

这一步可选，关闭 Root Docker Daemon

sudo systemctl disable --now docker.service docker.socket
sudo rm /var/run/docker.sock

Ubuntu 18.04 的 slirp4netns 配置

Rootless Docker Daemon 需要一个用户态的网络支持，这需要用到slirp4netns。但是，该软件仅在 Ubuntu 19.10 后才能随 apt 安装，我们需要手动下载。

# 注意修改版本，推荐使用最新版
wget https://github.com/rootless-containers/slirp4netns/releases/download/v1.3.2/slirp4netns-x86_64
sudo mv slirp4netns-x86_64 /usr/local/bin/slirp4netns

安装或更新 Nvidia Driver 和 CUDA

根据 llama.cpp 的需求，推荐使用 12.4 的 CUDA，可以这样下载：

wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

接下来安装它，全部默认选项即可，选择Install

sudo sh cuda_12.4.1_550.54.15_linux.run

安装过程不会输出过程信息到终端，而是直接打印到文件，可以到以下文件看到安装过程的输出：

/var/log/nvidia-installer.log
/var/log/cuda-installer.log

安装 Nvidia Container Toolkit

以下是官方文档

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo -E apt-get update
sudo -E apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place

一般用户的环境配置

配置和启动 Rootless Docker Daemon

首先，需要在配置后重新登录

接下来，使用 Docker 提供的脚本生成配置文件

dockerd-rootless-setuptool.sh install

修改.bashrc，添加环境变量

export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///run/user/$UID/docker.sock

为 Docker Daemon 配置代理

这一步可选，因为前面提到的网络原因，docker pull等操作需要配置代理服务器。打开$HOME/.config/systemd/user/docker.service，添加代理服务器

[Service]
Environment=PATH=...
Environment="HTTP_PROXY=???"
Environment="HTTPS_PROXY=???"
Environment="NO_PROXY=10.10.10.10,*.my.registry.com"
...

接下来重新启动 Rootless Docker Daemon

systemctl --user daemon-reload
systemctl --user restart docker

配置和启动 Nvidia Container Toolkit

配置 Nvidia CTK 运行时

nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker

测试配置是否成功

应该和 nvidia-smi 的效果一样，看到所有的 GPU，注意不需要 sudo 或者添加用户到 docker 用户组

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

部署 DeepSeek-R1-Distill-Qwen-32B

初步测试

测试配置：4 x 2080Ti 11GB

下载模型

pip install huggingface_hub
mkdir -p models/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-GGUF

HF_ENDPOINT=https://hf-mirror.com \
  huggingface-cli download \
  unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF \
  DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf \
  --local-dir models/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-GGUF

拉取镜像，也可以在 run 的时候自动拉取。

docker pull ghcr.io/ggml-org/llama.cpp:server-cuda

启动服务器

docker run \
  --rm --gpus all -p 8080:8000 -v ./models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf \
  --port 8000 --host 0.0.0.0 -t 8 -c 32768 -n 8192 --n-gpu-layers 512

资源占用：

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:3B:00.0 Off |                  N/A |
| 31%   35C    P2             63W /  250W |   10092MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:5E:00.0 Off |                  N/A |
| 31%   37C    P2             60W /  250W |    9486MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:AF:00.0 Off |                  N/A |
| 31%   37C    P2             43W /  250W |    9486MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:D8:00.0 Off |                  N/A |
| 31%   36C    P2             62W /  250W |    9922MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推理速度

prompt eval time =     544.32 ms /   233 tokens (    2.34 ms per token,   428.06 tokens per second)
       eval time =   16309.46 ms /   346 tokens (   47.14 ms per token,    21.21 tokens per second)
      total time =   16853.77 ms /   579 tokens

由于我使用的服务器是双路 CPU，没有 NVLink，存在 NUMA 问题，所以通讯速度很差。这个时候，如果使用-sm row参数，会导致速度降低到 15tokens/s。

进一步压榨性能

通过以下参数，可以进一步压榨 GPU 的能力：

-fa：使用 Flash Attention 可节约动态内存，提升推理速度
-ctk q8_0 -ctv q8_0：使用量化 KV Cache ，进一步节约内存
-c 128000：增加上下文窗口长度，允许模型进行更长的对话，处理更长的输入内容
--n-gpu-layers 512：尽可能多地将模型参数放入显存，可以提升推理速度

docker run \
  --rm --gpus all -p 8080:8000 -v ./models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf \
  --port 8000 --host 0.0.0.0 -t 8 --n-gpu-layers 512 -c 128000 -n 8192 -fa -ctk q8_0 -ctv q8_0

资源占用：

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:3B:00.0 Off |                  N/A |
| 31%   33C    P8             21W /  250W |   10847MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:5E:00.0 Off |                  N/A |
| 31%   36C    P8             18W /  250W |    9479MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:AF:00.0 Off |                  N/A |
| 31%   34C    P8              1W /  250W |    9479MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:D8:00.0 Off |                  N/A |
| 31%   35C    P8             22W /  250W |    9947MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推理速度

prompt eval time =     490.96 ms /   232 tokens (    2.12 ms per token,   472.54 tokens per second)
       eval time =   11767.64 ms /   254 tokens (   46.33 ms per token,    21.58 tokens per second)
      total time =   12258.60 ms /   486 tokens

使用更好的 GPU 测试

刚好手头有 A6000，看一下比较新的 GPU 会有多高的吞吐量。

docker run \
  --rm --gpus all -p 8080:8000 -v ./models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf \
  --port 8000 --host 0.0.0.0 -t 8 --n-gpu-layers 512 -c 131072 -n 8192 -fa -ctk q8_0 -ctv q8_0

资源占用：

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8             64W /  300W |   36649MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推理速度

prompt eval time =     271.75 ms /   233 tokens (    1.17 ms per token,   857.42 tokens per second)
       eval time =    7982.01 ms /   239 tokens (   33.40 ms per token,    29.94 tokens per second)
      total time =    8253.75 ms /   472 tokens

A6000 恐怖如斯，2080Ti 拼尽全力无法战胜。

部署 DeepSeek-R1-UD-IQ1_S

初步测试

测试配置：4 x V100 32GB

这个配置对于 671B 的模型来说非常丐，属于非常极限刚好能运行，推理速度很慢。

下载模型，模型是分成三个文件的，这是因为 git-lfs 的限制为 50GB

HF_ENDPOINT=https://hf-mirror.com \
  huggingface-cli download \
  unsloth/DeepSeek-R1-GGUF \
  --include "*UD-IQ1_S*" \
  --local-dir models/DeepSeek-R1-UD-IQ1_S

启动服务器，llama.cpp 会自动加载剩余的两个模型分片文件

docker run \
  --rm --gpus all -p 8080:8000 -v ./models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
  --port 8000 --host 0.0.0.0 -t 8 -c 8192 -n 4096 \
  --n-gpu-layers 40 -ts 14,16,16,16 -ctk q8_0

这里的参数与先前有所不同：

无-fa：DeepSeek-R1 模型使用 MLA，与 Flash Attention 不兼容
-ctk q8_0：对于 MLA 模型，只有 k 可以量化
--n-gpu-layers 40：只保留 40 层在显存中
-ts 14,16,16,16：DeepSeek-R1 默认状态下，GPU 0 的负载会稍高，通过调节切分参数，可以平衡显存占用

资源占用：

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-32GB           On  |   00000000:00:05.0 Off |                    0 |
| N/A   41C    P0             55W /  300W |   30042MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  |   00000000:00:06.0 Off |                    0 |
| N/A   39C    P0             57W /  300W |   29576MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  |   00000000:00:07.0 Off |                    0 |
| N/A   41C    P0             54W /  300W |   29576MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  |   00000000:00:08.0 Off |                    0 |
| N/A   40C    P0             55W /  300W |   29576MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推理速度

prompt eval time =   13482.15 ms /   226 tokens (   59.66 ms per token,    16.76 tokens per second)
       eval time =   87209.41 ms /   270 tokens (  323.00 ms per token,     3.10 tokens per second)
      total time =  100691.56 ms /   496 tokens

总结

llama.cpp 对于老旧硬件的支持非常好，像 V100，2080Ti 这样比较老旧的 GPU，也能运行最新的模型。
然而，671B 的模型还是太大了，为了能运行在 V100 上，只能将一部分参数放在 CPU 内存，这进一步降低了推理速度。这个部署尝试更像是一种极限运动，缺乏实用性。个人认为：

解码速度需要达到 10token/s 才能算是比较流畅
上下文窗口 8192，才能满足联网查询和深度思考的要求

如果需要自己部署，还是推荐使用 DeepSeek-R1-Distill-Qwen-32B-GGUF。在 Q4 量化下，使用一个 RTX A6000 就能进行推理，并享受到完整的 128k 长上下文，是比较实用的方案。

OMT: 估计部署模型的硬件需求

既然都读到这里了，我们干脆做一些分析，看看 llama.cpp 部署某个模型时，至少需要什么样的硬件。这里主要是围绕内存来计算，因为这会决定模型能否运行起来。

llama.cpp 部署后的模型，其主要有 4 个部分的内存需求：

模型参数（model buffer）：最大的一部分。
KV Cache（KV buffer）：较大的一部分，和上下文的长度成正比。
激活（compute buffer）：会受到-ub，-fa等参数的影响，但一般比模型参数低一个数量级。
CUDA Runtime：大约 300MB 每 GPU，会受到 CUDA 版本的影响。若占用非常多，建议更新 CUDA 版本。

模型参数的估计

偷懒的方法：直接使用.gguf模型文件的大小代表模型大小。虽然文件中还带有 tokenizer 等其他信息，但一般远远小于模型参数的大小。

精确的方法：使用reader.py读取文件，并根据每种量化类型的比特数，计算模型大小。

def read_gguf_file(gguf_file_path):
    """
    Reads and prints key-value pairs and tensor information from a GGUF file in an improved format.

    Parameters:
    - gguf_file_path: Path to the GGUF file.
    """

    reader = GGUFReader(gguf_file_path)

    # List all key-value pairs in a columnized format
    print("Key-Value Pairs:") # noqa: NP100
    max_key_length = max(len(key) for key in reader.fields.keys())
    for key, field in reader.fields.items():
        value = field.parts[field.data[0]]
        print(f"{key:{max_key_length}} : {value}") # noqa: NP100
    print("----") # noqa: NP100

    # List all tensors
    print("Tensors:") # noqa: NP100
    tensor_info_format = "{:<30} | Shape: {:<15} | Size: {:<12} | Quantization: {}"
    print(tensor_info_format.format("Tensor Name", "Shape", "Size", "Quantization")) # noqa: NP100
    print("-" * 80) # noqa: NP100
    data = []
    for tensor in reader.tensors:
        shape_str = "x".join(map(str, tensor.shape))
        size_str = str(tensor.n_elements)
        quantization_str = tensor.tensor_type.name
        data.append((tensor.name, shape_str, tensor.n_elements, quantization_str))
        print(tensor_info_format.format(tensor.name, shape_str, size_str, quantization_str)) # noqa: NP100
    df = pd.DataFrame(data, columns=["name", "shape", "size", "quantization"])
    df_sum = df.groupby("quantization")[["size"]].sum().reset_index()

    quantization_bits = {
        "F64": 64.0,
        "I64": 64.0,
        "F32": 32.0,
        "I32": 32.0,
        "F16": 16.0,
        "BF16": 16.0,
        "I16": 16.0,
        "Q8_0": 8.0,
        "Q8_1": 8.0,
        "Q8_K": 8.0,
        "I8": 8.0,
        "Q6_K": 6.5625,
        "Q5_0": 5.0,
        "Q5_1": 5.0,
        "Q5_K": 5.5,
        "Q4_0": 4.0,
        "Q4_1": 4.0,
        "Q4_K": 4.5,
        "Q3_K": 3.4375,
        "Q2_K": 2.5625,
        "IQ4_NL": 4.0,
        "IQ4_XS": 4.25,
        "IQ3_S": 3.44,
        "IQ3_XXS": 3.06,
        "IQ2_XXS": 2.06,
        "IQ2_S": 2.5,
        "IQ2_XS": 2.31,
        "IQ1_S": 1.56,
        "IQ1_M": 1.75
    }

    df_sum["bits"] = df_sum["quantization"].apply(lambda x: quantization_bits[x])
    print(df_sum)
    total_size = (df_sum["bits"] * df_sum["size"]).sum() / 8 / 1024 / 1024
    print(f"Total Size:\n{total_size / 1024:.2f} GiB")
    print(f"Total Size:\n{total_size:.2f} MiB")

使用多个设备时，模型参数大致会被均匀分到各个设备上，第一个 GPU 会稍多，但是也可以用-ts参数调整配比。

对于先前测试的 32B 模型，llama.cpp的内存理论值为18,926.01 MiB，实际测得如下：

load_tensors:        CUDA0 model buffer size =  4844.72 MiB
load_tensors:        CUDA1 model buffer size =  4366.53 MiB
load_tensors:        CUDA2 model buffer size =  4366.53 MiB
load_tensors:        CUDA3 model buffer size =  4930.57 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB

KV Cache 的估计

KV Cache 的内存占用可以通过以下公式估算：

$$ n_{layer} * n_{head\_kv} * (n_{embd\_head\_k} + n_{embd\_head\_v}) * (n_{ctx} + n_{predict}) * n_{bits} / 8 $$

$n_{layer}$：模型层数
$n_{head\_kv}$：每个头的维度
$n_{embd\_head\_k}$：键的维度
$n_{embd\_head\_v}$：值的维度
$n_{ctx}$：上下文长度，注意这是所有 query 共享的大小
$n_{predict}$：输出的最大长度
$n_{bits}$：量化位数

对于先前测试的 32B 模型，llama.cpp的内存理论值为17,024.00 MiB，实际测得如下：

llama_kv_cache_init:      CUDA0 KV buffer size =  4515.62 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  4250.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  4250.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =  3984.38 MiB

根据模型大小选择设备

对于先前测试的 32B 模型，假设所有层全部放入 GPU，并且满足 131,072 的最大上下文长度（$n_{ctx} + n_{predict}$），我们可以估计出：

模型参数需要18.48 GiB
KV Cache 参数需要16 GiB
每个 GPU 扣除1.3 GiB用于激活和 CUDA runtime

于是便有了这几种配置：

GPU 型号	显存	数量	数量（32k 上下文）	数量（8k 上下文）
RTX A6000	48GiB	1	1	1
Tesla V100	32GiB	2	1	1
RTX 3090	24GiB	2	1	1
RTX 2080Ti 魔改版	22GiB	2	1	1
RTX 2080Ti	11GiB	4	3	2

需要指出的是，GPU 数量越多，吞吐量的损失就越大，因为 llama.cpp 并不能利用到 NVLink（甚至除了 V100 和 A6000，其他型号也并无 NVLink）。