nv-tensorRT-LLM

  1. 根据官方demo快速上手
  2. tensorRT-LLM
  3. 环境搭建
    1. py环境
    2. docker 环境
    3. 测试 phi4 服务

根据官方demo快速上手

trt-llm 来加速大模型的推理

目前常用的加速方案有: vllm,ollama,tensorRT-LLM

tensorRT-LLM

Installing on Linux via pip — TensorRT-LLM

中文文档:

在 Linux 上安装 — TensorRT-LLM

环境搭建

py环境

本地使用的是 Tesla T4显卡 / A100 显卡

使用 Tesla T4 的时候会出现一些问题,由于Tesla T4显卡的版本太老,会导致无法使用一些新版本的功能,导致 core dump。

所以后续的实验都是在 A100 机器上进行的。

使用 python 3.10 / python3.12 的版本

# Optional step: Only required for NVIDIA Blackwell GPUs and SBSA platform
uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

sudo apt-get -y install libopenmpi-dev
uv pip install --upgrade pip setuptools 
uv pip install --extra-index-url https://pypi.nvidia.com/ tensorrt-llm

原始的仓库为 pypi 但是最新的仓库放在 pypi.nvidia仓库下

出现了报错:

(.tensorRT-llm) (base) azureuser@torch-train-1:~/lishun_ws/nv-triton-inference-server/tensorRT-LLM$ python main.py
/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/torch/cuda/init.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
Traceback (most recent call last):
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/main.py”, line 1, in
from tensorrt_llm import LLM, SamplingParams
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/init.py”, line 36, in
import tensorrt_llm.functional as functional
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/functional.py”, line 29, in
from . import graph_rewriting as gw
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/graph_rewriting.py”, line 12, in
from .network import Network
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/network.py”, line 31, in
from tensorrt_llm.module import Module
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/module.py”, line 17, in
from ._common import default_net
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/_common.py”, line 40, in
from .plugin import _load_plugin_lib
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/plugin/init.py”, line 15, in
from .plugin import (TRT_LLM_PLUGIN_NAMESPACE, PluginConfig, _load_plugin_lib,
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/plugin/plugin.py”, line 28, in
from …_ipc_utils import IpcMemory, can_access_peer
File “/home/azureuser/lishun_ws/nv-triton-inference-server/tensorRT-LLM/.tensorRT-llm/lib/python3.12/site-packages/tensorrt_llm/_ipc_utils.py”, line 20, in
from cuda import cuda, cudart
ImportError: cannot import name ‘cuda’ from ‘cuda’ (unknown location)

解决方法

下载与cuda-toolkit版本对应的cuda-python
比如 本项目使用的是 cu128
则要下载 cuda-python==12.8
uv pip install cuda-python==12.8

环境一直搞不好,还是拉docker来解决吧,推理一直会报错,各种古怪问题

后续排查发现是自己的显卡太老了,已经tensorRT-LLM都支持我这个卡的推理了

推理demo

from tensorrt_llm import LLM, SamplingParams


def main():

    # Model could accept HF model name, a path to local HF model,
    # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    for output in llm.generate(prompts, sampling_params):
        print(
            f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
        )

    # Got output like
    # Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
    # Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
    # Prompt: 'The capital of France is', Generated text: 'Paris.'
    # Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'


if __name__ == '__main__':
    main()
  • 运行

由于显卡比较老,需要加参数避免 core dumpe,实际测试还是会 core dump

export TRTLLM_DISABLE_FMHA=1
export TORCH_CUDA_ARCH_LIST=7.5
python main.py

docker 环境

docker run --ipc host --gpus all -it nvcr.io/nvidia/tensorrt-llm/release

–ipc host:容器与主机共享 IPC(进程间通信)命名空间,适用于需要高效进程通信的深度学习任务。

启动服务

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

client端

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 32,
        "temperature": 0
    }'

实例输出

{"id":"chatcmpl3317bfedef0f42fcb6bac75d05ef60e9",
"object":"chat.completion",
"created":1758265112,
"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"choices":[
    {"index":0,"message":
        {"role":"assistant",
        "content":"New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
        "reasoning_content":null,
        "tool_calls":[]
        },
        "logprobs":null,
        "finish_reason":"stop",
        "stop_reason":null,
        "disaggregated_params":null
    }
        ],
    "usage":{"prompt_tokens":43,"total_tokens":69,"completion_tokens":26}
}

image-20250919145649881

奇怪问题:

本地环境中使用这个命令会卡死在

rank 0 using MpiPoolSession to spawn MPI processes

但在docker环境下卡一下又好了

image-20250919150427631

image-20250919150752493

测试并发

  1. 采用 ApacheBench
echo '{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "messages":[{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
  "max_tokens": 32,
  "temperature": 0
}' > payload.json

ab -n 1000 -c 1000 -p payload.json -T "application/json" http://localhost:8000/v1/chat/completions

数据水平

image-20250919151156796

  1. 采用wrk测试
sudo apt-get install -y wrk
  • post.lua
wrk.method = "POST"
wrk.body   = [[{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "messages":[{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
  "max_tokens": 32,
  "temperature": 0
}]]
wrk.headers["Content-Type"] = "application/json"

执行

wrk -t12 -c1000 -d30s -s post.lua http://localhost:8000/v1/chat/completions

结果:

Running 30s test @ http://localhost:8000/v1/chat/completions
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.93s    35.39ms   1.99s    75.00%
    Req/Sec    65.99     73.76   480.00     75.20%
  5865 requests in 30.06s, 3.59MB read
  Socket errors: connect 0, read 0, write 0, timeout 5677
Requests/sec:    195.09
Transfer/sec:    122.31KB

官方推荐的测试方法

TensorRT-LLM 基准测试 — TensorRT-LLM

测试 phi4 服务

目前支持的大模型服务列表,选择对应版本查看

Supported Models — TensorRT-LLM

trtllm-serve "microsoft/Phi-4"

image-20250919153713820

github