使用说明

因为CPU和GPU本身是不同的硬件单元，而我们所有正常的编码操作，默认都是在cpu上运行的，缓存参数也默认是存储在内存（对应cpu）上。
要使用GPU进行相关操作，起始就是要我们把对应的模型部署到GPU上，也就是把模型和参数指标存储到显存（显卡存储）上，这时候在进行相关操作就是在gpu进行的操作。具体要迁移的模型和参数指标如下：

将模型部署到GPU上，需要在导入模型时，设置 .from_pretrained(model_name,device_map="cuda", torch_dtype=torch.float16)。8B模型需要的显存在16G左右，因为使用的是 3090&24G显存 的卡，只能运行低精度的模型，所以在部署阶段指定了低精度方案 (torch_dtype=torch.float16) 。
除了模型，我们最终得到的输入到模型的张量也需要从内存（cpu）拷贝到显存（gpu）上。tokenizer(prompt, return_tensors="pt").to("cuda")

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# 加载模型
model_name="/data1/tf_data/model/llama3/Meta-Llama-3-8B-Instruct"
##  device_map="cuda" 指定模型部署到GPU上
model = AutoModelForCausalLM.from_pretrained(model_name,device_map="cuda", torch_dtype=torch.float16) 
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 文本生成函数
def generate_response(prompt):
    # 对输入进行分词， 将分词张量拷贝到gpu上
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # 生成响应，相应的token数目和答复的数目。
    outputs = model.generate(**inputs, max_length=200, num_return_sequences=1)

    # 解码输出
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

# 输入处理包装函数
def preprocess_input(input):
    messages = [
        {"role": "system", "content": "hello,You are a helpful human assistant!"},
        {"role": "user", "content": input}, ]
    text = tokenizer.apply_chat_template(messages,
            tokenize=False, # tokenize需要设置 False，否则反馈会和预期有很大出入。测试的时候，反馈的是拼音和翻译成英文的句子┑(￣Д ￣)┍。
            add_generation_prompt=True)
    return text

# 示例对话
user_input = "介绍一下中国"
response = generate_response(preprocess_input(user_input))
print("模型回答:", response)

使用GPU以后，模型加载的时间会略有增长（cpu：2s vs gpu: 6s），但是推理时间会大幅减少（cpu：200s vs gpu: 5s）。
通过使用GPU 相同任务，推理时间从原来的运行时间从原来的5min 减少到5.1s。（含模型部署加载）