-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
LocalAI version:
185ab93 local build
Environment, CPU architecture, OS, and Version:
Intel i9-10850K CPU @ 3.60GHz, RTX 3090, Ubuntu 20.04
Linux Jiminthebox 5.15.0-113-generic 123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Describe the bug
When running llama.cpp with LocalAI, the generation eval time is significantly slower compared to running it with pure llama.cpp.
To Reproduce
Pure llama.cpp
./llama-cli -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n -1 -c 0 -ngl 999 -t $(nproc) -p "Print all ascii code." --color --mlock --batch-size 512LocalAI
./local-ai --debug=true # with CUDA buildname: llama3-8b-instruct-Q4_K_M
context_size: 8192
threads: 20
f16: true
mmap: true
mmlock: false
no_kv_offloading: false
low_vram: false
backend: llama-cpp
cuda: true
gpu_layers: 999
parameters:
model: Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
rope_scaling: linear
stopwords:
- <|im_end|>
- <dummy32000>
- <|eot_id|>
- <|end_of_text|>
template:
chat: |
<|begin_of_text|>{{.Input }}
<|start_header_id|>assistant<|end_header_id|>
chat_message: |
<|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content -}}
{{ else if .FunctionCall -}}
{{ toJson .FunctionCall -}}
{{ end -}}
<|eot_id|>
completion: |
{{.Input}}
function: |
<|start_header_id|>system<|end_header_id|>
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
<tools>
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
</tools>
Use the following pydantic model json schema for each tool call you will make:
{'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Function call:
Expected behavior
The generation evaluation time when using LocalAI should be comparable to running llama.cpp directly, considering the same hardware specifications.
Logs
Pure llama.cpp
llama_print_timings: load time = 718.94 ms
llama_print_timings: sample time = 112.66 ms / 1516 runs ( 0.07 ms per token, 13457.01 tokens per second)
llama_print_timings: prompt eval time = 14.48 ms / 6 tokens ( 2.41 ms per token, 414.36 tokens per second)
llama_print_timings: eval time = 13200.77 ms / 1515 runs ( 8.71 ms per token, 114.77 tokens per second)
llama_print_timings: total time = 14213.27 ms / 1521 tokens
LocalAI
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time = 49.55 ms / 19 tokens ( 2.61 ms per token, 383.46 tokens per second)","slot_id":0,"task_id":0,"t_prompt_processing":49.549,"num_prompt_tokens_processed":19,"t_token":2.607842105263158,"n_tokens_second":383.45879836121816}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time = 23612.37 ms / 1392 runs ( 16.96 ms per token, 58.95 tokens per second)","slot_id":0,"task_id":0,"t_token_generation":23612.366,"n_decoded":1392,"t_token":16.962906609195404,"n_tokens_second":58.952160914327685}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":351,"message":" total time = 23661.92 ms","slot_id":0,"task_id":0,"t_prompt_processing":49.549,"t_token_generation":23612.366,"t_total":23661.915}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"update_slots","line":1596,"message":"slot released","slot_id":0,"task_id":0,"n_ctx":8192,"n_past":1410,"n_system_tokens":0,"n_cache_tokens":1411,"truncated":false}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"update_slots","line":1549,"message":"all slots are idle and system prompt is empty, clear the KV cache"}
Additional context
Thank you for creating such a great project.
I am not sure how to achieve the same speed. Any assistance would be greatly appreciated.