generation eval time is slower than llama-cli in pure llama.cpp

LocalAI version:

185ab93 local build

Environment, CPU architecture, OS, and Version:

Intel i9-10850K CPU @ 3.60GHz, RTX 3090, Ubuntu 20.04
Linux Jiminthebox 5.15.0-113-generic 123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug

When running llama.cpp with LocalAI, the generation eval time is significantly slower compared to running it with pure llama.cpp.

To Reproduce

Pure llama.cpp

./llama-cli -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n -1 -c 0 -ngl 999 -t $(nproc) -p "Print all ascii code." --color --mlock --batch-size 512

LocalAI

./local-ai --debug=true # with CUDA build

name: llama3-8b-instruct-Q4_K_M
context_size: 8192
threads: 20
f16: true
mmap: true
mmlock: false
no_kv_offloading: false
low_vram: false
backend: llama-cpp
cuda: true
gpu_layers: 999
parameters:
  model: Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
rope_scaling: linear
stopwords:
- <|im_end|>
- <dummy32000>
- <|eot_id|>
- <|end_of_text|>
template:
  chat: |
    <|begin_of_text|>{{.Input }}
    <|start_header_id|>assistant<|end_header_id|>
  chat_message: |
    <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>

    {{ if .FunctionCall -}}
    Function call:
    {{ else if eq .RoleName "tool" -}}
    Function response:
    {{ end -}}
    {{ if .Content -}}
    {{.Content -}}
    {{ else if .FunctionCall -}}
    {{ toJson .FunctionCall -}}
    {{ end -}}
    <|eot_id|>
  completion: |
    {{.Input}}
  function: |
    <|start_header_id|>system<|end_header_id|>

    You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
    <tools>
    {{range .Functions}}
    {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
    {{end}}
    </tools>
    Use the following pydantic model json schema for each tool call you will make:
    {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Function call:

Expected behavior

The generation evaluation time when using LocalAI should be comparable to running llama.cpp directly, considering the same hardware specifications.

Logs

Pure llama.cpp

llama_print_timings:        load time =     718.94 ms
llama_print_timings:      sample time =     112.66 ms /  1516 runs   (    0.07 ms per token, 13457.01 tokens per second)
llama_print_timings: prompt eval time =      14.48 ms /     6 tokens (    2.41 ms per token,   414.36 tokens per second)
llama_print_timings:        eval time =   13200.77 ms /  1515 runs   (    8.71 ms per token,   114.77 tokens per second)
llama_print_timings:       total time =   14213.27 ms /  1521 tokens

LocalAI

11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time     =      49.55 ms /    19 tokens (    2.61 ms per token,   383.46 tokens per second)","slot_id":0,"task_id":0,"t_prompt_processing":49.549,"num_prompt_tokens_processed":19,"t_token":2.607842105263158,"n_tokens_second":383.45879836121816}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time =   23612.37 ms /  1392 runs   (   16.96 ms per token,    58.95 tokens per second)","slot_id":0,"task_id":0,"t_token_generation":23612.366,"n_decoded":1392,"t_token":16.962906609195404,"n_tokens_second":58.952160914327685}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":351,"message":"          total time =   23661.92 ms","slot_id":0,"task_id":0,"t_prompt_processing":49.549,"t_token_generation":23612.366,"t_total":23661.915}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"update_slots","line":1596,"message":"slot released","slot_id":0,"task_id":0,"n_ctx":8192,"n_past":1410,"n_system_tokens":0,"n_cache_tokens":1411,"truncated":false}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"update_slots","line":1549,"message":"all slots are idle and system prompt is empty, clear the KV cache"}

Additional context

Thank you for creating such a great project.
I am not sure how to achieve the same speed. Any assistance would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

generation eval time is slower than llama-cli in pure llama.cpp #2780

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

generation eval time is slower than llama-cli in pure llama.cpp #2780

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions