Add NVIDIA support to Inference Plugin by Jan-Kazlouski-elastic · Pull Request #132388 · elastic/elasticsearch

Jan-Kazlouski-elastic · 2025-08-04T10:00:09Z

Creation of new NVIDIA inference provider integration allowing:

text_embedding
completion (both streaming and non-streaming)
chat_completion
rerank

tasks to be executed as part of inference API with nvidia provider.

Changes were tested locally against next models:

nvidia/llama-3.2-nv-embedqa-1b-v2 & nvidia/nvclip (text_embedding)
microsoft/phi-3-mini-128k-instruct (completion and chat_completion)
nv-rerank-qa-mistral-4b:1 (rerank)

Useful doc links:
https://docs.api.nvidia.com/nim/reference/llm-apis
https://docs.api.nvidia.com/nim/reference/retrieval-apis
https://docs.api.nvidia.com/nim/reference/nvidia-llama-3_2-nemoretriever-500m-rerank-v2-infer

Model ID is mandatory because it is used to determine which model to use.
URL is optional because there are default values to access endpoints for different task types
Most of Embeddings models require input_type parameter. It can be provided in task_settings with truncate parameter.

During the chat_completion testing response for function call didn't return any tool_call usage info. If one of the models returns it - it is going to be handled by OpenAI logic.

EMBEDDINGS

Create embeddings endpoint (mandatory input_type)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "url": "https://integrate.api.nvidia.com/v1/embeddings",
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2"
    },
    "task_settings":{
        "input_type": "ingest"
    }
}
RS
{
    "inference_id": "nvidia-text-embedding-1",
    "task_type": "text_embedding",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        "rate_limit": {
            "requests_per_minute": 3000
        },
        "dimensions": 2048,
        "similarity": "dot_product"
    },
    "task_settings": {
        "input_type": "ingest"
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}

Create embeddings endpoint (Default URL)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2"
    },
    "task_settings": {
        "input_type": "ingest"
    }
}
RS
{
    "inference_id": "nvidia-text-embedding-1",
    "task_type": "text_embedding",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        "rate_limit": {
            "requests_per_minute": 3000
        },
        "dimensions": 2048,
        "similarity": "dot_product"
    },
    "task_settings": {
        "input_type": "ingest"
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}

Perform embeddings (input_type is taken from task_settings on endpoint creation)

RQ
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ]
}
RS
{
    "text_embedding": [
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        },
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        }
    ]
}

Perform embeddings (with task_settings)

RQ
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ],
    "task_settings":{
        "input_type": "ingest",
        "truncate": "start"
    }
}
RS
{
    "text_embedding": [
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        },
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        }
    ]
}

Create embeddings endpoint (without task_settings with input_type)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nvidia/nvclip"
    }
}
RS
{
    "inference_id": "nvidia-text-embedding-2",
    "task_type": "text_embedding",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nvidia/nvclip",
        "rate_limit": {
            "requests_per_minute": 3000
        },
        "dimensions": 1024,
        "similarity": "dot_product"
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}

Perform embeddings (no task_settings)

RQ
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ]
}
RS
{
    "text_embedding": [
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        },
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        }
    ]
}

Create embeddings endpoint (Not Found error)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nvidia/nvclip"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/embeddings] for request from inference entity id [nvidia-text-embedding-3] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/embeddings] for request from inference entity id [nvidia-text-embedding-3] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}

COMPLETION

Create completion endpoint

RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-completion",
    "task_type": "completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

Perform non-streaming completion

RQ
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS
{
    "completion": [
        {
            "result": "This line uses a simile to describe the sky over a seaport. The color of a sky depicted \"as if it's television tuned to a dead channel\" suggests a grey or dull and potentially gloomy atmosphere, with the imagery evoking the lifeless imitation of color that would occur when a TV is not receiving any signals—the sky is as uninteresting and lifeless as static on a screen. The scene described could represent an underwhelming day, perhaps signifying a drab or dismal mood or atmosphere. It could also suggest a sense of tranquility or stillness, as the lack of activity (\"dead channel\") is mirrored by the monotonous sky."
        }
    ]
}

Perform streaming completion

RQ
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS
event: message
data: {"completion":[{"delta":"This"}]}

event: message
data: {"completion":[{"delta":" line"},{"delta":" uses"},{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" sim"},{"delta":"ile"}]}

event: message
data: {"completion":[{"delta":" to"},{"delta":" describe"},{"delta":" the"}]}

event: message
data: {"completion":[{"delta":" sky"},{"delta":" over"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" se"}]}

event: message
data: {"completion":[{"delta":"ap"},{"delta":"ort"}]}

event: message
data: {"completion":[{"delta":"."},{"delta":" The"}]}

event: message
data: {"completion":[{"delta":" color"},{"delta":" of"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" sky"},{"delta":" dep"}]}

event: message
data: {"completion":[{"delta":"icted"}]}

event: message
data: {"completion":[{"delta":" \""},{"delta":"as"}]}

event: message
data: {"completion":[{"delta":" if"},{"delta":" it"}]}

event: message
data: {"completion":[{"delta":"'"}]}

event: message
data: {"completion":[{"delta":"s"},{"delta":" television"}]}

event: message
data: {"completion":[{"delta":" tun"},{"delta":"ed"}]}

event: message
data: {"completion":[{"delta":" to"},{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" dead"}]}

event: message
data: {"completion":[{"delta":" channel"},{"delta":"\""}]}

event: message
data: {"completion":[{"delta":" suggests"}]}

event: message
data: {"completion":[{"delta":" a"},{"delta":" grey"}]}

event: message
data: {"completion":[{"delta":" or"},{"delta":" d"}]}

event: message
data: {"completion":[{"delta":"ull"}]}

event: message
data: {"completion":[{"delta":" and"},{"delta":" potentially"},{"delta":" glo"},{"delta":"omy"}]}

event: message
data: {"completion":[{"delta":" atmosphere"},{"delta":","}]}

event: message
data: {"completion":[{"delta":" with"}]}

event: message
data: {"completion":[{"delta":" the"},{"delta":" imag"}]}

event: message
data: {"completion":[{"delta":"ery"}]}

event: message
data: {"completion":[{"delta":" ev"}]}

event: message
data: {"completion":[{"delta":"oking"},{"delta":" the"},{"delta":" lif"}]}

event: message
data: {"completion":[{"delta":"eless"}]}

event: message
data: {"completion":[{"delta":" im"},{"delta":"itation"}]}

event: message
data: {"completion":[{"delta":" of"},{"delta":" color"}]}

event: message
data: {"completion":[{"delta":" that"},{"delta":" would"}]}

event: message
data: {"completion":[{"delta":" occur"}]}

event: message
data: {"completion":[{"delta":" when"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" TV"}]}

event: message
data: {"completion":[{"delta":" is"}]}

event: message
data: {"completion":[{"delta":" not"}]}

event: message
data: {"completion":[{"delta":" receiving"}]}

event: message
data: {"completion":[{"delta":" any"}]}

event: message
data: {"completion":[{"delta":" signals"}]}

event: message
data: {"completion":[{"delta":"—"}]}

event: message
data: {"completion":[{"delta":"the"}]}

event: message
data: {"completion":[{"delta":" sky"}]}

event: message
data: {"completion":[{"delta":" is"}]}

event: message
data: {"completion":[{"delta":" as"}]}

event: message
data: {"completion":[{"delta":" un"}]}

event: message
data: {"completion":[{"delta":"inter"}]}

event: message
data: {"completion":[{"delta":"est"}]}

event: message
data: {"completion":[{"delta":"ing"}]}

event: message
data: {"completion":[{"delta":" and"}]}

event: message
data: {"completion":[{"delta":" lif"}]}

event: message
data: {"completion":[{"delta":"eless"}]}

event: message
data: {"completion":[{"delta":" as"}]}

event: message
data: {"completion":[{"delta":" static"}]}

event: message
data: {"completion":[{"delta":" on"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" screen"}]}

event: message
data: {"completion":[{"delta":"."}]}

event: message
data: {"completion":[{"delta":" The"}]}

event: message
data: {"completion":[{"delta":" scene"}]}

event: message
data: {"completion":[{"delta":" described"}]}

event: message
data: {"completion":[{"delta":" could"}]}

event: message
data: {"completion":[{"delta":" represent"}]}

event: message
data: {"completion":[{"delta":" an"}]}

event: message
data: {"completion":[{"delta":" under"}]}

event: message
data: {"completion":[{"delta":"wh"}]}

event: message
data: {"completion":[{"delta":"el"}]}

event: message
data: {"completion":[{"delta":"ming"}]}

event: message
data: {"completion":[{"delta":" day"}]}

event: message
data: {"completion":[{"delta":","}]}

event: message
data: {"completion":[{"delta":" perhaps"}]}

event: message
data: {"completion":[{"delta":" sign"}]}

event: message
data: {"completion":[{"delta":"ifying"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" d"}]}

event: message
data: {"completion":[{"delta":"rab"}]}

event: message
data: {"completion":[{"delta":" or"}]}

event: message
data: {"completion":[{"delta":" dis"}]}

event: message
data: {"completion":[{"delta":"mal"}]}

event: message
data: {"completion":[{"delta":" m"}]}

event: message
data: {"completion":[{"delta":"ood"}]}

event: message
data: {"completion":[{"delta":" or"}]}

event: message
data: {"completion":[{"delta":" atmosphere"}]}

event: message
data: {"completion":[{"delta":"."}]}

event: message
data: {"completion":[{"delta":" It"}]}

event: message
data: {"completion":[{"delta":" could"}]}

event: message
data: {"completion":[{"delta":" also"}]}

event: message
data: {"completion":[{"delta":" suggest"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" sense"}]}

event: message
data: {"completion":[{"delta":" of"}]}

event: message
data: {"completion":[{"delta":" tran"}]}

event: message
data: {"completion":[{"delta":"qu"}]}

event: message
data: {"completion":[{"delta":"ility"}]}

event: message
data: {"completion":[{"delta":" or"}]}

event: message
data: {"completion":[{"delta":" still"}]}

event: message
data: {"completion":[{"delta":"ness"}]}

event: message
data: {"completion":[{"delta":","}]}

event: message
data: {"completion":[{"delta":" as"}]}

event: message
data: {"completion":[{"delta":" the"}]}

event: message
data: {"completion":[{"delta":" lack"}]}

event: message
data: {"completion":[{"delta":" of"}]}

event: message
data: {"completion":[{"delta":" activity"}]}

event: message
data: {"completion":[{"delta":" (\""}]}

event: message
data: {"completion":[{"delta":"de"}]}

event: message
data: {"completion":[{"delta":"ad"}]}

event: message
data: {"completion":[{"delta":" channel"}]}

event: message
data: {"completion":[{"delta":"\")"}]}

event: message
data: {"completion":[{"delta":" is"}]}

event: message
data: {"completion":[{"delta":" mirror"}]}

event: message
data: {"completion":[{"delta":"ed"}]}

event: message
data: {"completion":[{"delta":" by"}]}

event: message
data: {"completion":[{"delta":" the"}]}

event: message
data: {"completion":[{"delta":" monot"}]}

event: message
data: {"completion":[{"delta":"on"}]}

event: message
data: {"completion":[{"delta":"ous"}]}

event: message
data: {"completion":[{"delta":" sky"}]}

event: message
data: {"completion":[{"delta":"."}]}

event: message
data: [DONE]

Create completion endpoint (Not Found error)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions23123] for request from inference entity id [nvidia-completion] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions23123] for request from inference entity id [nvidia-completion] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}

Create completion endpoint (Default URL)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-completion-2",
    "task_type": "completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

CHAT COMPLETION

Create chat completion endpoint

RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-chat-completion",
    "task_type": "chat_completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

Perform basic chat completion

RQ
{
    "model": "microsoft/phi-3-mini-128k-instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 10
}
RS
event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"role":"assistant"},"index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk"}

event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"content":"Deep"},"index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk"}

event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"content":","},"index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk"}

event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"content":" which"},"finish_reason":"length","index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":8,"total_tokens":18}}

event: message
data: [DONE]

Create chat completion endpoint (Not Found error)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions123",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "unified_chat_completion_exception",
                "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions123] for request from inference entity id [nvidia-chat-completion] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "unified_chat_completion_exception",
            "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions123] for request from inference entity id [nvidia-chat-completion] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}

Create chat completion endpoint (Default URL)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-chat-completion-3",
    "task_type": "chat_completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

RERANK

Create rerank endpoint

RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking",
        "api_key": "api_key",
        "model_id": "nv-rerank-qa-mistral-4b:1"
    }
}
RS
{
    "inference_id": "nvidia-rerank",
    "task_type": "rerank",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nv-rerank-qa-mistral-4b:1",
        "url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

Perform rerank

RQ
{
    "input": [
        "luke",
        "like",
        "leia",
        "chewy",
        "r2d2",
        "star",
        "wars"
    ],
    "query": "star wars main character"
}
RS
{
    "rerank": [
        {
            "index": 0,
            "relevance_score": -7.8710938
        },
        {
            "index": 5,
            "relevance_score": -8.2578125
        },
        {
            "index": 2,
            "relevance_score": -9.5390625
        },
        {
            "index": 4,
            "relevance_score": -11.2578125
        },
        {
            "index": 6,
            "relevance_score": -11.53125
        },
        {
            "index": 3,
            "relevance_score": -12.3671875
        },
        {
            "index": 1,
            "relevance_score": -12.46875
        }
    ]
}

Create rerank endpoint (Not Found error)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking123",
        "api_key": "api_key",
        "model_id": "nv-rerank-qa-mistral-4b:1"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking123] for request from inference entity id [nvidia-rerank] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking123] for request from inference entity id [nvidia-rerank] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}

Create rerank endpoint (Default URL)

RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nv-rerank-qa-mistral-4b:1"
    }
}
RS
{
    "inference_id": "nvidia-rerank-1",
    "task_type": "rerank",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nv-rerank-qa-mistral-4b:1",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

Perform rerank (With default URL)

RQ
{
    "input": [
        "luke",
        "like",
        "leia",
        "chewy",
        "r2d2",
        "star",
        "wars"
    ],
    "query": "star wars main character"
}
RS
{
    "rerank": [
        {
            "index": 0,
            "relevance_score": -7.8710938
        },
        {
            "index": 5,
            "relevance_score": -8.2578125
        },
        {
            "index": 2,
            "relevance_score": -9.5390625
        },
        {
            "index": 4,
            "relevance_score": -11.2578125
        },
        {
            "index": 6,
            "relevance_score": -11.53125
        },
        {
            "index": 3,
            "relevance_score": -12.3671875
        },
        {
            "index": 1,
            "relevance_score": -12.46875
        }
    ]
}

- Have you signed the contributor license agreement?
- Have you followed the contributor guidelines?
- If submitting code, have you built your formula locally prior to submission with gradle check?
- If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
- If submitting code, have you checked that your submission is for an OS and architecture that we support?
- If you are submitting this code for a class then read our policy for that.

…ation # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceNamedWriteablesProvider.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferencePlugin.java

…ation # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

… feature/nvidia-integration

…ation # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

…ovements

…ation

…iptions

…ation

… improved configuration validation

Jan-Kazlouski-elastic · 2025-11-25T19:26:57Z

Changes proposed in #132388 (comment) are done with improvements and reusage of existing code. Thanks.

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Jan-Kazlouski-elastic · 2025-11-25T19:42:54Z

Hi @DonalEvans
Your comments are addressed and PR can be reviewed again.

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

…ation

DonalEvans · 2025-12-02T18:35:19Z

I tested for different model and got same error message. Meaning those are shared across the models. Fixed. Now we recognize 413 and 400 with "exceeds maximum allowed token size" within error message.

The AzureOpenAI, Llama and OpenShiftAI integrations all also extend OpenAiResponseHandler for handling text embedding responses, so I wonder if there is a similar issue with those integrations, or with the OpenAI integration itself, for text embeddings. I don't have accounts with any of those providers, so if you do have access to them, would you be able to check, please?

DonalEvans

Sorry for the late request for this change, but would it be possible to convert the unit tests added in this PR that are extending AbstractWireSerializingTestCase to extend AbstractBWCWireSerializationTestCase instead? While there are no backwards compatibility concerns with the classes now since they're brand new, it's good to have the tests set up to catch any that might be introduced in future. The change should be simple, just implementing the mutateInstanceForVersion() method to return the instance unchanged in each test class.

DonalEvans · 2025-12-02T18:24:57Z

...lasticsearch/xpack/inference/services/nvidia/embeddings/NvidiaEmbeddingsResponseHandler.java

+    protected void checkForFailureStatusCode(Request request, HttpResult result) throws RetryException {
+        super.checkForFailureStatusCode(request, result);
+    }


Does this method need to be overridden? It just calls the super method without any additional logic. I think it would be better to make the super method public instead of overriding the method to give access to it. It's also an option to move the OpenAiResponseHandler class into the org.elasticsearch.xpack.inference.common package, similar to what was done with the Truncation enum, since it's used by 9 different integrations at this point.

OpenAiResponseHandler is way too widely used for me to be comfortable moving it in this integration. I'd do it in separate PR. Made the change to accessibility of checkForFailureStatusCode.

DonalEvans · 2025-12-02T18:48:43Z

.../java/org/elasticsearch/xpack/inference/services/nvidia/action/NvidiaActionCreatorTests.java

-                            "param": null,
-                            "code": null
-                        }
+                        "error": "Input length 18432 exceeds maximum allowed token size 8192"


For this specific test case, where we're checking that a 413 status leads to truncating the input, it might be better to use an error message that doesn't match the one we check for, to confirm that it's the status code that causes the truncation rather than the error message.

DonalEvans · 2025-12-02T18:52:49Z

.../java/org/elasticsearch/xpack/inference/services/nvidia/action/NvidiaActionCreatorTests.java

-                            "param": null,
-                            "code": null
-                        }
+                        "error": "Input length 18432 exceeds maximum allowed token size 8192"


We should be testing both that text embedding requests get truncated when they see this message and that completion requests get truncated when they get a 413 status or see the "Please reduce your prompt; or completion length." message. Could you add tests for the latter two cases please?

The issue is that we don't perform truncation logic for any chat completion requests. For any integration, including Nvidia.

Meaning if 413 error or 400 with appropriate message is received - request is retried 3 times without changing the input and then if same errors are received - error is returned to the customer.
Truncating completion requests seems off to me, because it would change the meaning of the input, but retrying them when we know that model is not capable of returning positive response is not good either.

I added logic that would throw error right away without retries for completions in case ContentTooLarge error is received. That would make more sense.
Let me know your opinion on that.

DonalEvans · 2025-12-02T19:12:30Z

...csearch/xpack/inference/services/nvidia/embeddings/NvidiaEmbeddingsServiceSettingsTests.java

                new NvidiaEmbeddingsServiceSettings(
                    MODEL_VALUE,
-                    ServiceUtils.createOptionalUri(null),
+                    createOptionalUri(null),


Rather than using null as the URI here, it might be better to pass in the expected default URI. We end up asserting the same thing in either case, but it makes the test clearer in terms of what the expected value actually is instead of having it hidden inside the logic in the constructor.

Good thinking. Changed to default value.

DonalEvans · 2025-12-02T19:23:02Z

...h/xpack/inference/services/nvidia/request/embeddings/NvidiaEmbeddingsRequestEntityTests.java

+    private static final InputType INPUT_TYPE_EXPEDIA_VALUE = InputType.INGEST;
+    private static final Truncation TRUNCATE_EXPEDIA_VALUE = Truncation.START;


Some more "EXPEDIA" instead of "ELASTIC" here.

Thanks. Fixed now.

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

…lizationTestCase for better backward compatibility handling

Jan-Kazlouski-elastic · 2025-12-03T10:25:38Z

Sorry for the late request for this change, but would it be possible to convert the unit tests added in this PR that are extending AbstractWireSerializingTestCase to extend AbstractBWCWireSerializationTestCase instead? While there are no backwards compatibility concerns with the classes now since they're brand new, it's good to have the tests set up to catch any that might be introduced in future. The change should be simple, just implementing the mutateInstanceForVersion() method to return the instance unchanged in each test class.

Done. Now embeddings tests are extending AbstractBWCWireSerializationTestCase instead of AbstractWireSerializingTestCase

…lers

Jan-Kazlouski-elastic · 2025-12-03T14:11:33Z

I tested for different model and got same error message. Meaning those are shared across the models. Fixed. Now we recognize 413 and 400 with "exceeds maximum allowed token size" within error message.

The AzureOpenAI, Llama and OpenShiftAI integrations all also extend OpenAiResponseHandler for handling text embedding responses, so I wonder if there is a similar issue with those integrations, or with the OpenAI integration itself, for text embeddings. I don't have accounts with any of those providers, so if you do have access to them, would you be able to check, please?

Unfortunately for OpenShift AI we won't be able to test because the environment was taken away from us and now is used for other purposes. Will check the other providers.

Jan-Kazlouski-elastic · 2025-12-03T15:47:35Z

Hello @DonalEvans
Your comments are addressed. PR is ready to be reviewed once more.

DonalEvans · 2025-12-03T22:23:35Z

The failures in the serverless checks are due to being out of date with the base branch, merging main should resolve them. Once all the checks are passing, I'll merge this PR.

…ation

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

…ation

Jan-Kazlouski-elastic · 2025-12-04T15:38:54Z

@DonalEvans I added specification and merged master branch.

Jan-Kazlouski-elastic added 2 commits July 29, 2025 10:55

Add Nvidia integration for Completion and Chat Completion

7e639a2

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 4, 2025

Jan-Kazlouski-elastic marked this pull request as draft August 4, 2025 10:01

[CI] Auto commit changes from spotless

623ae46

gareth-ellis added :ml Machine learning Team:ML Meta label for the ML team >enhancement labels Aug 15, 2025

Jan-Kazlouski-elastic added 2 commits August 18, 2025 20:16

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

346294c

…ation # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge remote-tracking branch 'origin/feature/nvidia-integration' into…

4cd2a4c

… feature/nvidia-integration

pxsalehi removed the needs:triage Requires assignment of a team area label label Sep 24, 2025

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

Jan-Kazlouski-elastic and others added 15 commits October 7, 2025 13:05

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

0318f60

…ation # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Update NVIDIA integration to support new Error Handling and versioning

d96b175

Update NVIDIA integration

75e53a4

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

73e90c8

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

1a0773d

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Update Nvidia integration

8680356

Enhance Nvidia integration with task settings and error handling impr…

747cf47

…ovements

[CI] Auto commit changes from spotless

73e5886

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

973c7c6

…ation

Refactor Nvidia service documentation and enhance task settings descr…

c563350

…iptions

Fix documentation typos in Nvidia model constructors and factory methods

cd2d26e

Ensure modelId is non-null in NvidiaServiceSettings constructor

c7c76b2

Add unit tests for NvidiaChatCompletionModel

d466301

Add unit tests for NvidiaChatCompletionResponseHandler

70eb26e

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

091ecba

…ation

Jan-Kazlouski-elastic added 2 commits November 25, 2025 21:04

Refactor NvidiaRerankRequestTests to improve method naming for clarity

7de88eb

Update Nvidia service settings tests to include RateLimitSettings for…

9b2c2d0

… improved configuration validation

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

72cb2f2

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Refactor Nvidia request fields to improve package structure and clarity

56b2971

Jan-Kazlouski-elastic requested a review from DonalEvans November 27, 2025 12:31

Jan-Kazlouski-elastic added 2 commits December 1, 2025 12:39

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

4ef3787

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

d1cdf19

…ation

DonalEvans reviewed Dec 2, 2025

View reviewed changes

Jan-Kazlouski-elastic added 3 commits December 3, 2025 11:31

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

4adfe30

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Update Nvidia service settings tests to use default URL value

02b155f

Refactor Nvidia service settings tests to extend AbstractBWCWireSeria…

3b6dceb

…lizationTestCase for better backward compatibility handling

Jan-Kazlouski-elastic added 2 commits December 3, 2025 12:51

Refactor checkForFailureStatusCode method visibility in response hand…

a9f8aeb

…lers

Enhance Nvidia response handlers to manage content too large errors

89e6089

Jan-Kazlouski-elastic requested a review from DonalEvans December 3, 2025 15:22

DonalEvans approved these changes Dec 3, 2025

View reviewed changes

Jan-Kazlouski-elastic added 5 commits December 4, 2025 11:43

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

870dc03

…ation

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

01c1801

…ation # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

c25f10f

…ation

Add specification

25fdd7f

Merge remote-tracking branch 'origin/main' into feature/nvidia-integr…

1152647

…ation

DonalEvans merged commit 539645d into elastic:main Dec 4, 2025
36 checks passed

Jan-Kazlouski-elastic mentioned this pull request Dec 5, 2025

Add Nvidia inference specification elastic/elasticsearch-specification#5794

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVIDIA support to Inference Plugin#132388

Add NVIDIA support to Inference Plugin#132388
DonalEvans merged 95 commits intoelastic:mainfrom
Jan-Kazlouski-elastic:feature/nvidia-integration

Jan-Kazlouski-elastic commented Aug 4, 2025 •

edited

Loading

Jan-Kazlouski-elastic commented Nov 25, 2025

Jan-Kazlouski-elastic commented Nov 25, 2025

DonalEvans commented Dec 2, 2025

DonalEvans left a comment

DonalEvans Dec 2, 2025

Jan-Kazlouski-elastic Dec 3, 2025

DonalEvans Dec 2, 2025

Jan-Kazlouski-elastic Dec 3, 2025

DonalEvans Dec 2, 2025

Jan-Kazlouski-elastic Dec 3, 2025

DonalEvans Dec 2, 2025

Jan-Kazlouski-elastic Dec 3, 2025

DonalEvans Dec 2, 2025

Jan-Kazlouski-elastic Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 3, 2025

DonalEvans commented Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 4, 2025

Uh oh!

Labels

5 participants

		private static final InputType INPUT_TYPE_EXPEDIA_VALUE = InputType.INGEST;
		private static final Truncation TRUNCATE_EXPEDIA_VALUE = Truncation.START;

Conversation

Jan-Kazlouski-elastic commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jan-Kazlouski-elastic commented Nov 25, 2025

Jan-Kazlouski-elastic commented Nov 25, 2025

DonalEvans commented Dec 2, 2025

DonalEvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic commented Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 3, 2025

DonalEvans commented Dec 3, 2025

Jan-Kazlouski-elastic commented Dec 4, 2025

Uh oh!

Labels

5 participants

Jan-Kazlouski-elastic commented Aug 4, 2025 •

edited

Loading