Skip to content

Add NVIDIA support to Inference Plugin#132388

Merged
DonalEvans merged 95 commits intoelastic:mainfrom
Jan-Kazlouski-elastic:feature/nvidia-integration
Dec 4, 2025
Merged

Add NVIDIA support to Inference Plugin#132388
DonalEvans merged 95 commits intoelastic:mainfrom
Jan-Kazlouski-elastic:feature/nvidia-integration

Conversation

@Jan-Kazlouski-elastic
Copy link
Contributor

@Jan-Kazlouski-elastic Jan-Kazlouski-elastic commented Aug 4, 2025

Creation of new NVIDIA inference provider integration allowing:

  • text_embedding
  • completion (both streaming and non-streaming)
  • chat_completion
  • rerank

tasks to be executed as part of inference API with nvidia provider.

Changes were tested locally against next models:

  • nvidia/llama-3.2-nv-embedqa-1b-v2 & nvidia/nvclip (text_embedding)
  • microsoft/phi-3-mini-128k-instruct (completion and chat_completion)
  • nv-rerank-qa-mistral-4b:1 (rerank)

Useful doc links:
https://docs.api.nvidia.com/nim/reference/llm-apis
https://docs.api.nvidia.com/nim/reference/retrieval-apis
https://docs.api.nvidia.com/nim/reference/nvidia-llama-3_2-nemoretriever-500m-rerank-v2-infer

Model ID is mandatory because it is used to determine which model to use.
URL is optional because there are default values to access endpoints for different task types
Most of Embeddings models require input_type parameter. It can be provided in task_settings with truncate parameter.

During the chat_completion testing response for function call didn't return any tool_call usage info. If one of the models returns it - it is going to be handled by OpenAI logic.

EMBEDDINGS
Create embeddings endpoint (mandatory input_type)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "url": "https://integrate.api.nvidia.com/v1/embeddings",
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2"
    },
    "task_settings":{
        "input_type": "ingest"
    }
}
RS
{
    "inference_id": "nvidia-text-embedding-1",
    "task_type": "text_embedding",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        "rate_limit": {
            "requests_per_minute": 3000
        },
        "dimensions": 2048,
        "similarity": "dot_product"
    },
    "task_settings": {
        "input_type": "ingest"
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}
Create embeddings endpoint (Default URL)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2"
    },
    "task_settings": {
        "input_type": "ingest"
    }
}
RS
{
    "inference_id": "nvidia-text-embedding-1",
    "task_type": "text_embedding",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        "rate_limit": {
            "requests_per_minute": 3000
        },
        "dimensions": 2048,
        "similarity": "dot_product"
    },
    "task_settings": {
        "input_type": "ingest"
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}
Perform embeddings (input_type is taken from task_settings on endpoint creation)
RQ
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ]
}
RS
{
    "text_embedding": [
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        },
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        }
    ]
}
Perform embeddings (with task_settings)
RQ
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ],
    "task_settings":{
        "input_type": "ingest",
        "truncate": "start"
    }
}
RS
{
    "text_embedding": [
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        },
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        }
    ]
}
Create embeddings endpoint (without task_settings with input_type)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nvidia/nvclip"
    }
}
RS
{
    "inference_id": "nvidia-text-embedding-2",
    "task_type": "text_embedding",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nvidia/nvclip",
        "rate_limit": {
            "requests_per_minute": 3000
        },
        "dimensions": 1024,
        "similarity": "dot_product"
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}
Perform embeddings (no task_settings)
RQ
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ]
}
RS
{
    "text_embedding": [
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        },
        {
            "embedding": [
                -0.014701843,
                1.5926361E-4
            ]
        }
    ]
}
Create embeddings endpoint (Not Found error)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nvidia/nvclip"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/embeddings] for request from inference entity id [nvidia-text-embedding-3] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/embeddings] for request from inference entity id [nvidia-text-embedding-3] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}
COMPLETION
Create completion endpoint
RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-completion",
    "task_type": "completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform non-streaming completion
RQ
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS
{
    "completion": [
        {
            "result": "This line uses a simile to describe the sky over a seaport. The color of a sky depicted \"as if it's television tuned to a dead channel\" suggests a grey or dull and potentially gloomy atmosphere, with the imagery evoking the lifeless imitation of color that would occur when a TV is not receiving any signals—the sky is as uninteresting and lifeless as static on a screen. The scene described could represent an underwhelming day, perhaps signifying a drab or dismal mood or atmosphere. It could also suggest a sense of tranquility or stillness, as the lack of activity (\"dead channel\") is mirrored by the monotonous sky."
        }
    ]
}
Perform streaming completion
RQ
{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}
RS
event: message
data: {"completion":[{"delta":"This"}]}

event: message
data: {"completion":[{"delta":" line"},{"delta":" uses"},{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" sim"},{"delta":"ile"}]}

event: message
data: {"completion":[{"delta":" to"},{"delta":" describe"},{"delta":" the"}]}

event: message
data: {"completion":[{"delta":" sky"},{"delta":" over"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" se"}]}

event: message
data: {"completion":[{"delta":"ap"},{"delta":"ort"}]}

event: message
data: {"completion":[{"delta":"."},{"delta":" The"}]}

event: message
data: {"completion":[{"delta":" color"},{"delta":" of"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" sky"},{"delta":" dep"}]}

event: message
data: {"completion":[{"delta":"icted"}]}

event: message
data: {"completion":[{"delta":" \""},{"delta":"as"}]}

event: message
data: {"completion":[{"delta":" if"},{"delta":" it"}]}

event: message
data: {"completion":[{"delta":"'"}]}

event: message
data: {"completion":[{"delta":"s"},{"delta":" television"}]}

event: message
data: {"completion":[{"delta":" tun"},{"delta":"ed"}]}

event: message
data: {"completion":[{"delta":" to"},{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" dead"}]}

event: message
data: {"completion":[{"delta":" channel"},{"delta":"\""}]}

event: message
data: {"completion":[{"delta":" suggests"}]}

event: message
data: {"completion":[{"delta":" a"},{"delta":" grey"}]}

event: message
data: {"completion":[{"delta":" or"},{"delta":" d"}]}

event: message
data: {"completion":[{"delta":"ull"}]}

event: message
data: {"completion":[{"delta":" and"},{"delta":" potentially"},{"delta":" glo"},{"delta":"omy"}]}

event: message
data: {"completion":[{"delta":" atmosphere"},{"delta":","}]}

event: message
data: {"completion":[{"delta":" with"}]}

event: message
data: {"completion":[{"delta":" the"},{"delta":" imag"}]}

event: message
data: {"completion":[{"delta":"ery"}]}

event: message
data: {"completion":[{"delta":" ev"}]}

event: message
data: {"completion":[{"delta":"oking"},{"delta":" the"},{"delta":" lif"}]}

event: message
data: {"completion":[{"delta":"eless"}]}

event: message
data: {"completion":[{"delta":" im"},{"delta":"itation"}]}

event: message
data: {"completion":[{"delta":" of"},{"delta":" color"}]}

event: message
data: {"completion":[{"delta":" that"},{"delta":" would"}]}

event: message
data: {"completion":[{"delta":" occur"}]}

event: message
data: {"completion":[{"delta":" when"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" TV"}]}

event: message
data: {"completion":[{"delta":" is"}]}

event: message
data: {"completion":[{"delta":" not"}]}

event: message
data: {"completion":[{"delta":" receiving"}]}

event: message
data: {"completion":[{"delta":" any"}]}

event: message
data: {"completion":[{"delta":" signals"}]}

event: message
data: {"completion":[{"delta":"—"}]}

event: message
data: {"completion":[{"delta":"the"}]}

event: message
data: {"completion":[{"delta":" sky"}]}

event: message
data: {"completion":[{"delta":" is"}]}

event: message
data: {"completion":[{"delta":" as"}]}

event: message
data: {"completion":[{"delta":" un"}]}

event: message
data: {"completion":[{"delta":"inter"}]}

event: message
data: {"completion":[{"delta":"est"}]}

event: message
data: {"completion":[{"delta":"ing"}]}

event: message
data: {"completion":[{"delta":" and"}]}

event: message
data: {"completion":[{"delta":" lif"}]}

event: message
data: {"completion":[{"delta":"eless"}]}

event: message
data: {"completion":[{"delta":" as"}]}

event: message
data: {"completion":[{"delta":" static"}]}

event: message
data: {"completion":[{"delta":" on"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" screen"}]}

event: message
data: {"completion":[{"delta":"."}]}

event: message
data: {"completion":[{"delta":" The"}]}

event: message
data: {"completion":[{"delta":" scene"}]}

event: message
data: {"completion":[{"delta":" described"}]}

event: message
data: {"completion":[{"delta":" could"}]}

event: message
data: {"completion":[{"delta":" represent"}]}

event: message
data: {"completion":[{"delta":" an"}]}

event: message
data: {"completion":[{"delta":" under"}]}

event: message
data: {"completion":[{"delta":"wh"}]}

event: message
data: {"completion":[{"delta":"el"}]}

event: message
data: {"completion":[{"delta":"ming"}]}

event: message
data: {"completion":[{"delta":" day"}]}

event: message
data: {"completion":[{"delta":","}]}

event: message
data: {"completion":[{"delta":" perhaps"}]}

event: message
data: {"completion":[{"delta":" sign"}]}

event: message
data: {"completion":[{"delta":"ifying"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" d"}]}

event: message
data: {"completion":[{"delta":"rab"}]}

event: message
data: {"completion":[{"delta":" or"}]}

event: message
data: {"completion":[{"delta":" dis"}]}

event: message
data: {"completion":[{"delta":"mal"}]}

event: message
data: {"completion":[{"delta":" m"}]}

event: message
data: {"completion":[{"delta":"ood"}]}

event: message
data: {"completion":[{"delta":" or"}]}

event: message
data: {"completion":[{"delta":" atmosphere"}]}

event: message
data: {"completion":[{"delta":"."}]}

event: message
data: {"completion":[{"delta":" It"}]}

event: message
data: {"completion":[{"delta":" could"}]}

event: message
data: {"completion":[{"delta":" also"}]}

event: message
data: {"completion":[{"delta":" suggest"}]}

event: message
data: {"completion":[{"delta":" a"}]}

event: message
data: {"completion":[{"delta":" sense"}]}

event: message
data: {"completion":[{"delta":" of"}]}

event: message
data: {"completion":[{"delta":" tran"}]}

event: message
data: {"completion":[{"delta":"qu"}]}

event: message
data: {"completion":[{"delta":"ility"}]}

event: message
data: {"completion":[{"delta":" or"}]}

event: message
data: {"completion":[{"delta":" still"}]}

event: message
data: {"completion":[{"delta":"ness"}]}

event: message
data: {"completion":[{"delta":","}]}

event: message
data: {"completion":[{"delta":" as"}]}

event: message
data: {"completion":[{"delta":" the"}]}

event: message
data: {"completion":[{"delta":" lack"}]}

event: message
data: {"completion":[{"delta":" of"}]}

event: message
data: {"completion":[{"delta":" activity"}]}

event: message
data: {"completion":[{"delta":" (\""}]}

event: message
data: {"completion":[{"delta":"de"}]}

event: message
data: {"completion":[{"delta":"ad"}]}

event: message
data: {"completion":[{"delta":" channel"}]}

event: message
data: {"completion":[{"delta":"\")"}]}

event: message
data: {"completion":[{"delta":" is"}]}

event: message
data: {"completion":[{"delta":" mirror"}]}

event: message
data: {"completion":[{"delta":"ed"}]}

event: message
data: {"completion":[{"delta":" by"}]}

event: message
data: {"completion":[{"delta":" the"}]}

event: message
data: {"completion":[{"delta":" monot"}]}

event: message
data: {"completion":[{"delta":"on"}]}

event: message
data: {"completion":[{"delta":"ous"}]}

event: message
data: {"completion":[{"delta":" sky"}]}

event: message
data: {"completion":[{"delta":"."}]}

event: message
data: [DONE]

Create completion endpoint (Not Found error)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions23123] for request from inference entity id [nvidia-completion] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions23123] for request from inference entity id [nvidia-completion] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}
Create completion endpoint (Default URL)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-completion-2",
    "task_type": "completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
CHAT COMPLETION
Create chat completion endpoint
RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-chat-completion",
    "task_type": "chat_completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform basic chat completion
RQ
{
    "model": "microsoft/phi-3-mini-128k-instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 10
}
RS
event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"role":"assistant"},"index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk"}

event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"content":"Deep"},"index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk"}

event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"content":","},"index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk"}

event: message
data: {"id":"cmpl-230722dd37674bb091b29b3e5c615abc","choices":[{"delta":{"content":" which"},"finish_reason":"length","index":0}],"model":"microsoft/phi-3-mini-128k-instruct","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":8,"total_tokens":18}}

event: message
data: [DONE]


Create chat completion endpoint (Not Found error)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions123",
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "unified_chat_completion_exception",
                "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions123] for request from inference entity id [nvidia-chat-completion] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "unified_chat_completion_exception",
            "reason": "Resource not found at [https://integrate.api.nvidia.com/v1/chat/completions123] for request from inference entity id [nvidia-chat-completion] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}
Create chat completion endpoint (Default URL)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "microsoft/phi-3-mini-128k-instruct"
    }
}
RS
{
    "inference_id": "nvidia-chat-completion-3",
    "task_type": "chat_completion",
    "service": "nvidia",
    "service_settings": {
        "model_id": "microsoft/phi-3-mini-128k-instruct",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
RERANK
Create rerank endpoint
RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking",
        "api_key": "api_key",
        "model_id": "nv-rerank-qa-mistral-4b:1"
    }
}
RS
{
    "inference_id": "nvidia-rerank",
    "task_type": "rerank",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nv-rerank-qa-mistral-4b:1",
        "url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform rerank
RQ
{
    "input": [
        "luke",
        "like",
        "leia",
        "chewy",
        "r2d2",
        "star",
        "wars"
    ],
    "query": "star wars main character"
}
RS
{
    "rerank": [
        {
            "index": 0,
            "relevance_score": -7.8710938
        },
        {
            "index": 5,
            "relevance_score": -8.2578125
        },
        {
            "index": 2,
            "relevance_score": -9.5390625
        },
        {
            "index": 4,
            "relevance_score": -11.2578125
        },
        {
            "index": 6,
            "relevance_score": -11.53125
        },
        {
            "index": 3,
            "relevance_score": -12.3671875
        },
        {
            "index": 1,
            "relevance_score": -12.46875
        }
    ]
}
Create rerank endpoint (Not Found error)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking123",
        "api_key": "api_key",
        "model_id": "nv-rerank-qa-mistral-4b:1"
    }
}
RS
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Resource not found at [https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking123] for request from inference entity id [nvidia-rerank] status [404]. Error message: [404 page not found\n]"
            }
        ],
        "type": "status_exception",
        "reason": "Could not complete inference endpoint creation as validation call to service threw an exception.",
        "caused_by": {
            "type": "status_exception",
            "reason": "Resource not found at [https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking123] for request from inference entity id [nvidia-rerank] status [404]. Error message: [404 page not found\n]"
        }
    },
    "status": 400
}
Create rerank endpoint (Default URL)
RQ
{
    "service": "nvidia",
    "service_settings": {
        "api_key": "api_key",
        "model_id": "nv-rerank-qa-mistral-4b:1"
    }
}
RS
{
    "inference_id": "nvidia-rerank-1",
    "task_type": "rerank",
    "service": "nvidia",
    "service_settings": {
        "model_id": "nv-rerank-qa-mistral-4b:1",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}
Perform rerank (With default URL)
RQ
{
    "input": [
        "luke",
        "like",
        "leia",
        "chewy",
        "r2d2",
        "star",
        "wars"
    ],
    "query": "star wars main character"
}
RS
{
    "rerank": [
        {
            "index": 0,
            "relevance_score": -7.8710938
        },
        {
            "index": 5,
            "relevance_score": -8.2578125
        },
        {
            "index": 2,
            "relevance_score": -9.5390625
        },
        {
            "index": 4,
            "relevance_score": -11.2578125
        },
        {
            "index": 6,
            "relevance_score": -11.53125
        },
        {
            "index": 3,
            "relevance_score": -12.3671875
        },
        {
            "index": 1,
            "relevance_score": -12.46875
        }
    ]
}
  • - Have you signed the contributor license agreement?
  • - Have you followed the contributor guidelines?
  • - If submitting code, have you built your formula locally prior to submission with gradle check?
  • - If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • - If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • - If you are submitting this code for a class then read our policy for that.
…ation

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceNamedWriteablesProvider.java
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferencePlugin.java
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 4, 2025
@Jan-Kazlouski-elastic Jan-Kazlouski-elastic marked this pull request as draft August 4, 2025 10:01
@gareth-ellis gareth-ellis added :ml Machine learning Team:ML Meta label for the ML team >enhancement labels Aug 15, 2025
…ation

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@pxsalehi pxsalehi removed the needs:triage Requires assignment of a team area label label Sep 24, 2025
@Jan-Kazlouski-elastic
Copy link
Contributor Author

Changes proposed in #132388 (comment) are done with improvements and reusage of existing code. Thanks.

…ation

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.3.csv
@Jan-Kazlouski-elastic
Copy link
Contributor Author

Hi @DonalEvans
Your comments are addressed and PR can be reviewed again.

@DonalEvans
Copy link
Contributor

I tested for different model and got same error message. Meaning those are shared across the models. Fixed. Now we recognize 413 and 400 with "exceeds maximum allowed token size" within error message.

The AzureOpenAI, Llama and OpenShiftAI integrations all also extend OpenAiResponseHandler for handling text embedding responses, so I wonder if there is a similar issue with those integrations, or with the OpenAI integration itself, for text embeddings. I don't have accounts with any of those providers, so if you do have access to them, would you be able to check, please?

Copy link
Contributor

@DonalEvans DonalEvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late request for this change, but would it be possible to convert the unit tests added in this PR that are extending AbstractWireSerializingTestCase to extend AbstractBWCWireSerializationTestCase instead? While there are no backwards compatibility concerns with the classes now since they're brand new, it's good to have the tests set up to catch any that might be introduced in future. The change should be simple, just implementing the mutateInstanceForVersion() method to return the instance unchanged in each test class.

Comment on lines 35 to 37
protected void checkForFailureStatusCode(Request request, HttpResult result) throws RetryException {
super.checkForFailureStatusCode(request, result);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method need to be overridden? It just calls the super method without any additional logic. I think it would be better to make the super method public instead of overriding the method to give access to it. It's also an option to move the OpenAiResponseHandler class into the org.elasticsearch.xpack.inference.common package, similar to what was done with the Truncation enum, since it's used by 9 different integrations at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAiResponseHandler is way too widely used for me to be comfortable moving it in this integration. I'd do it in separate PR. Made the change to accessibility of checkForFailureStatusCode.

"param": null,
"code": null
}
"error": "Input length 18432 exceeds maximum allowed token size 8192"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific test case, where we're checking that a 413 status leads to truncating the input, it might be better to use an error message that doesn't match the one we check for, to confirm that it's the status code that causes the truncation rather than the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"param": null,
"code": null
}
"error": "Input length 18432 exceeds maximum allowed token size 8192"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be testing both that text embedding requests get truncated when they see this message and that completion requests get truncated when they get a 413 status or see the "Please reduce your prompt; or completion length." message. Could you add tests for the latter two cases please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we don't perform truncation logic for any chat completion requests. For any integration, including Nvidia.

Meaning if 413 error or 400 with appropriate message is received - request is retried 3 times without changing the input and then if same errors are received - error is returned to the customer.
Truncating completion requests seems off to me, because it would change the meaning of the input, but retrying them when we know that model is not capable of returning positive response is not good either.

I added logic that would throw error right away without retries for completions in case ContentTooLarge error is received. That would make more sense.
Let me know your opinion on that.

new NvidiaEmbeddingsServiceSettings(
MODEL_VALUE,
ServiceUtils.createOptionalUri(null),
createOptionalUri(null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using null as the URI here, it might be better to pass in the expected default URI. We end up asserting the same thing in either case, but it makes the test clearer in terms of what the expected value actually is instead of having it hidden inside the logic in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. Changed to default value.

Comment on lines 30 to 31
private static final InputType INPUT_TYPE_EXPEDIA_VALUE = InputType.INGEST;
private static final Truncation TRUNCATE_EXPEDIA_VALUE = Truncation.START;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more "EXPEDIA" instead of "ELASTIC" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed now.

@Jan-Kazlouski-elastic
Copy link
Contributor Author

Sorry for the late request for this change, but would it be possible to convert the unit tests added in this PR that are extending AbstractWireSerializingTestCase to extend AbstractBWCWireSerializationTestCase instead? While there are no backwards compatibility concerns with the classes now since they're brand new, it's good to have the tests set up to catch any that might be introduced in future. The change should be simple, just implementing the mutateInstanceForVersion() method to return the instance unchanged in each test class.

Done. Now embeddings tests are extending AbstractBWCWireSerializationTestCase instead of AbstractWireSerializingTestCase

@Jan-Kazlouski-elastic
Copy link
Contributor Author

I tested for different model and got same error message. Meaning those are shared across the models. Fixed. Now we recognize 413 and 400 with "exceeds maximum allowed token size" within error message.

The AzureOpenAI, Llama and OpenShiftAI integrations all also extend OpenAiResponseHandler for handling text embedding responses, so I wonder if there is a similar issue with those integrations, or with the OpenAI integration itself, for text embeddings. I don't have accounts with any of those providers, so if you do have access to them, would you be able to check, please?

Unfortunately for OpenShift AI we won't be able to test because the environment was taken away from us and now is used for other purposes. Will check the other providers.

@Jan-Kazlouski-elastic
Copy link
Contributor Author

Hello @DonalEvans
Your comments are addressed. PR is ready to be reviewed once more.

@DonalEvans
Copy link
Contributor

The failures in the serverless checks are due to being out of date with the base branch, merging main should resolve them. Once all the checks are passing, I'll merge this PR.

@Jan-Kazlouski-elastic
Copy link
Contributor Author

@DonalEvans I added specification and merged master branch.

@DonalEvans DonalEvans merged commit 539645d into elastic:main Dec 4, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.3.0

5 participants