Chat Completions
POST/v1/chat/completions
Query LLM using text and, for vision-capable models, image data. This endpoint supports multiple models (up to 5, comma-separated) in the model field. Specify the appropriate request format depending on whether the model is text-based or vision-capable.
An array of responses: will be returned for multi-model queries.
Please refer to the ChatCompletionRequest for text-based models and VisionChatCompletionRequest for vision-capable models for detailed request formats.
Request
- application/json
Body
Chat Completion query which can include either text or image data based on the model capability. Refer to the appropriate schema based on the model type.
- ChatCompletionRequest
- VisionChatCompletionRequest
- Array [
- ]
- Array [
- ]
- Array [
- Array [
- MOD1
- MOD2
- ]
- ]
messages object[]required
Possible values: [assistant
, system
, user
]
Message role
Content of the message
Language Model to use, comma separated, up to 5 models. Check our /models route for list of language models. Cannot use provider option with multi-model.
AI Service Provider to use, omit provider and we will automatically use the most responsive provider. Optionally you can include provider with the model instead such as model: provider/model
Specify the name of the rag tune or vector collection to be used for RAG tuning. This will augment the language model query with information from the specified vector database.
Define how we should route your call when you do not specify a provider and multiple providers exist for that model. Options are price (cheapest), multi-tiered performance (routing based on lowest latency for prompt size) and average_latency. The default is perf (price/perf/perf_avg)
Possible values: <= 2
Influences the randomness in the selection process of the next token. Lower values make the model more deterministic, while higher values increase diversity but might reduce coherence.
Possible values: <= 1
Controls the cumulative probability distribution cutoff, selecting the smallest set of tokens whose cumulative probability exceeds the threshold p. This focuses generation on more likely tokens, enhancing creativity and coherence.
Limits the selection pool to the top k most probable tokens. The probability distribution is then reranked among these k tokens, which helps in reducing randomness by eliminating the least likely options.
Possible values: >= -2
and <= 2
Adjusts the likelihood of a token's selection based on its previous occurrences, decreasing the chances of frequently selected tokens to promote diversity in the output.
Possible values: >= -2
and <= 2
Similar to frequency penalty, but it decreases the likelihood of tokens appearing again based on their presence, regardless of frequency, to encourage novel token selection.
Possible values: >= 1
and <= 2
Discourages the model from repeating the same words or phrases, enhancing the uniqueness and variety of the content generated.
Possible values: >= 1
and <= 5
receive this many responses to your prompt, currently only works with OpenAI direct
Possible values: >= 1
and <= 5
Used in beam search, it represents the number of sequences to keep at each step of the generation. A larger beam size increases the chances of finding a more optimal sequence but at the cost of computational resources and time. Only some models support this
Possible values: >= 1
Max tokens used to generate a response
Return response in event-stream
tools object[]
Optional tools/functions, all models support tools, serviced by an Open AI or Anthropic model of your choice. Function calls are handled in Open AI standard request/response format.
Possible values: [function
]
function object
Name of the function to call
Description of the function
Parameters required by the function
Specifies how the tools are chosen. "none" means the model will not call any tool and instead generates a message, this is default. "auto" means the model can pick between generating a message or calling one or more tools, this is how most devs use it. "required" means the model must call one or more tools.
Default value: gpt-4o-mini
Specifies the model to use for processing tools, pass any OpenAI or Anthropic model, gpt-4o-mini is default.
Possible values: [12
, 13
]
Integrity setting, can be 12 or 13, used to query and return best of two answers or the best of 3 answers.
Default value: gpt-4o
Specifies model to use for integrity checks, currently only supports OpenAI models.
Force request to be routed to the specified provider, otherwise request will be routed to the requested provider only if it is available
Vision Model to use, refer to /models route for compatible vision models
messages object[]required
Possible values: [user
]
Role of the message sender
content object[]
Possible values: [text
]
Textual content of the message
Possible values: [image_url
]
image_url object
URL of the image
Possible values: [high
]
Detail level of the image, only high supported
Possible values: >= 1
Max tokens used to generate a response
Responses
- 200
Result of the Query
- application/json
- Schema
- Example (from schema)
Schema
- Array [
- ]
- Array [
- ]
Unique identifier for the completion.
Type of the returned object, usually set to chat.completion.
UTC timestamp of when the completion was created.
Provider of the AI service.
AI model used for generating the completion.
choices object[]required
Array of possible completion options generated by the model.
Index of the choice in the array.
message objectrequired
Role of the message, such as user or assistant.
Content of the message.
Log probabilities for the completion, can be null.
Reason why the model stopped generating text.
usage objectrequired
Number of tokens used in the prompt.
Number of tokens generated in the completion.
Total number of tokens used in both prompt and completion.
Number of characters used in the prompt.
Number of characters in the response.
Cost associated with the computation of the completion.
Latency in milliseconds for the completion to be generated.
Unique fingerprint of the system configuration used.
streaming object
Object detailing streaming data chunks if stream is true.
Type of the streamed object, usually chat.completion.chunk.
chunks object[]
Array of data chunks streamed.
Content of the streamed data chunk.
Index of the data chunk in the stream.
Reason for the finish of this particular chunk.
tokens, characters, latency and cost for this query
{
"id": "string",
"object": "string",
"created": 0,
"provider": "string",
"model": "string",
"choices": [
{
"index": 0,
"message": {
"role": "string",
"content": "string"
},
"logprobs": {},
"finish_reason": "string"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0,
"prompt_characters": 0,
"response_characters": 0,
"cost": 0,
"latency_ms": 0
},
"system_fingerprint": "string",
"streaming": {
"type": "string",
"chunks": [
{
"content": "string",
"index": 0,
"finish_reason": "string",
"usage": "string"
}
]
}
}