Chat completions

POST /v1/chat/completions

Given a list of messages forming a conversation, the model generates a response. See available models at this pricing table.

To successfully run an inference request, it is mandatory to enter a personal access token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token.

Request

Header Parameters

X-Friendli-Team string

ID of team to run requests as (optional parameter).

application/json

Body

model stringrequired

Code of the model to use. See available model list.

messages object[]required

A list of messages comprising the conversation so far.

Array [

role stringrequired

Possible values: [system, user, assistant, tool]

The role of the messages author.

system
user
assistant
tool

content stringrequired

The content of system message.

name string

The name for the participant to distinguish between participants with the same role.

content string

The content of assistant message. Required unless tool_calls is specified.

name string

The name for the participant to distinguish between participants with the same role.

tool_calls object[]

Array [

id stringrequired

The ID of tool call.

type stringrequired

Possible values: [function]

The type of tool call.

function objectrequired

The function specification

name stringrequired

The name of function

arguments stringrequired

The arguments of function in JSON schema format to call the function.

]

frequency_penalty numbernullable

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

presence_penalty numbernullable

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

repetition_penalty numbernullable

Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). should be greater than or equal to 1.0. 1.0 means no penalty. see keskar et al., 2019 for more details. this is similar to Hugging Face's repetition_penalty argument.

max_tokens integernullable

The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

min_tokens integernullable

The minimum number of tokens to generate. default value is 0. this is similar to hugging face's min_new_tokens argument.

n integernullable

The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

stop string[]nullable

When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.

stream booleannullable

Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.

temperature numbernullable

Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. defaults to 1.0. this is similar to hugging face's temperature argument.

top_p numbernullable

Tokens comprising the top top_p probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p argument.

top_k integernullable

The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k argument.

timeout_microseconds integernullable

Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.

seed integer[]nullable

Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.

eos_token integer[]nullable

A list of endpoint sentence tokens.

tools object[]nullable

A list of tools the model may call. Currently, only functions are supported as a tool. A maximum of 128 functions is supported. Use this to provide a list of functions the model may generate JSON inputs for.

Array [

type stringrequired

Possible values: [function]

The type of the tool. Currently, only function is supported.

function objectrequired

description stringnullable

A description of what the function does, used by the model to choose when and how to call the function.

name stringrequired

The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

parameters objectrequired

The parameters the functions accepts, described as a JSON Schema object. To represent a function with no parameters, use the value {"type": "object", "properties": {}}.

]

tool_choice stringobjectnullable

Determines the tool calling behavior of the model. When set to none, the model will bypass tool execution and generate a response directly. In auto mode (the default), the model dynamically decides whether to call a tool or respond with a message. Alternatively, setting required ensures that the model invokes at least one tool before responding to the user. You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}.

parallel_tool_calls booleannullable

Whether to enable parallel function calling.

response_format TextResponseFormatnullable

The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens. You can check this by verifying that the finish_reason of the output message is "length".

Important You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.). Otherwise, the model may result in an unending stream of whitespace or other characters.

type stringrequired

Possible values: [text, json_object, regex]

Type of the response format.

schema string

The schema of the output. For { "type": "json_object" }, schema should be a serialized string of JSON schema. For { "type": "regex" }, schema should be a regex pattern.

Caveat For the JSON object type, recursive definitions are not supported. Optional properties are also not supported; all properties of { "type": "object" } are generated regardless of whether they are listed in the required field. For the regex type, lookaheads/lookbehinds (e.g., \a, \z, ^, $, (?=), (?!), (?<=...), (?<!...)) are not supported. Group specials (e.g., \w, \W, \d, \D, \s, \S) do not support non-ASCII characters. Unicode escape patterns (e.g., \N, \p, \P) are not supported. Additionally, conditional matching ((?() and back-references can cause inefficiency.

Responses

Successfully generated a chat response. When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json.

application/json
text/event-stream

Schema
Example (No Streaming)

Schema

choices object[]

Array [

index integer

The index of the choice in the list of generated choices.

message object

role string

Role of the generated message author, in this case assistant.

content string

The contents of the assistant message.

tool_calls object[]

Array [

id string

The ID of the tool call.

type string

Possible values: [function]

The type of the tool

function object

name string

The name of the function to call.

arguments string

The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON.

]

finish_reason string

Termination condition of the generation. stop means the API returned the full chat completion generated by the model without running into any limits. length means the generation exceeded max_tokens or the conversation exceeded the max context length. tool_calls means the API has generated tool calls.

]

usage object

prompt_tokens integer

Number of tokens in the prompt.

completion_tokens integer

Number of tokens in the generated completion.

total_tokens integer

Total number of tokens used in the request (prompt_tokens + completion_tokens).

created integer

The Unix timestamp (in seconds) for when the generation completed.

No streaming example

{
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "\n\nHello there, how may I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

Schema
Example (Streaming)

Schema

choices object[]

Array [

index integer

The index of the choice in the list of generated choices.

delta object

role string

Role of the generated message author, in this case assistant.

content string

The contents of the assistant message.

tool_calls object

index integer

The index of tool call being generated.

id string

The ID of the tool call.

type string

Possible values: [function]

The type of the tool

function object

name string

The name of the function to call.

arguments string

finish_reason stringnullable

]

created integer

The Unix timestamp (in seconds) for when the token sampled.

Streaming example

data: {"created":1694268190,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"created":1694268190,"choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"created":1694268190,"choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

....

data: {"created":1694268190,"choices":[{"index":0,"delta":{"content":" today"},"finish_reason":null}]}

data: {"created":1694268190,"choices":[{"index":0,"delta":{"content":"?"},"finish_reason":null}]}

data: {"created":1694268190,"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

[DONE]

Chat completions

/v1/chat/completions

Request​

Header Parameters

Body

Responses​

Request

Responses