Skip to main content

Chat completions



Given a list of messages forming a conversation, the model generates a response. See available models at this pricing table.


Header Parameters

    X-Friendli-Team string

    ID of a team to run request as.


    model stringrequired

    Code of the model to use. See available model list.

    messages object[]required

    A list of messages comprising the conversation so far.

  • Array [
  • content stringrequired

    The contents of the message.

    role stringrequired

    The role of the messages author.

  • ]
  • frequency_penalty numbernullable

    Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

    presence_penalty numbernullable

    Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

    max_tokens integernullable

    The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

    n integernullable

    The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

    stop string[]nullable

    When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.

    stream booleannullable

    Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.

    temperature numbernullable

    Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. defaults to 1.0. this is similar to hugging face's temperature argument.

    top_p numbernullable

    Tokens comprising the top top_p probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p argument.

    timeout_microseconds integernullable

    Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.

    response_format object

    The enforced format of the model's output.

    type stringrequired

    Possible values: [text, json_object, regex]

    Type of the response format.

    schema string

    The schema of the output. For { "type": "json_object" }, schema should be a serilaized string of JSON schema. For { "type": "regex" }, schema should be a regex pattern.

    Caveat For the JSON object type, recursive definitions are not supported. Optional properties are also not supported; all properties of { "type": "object" } are generated regardless of whether they are listed in the required field. For the regex type, lookaheads/lookbehinds (e.g., \a, \z, ^, $, (?=), (?!), (?<=...), (?<!...)) are not supported. Group specials (e.g., \w, \W, \d, \D, \s, \S) do not support non-ASCII characters. Unicode escape patterns (e.g., \N, \p, \P) are not supported. Additionally, conditional matching ((?() and back-references can cause inefficiency.


Successfully generated a chat response. When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json.

    choices object[]
  • Array [
  • index integer

    The index of the choice in the list of generated choices.

    message object
    role string

    Role of the generated message author, in this case assistant.

    content string

    The contents of the assistant message.

    finish_reason string

    Termination condition of the generation. stop means the API returned the full chat completion generated by the model without running into any limits. length means the generation exceeded max_tokens or the conversation exceeded the max context length.

  • ]
  • usage object
    prompt_tokens integer

    Number of tokens in the prompt.

    completion_tokens integer

    Number of tokens in the generated completion.

    total_tokens integer

    Total number of tokens used in the request (prompt_tokens + completion_tokens).

    created integer

    The Unix timestamp (in seconds) for when the generation completed.