Skip to main content

Rate Limits

When interacting with Friendli Serverless Endpoints, it's important to be aware of the rate limits imposed on requests. These limits are in place to regulate the number of requests made within a specified timeframe, ensuring a balanced and efficient use of resources. The rate limits are quantified using three metrics:

  • RPM (Requests per Minute): This measures the maximum number of requests allowed per minute.
  • TPM (Tokens per Minute): TPM represents the maximum estimated tokens processed per minute, providing insight into the computational load.
  • SPM (Steps per Minute): SPM signifies the maximum number of inference steps permitted within a minute.

RPM is used for all types of generation models, while TPM and SPM are used only for text generation models and image generation models respectively. The information related to the rate limits is included in the response headers as follows:

  • In all responses
    • X-RateLimit-Limit-Requests
    • X-RateLimit-Remaining-Requests
    • X-RateLimit-Reset-Requests
  • In text generation responses
    • X-RateLimit-Limit-Tokens
    • X-RateLimit-Remaining-Tokens
    • X-RateLimit-Reset-Tokens
  • In image generation responses
    • X-RateLimit-Limit-Steps
    • X-RateLimit-Remaining-Steps
    • X-RateLimit-Reset-Steps

The specific rate limits applied depend on the user's subscription plan, with higher-tier plans enjoying fewer restrictions. The following table illustrates the rate limits corresponding to each plan:

EnterpriseNo limitNo limitNo limit

The metrics are measured per team across all models.