Rate Limits

When interacting with Friendli Serverless Endpoints, it's important to be aware of the rate limits imposed on requests. These limits are in place to regulate the number of requests made within a specified timeframe, ensuring a balanced and efficient use of resources. The rate limits are quantified using two metrics:

RPM (Requests per Minute): This measures the maximum number of requests allowed per minute.
TPM (Tokens per Minute): TPM represents the maximum estimated tokens processed per minute, providing insight into the computational load.

info

RPM is used for all types of generation models, while TPM is used only for text generation models. The information related to the rate limits is included in the response headers as follows:

In all responses
- X-RateLimit-Limit-Requests
- X-RateLimit-Remaining-Requests
- X-RateLimit-Reset-Requests
In text generation responses
- X-RateLimit-Limit-Tokens
- X-RateLimit-Remaining-Tokens
- X-RateLimit-Reset-Tokens

The specific rate limits applied depend on the user's subscription plan, with higher-tier plans enjoying fewer restrictions. The following table illustrates the rate limits corresponding to each plan:

Plan	RPM	TPM
Trial	10	10K
Basic	5K	50K
Enterprise	No limit	No limit

info

The metrics are measured per team across all models.