Rate Limits

When interacting with Friendli Serverless Endpoints, it's important to be aware of the rate limits imposed on requests. These limits are in place to regulate the number of requests made within a specified timeframe, ensuring a balanced and efficient use of resources. The rate limits are quantified using three metrics:

RPM (Requests per Minute): This measures the maximum number of requests allowed per minute.
TPM (Tokens per Minute): TPM represents the maximum estimated tokens processed per minute, providing insight into the computational load.
SPM (Steps per Minute): SPM signifies the maximum number of inference steps permitted within a minute.

info

RPM is used for all types of generation models, while TPM and SPM are used only for text generation models and image generation models respectively. The information related to the rate limits is included in the response headers as follows:

In all responses
- X-RateLimit-Limit-Requests
- X-RateLimit-Remaining-Requests
- X-RateLimit-Reset-Requests
In text generation responses
- X-RateLimit-Limit-Tokens
- X-RateLimit-Remaining-Tokens
- X-RateLimit-Reset-Tokens
In image generation responses
- X-RateLimit-Limit-Steps
- X-RateLimit-Remaining-Steps
- X-RateLimit-Reset-Steps

The specific rate limits applied depend on the user's subscription plan, with higher-tier plans enjoying fewer restrictions. The following table illustrates the rate limits corresponding to each plan:

Plan	RPM	TPM	SPM
Trial	10	10K	1K
Basic	5K	50K	3K
Enterprise	No limit	No limit	No limit

info

The metrics are measured per team across all models.