Efficient Model Inference with FP8
Introduction
FP8 is an 8-bit floating-point format that optimizes machine learning model efficiency with minimal accuracy drops. FP8 offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation. This guide provides a comprehensive overview of how to serve FP8 models.
FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.
Prerequisites
If you'd like to use FP8 models that have already been pre-converted by FriendliAI, you can bypass these steps and jump straight to Serving FP8 Models in Hugging Face Hub.
Before converting your models to FP8, ensure you have the following:
Install
friendli-client
package withmllib
dependency to run checkpoint conversion for Multi-LoRA serving.pip install "friendli-client[mllib]"
If you want to convert and serve your own model with FP8 format, prepare the original model checkpoint.
Converting Models to FP8
Quantization Configuration
You can describe the configuration for the FP8 quantization at a YAML file (e.g., quant_config.yaml
).
Here’s a simplified example of what this file might contain:
mode: fp8
device: cuda
seed: 42
quant_dtype: fp8_e4m3
offload: true
calibration_dataset:
path_or_name: abisee/cnn_dailymail:3.0.0
format: json
split: train
lookup_column_name: article
num_samples: 512
max_length: 512
This configuration specifies the quantization mode, the device to perform the quantization on, and details about the calibration dataset used for the quantization process.
To apply FP8 quantization, fp8
and fp8_e4m3
should be specified at mode
and quant_dtype
field respectively.
If you want to use another dataset for calibration, replace the values at calibration_dataset
with another one.
For more information about each field, refer to this guide.
For now, we only support E4M3 (4-bit exponent and 3-bit mantissa) encoding format.
Running Conversion
The conversion process involves quantizing your existing model to the FP8 format. This process is facilitated by specifying the model's path, the desired output directory, and the quantization configuration. Below is an example command that illustrates this conversion process:
friendli model convert \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--data-type fp16 \
--output-ckpt-file-type safetensors \
--quantize \
--quant-config-file ./quant_config.yaml
When the model checkpoint is successfully converted to FP8 format, the following files will be created at $OUTPUT_DIR
.
config.json
model.safetensors
special_tokens_map.json
tokenizer_config.json
tokenizer.json
If the size of the model exceeds 10GB, multiple sharded checkpoints are generated as follows instead of a single model.safetensors
.
model-00001-of-00005.safetensors
model-00002-of-00005.safetensors
model-00003-of-00005.safetensors
model-00004-of-00005.safetensors
model-00005-of-00005.safetensors
For more information about each option of friendli model convert
, refer to this guide.
Search Optimal Policy
To serve FP8 models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.
Serving FP8 Models
Once you have prepared the FP8 model checkpoint and the policy file, you are ready to create a serving endpoint.
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v $OUTPUT_DIR:/model \
-v $POLICY_DIR:/policy \ # Make sure running policy search
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name /model
--algo-policy-dir /policy
This command serves the FP8 model on a specified port 8000, making it accessible for inference requests.
Serving FP8 Models in Hugging Face Hub
FriendliAI provides pre-converted FP8 models through Hugging Face Hub. You don't need to go through the conversion process for the models available in the Hub:
FriendliAI/Llama-2-7b-chat-hf-fp8
FriendliAI/Llama-2-13b-chat-hf-fp8
FriendliAI/Llama-2-70b-chat-hf-fp8
FriendliAI/Mistral-7B-Instruct-v0.2-fp8
FriendliAI/Meta-Llama-3-8B-fp8
FriendliAI/Meta-Llama-3-8B-Instruct-fp8
FriendliAI/Meta-Llama-3-70B-fp8
FriendliAI/Meta-Llama-3-70B-Instruct-fp8
Example: FriendliAI/Llama-2-13b-chat-hf-fp8
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \ # Make sure running policy search
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name FriendliAI/Llama-2-13b-chat-hf-fp8
--algo-policy-dir /policy