Skip to main content

Efficient Model Inference with FP8

Introduction

FP8 is an 8-bit floating-point format that optimizes machine learning model efficiency with minimal accuracy drops. FP8 offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation. This guide provides a comprehensive overview of how to serve FP8 models.

info

FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.

Prerequisites

info

If you'd like to use FP8 models that have already been pre-converted by FriendliAI, you can bypass these steps and jump straight to Serving FP8 Models in Hugging Face Hub.

Before converting your models to FP8, ensure you have the following:

  • Install friendli-client package with mllib dependency to run checkpoint conversion for Multi-LoRA serving.

    pip install "friendli-client[mllib]"
  • If you want to convert and serve your own model with FP8 format, prepare the original model checkpoint.

Converting Models to FP8

Quantization Configuration

You can describe the configuration for the FP8 quantization at a YAML file (e.g., quant_config.yaml). Here’s a simplified example of what this file might contain:

quant_config.yaml
mode: fp8
device: cuda
seed: 42
quant_dtype: fp8_e4m3
offload: true
calibration_dataset:
path_or_name: abisee/cnn_dailymail:3.0.0
format: json
split: train
lookup_column_name: article
num_samples: 512
max_length: 512

This configuration specifies the quantization mode, the device to perform the quantization on, and details about the calibration dataset used for the quantization process. To apply FP8 quantization, fp8 and fp8_e4m3 should be specified at mode and quant_dtype field respectively. If you want to use another dataset for calibration, replace the values at calibration_dataset with another one. For more information about each field, refer to this guide.

info

For now, we only support E4M3 (4-bit exponent and 3-bit mantissa) encoding format.

Running Conversion

The conversion process involves quantizing your existing model to the FP8 format. This process is facilitated by specifying the model's path, the desired output directory, and the quantization configuration. Below is an example command that illustrates this conversion process:

friendli model convert \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--data-type fp16 \
--output-ckpt-file-type safetensors \
--quantize \
--quant-config-file ./quant_config.yaml

When the model checkpoint is successfully converted to FP8 format, the following files will be created at $OUTPUT_DIR.

  • config.json
  • model.safetensors
  • special_tokens_map.json
  • tokenizer_config.json
  • tokenizer.json
info

If the size of the model exceeds 10GB, multiple sharded checkpoints are generated as follows instead of a single model.safetensors.

  • model-00001-of-00005.safetensors
  • model-00002-of-00005.safetensors
  • model-00003-of-00005.safetensors
  • model-00004-of-00005.safetensors
  • model-00005-of-00005.safetensors

For more information about each option of friendli model convert, refer to this guide.

Search Optimal Policy

To serve FP8 models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.

Serving FP8 Models

Once you have prepared the FP8 model checkpoint and the policy file, you are ready to create a serving endpoint.

# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')

docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v $OUTPUT_DIR:/model \
-v $POLICY_DIR:/policy \ # Make sure running policy search
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name /model
--algo-policy-dir /policy

This command serves the FP8 model on a specified port 8000, making it accessible for inference requests.

Serving FP8 Models in Hugging Face Hub

FriendliAI provides pre-converted FP8 models through Hugging Face Hub. You don't need to go through the conversion process for the models available in the Hub:

Example: FriendliAI/Llama-2-13b-chat-hf-fp8

# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')

docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \ # Make sure running policy search
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name FriendliAI/Llama-2-13b-chat-hf-fp8
--algo-policy-dir /policy