Optimizing Inference with Policy Search
Introduction
For specialized cases, like serving MoE models (e.g., Mixtral) or quantized models, performance of inference can be further optimized through a execution policy search. This process can be skipped, but it is necessary to get the optimized speed of Friendli Engine. When Friendli Engine runs with the optimal policy, the performance can increase by from 1.5x to 2x (i.e., throughput and latency). Therefore, we recommend skipping policy search for simple model testing, and performing policy search for cost analysis or latency analysis in production service.
Policy search is effective only when serving (1) MoE models (2) AWQ or FP8 quantized models. Otherwise, it is useless.
Running Policy Search
You can run policy search by adding the following options to the launch command of Friendli Container.
Options | Type | Summary | Default |
---|---|---|---|
--algo-policy-dir | TEXT | Path to the directory to save the searched optimal policy file. | - |
--search-policy | BOOLEAN | Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc. | - |
Example: Llama 2 13B Chat AWQ
For example, you can start the policy search for TheBloke/Llama-2-13B-chat-AWQ model as follows:
export HF_MODEL_NAME="TheBloke/Llama-2-13B-chat-AWQ"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0"'
export POLICY_DIR=$PWD/policy
mkdir -p $POLICY_DIR
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--algo-policy-dir /policy \
--search-policy true
Example: Mixtral 8x7B Instruct (TP=4)
export HF_MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0,1,2,3"'
export POLICY_DIR=$PWD/policy
mkdir -p $POLICY_DIR
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--num-devices 4 \
--algo-policy-dir /policy \
--search-policy true
Once the policy search is complete, a policy file will be created in $POLICY_DIR
.
It takes up to 2 hours to find the optimal policy for Llama 2 13B model with NVIDIA A100 80GB GPU. The estimated time and remaining time will be displayed in the stderr when you run the policy search.
Starting Serving Endpoint
When the policy file is prepared, you can start serving your model more efficiently by adding --algo-policy-dir
option to the launch command of Friendli Container.
Example: Llama 2 13B Chat AWQ
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--algo-policy-dir /policy
Example: Mixtral 8x7B Instruct (TP=4)
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--num-devices 4 \
--algo-policy-dir /policy \
FAQ: When to Run Policy Search Again?
The execution policy depends on the following factors:
- Model
- GPU
- GPU count and parallelism degree (The value for
--num-devices
and--num-workers
options) - NVIDIA Driver major version
- Friendli Container version
You should run policy search again when any of these are changed from your serving setup.