Skip to main content

Serving AWQ models

Introduction

This tutorial will walk you through the process of serving Activation-Aware Weight Quantized (AWQ) models with Friendli Containers.

Activation-Aware Weight Quantization

Activation-Aware Weight Quantization (AWQ) is a technique that optimizes neural networks for efficiency without compromising accuracy. Unlike traditional weight quantization methods, AWQ leverages a deep understanding of the data distribution within neural networks during inference. To learn more about AWQ, refer to this article.

Prerequisites

  1. Prepare Friendli Container: Before you begin, make sure you have signed up for Friendli Suite and have access to Friendli Containers. You can use Friendli Container free of charge for 60 days.
  2. Prepare a Quantized Model Checkpoint: Friendli Container supports running models that had been quantized with AutoAWQ. To make a quantized model on your own, you can follow the instructions that AutoAWQ repo provides. Instead of manual quantization, you can run various models existing on the HuggingFace hub (e.g., TheBloke/Llama-2-13B-chat-AWQ, casperhansen/mixtral-instruct-awq).

Step 1. Search Optimal Policy

To serve AWQ models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.

Step 2. Running Friendli Container

When the optimal policy is successfully searched, the policy is compiled into a policy file, which can be used for creating serving endpoints. Learn how to run Friendli Container with the policy file at Starting Serving Endpoint.

Evaluation

Refer to this article to see the benchmark accuracy numbers of AWQ-ed models on our platform.