Skip to main content

QuickStart: Friendli Container Trial

Introduction

Friendli Container enables you to efficiently deploy LLMs of your choice on your infrastructure. With Friendli Container, you can perform high-speed LLM inferencing in a secure and private environment.

This tutorial will guide you through the process of running a Friendli Container for your LLM.

Prerequisites

  • Hardware Requirements: Friendli Container currently only targets x86_64 architecture and supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to our required CUDA compatibility guide.
  • Software Requirements: Your machine should be able to run containers with the NVIDIA container toolkit. In this tutorial, we will use Docker as container runtime and make use of Docker Compose.
  • Model Compatibility: If your model is in a safetensors format, which is compatible with Hugging Face transformers, you can serve the model directly with the Friendli Container. Please check our Model library for the non-exhaustive list of supported models.

This tutorial assumes that your model of choice is uploaded to Hugging Face and you have access to it. If the model is gated or private, you need to prepare a Hugging Face Access Token.

Getting Access to Friendli Container

Activate your Free Trial

  1. Sign up for Friendli Suite.
  2. In the 'Friendli Container' section, click the 'Start free trial' button.

Now you can use Friendli Containers free of charge during the trial period.

Get Access to the Container Registry

PAT (Personal Access Token) is a user credential that is required for logging into our container registry.

  1. Go to User Settings > Tokens and click 'Create token'.
  2. Save the token you just created.

Prepare your Container Secret

Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image.

  1. Go to Container > Container Secrets and click 'Create secret'.
  2. Save the secret you just created.
info

🔑 Secret Rotation

You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be automatically revoked in 30 minutes.

Running Friendli Container

Pull the Friendli Container Image

  1. Log in to the container registry using the email address for your Friendli Suite account and the PAT (Personal Access Token).

    export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS"
    export FRIENDLI_PAT="YOUR PERSONAL ACCESS TOKEN"
    docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_PAT
  2. Pull the image.

    docker pull registry.friendli.ai/trial

Run Friendli Container with a HuggingFace Model

  1. Clone our container resource git repository.

    git clone https://github.com/friendliai/container-resource
    cd container-resource/quickstart/docker-compose
  2. Set up environment variables.

     export HF_MODEL_NAME="<...>"  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
    export FRIENDLI_CONTAINER_SECRET="<...>" # Friendli container secret

    If your model is a private or gated one, you also need to provide HuggingFace Access Token.

     export HF_TOKEN="<...>"  # HuggingFace Access Token
  3. Launch the Friendli Container.

    docker compose up -d
info

By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.

export FRIENDLI_PORT="8001"
export FRIENDLI_GRAFANA_PORT="3001"
info

Even though the machine has multiple GPUs, the container will make use of only one GPU, specifically the first GPU (device_ids: ['0']). You can edit docker-compose.yaml to change what GPU device the container will use.

info

The downloaded HuggingFace model will be cached in the $HOME/.cache/huggingface directory. You may want to clean up this directory after completing this tutorial.

Send Inference Requests

You can now send inference requests to the running container. For information on all parameters that can be used in an inference request, please refer to this document.

curl -X POST http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "What makes a good leader?", "max_tokens": 30, "stream": true}'
info

Chat completion requests work only if the model's tokenizer config contains a chat_template.

Monitor using Grafana

Using your browser, open http://127.0.0.1:3000/d/friendli-engine, and login with username admin and password admin. You can now access the dashboards showing useful engine metrics.

Grafana Dashboard

info

If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine.

# Change these variables to match your environment.
LOCAL_GRAFANA_PORT=3000 # The number of the port in your PC.
FRIENDLI_GRAFANA_PORT=3000 # The number of the port in the GPU machine.

ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:127.0.0.1:$FRIENDLI_GRAFANA_PORT"

where $GPU_MACHINE_ADDRESS shall be replaced with the address of the GPU machine. You may also want to use -l login_name or -p port options to connect to the GPU machine using SSH.

Then using your browser on the PC, open http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine.

Going Further

Congratulations! You can now serve your LLM of choice using your hardware, with the power of the most efficient LLM serving engine on the planet. The following topics will help you go further through your AI endeavors.

  • Multi-GPU Serving: Although this tutorial is limited to using only one GPU, Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference. Check Multi-GPU Serving for more information.
  • Serving Multi-LoRA Models: You can deploy multiple customized LLMs without additional GPU resources. See Serving Multi-LoRA Models to learn how to launch the container with your adapters.
  • Serving AWQ Models: Running quantized models requires an additional step of execution policy search. See Serving AWQ Models to learn how to create an inference endpoint for AWQ models.
  • Serving MoE Models: Running MoE (Mixture of Experts) models requires an additional step of execution policy search. See Serving MoE Models to learn how to create an inference endpoint for MoE models.

If you are stuck or need help going through this tutorial, please ask for support by sending an email to Support.