Skip to main content

Inference with gRPC

This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through friendli-client SDK.

Prerequisites

Install friendli-client to use gRPC client SDK:

pip install friendli-client
info

Ensure you have the friendli-client SDK version 1.4.1 or higher installed.

Starting the Friendli Container with gRPC

Running the Friendli Container with a gRPC server for completions is available by adding the --grpc true option to the command argument. This supports response-streaming gRPC, and you can send requests using our friendli-client SDK. To start the Friendli Container with gRPC support, use the following command:

# Fill the values of following variables.
export HF_MODEL_NAME="" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')

docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--grpc true
[LAUNCH_OPTIONS]
note

You can change the port of the server with --web-server-port argument.

Sending Requests with the Client SDK

Here is how to use the friendli-client SDK to interact with the gRPC server. This example assumes that the gRPC server is running on 0.0.0.0:8000.

from friendli import Friendli

client = Friendli(base_url="0.0.0.0:8000", use_grpc=True)

stream = client.completions.create(
prompt="Explain what gRPC is.",
stream=True, # Should be True
top_k=1,
)

for chunk in stream:
print(chunk.text, end="", flush=True)

Properly Closing the Client

By default, the library closes underlying HTTP and gRPC connections when the client is garbage-collected. You can manually close the Friendli or AsyncFriendli client using the .close() method or utilize a context manager to ensure proper closure when exiting a with block.

from friendli import Friendli

client = Friendli(base_url="0.0.0.0:8000", use_grpc=True)

with client:
stream = client.completions.create(
prompt="Explain what gRPC is.",
stream=True, # Should be True
top_k=1,
min_tokens=10,
)

for chunk in stream:
print(chunk.text, end="", flush=True)