Inference with gRPC

This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through friendli-client SDK.

Prerequisites

Install friendli-client to use gRPC client SDK:

pip install friendli-client

info

Ensure you have the friendli-client SDK version 1.4.1 or higher installed.

Starting the Friendli Container with gRPC

Running the Friendli Container with a gRPC server for completions is available by adding the --grpc true option to the command argument. This supports response-streaming gRPC, and you can send requests using our friendli-client SDK. To start the Friendli Container with gRPC support, use the following command:

# Fill the values of following variables.
export HF_MODEL_NAME=""  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    --grpc true
    [LAUNCH_OPTIONS]

note

You can change the port of the server with --web-server-port argument.

Sending Requests with the Client SDK

Here is how to use the friendli-client SDK to interact with the gRPC server. This example assumes that the gRPC server is running on 0.0.0.0:8000.

Default
Async

from friendli import Friendli

client = Friendli(base_url="0.0.0.0:8000", use_grpc=True)

stream = client.completions.create(
    prompt="Explain what gRPC is.",
    stream=True,  # Should be True
    top_k=1,
)

for chunk in stream:
    print(chunk.text, end="", flush=True)

For asynchronous operations, use the following code snippet:

import asyncio
from friendli import AsyncFriendli

client = AsyncFriendli(base_url="0.0.0.0:8000", use_grpc=True)

async def run():
    stream = await client.completions.create(
        prompt="Explain what gRPC is.",
        stream=True,  # Should be True
        top_k=1,
    )

    async for chunk in stream:
        print(chunk.text, end="", flush=True)

asyncio.run(run())

Properly Closing the Client

By default, the library closes underlying HTTP and gRPC connections when the client is garbage-collected. You can manually close the Friendli or AsyncFriendli client using the .close() method or utilize a context manager to ensure proper closure when exiting a with block.

Default
Async

from friendli import Friendli

client = Friendli(base_url="0.0.0.0:8000", use_grpc=True)

with client:
    stream = client.completions.create(
        prompt="Explain what gRPC is.",
        stream=True,  # Should be True
        top_k=1,
        min_tokens=10,
    )

    for chunk in stream:
        print(chunk.text, end="", flush=True)

import asyncio
from friendli import AsyncFriendli

client = AsyncFriendli(base_url="0.0.0.0:8000", use_grpc=True)

async def run():
    async with client:
        stream = await client.completions.create(
            prompt="Explain what gRPC is.",
            stream=True,  # Should be True
            top_k=1,
        )

        async for chunk in stream:
            print(chunk.text, end="", flush=True)

asyncio.run(run())

Inference with gRPC

Prerequisites​

Starting the Friendli Container with gRPC​

Sending Requests with the Client SDK​

Properly Closing the Client​

Prerequisites

Starting the Friendli Container with gRPC

Sending Requests with the Client SDK

Properly Closing the Client