Inference with gRPC
This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through friendli-client
SDK.
Prerequisites
Install friendli-client
to use gRPC client SDK:
pip install friendli-client
Ensure you have the friendli-client
SDK version 1.4.1
or higher installed.
Starting the Friendli Container with gRPC
Running the Friendli Container with a gRPC server for completions is available by adding the --grpc true
option to the command argument.
This supports response-streaming gRPC, and you can send requests using our friendli-client
SDK.
To start the Friendli Container with gRPC support, use the following command:
# Fill the values of following variables.
export HF_MODEL_NAME="" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--grpc true
[LAUNCH_OPTIONS]
You can change the port of the server with --web-server-port
argument.
Sending Requests with the Client SDK
Here is how to use the friendli-client
SDK to interact with the gRPC server.
This example assumes that the gRPC server is running on 0.0.0.0:8000
.
- Default
- Async
from friendli import Friendli
client = Friendli(base_url="0.0.0.0:8000", use_grpc=True)
stream = client.completions.create(
prompt="Explain what gRPC is.",
stream=True, # Should be True
top_k=1,
)
for chunk in stream:
print(chunk.text, end="", flush=True)
For asynchronous operations, use the following code snippet:
import asyncio
from friendli import AsyncFriendli
client = AsyncFriendli(base_url="0.0.0.0:8000", use_grpc=True)
async def run():
stream = await client.completions.create(
prompt="Explain what gRPC is.",
stream=True, # Should be True
top_k=1,
)
async for chunk in stream:
print(chunk.text, end="", flush=True)
asyncio.run(run())
Properly Closing the Client
By default, the library closes underlying HTTP and gRPC connections when the client
is garbage-collected.
You can manually close the Friendli
or AsyncFriendli
client using the .close()
method or utilize a context manager to ensure proper closure when exiting a with
block.
- Default
- Async
from friendli import Friendli
client = Friendli(base_url="0.0.0.0:8000", use_grpc=True)
with client:
stream = client.completions.create(
prompt="Explain what gRPC is.",
stream=True, # Should be True
top_k=1,
min_tokens=10,
)
for chunk in stream:
print(chunk.text, end="", flush=True)
import asyncio
from friendli import AsyncFriendli
client = AsyncFriendli(base_url="0.0.0.0:8000", use_grpc=True)
async def run():
async with client:
stream = await client.completions.create(
prompt="Explain what gRPC is.",
stream=True, # Should be True
top_k=1,
)
async for chunk in stream:
print(chunk.text, end="", flush=True)
asyncio.run(run())