Nvidia

rules

13 snippets

Tool Calling

The Llama Stack API supports tool calling, allowing the model to interact with external functions. Unlike the OpenAI API, the Llama Stack API only supports the tool choices "auto", “required", or None.

from inference import InferenceClient, process_chat_completion from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage, ToolDefinition, ToolParamDefinition from llama_models.llama3.api.datatypes import SamplingParams,...

Streaming Responses

For streaming responses, use the same structure:

from inference import InferenceClient, process_chat_completion from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage from llama_models.llama3.api.datatypes import SamplingParams def stream_chat(): client =...

Basic Usage

Use these common components in the following basic usage example:

Common Components

The following example stores common components in the file inference.py. This file contains the InferenceClient class and utility functions that are used across different examples. Here’s the content of inference.py:

import json from typing import Union, Generator import requests from llama_toolchain.apis.inference import ( ChatCompletionRequest, ChatCompletionResponse, ChatCompletionResponseStreamChunk ) class InferenceClient: def...

ContextVar (_decoding_state)

This function will be called once at the beginning before the (chat) completion is generated. It is the responsibility of the user code to maintain and fetch the state thereafter. As an example, the state may be set in the form of a Python ContextVar

from contextvars import ContextVar from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest from typing import Union my_context_var = ContextVar("my_state") async def set_custom_guided_decoding_parameters(request:...

Maintaining custom state

NIM allows for the ability to store and maintain a custom state that could be used in a custom backend. To enable this, define a function set_custom_guided_decoding_parameters in backend.py

from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest from typing import Union async def set_custom_guided_decoding_parameters(request: Union[ChatCompletionRequest, CompletionRequest]) -> None: # Set state

Statement level postprocessing

NIM also allows specific statement level postprocessing for the custom backend. To enable this, define a function get_guided_decoding_constrained_generator with the exact function definition as follows, which updates the final response generator.

from vllm.entrypoints.openai.protocol import ChatCompletionResponse, CompletionResponse, ErrorResponse from fastapi import Request from typing import Union, AsyncGenerator async def get_guided_decoding_constrained_generator( response:...

Custom decoding test

Note that the function name, argument names, and the argument type hints must match exactly the shown signatures. The following example backend.py file contains a custom logits processor which only outputs the response string Custom decoding test.

from typing import List, Optional, Union import torch from transformers import PreTrainedTokenizer from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest from vllm.sampling_params import LogitsProcessor RESPONSE =...

Custom guided decoding backend specifications

To launch the custom guided decoding backend you must provide the name of a directory that contains a single backend.py file and any other *.whl Python wheel files that are required as additional dependencies, including transitive dependencies, not already included in NIM. The directory structure should look something like the following:

custom_backends/my-custom-backend |___ backend.py |___ my_dep_-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Specifying a custom backend at runtime

NIM_GUIDED_DECODING_BACKEND sets the default backend. You can specify the guided decoding backend per request by setting backend=my-custom-backend where my-custom-backend is the name of a subdirectory in NIM_CUSTOM_GUIDED_DECODING_BACKENDS that holds a custom backend definitions as outline in the next section. example query:

{ "model": "my-model", "prompt": "My prompt", "top_p": 1, "n": 1, "frequency_penalty": 1.0, "stream": false, "max_tokens": 15, "backend": "my-custom-backend" }

Loading the custom guided decoding backends

The custom guided decoding backend directory needs to be mounted to the container at runtime. In addition the following environment variables need to be set: NIM_TRUST_CUSTOM_CODE=1 NIM_CUSTOM_GUIDED_DECODING_BACKENDS=/path/to/mounted/custom/backends/directory NIM_GUIDED_DECODING_BACKEND=/name/of/subdirectory/in/NIM_CUSTOM_GUIDED_DECODING_BACKENDS Requests will use NIM_GUIDED_DECODING_BACKEND by default. To launch the container, use the following command:

docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY=$NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -v /local/path/to/custom/backends:/custom-backends \ -u $(id -u) \ ...

Launch NIM

Example Docker Command The following command requires that you have stored a temporary AWS token in /home/usr/.aws and stored your CA certificate in /etc

docker run --rm --runtime=nvidia --gpus=all -p 8000:8000 \ -e NGC_API_KEY=$NGC_API_KEY \ -v /home/usr/.aws:/tmp/.aws \ -e AWS_SHARED_CREDENTIALS_FILE=/tmp/.aws/credentials \ -e AWS_PROFILE=default \ -e AWS_REGION=us-east-1 \ ...

When to use

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of...