Tool Calling

The Llama Stack API supports tool calling, allowing the model to interact with external functions. Unlike the OpenAI API, the Llama Stack API only supports the tool choices "auto", “required", or None.

from inference import InferenceClient, process_chat_completion from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage, ToolDefinition, ToolParamDefinition from llama_models.llama3.api.datatypes import SamplingParams,...

Streaming Responses

For streaming responses, use the same structure:

from inference import InferenceClient, process_chat_completion from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage from llama_models.llama3.api.datatypes import SamplingParams def stream_chat(): client =...

Basic Usage

Use these common components in the following basic usage example:

from inference import InferenceClient, process_chat_completion from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage from llama_models.llama3.api.datatypes import SamplingParams def chat(): client =...

Common Components

The following example stores common components in the file inference.py. This file contains the InferenceClient class and utility functions that are used across different examples. Here’s the content of inference.py:

import json from typing import Union, Generator import requests from llama_toolchain.apis.inference import ( ChatCompletionRequest, ChatCompletionResponse, ChatCompletionResponseStreamChunk ) class InferenceClient: def...

Llama Stack API for NVIDIA NIM for LLMs

Llama Stack API for NVIDIA NIM for LLMs

Tool Calling

Streaming Responses

Basic Usage

Common Components