Nvidia

KV Cache Reuse with NVIDIA NIM for LLMs

3 snippets

Launch NIM

Example Docker Command The following command requires that you have stored a temporary AWS token in /home/usr/.aws and stored your CA certificate in /etc

docker run --rm --runtime=nvidia --gpus=all -p 8000:8000 \ -e NGC_API_KEY=$NGC_API_KEY \ -v /home/usr/.aws:/tmp/.aws \ -e AWS_SHARED_CREDENTIALS_FILE=/tmp/.aws/credentials \ -e AWS_PROFILE=default \ -e AWS_REGION=us-east-1 \ ...

Script to demonstrate the speedup

import time import requests import json # Define your model endpoint URL API_URL = "http://0.0.0.0:8000/v1/chat/completions" # Function to send a request to the API and return the response time def send_request(model, messages, max_tokens=15): ...

When to use

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of...