A shared folder with AI prompts and code snippets
From workspace: Nvidia
Team: Main
Total snippets: 3
3 snippets
Streams token-by-token responses and logs timing and throughput
import time import random start_time = time.time() tokens_generated = 0 for val in client.stream(prompt): tokens_generated += 1 print(val, end="", flush=True) total_time = time.time() - start_time print(f"\n--- Generated...
Creates the streaming inference client for triton llm endpoint
from langchain_nvidia_trt_llms import TritonTensorRTLLM triton_url = "llm:8001" pload = { 'tokens':300, 'server_url': triton_url, 'model_name': "ensemble", 'temperature':1.0, 'top_k':1, 'top_p':0, 'beam_width':1, ...
Builds the full LLM prompt using system message, context, and question
LLAMA_PROMPT_TEMPLATE = ( "<s>[INST] <<SYS>>\n" "{system_prompt}\n" "<</SYS>>\n" "[/INST] {context} </s><s>[INST] {question} [/INST]" ) system_prompt = "You are a helpful, respectful and honest assistant. Always answer as...