Nvidia

Stage 2

8 snippets

Triton Q&A Chain Demo

Demonstrates how to query a simplified Triton LLM chain with multiple sequential questions, showcasing contextual follow-up.

# Question 1: What is Triton? query = "What is Triton?" result = qa({"question": query}) print(result.get("answer")) # Question 2: Ask about ONNX support query = "Does Triton support ONNX?" result = qa({"question":...

Single LLM Conversational Chain

Demonstrates a simpler retrieval-augmented generation (RAG) chain using a single LLM (LLaMA2 70B) as both the document retriever and answer generator.

llm = ChatNVIDIA(model="llama2_70b", temperature=0.1, max_tokens=1000, top_p=1.0) qa_prompt = QA_PROMPT doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT) qa = ConversationalRetrievalChain.from_llm( llm=llm, ...

But why?

Follow up with a conversational question referencing prior context.

query = "But why?" result = qa({"question": query}) print(result.get("answer"))

What interfaces does Triton support?

Ask a more specific technical question about what interfaces Triton supports.

query = "What interfaces does Triton support?" result = qa({"question": query}) print(result.get("answer"))

What is Triton?

Ask a general question about what Triton is.

query = "What is Triton?" result = qa({"question": query}) print(result.get("answer"))

Conversational Retrieval Chain (2 LLMs)

Build a ConversationalRetrievalChain using both Llama2 and Mixtral to enhance multi-LLM RAG performance.

llm = ChatNVIDIA(model="llama2_70b") memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True) question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT) chat = ChatNVIDIA(model="mixtral_8x7b",...

Load Embeddings from FAISS

Load vector data from a local FAISS store using the NVIDIA embedding model.

# Embed documents embedding_path = "embed/" docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model)

Create Embedding Model

Create the NVIDIAEmbeddings model for use in loading or generating vector data for RAG.

create_embeddings() embedding_model = NVIDIAEmbeddings(model="nvolveqa_40k")