Nvidia

Q&A with LlamaIndex

15 snippets

Step 6 - Stream Response from Query Engine

Send a user query to the query engine and stream the response with time measurement.

import time start_time = time.time() response = query_engine.query("what is the context length of llama2?") response.print_response_stream() print(f"\n--- {time.time() - start_time} seconds ---")

Step 6 - Build the Query Engine

Build a query engine from the vector index, assigning a custom prompt template and enabling streaming.

query_engine = index.as_query_engine(text_qa_template=qa_template, streaming=True)

Step 5 - Insert Nodes into Vector DB

Generate nodes from parsed documents and insert them into the vector index.

import time start_time = time.time() nodes = node_parser.get_nodes_from_documents(documents) index.insert_nodes(nodes) print(f"--- {time.time() - start_time} seconds ---")

Step 5 - Connect and Store Embeddings in Milvus

Connect to a Milvus vector store, store the context, and insert parsed nodes.

from llama_index import VectorStoreIndex from llama_index.storage.storage_context import StorageContext from llama_index.vector_stores import MilvusVectorStore vector_store = MilvusVectorStore(uri="http://milvus:19538", dim=1024,...

Step 5 - Set Global Service Context

Set the global service context in LlamaIndex to avoid passing it manually in each call.

from llama_index import set_global_service_context set_global_service_context(service_context)

Step 5 - Define ServiceContext for LlamaIndex

Bundle LLM, embed model, node parser, and prompt helper into a LlamaIndex ServiceContext.

from llama_index import ServiceContext service_context = ServiceContext.from_defaults( llm=llm, embed_model=embed_model, node_parser=node_parser, prompt_helper=prompt_helper )

Step 5 - Define LangChain Embedding with HuggingFace

Define a LangChain-compatible HuggingFace embedding model and wrap it for use in LlamaIndex.

from langchain.embeddings import HuggingFaceEmbeddings from llama_index.embeddings import LangchainEmbedding #Running the model on CPU as we want to conserve gpu memory. #In the production deployment (API server shown as part of the 5th notebook...

Step 4 - Prompt Helper Setup for Context Window

Initialize PromptHelper from LlamaIndex to manage context window, output tokens, and chunking ratio.

from llama_index import PromptHelper prompt_helper = PromptHelper( context_window=4096, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None )

Step 4 - Sentence Transformer Text Splitter & Node Parser

Set up a token-based text splitter using SentenceTransformers and initialize a NodeParser for LlamaIndex.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter from llama_index.node_parser import LangchainNodeParser TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2" TEXT_SPLITTER_TOKENS_PER_CHUNK = 510 TEXT_SPLITTER_CHUNK_OVERLAP =...

Step 3 - Output Console Log (NLTK packages)

Sample console log output showing NLTK dependencies getting downloaded and used.

[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to /root/nltk_data... [nltk_data] Package averaged_perceptron_tagger is...

Step 3 - Load PDF and Measure Time

Instantiate the loader, read the document, and measure the processing time.

import time loader = UnstructuredReader() start_time = time.time() documents = loader.load_data(file="llama2_paper.pdf") print(f"--- {time.time() - start_time} seconds ---")

Step 3 - Import Unstructured Reader

Import the PDF loader module from Llama Hub.

from llama_hub.file.unstructured.base import UnstructuredReader

Step 3 - load llama2 paper

Load the Llama2 paper using LlamaIndex’s UnstructuredReader from the Llama Hub and convert the content into a format ready for embedding.

! wget -O "llama2_paper.pdf" -nc --user-agent="Mozilla" https://arxiv.org/pdf/2307.09288.pdf

Step 2 - llama prompt template

Create a structured prompt template for Llama2 using context and user question, formatted for use in LlamaIndex.

from llama_index import Prompt LLAMA_PROMPT_TEMPLATE = ( "<s>[INST] <<SYS>>" "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer." ...

Step 1 - integrate triton with llamaindex

Custom integration of TensorRT-LLM with LangChain and LlamaIndex using LangChainLLM wrapper.

from triton_trt_llm import TensorRTLLM from llama_index.llms import LangChainLLM trtllm = TensorRTLLM(server_url="llm:8001", model_name="ensemble", tokens=500) llm = LangChainLLM(llm=trtllm)