Nvidia

Advanced Q&A with LlamaIndex

14 snippets

Step 6 - Clear Cache

Clears out any cached vector store data (for a fresh run).

!rm -rf data_*

Step 6 - Stream Response

Streams the LLM response for the given query and logs the time taken.

import time start_time = time.time() response = query_engine.query(query) response.print_response_stream() print(f"\n--- {time.time() - start_time} seconds ---")

Step 6 - Set Query

Define a custom user query for the LLM to respond to.

query = "How do I setup a weaviate vector db? Give me a code sample please."

Step 5 - Build Retriever and Query Engine

Creates a retriever using AutoMergingRetriever and builds a query engine with streaming and node post-processing using the previously defined token limiter.

from llama_index.retrievers import AutoMergingRetriever from llama_index.query_engine import RetrieverQueryEngine retriever = AutoMergingRetriever( index.as_retriever(similarity_top_k=12), storage_context=storage_context ) query_engine =...

Step 5 - Define Custom Node Post-Processor

Custom post-processor to limit the total length of retrieved nodes to fit within a maximum token limit (2500).

from typing import Callable, Optional from llama_index.utils import globals_helper, get_tokenizer from llama_index.schema import MetadataMode class LimitRetrievedNodesLength: def __init__(self, limit: int = 2500, tokenizer:...

Step 4 - Store Parent Nodes and Leaf Nodes in VectorStore

For each documentation directory, parse nodes using load_markdown_docs() with hierarchical=True, store parent nodes in SimpleDocumentStore, and persist leaf nodes in VectorStoreIndex.

from llama_index import VectorStoreIndex, StorageContext, load_index_from_storage from llama_index.query_engine import RetrieverQueryEngine from llama_index.tools import QueryEngineTool, ToolMetadata from llama_index.storage.docstore import...

Step 4 - Define Documentation Directories and Load with Node Parser

Define documentation folders and load documents using load_markdown_docs() with hierarchical parsing enabled.

docs_directories = { "./llama_docs_bot/docs/community": "Useful for information on community integrations with other libraries, vector dbs, and frameworks.", "./llama_docs_bot/docs/core_modules/agent_modules": "Useful for information on...

Step 3 - Load Markdown Docs with Node Parsing

Load markdown docs from a directory and parse them either hierarchically or simply into nodes using LlamaIndex. Supports hierarchical chunking into parent and child node structure.

from llama_index import SimpleDirectoryReader, Document from llama_index.node_parser import HierarchicalNodeParser, SimpleNodeParser, get_leaf_nodes from llama_index.schema import MetadataMode from...

Step 2 - Clone Llama Docs Bot Repository

Clone the official llama_docs_bot GitHub repo, which contains the sample documentation used for chat.

!git clone https://github.com/run-llama/llama_docs_bot.git

Step 1 - Create Multi-Level Token Splitter

Split text into multiple levels of nodes using a token splitter map with different chunk sizes.

from llama_index.text_splitter import TokenTextSplitter text_splitter_ids = ["1024", "510"] text_splitter_map = {} for ids in text_splitter_ids: text_splitter_map[ids] = TokenTextSplitter( chunk_size=int(ids), ...

Step 1 - Set Global Service Context

Create a ServiceContext instance with the LLM and embedding model, and set it globally for LlamaIndex.

from llama_index import ServiceContext from llama_index import set_global_service_context service_context = ServiceContext.from_defaults( llm=llm, embed_model=embed_model ) set_global_service_context(service_context)

Step 1 - Load LangChain Embedding with CPU

Load a HuggingFace embedding model for inference on CPU and wrap it for use with LlamaIndex.

from langchain.embeddings import HuggingFaceEmbeddings from llama_index.embeddings import LangchainEmbedding model_kwargs = { "device": "cpu" } encode_kwargs = { "normalize_embeddings": False} hf_embeddings = HuggingFaceEmbeddings( ...

Step 1 - Create Llama2 Prompt Template

Define a structured prompt format using Llama2’s syntax and wrap it as a LlamaIndex prompt template

from llama_index import Prompt LLAMA_PROMPT_TEMPLATE = ( "<s>[INST] <<SYS>>" "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer." ...

Step 1 - Connect TensorRT LLM to LangChain

Connect your TensorRT LLM to LangChain using a custom wrapper for LlamaIndex compatibility.

from triton_trt_llm import TensorRTLLM from llama_index.llms import LangChainLLM trtllm = TensorRTLLM(server_url = "llm:8001", model_name="ensemble", tokens=500) llm = LangChainLLM(llm=trtllm)