RAG implementation with Llama2-13B-chat LLM Model

11 min readMar 4, 2024

Welcome to our second laboratory study on LLM. In this study, I will discuss the subject of RAG. The RAG topic is the most powerful and popular among the highly complex but productive artificial intelligence sub-topics, as it includes many concepts such as the established flow, the selected vector database, the selection and association of the LLM model. In short, the concept starts with creating searchable content by bringing together many documents or unstructured data and indexing them in a vector database. Afterwards, the prompt engineering output created by the user goes to the vector database and searches for which content is more relevant. After selecting the relevant content, a query is made on the LLM side to produce a unique output suitable for this content, and an answer appropriate to the user’s input is created as an output. I hope it will be useful to those who are interested in the subject :)

Throughout the work, we will do the parts numbered above together, step by step. Therefore, the image at the top can be considered as a road map of our work.

Introduction

The dataset used in this study is a subset of nq_open, an open source question answering dataset based on content from Wikipedia. The selected subset contains the gold standard passages that will answer queries in the dataset, allowing the quality of retrieval to be evaluated. The original version of the dataset basically contains 3000 content pages of web content and 600 question and answer examples. Responses have been modified for fluency by IBM Research. In this way, an end-to-end access and response dataset was created, focusing more on answering abstract, longer-form questions.

Since GPU resources in the IBM Research environment will be used in our study and we have limited usage, 300 pages, which is 10% of the web page content, and 15 questions were selected and used as examples.

RAG Document & Q&A Dataset Samples

Sample Documents Data Set and Format:

Sample Q&A Data Set and Format:

The relevant id field in the question and answer dataset is related to the id field in the document dataset that we use as knowledge base. The reason for this is to calculate the success of the model because we do not understand how related the answers produced are to the content. It is very important to calculate the success rate of the model with a somewhat small number of data. Then, by expanding the content, you can build the same model on unstructured data without the need for data preparation.

The meaning of -1 is; It is not related to any content in the documents. For this, -1 means we will keep what we see separate without including it in the calculation.

Hardware Specs

For hardware, we use a Jupyter notebook on the research center located in the cloud. The reason for this is that the language model we will use requires a dedicated A100 80 GB GPU card. For this reason, we run our jupyter notebook environment on the following resources.

The model we used in our study is the Llama13b model, which is open source and published by meta. You can allocate less resources for models with lower parameter volume. Or, on the contrary, if you want to use huge models such as Llama70b, you may need a minimum of 4 A100 80GB GPU resources. This number and need is expected to increase day by day with new models being released. For example, we will need 8 identical GPU resources for Llama3.

Step 1 — Ingest & Prepare

Customer data is transformed and/or enriched to make it suitable for model extension. Transformations can include simple format conversions, such as converting PDF documents to text, or more complex transformations, such as converting complex table structures into if-then type expressions. Enrichment may include expanding common abbreviations, adding metadata such as currency information, and other additions to increase the relevance of search results. We read data from tsv and xlsx files in this study. The method we recommend is to process the company’s TBs of data by using a data ingestion tool with powerful parallel computing capabilities.

We use classic Python libraries in the study. The three most important libraries are genai, langchain and milvus, respectively.

The service that we will position as a Milvus vector database and create a temporary area with a local server in the code. For your work in the product environment, it is critical that this database source is on orchestration environments such as openshift. Because when there are many simultaneous queries and huge document piles on a much larger scale, you may fall into a bottleneck with uncontrollably increasing network traffic and embedding files.

Genai is; The library we will use to connect to our account in the IBM research environment and use our resources.

Langchain is; As we know, it provides ease of use with the chain it creates by bringing different arguments such as vector database, LLM model and prompt together at the same point and can be easily integrated with each other.

Step 2— Generate Embeddings

An embedding model is used to convert the source data into a series of vectors that represent the words in the client data. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. The embeddings are stored as passages (called chunks) of client data, think of sub-sections or paragraphs, to make it easier to find information.

Simple function that converts the texts to embeddings. Because we must need a sentence transformers model for text to vector conversion and for create the artificial neural network with transformer models knowledge.

A standalone milvus vector database that will share Jupyter notebook kernel resources on localhost is created and set up.

In this step, the collection is created. You can think of the collection as a table in classical RDBMS systems. A collection is created to be compatible with the data we receive from the content of the documents.

Documents that will be divided into different parts are processed as embedding and are now ready to be recorded by creating a new index on milvus. The critical point here is chunk size and overlap parameters. You should define them correctly according to your content. Incorrect definitions may return low success or complex results in vector search results. The best way to do this is to determine the ideal chunk values by testing step by step with the data we have.

Step 3— Save Embeddings

Insert the embeddings, texts, titles and question id’s in milvus DB. Then Create an index on vector field. It was parsed into 17 different embeddings with the chunk parameters we defined and was successfully made ready for use with a new index tag.

Most of the vector index types supported by Milvus use approximate nearest neighbors search (ANNS). Compared with accurate retrieval, which is usually very time-consuming, the core idea of ANNS is no longer limited to returning the most accurate result, but only searching for neighbors of the target. ANNS improves retrieval efficiency by sacrificing accuracy within an acceptable range.

According to the implementation methods, the ANNS vector index can be divided into four categories:

Tree-based index
Graph-based index (HNSW , Generally use for High-speed query, Requires a recall rate as high as possible, Large memory resources) HNSW has got two different alternatives for metric type: Euclidean distance (L2) and Inner product (IP)
Hash-based index
Quantization-based index

If you need a larger size vector field and hope to get better results with less consumption, it is especially useful to use the RHNSW_SQ index type, which uses both graph-based and quantization methods together. RHNSW_SQ is an improved indexing algorithm based on HNSW. This type of index performs product quantization on vector data on a HNSW basis, thus significantly reducing storage consumption.

Step 4— Query

The input token limit depends on the selected generative model’s max sequence length. The total input tokens in the RAG prompt should not exceed the model’s max sequence length minus the number of desired output tokens. The choice of the number of paragraphs to retrieve as context impacts the number tokens in the prompt.

Llama2-13b-chat model token limit: 4096

Our Dataset’s input token max limit: 3995

4096 must greater or equal to 3995. Dataset’s input token limit can not define more than 4096

We statically select question number 2 from our question list to query the vector data set located on milvus.

Step 5— Search Relevant Information for Query

The function below returns paragraphs similar to the question. In this way, we determine which embeddings part and content we will use to send input to the LLM model side to create new content. If the right input cannot be found here, the possibility of creating content acceptable to LLM is almost non-existent. Therefore, the 2 points we emphasized form the basis of this. The first is to define the chunk parameters correctly and the second is to use the correct indexing algorithms and metric type. Of course, before doing this, we assume that you have semantically distinct documents that are consistent and not truly duplicates.

As you can see from the results, the 4 closest alternative contents point to the same embedding. The content with id 200 says that we should use it without any alternative, with a vector distance between 1.190740 and 1,277924.

1,19 shows that it is a very accurate search result for us.

Step 6— Send Query + Prompt + Enhanced Context

`prompt_template` is a function to create a prompt from the given context and question. Changing the prompt will sometimes result in much more appropriate answers (or it may degrade the quality significantly). The prompt template below is most appropriate for short-form extractive use cases.

`make_prompt` includes a script to truncate the context length provided as an input in case the total token inputs exceed the model’s limit. The paragraphs with the largest distance are truncated first. This functionality is helpful in case the embedded passages are not of the same size.

Step 7— Return Human Like Response from Llama2-13b-chat LLM

Step 8— Evaluate RAG performance on your data (Bonus)

This step requires having a test dataset that includes for each question:

The indexes of the passage(s) that contain the answer — i.e. the goldstandard passages (if the question is answerable by the knowledge base)
The question’s goldstandard answer (this can be short or long-form)

We will now run the RAG pipeline on the given questions

Evaluate Retrieval quality

There are many ways to compute retrieval quality, namely how the information contained in the documents that are relevant to the question being asked. We’re focusing here on success at given number of returns (aka recall at given levels), which is to say, given a fixed number of documents returned (e.g., 1, 3, 5), is the question’s answer contained in them. The scores increase with the recall level.

Evaluate answered and unanswered questions

The following table breaks the count of question/answer pairs by whether the test dataset has an answer (rows) and the RAG model returned an answer (columns).

Complex evaluation of retrieval quality and generated answers

We will leverage unitxt metrics to evaluate the system in a more robust, complex way. Please refer to this document to see the full explanation of the metrics.

Context Relevance: This is a reference-less metric gauging the relevance of the retrieved texts to answering the user question. The metric range is [0, 1], where higher is better.

Context Correctness: This is a reference-based metric reflecting the rank of the ground-truth text in the retrieved texts. The metric range is [0, 1] where higher is better.

Faithfulness: This is a reference-less metric gauging the groundedness of the generated answer in the retrieved texts. The metric range is [0, 1], where higher is better.

We calculate a separate success accuracy rate for each prompt we test. By taking the average of all of them, we show the average score of the study as follows.

Answer Reward: This is a reference-less metric that predicts which generated answer is better judged by a human, given a question. The metric range is [0, 1], where higher is better.

Answer Correctness: This is a reference-based metric gauging the similarity between the generated answer to a gold answer. The metric range is [0, 1], where higher is better.

As a result, the accuracy rate among all questions that can be answered is 39.3%. This is actually not an accurate number for our evaluation because we only did it with 15 questions and many of them were not relevant to the actual document dataset. The ideal scenario for us to do this is; I need to do this with a parsed test data set, which we do by uploading a large number of documents and correctly labeling the relevant id match. This way we can see acceptable results.

Conclusion & Appreciation

We carried out this study together with my teammates Bengü Sanem Pazvant, Ceyda Hamurcu , Seray Boynuyoğun and Merve Özmen from the IBM Client Engineering team. I would like to thank all my teammates for their contributions and efforts.