RAG
Notes
What is it
The problem we face when using public pre-trained LLMs is that they know nothing about our private data sources, but training your own LLM is usually not affordable for mose complanies/people.
RAG is a way to boost a LLM’s capability of understanding and anwersing questions to your private data.
How to create your own RAG
1. prepare your private database
First, you need tp prepare your own private data in a dtabase.
a. Started by chunking your private data into meaningful sentences/paragraphs
b. Convert them into numerical representations(vectors), it’s called embedding language models.
c. Store them in a vector database, a kind of database that’s specialized in comparing/mathcing vectors.
An example of a database record
{
"id": "chunk_1",
"vector": [0.15, -0.46, 0.3, 0.2026, ...], // hundreds or thousands of numbers representing the meaning of the text chunk
"text": "Phone bill reimbursement 101. You need to first create an account in ...",
"metadata": {
"source": "www.wiki.company.com/internal/expense/reimbursement",
"page": 5
}
}
2. Retrieve relevant information when a uesr queries
For example, when a user asks “How can I expense my phone bills?”, it will be converted to a vector like [1.5, 2.7, -1.02 …].
The vector will then be used to find and retrieve the relevant information in the vector database.
3. Talking to the LLM
The original prompt will be augumented, becoming something like
"UserQuestion": "How can I expense my phone bills?"
"Context": "Phone bill reimbursement 101. You need to first create an account in ..." // your private data
The LLM will then have the ability to answer your question regarding the data that it has never been trained on.
Other topics
Can the data in the database gets stale?
Yes. It’s a challenging but yet an existing problem for all the tasks the reuqire preprocessings, for example, Google’s search index update.
Usually people just do an async update every once for a while, depending on how fresh you want your data to be.
How to find relevant information given a query’s vector
Apparenly we want to find a a few chunks that are most similar to our query.
To measure the similart between 2 vectors,
Cosine similarity is a standard way in RAG (Measures angle between vectors).
Scanning through the whole database is going to be slow.
Vector index is the key to make it fast enough for real-time apps.
The core idea is to group the similar vectors together.
Hierarchical Navigable Small World (HNSW) is one of most popular solutions.
A graph of nodes is built. Each node (vector) is connected to similar node.
- A query can start at a random or a
- Check if any of the neighbors are closer to the target(query)
- Move
- Repeat unti no improvement
Instead of scanning everything, we walk towadrds the target.
It’s different from fine tuning?
Fine funing changes the weights in the LLMs. It requires retrain or continue training, which can be costly.
But it also has some advantages againg RAG, for example, you can change the weigths so that the LLM always answers like a lawer.
There are trade offs, but choose RAG when
- Updates frequently: training is too costly
- Factual accuracy is important: RAG helps by grounding answers in real, retrieved text
- Traceability: this is actually really important, you have real data in your database to find the source of LLM’s answer.
- Fast iteration: change your database records and the answer from LLM will be different immediately.
Summary
So basically the LLM in RAG architecture does change at all, its job is still generating outputs given inputs.
It’s just one more extra step before that to add more context to the input for LLM.