# Understanding BBC News Q&A with Advanced RAG and Microsoft Phi3

In this blog, we would be doing question and answering on a news data feed.
The blog has 2 parts
* `Conceptual`
*  `Implementation details` which comes as expected with code as well as the full code link            

Please feel free to choose both or at least the `Conceptual` section

For this we are using the [BBC News Dataset](https://www.kaggle.com/datasets/gpreda/bbc-news/versions/801) . This is a `self updating dataset` and is updated daily.

We would be learning Simple and Advanced RAG [ **Retrieval Augmented Generation**] using a small language model **Phi3  mini 128K instruct** through this blog.

We would be asking questions like `What is the news in Ukraine` and the application will provide the **answers** using this technique.

The Phi-3-Mini-128K-Instruct is a `3.8 billion-parameter`, lightweight, state-of-the-art open model trained using the Phi-3 datasets. In comparison GPT-4 has more than a trillion parameters and the smallest Llama 3 model has 8 billion.  These models such as Phi-3 are popularly known as **SLM**[ `Small Language Models`] while the likes of GPT-4, GPT-3.5 Turbo are known as **LLM**[ `Large Language Models`]

 
The concept of **Word Embeddings** would be widely used in the blog.
From the TensorFlow documentation [word embeddings documentation](https://www.tensorflow.org/tutorials/text/word_embeddings)

> Word embeddings give us a way to use an **efficient, dense** representation in which similar words have a similar encoding.   

> Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify).   

> Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer).  

> It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.        

![Word Embeddings](https://i.imgur.com/tPWCPQ6.png)


RAG has 3 major components

1. Ingestion
    
2. Querying

3. Generation 

<hr/>

## Ingestion

<hr/>

For Ingestion, following are the key components

1. Read the Data Source
    
2. Convert the read text into manageable chunks
    
3. Convert the manageable chunks into embeddings. This is a technique in which you convert text into an array of numbers
    
4. Store the embeddings into a vector database
    
5. Store the metadata such as the filename, text , and other relevant things in the vector database

![Ingestion](https://i.imgur.com/vcccw0V.png)
   

<hr/>

## Query the data using Simple RAG

<hr/>

In the query component, we require 3 main components

1. `Orchestrating application` which is responsible for coordinating the interactions between the other components such as the user , vector database , Language Model .
    
2. Vector Database which stores the information
    
3. Language model which is helpful for generating the information after it has been provided **contextual** information
    

<hr/>

## Data Flow of a Simple RAG

<hr/>

1. The user inputs the question . Example : `What is the news in Ukraine`
    
2. The Orchestrating application uses a **encoder** to transform the text into embedding We have used the `all-MiniLM-L6-v2` of the Sentence Transformer as the encoder
    
3. The embedding is searched in the Vector database. In this case we have used the **Qdrant** database as the vector database
    
4. Search results are obtained from the vector database. We get the top K results from the vector database. The number of results to be obtained is configurable
    
5. A consolidated answer or popularly called **context** is prepared from the answers. In the implementation that we would do is done by concatenating the search results
    
6. This context is sent to the language model for generating the answers relevant for the context. In the implementation we have used a small language model **Phi3**
    
![Simple RAG](https://i.imgur.com/b2xtcFG.png)

<hr/>

## Data Flow of a Advanced RAG

<hr/>

The steps remain the same.

Except the following

`Step 4` - Search results are obtained from the vector database. We get the top K2 results from the vector database. The number of results to be obtained is configurable. The results K2 is larger than K

`Step 4A`. The results obtained are passed into a new type of block known as the **cross-encoder** which distills the number of results and provides a smaller set of results which has high similarity between the results and the query. These smaller set of results can be the top K results.

![Advanced RAG](https://i.imgur.com/EM41f5k.png)

<hr/>

## Implementation details

<hr/>

For this implementation , we have used the following

1. Dataset - **BBC News** dataset
    
2. Vector Database - Qdrant. We have a used in memory version of Qdrant for demonstration
    
3. Language Model - Small language model `Phi3`
    
4. Orchestrator application - Kaggle notebook
    
## Setup   

### Install the python libraries

```plaintext
! pip install -U qdrant-client --quiet
! pip install -U sentence-transformers --quiet
```

### Imports

```plaintext
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer,CrossEncoder
```

### Sentence Transformer Encoder

Instantiate the sentence transformer encoder

```plaintext
encoder = SentenceTransformer("all-MiniLM-L6-v2")
```

### Create the Qdrant Collection

We are creating

* In memory qdrant collection
    
* The collection name is BBC
    
* The size of the vector embedding to be inserted is the dimention of the encoder . In this case , the dimension when evaluated is `384`
    
* Distance of similarity is `cosine`
    

```plaintext
qdrant = QdrantClient(":memory:")

qdrant.recreate_collection(
    collection_name="BBC",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)
```
## Data Ingestion  

### Read the Dataset

Read the BBC News Dataset

```plaintext
LIMIT = 500
df = pd.read_csv("/kaggle/input/bbc-news/bbc_news.csv")
docs = df[:LIMIT]
```
![BBC News Dataset Rows](https://i.imgur.com/ySY37Kx.png)

### Upload the documents into Qdrant

```plaintext
import uuid
%%capture --no-display
qdrant.upload_points(
    collection_name="BBC",
    points=[
        models.PointStruct(
            id=str(uuid.uuid4()), 
            vector=encoder.encode(row[1]["title"]),
            payload={ "title":row[1]["title"] ,
                     "description":row[1]["description"] }
        )
        for row in docs.iterrows()
    ],
)
```

### Verify the documents have been uploaded into Qdrant

```plaintext
qdrant.count(
    collection_name="BBC",
    exact=True,
)
```

If you have reached till this point, Congratulations 👌 . You have been able to complete the understanding of the **Data Ingestion into Qdrant**


## Query the Qdrant database


### Query for the user

```plaintext
query_string = "Describe the news for Ukraine"
```

### Search Qdrant for the query

For searching , note how we have converted the user input into a embedding

`encoder.encode(query_string).tolist()`

```plaintext
hits = qdrant.search(
    collection_name="BBC",
    query_vector=encoder.encode(query_string).tolist(),
    limit=35,
)

for hit in hits:
    print(hit.payload, "score:", hit.score)
```

### Refine the result with the CrossEncoder

We are refining the results from the CrossEncoder .

We have got in our implementation K2 = 35 results from Qdrant. We have used the Cross Encoder `cross-encoder/ms-marco-MiniLM-L-6-v2` to refine the results The refined results in our case K = 5 after we pass the results through the cross encoder.

```plaintext
CROSSENCODER_MODEL_NAME = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
RANKER_RESULTS_LIMIT = 5

user_input = query_string

contexts_list = []
for result in hits:
    contexts_list.append(result.payload['description'])

cross_encoder = CrossEncoder(CROSSENCODER_MODEL_NAME)
cross_inp = [[user_input, hit] for hit in contexts_list]
cross_scores = cross_encoder.predict(cross_inp)

cross_scores_text = []
cross_scores_length = len(cross_scores)
for i in range(cross_scores_length):
    d = {}
    d['score'] = cross_scores[i]
    d['text'] = contexts_list[i]
    cross_scores_text.append(d)

hits_selected = sorted(cross_scores_text, key=lambda x: x['score'], reverse=True)
contexts =""
hits = hits_selected[:RANKER_RESULTS_LIMIT]
```

### Create the context

We create the Context for RAG using the search results

```plaintext
contexts =""
for i in range(len(hits)):
    contexts  +=  hits[i]['text']+"\n---\n"
```

If you have reached till this point, Congratulations 👌 👌 again. You have been able to complete the understanding of the **Getting Results from Qdrant [ Vector Database ] **

<hr/>

## Generate the answer with the Small Language Model

<hr/>
Now we have got the context from the Vector Database , Qdrant and we would send the results to our small language model **Phi3**

We also use the small language model **microsoft/Phi-3-mini-128k-instruct** model .

From the Hugging Face model card 

> The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

From the [Microsoft blog](https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/)

> Thanks to their smaller size, Phi-3 models can be used in compute-limited inference environments. Phi-3-mini, in particular, can be used on-device, especially when further optimized with ONNX Runtime for cross-platform availability. The smaller size of Phi-3 models also makes fine-tuning or customization easier and more affordable. In addition, their lower computational needs make them a lower cost option with much better latency. The longer context window enables taking in and reasoning over large text content—documents, web pages, code, and more. Phi-3-mini demonstrates strong reasoning and logic capabilities, making it a good candidate for analytical tasks. 


```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
```

### Create the prompt
The prompt is created with 2 components 
* Context which we created in the section `Create the context`         
* User input which is the user input         

```
prompt = f"""Answer based on context:\n\n{contexts}\n\n{user_input}"""
```

### Create the message template
```
messages = [
     {"role": "user", "content": prompt},
]
```

### Generate the message

```
%%time
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs =  model_inputs.to('cuda')
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
```

### Print the answer

```
print(decoded[0].split("<|assistant|>")[-1].split("<|end|>")[0])
```

## Code
The code can be found in the **Kaggle** notebook 
[BBC NEWS Advanced RAG PHI3](https://www.kaggle.com/code/ambarish/bbc-news-advanced-rag-phi3)