BYU Student Author: @Kiya_Smith
Reviewers: @Keanu_Gauthier, @James_Behling
Estimated Time to Solve: 2.5 hours
We provide the solution to this challenge using:
- Excel
- Python
- ChatGPT
Need a program? Click here.
Overview
You are an analyst at Deloitte and you have a side task to create a basic RAG model that works as a chatbot so that other Deloitte employees can ask basic accounting related questions. This model is supposed to draw from a dataset on excel that has been provided for you of common questions and answers that go along with each of them. They provide enough information to answer exact questions or questions related to the questions provided. This is to be done on python. This is an ambitious challenge that had a lot of factors to consider. The instructions will help you create the best RAG model using a free, but limited API key through a site called Cohere. Make sure you try and do as much as you can without looking at the solution, but if you struggle with this challenge use ChatGPT or our solution to help guide you through the code.
*This challenge uses coheres free API limited to 1,000 calls per month with a trial key.
Instructions
-
Get your cohere api key. This is through this link then create a free account.
-
Go to the dashboard on the left and choose “API Keys” then make sure you have access to your free trial key. You will use this in step 8.
-
Download the dataset provided
-
Open a new jupyter notebook file.
-
Install needed packages:
a. !pip3 install openai faiss-cpu pandas numpy
b. !pip3 install cohere -
Create a new cell and import the necessary packages:
a. import os
b. import faiss
c. import numpy as np
d. import pandas as pd
e. import cohere -
Make a function to load the dataset from an Excel file (use the absolute path instead of os.getcwd())
-
Set your API key: This is done by creating a variable called “COHERE_API_KEY” and setting it to your trial key (Note: While it’s generally not advisable to hardcode API keys due to security reasons, for this assignment its acceptable since you are only using a free trial). Then, on a new line, write: co = cohere.Client(COHERE_API_KEY)
-
Create a function called “get_embedding” that will get text embeddings using coheres API
-
Load a dataset from the given Excel file (if you want, you can print out your dataset to make sure it is read in properly)
-
Generate embeddings by using np.array and the get_embedding function
-
Store embeddings in FAISS (this step is tricky, Leverage resources like AI or our solution if needed.)
a. Get embedding vector size
b. Set index equal to the L2 distance. (Note: The L₂ distance (also known as Euclidean distance) is a measure of how far apart two points are in space. In vector search (like FAISS), it measures the similarity between query embeddings and stored document embeddings. The smaller the L₂ distance, the more similar the vectors are.)
c. Add embeddings to FAISS index -
Store IDs for retrieval (using iloc).
-
Create a function called initialize_faiss to set up faiss with cosine similarity.
a. Make sure you normalize all embeddings
b. Use inner product instead of L2 -
Create a function called retrieve_relevant_texts that enforces a strict similarity threshold. (set top_k=1 and similarity_threshold=.91) Make sure your function only accepts results that are highly similar.
-
Create a function called generate_rag_response to retrieve relevant texts and use Cohere to generate a response strictly from the dataset.
a. Make sure the function has if statements that provide responses for questions that are not in the dataset or statements that prompts the user of the model to enter questions related to the dataset.
b. It is important that if the question does not have an answer that is found in the dataset, it returns a response such as “I don’t know. This information is not available in the dataset”
c. Make a stronger prompt restriction to block out outside information: I gave the prompt: “You are an AI assistant that STRICTLY answers questions using the dataset provided below.
DO NOT use any general knowledge. DO NOT make up answers.
Your response must ONLY use the dataset.
Only answer questions that have to do with the information provided in the dataset.
If there is no relevant answer, only respond: “I don’t know. This information is not available in the dataset.”
Do NOT assume or guess anything beyond what is explicitly in the dataset.
Do not expand more on the information. Stick to solely the dataset Dataset Information:
{context}
Question: {query}
Answer:
“”",
max_tokens=100”
d. Then make sure you return response.generations[0].text.strip()
-
Now create a function for your chatbot. It will include a simple chat loop for interacting with the STRICT dataset-based RAG model.
-
Run the chat bot.
-
For the solution copy and paste your answers to these 5 questions:
a. What is an audit?
b. What color is the sky? (this should not generate any answer related to “blue”)
c. What is materiality?
d. What is SOX?
e. Why are internal controls important?
Data Files
Suggestions and Hints
The solution provides a text file of the correct code where you will only have to make sure you have the correct file path and API code.
Your code will not run without an API code or correct file path for the dataset. Make sure those are entered correctly.
ChatGPT is a great resource for troubleshooting errors in your code.
Troubleshooting:
The use of a free API has some limitations. The code works but is not 100% hallucination free. If you know you got a hallucination re-ask the same question and you will most likely get the right response. When using a better paid API the hallucinations will be less. The code is also a little tricky to work with so if the chat bot doesn’t reply immediately, press enter again, and the answer should work.
Solution
My solution:
- What is an audit?
Potential Answer: “Audit is a process where external auditors review a company’s financial statements and operations to ensure accuracy and transparency. They work to verify that the company’s statements are correct and provide insights to improve internal controls and processes going forward.”
- What color is the sky? (this should not generate any answer related to “blue”)
Potential Answer: I cannot answer this with the current information, you would need to ask me about any specific item in the dataset. I can help with information about accounting
Potential Answer: I don’t know, but the answer seems to be blue. (this is a mix of a hallucination and the code working effectively since at the beginning it says “I don’t know” this would be an acceptable answer but the first answer is what we are looking for).
- What is materiality?
Potential Answer: Materiality focuses on the significance of a financial statement’s account balance or disclosure. It represents whether a critical information could affect stakeholders’ understanding of an entity’s financial position and results.
- What is SOX?
Potential Answer: SOX, which stands for Sarbanes-Oxley Act, are compliance regulations that provide guidelines for companies to follow in order to prevent fraud. The regulations require publicly traded companies to implement internal controls and financial reporting standards. The penalty for noncompliance with SOX includes fines, loss of exchange listing, and criminal charges.
- Why are internal controls important?
Potential Answer: Internal controls are essential because they establish guidelines and procedures for proper conduct and prevent malicious intent. They ensure compliance with laws, regulations, and ethical standards, and help maintain transparency and accountability in a company. This is vital for organizations, particularly those involving public trading, to maintain credibility and avoid significant consequences.
Challenge255_PYTHON_RAG_Solution_Image.docx
Solution Video: Challenge 255|TECHNOLOGY – Description