Introduction

For this project, I was hired by a company, Maespro, to develop a chatbot that allows users to ask questions about the Quebec Construction Code (1000+ pages). What set this project apart from the previous chatbots I had developed was the need to clearly mention the source (document name + page) and highlight the relevant information directly in the source document.

To carry out this project, I used React for the interface, Python Flask for the backend, MongoDB Atlas for the database and vector search, and OpenAI (GPT) for generating the responses.

Using OCR, I read the pages of the document, vectorized them, and stored them in the database. This allows, when a user makes a query, to perform a vector search to find the most relevant pages and use them with GPT to generate a response (RAG). This architecture I built not only enables querying information from the Quebec Construction Code, but it could also be used with different documents simply by adding them to the database. To then highlight the relevant information in the document, I used PyMuPDF and GPT.

Result

Screenshot from 2025-08-24 16-03-39.png

Here, I asked the chatbot what a direct ventilation is. The chatbot then generated a response, provided the source (“Quebec Construction Code, page 46”), and displayed the relevant page with the information highlighted.

Conclusion

One challenge with language models is the source of the information. It is very useful and important to know the source of the information provided in order to verify the authenticity of the response. This project allowed me to develop a reliable chatbot that can be used in a professional context, with verifiable information. I was able to implement a complete RAG architecture, and this solution can work with a large number of documents, making it adaptable for different companies with very different types of data.