Tech

Chatbot Creation: Answering Questions with YouTube Data

YouTube Home Page as Chatbot Creation Introduction
Written by
Santiago Sarachu
Published on
September 9, 2024

After our article on how to create an AI Chatbot, it was only natural that we build one ourselves to show you how it works. Now, with the growth of generative AI and natural language models, chatbots are becoming increasingly popular for their ability to answer complex questions by leveraging large datasets. With the vast amount of information available on platforms like YouTube, creating a chatbot that can provide insightful responses based on video content is possible. In this article, we'll walk you through building a simple chatbot that answers questions using data extracted from YouTube videos. We’ll explore the tools and techniques that make this possible, and by the end, you'll have a basic understanding of how to set up a similar system yourself.

We were inspired by the idea presented in https://www.youtube.com/watch?v=1Rpn4lrshlo 

The idea behind creating a chatbot is to make a large language model an expert on a particular theme. When interacting with models like ChatGPT, it’s not uncommon to encounter responses with hallucinations, outdated information, lack of source information, etc. This is where RAG emerges as an efficient solution for these problems.

Retrieval Augmented Generation (RAG) for Chatbot Creation

Retrieval Augmented Generation (RAG) essentially consists of 2 phases:

  • Information Retrieval: In this phase, given an input query, a search is performed within a knowledge base to locate information that could be relevant for responding.
  • Content generation: The relevant information is incorporated as context into the original query, and this enriched prompt is sent to a generative model to produce the final response.

In this article, we’ll use the RAG technique to make a large language model focus only on information provided from a YouTube channel and generate answers based on that information.

The creation of the chatbot

Tools to Use

  • OpenAI Language Models: These models are the backbone of the chatbot, enabling it to understand queries and generate text-based responses. Feeding the model with relevant context from YouTube videos can produce accurate and contextually appropriate answers.
  • Embedding Models: These models convert text into numerical vectors, known as embeddings. Embeddings capture the semantic meaning of the text, allowing the chatbot to find and retrieve the most relevant information when a question is asked.
  • Chroma Database: Chroma stores and manages the embeddings generated from the video transcriptions. It enables efficient searching and retrieval of relevant data chunks, which are then passed to the language model for generating responses.
  • Whisper: Whisper is an automatic speech recognition (ASR) system from OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It enables transcription in multiple languages, as well as translation from those languages into English. In this case, we will use it to convert the spoken content of YouTube videos into text, which the chatbot then processes and uses to generate answers.

Generating the data

For this project, we chose the Fall of Civilizations podcast channel on YouTube. This podcast explores the collapse of different societies through history, making it an interesting source for generating detailed and informative responses. The videos are rich in narrative and analysis, providing substantial text to work with.

Once we’ve picked the channel, it’s time to download and transcribe the audio with Whisper.

We now have the transcriptions of all the podcast episodes. How do we provide them as context for the language model?

We have to convert the text into numerical values using an embedding model. Since the texts of all the episodes are quite long, we'll need to find a reasonable way to split them into smaller chunks.

We divided the transcriptions into chunks of 5000 characters, which is approximately 10 minutes of video, an amount of time we consider appropriate for providing sufficient context.

Once the text is divided into chunks, the next step is to convert these chunks into embeddings using an embedding model. These embeddings enable the chatbot to find the most relevant information in response to a query. To do that, we used Chroma, an open-source vector store for storing and retrieving vector embeddings. For the embedding model, we used “text-embedding-ada-002” from OpenAI.

Search for relevant chunks

When a user asks a question to the chatbot, the first thing the chatbot does is search for relevant chunks in the Chroma database. This returns the parts of the videos that relate the most to the user’s question. Then, this information is provided as context for the LLM, enabling it to generate an answer based on that information.

Chroma has a function called similarity_search_with_score, which has a parameter to set the number of results to return. We set it to retrieve the top 10 most similar chunks relative to a user's query. When a user interacts with our chatbot, this function returns 10 video segments most likely to contain relevant information. Then, these selected video segments are provided to the llm, which synthesizes the information to generate an appropriate response.

Design the prompt

Once the relevant text chunks are retrieved from the Chroma database, the next step is to format this information so that the language model can use it to generate an answer. We must construct a prompt to communicate the question and provide sufficient context from the retrieved text chunks.

Typically, the prompt includes a brief introduction or background followed by the specific question you want the model to answer. The relevant text chunks are included as context to help the model generate a more accurate and informed response.

As an example of a possible prompt, we used the following:

Asking the LLM

With the prompt designed and the relevant context in place, the next step is to query the language model. This involves sending the constructed prompt to the model and receiving a generated response. The LLM uses the context provided to understand the question and generate a coherent and contextually accurate answer.

We use the OpenAI API to get the large language model (gpt-4-0125-preview), and then make the query. The model then processes the input and returns a text response. After that, we use that output and the relevant chunks as context to build the JSON we deliver to the front end.

Recommendations for a better chatbot

  • Chunking the documents: 
    • Embedding overly large or excessively small text chunks may lead to suboptimal outcomes. Therefore, identifying the optimal chunk size for documents within the corpus is crucial to ensuring the accuracy and relevance of the retrieved results.
    • There is no one-size-fits-all ”best” strategy, only the most appropriate one for a particular context.
  • Embedding model selection:some text
    • There are different embedding models that vary in price, input length, task for which they were made, etc. It’s a good practice to try some of them and find the one that works best.
  • Vector databases: 
    • There are several databases which can be used and different metrics to compute similarity. Each one has its pros and cons and differs in aspects such as whether they are paid or open source, cloud options, and more.
  • Transform queries:
    • Consider expanding user queries into multiple variations.
    • Implement query rewriting to improve the alignment between the search terms and the content, ensuring more precise and relevant search results.

Conclusion

Creating a chatbot that can answer questions using YouTube data is a powerful demonstration of how modern AI tools can be leveraged to make vast amounts of information more accessible and interactive. Following the steps outlined in this guide, you can build a functional chatbot capable of providing responses based on information provided. If you want to check out the demo of the case we just saw, click here.

This process showcases the potential of retrieval-augmented generation (RAG), where external data enhance large language models to provide more accurate and contextually relevant answers. While this guide focused on a specific use case with the Fall of Civilizations podcast, the principles and techniques can be applied to various topics and content sources.

As AI and machine learning continue to evolve, the ability to integrate and interact with diverse data sources will only grow, opening up new possibilities for creating intelligent systems that can understand and respond to complex queries. Whether you're a developer looking to build your chatbot or simply interested in the potential of AI, the techniques covered here provide a solid foundation to start exploring this exciting field.