A summer of opportunities: my summer as an intern at Unbabel.
A few months before this summer I started looking out for opportunities to work during the Summer. I was looking for opportunities where I could do something that I loved while maximizing learning opportunities.
After a few interview calls and some back-and-forth emails with an amazing recruiter from Unbabel (hi Maria Bernardo!), it was decided. I would be working as a summer intern for 2 months in the Portal Team at Unbabel.
🧐 | So, what now?
I was going to build a Slack chatbot for one of the internal channels. One of Unbabel’s products, the Portal, has a help-related slack channel where internal users can go and ask anything they want. As many questions were repeated over long periods, an FAQ was created.
With this, the idea was to create a chatbot that could understand what the users are asking and then retrieve the most appropriate answer from the FAQ (and the FAQ only). I would need to create the backend for the bot, as well as deal with all the integrations with the Slack API, create everything NLP and ML related and I would also have the opportunity to create a frontend for managing some of the model’s configurations. This was a true full-stack project: Backend, Data Science / Artificial Intelligence / NLP, and Frontend.
And, yeah, this is it! So, buckle up 🚀 and get ready for a deep dive into how did we create the bot.
🧠 | The master plan:
With the help of my supervisor (Lawrence de Almeida, an amazingly supportive person, with an unusually mature and constructive look over things) we created a first sketch of how would we build the bot.
Looks easy. Right…? 😰. Well, if it was easy or not, I’m not yet sure, but it was an amazingly creative and fun journey, that’s for sure!
🏅 | How’s our gold? Data, I mean!
With the cliché that data is gold in mind, we need to know how’s our data, what it can give us and what are its limitations.
The FAQ, like any other FAQ, is very direct and its questions and answers are as concise as possible. We have 50 entries. Data is scarce.
Oh, but wait. We found a gold bucket!
I’ve also manually retrieved 150 questions that were asked in the slack channel over 1 year. Each one of these questions was then manually categorized.
As these questions were asked by humans, they are very different from the FAQ. They are usually more detailed and less direct. Also, instead of being on average, quite small (as the FAQ entries are), these prompts are much bigger.
🤓 | Conclusions, conclusions, and more conclusions!!
With our data in our hands, we applied some regular NLP pre-processing to it and applied a usual data science pipeline. In the end, we were only able to find out one interesting pattern:
After removing the stop-words, there is a distinct group of words for each category, like a fingerprint of each category.
For the General category the words: design, portal, access, customer and user are words that appear with much more frequency than any others. This is valuable knowledge that we are going to use.
🤖 | Defining the use cases:
Regarding the use cases and bot workflow, and because I’m more interested in telling you more about the technical part, the only important thing for you to know is that the bot is always listening to every message sent on the Slack channel. If it matches some rules then we will try to retrieve the best answer to it, and show it to the user.
👨💻 | The fun begins. Let’s create the bot:
To create the bot, our idea is quite simple. First, let’s identify if the question being made is: 1) indeed a question and 2) if it is relevant in our context. Here our context is the ‘Portal’ context. However, we can generalize the context to be tech-related.
1 | Question classification
To create this question classifier that outputs ‘yes’ if the input is a question and relevant or ‘no’ if otherwise, we had to collect more data.
I present to you the TechQA dataset! This is a dataset that has the same structure as our FAQ minus the category. As we want to detect if a given input is a question and also if it is relevant (our context is tech-related), the TechQA is a perfect match!
After testing multiple approaches, either by experimenting with different pre-processing methods or by using different AI models, this was the pipeline that worked the best for us, reaching very impressive results:
📖 | Heuristics, heuristics, and maybe even some more heuristics!
This is where the fun part begins! No matter the experiment that we ran, the results were quite bad. Do you remember the results we obtained when analysing our data? We found out that for every category, some keywords stand out.
We are using that information to improve the results given by the model. For each word in the input sentence, if there is a match with any of the keywords that stood out from the categories, we add more weight to the probability that the result is ‘yes’ (is a question and relevant). This helps with the ‘relevant’ detector part of the model.
Regarding the ‘question’ detector part of the model, we also help the model by using some NLP common question-detection patterns, like seeing if the sentence ends in a question mark, or if it starts with “How can…”.
For our domain, it was only with these two helper methods that our results got good.
2 | Question Categorization:
Now that we can classify if we should answer an input or not, the next step is to give the correct answer to the user. To do so we will first narrow our search field. Instead of retrieving the best answer from the entire FAQ, we will first identify the category (remember them?) that the question belongs.
Regarding the training and testing data, the questions manually received from the Slack channel and the FAQ are categorised, thus, they compose our data for the categorizer.
In the image above you can also see some of the experiments that we did, like RASA and Semantic Hashing. They all bear poor results. It was only when we moved to Multinomial Naive Bayes with (guess what?) a Heuristics Engine that the results got better.
Essentially, the classifier returns a probability vector of a question belonging to each one of the categories in the FAQ. Then, (with the Heuristics Engine) for each word in the question, if the word belongs to any of the categories keywords, we add a small value to that category. The values to be added were manually set. Thus, we now have a table of Keyword-Category-Weight entries. Without the heuristics, the accuracy was around 50%. With it, almost 90%. And we now have good results! 🤩
3 | Question-Answer Pairing
Now comes the fun part 🥳! We are now ready to retrieve the most appropriate answer for the user’s inputs.
To do so we are using the Haystack Python library. With Haystack, and using the cosine similarity as the measure of similarity between sentences, we are using a pre-trained sentence transformer, available on Hugging Face, the all-MiniLM-L6-v2. This model maps sentences to a 384-dimensional vector, from which we can calculate the similarity for each one of the FAQ entries, and retrieve the one most similar.
When searching for the best answer, we search on both the predicted category and the entirety of the FAQ (as the predicted category might be wrong). After we return an average of the results.
🤯 | Wow, this is getting a bit long… Now what?
As we stated earlier, data is gold, and, in this case, it is our biggest limitation. If a user asks for something not present in the FAQ the result won’t make sense. That’s an intrinsic limitation for which we can’t do a thing.
To overcome or at least ease this limitation, we introduced a thumbs-up and thumbs-down voting system. After some months of use, we will be able to see what answers obtained poor results and either tweak what we can (the keywords and respective weights) and/or add new FAQ entries.
Regarding the deployment phase, everything was done by Mathieu, a backend developer in the Portal team who has helped me tirelessly. I only. prepared the Docker Containers. The real hard work (a lot of it, believe me), was done by him… A big thank you!
✌️ | One more thing.
Oh, there’s a mini frontend… Wait, what? As I also love Design and frontend development, I created a frontend with two main objectives: visualizing the FAQ and editing keywords. The frontend was entirely built with Vue 3 and Tailwind CSS and connects directly with the AI models backend (that, by turn, handles the connections with the database, stored in an AWS S3 bucket).
You can’t see but, there are some lovely animations and transitions for the sticky headers 🥹.
😱 | Ok, is this all?
This might sound like a cliché but with this summer internship, I learned a lot of new stuff. I created an end-to-end system with 2 backends (for Slack and the AI models), learned more about Data Science and AI in the real world, implemented a mini frontend in Vue 3 (I didn’t know anything about it before), integrated it all by creating Docker Containers and using Docker Compose and started working on an internal frontend project for 1 month with another summer intern, following all the best practices and rules that the team uses, allowing me to learn a great deal about frontend development.
It was a great journey and opportunity to be able to work for a bit over 2 months in a company with a demanding environment and in a team like the one I was with while having the privilege of learning so much about areas that I deeply care about and love.
The entire experience was further improved by the help, companionship, and wisdom of five people: Lawrence de Almeida, Maria Bernardo, Mathieu Giquel, Nuno Infante, and Luís Bernardo. A special thank you to Lawrence for all the dedication he has shown while supervising my work.
Not that anyone has asked me, but, in case someone is wondering, I totally recommend anyone to apply for a Summer Internship at Unbabel. Regardless of your interest area: either frontend, ML, or any other, you are going to love it!
Until next time, Unbabel. 👋