ML Model to Detect Toxic Content on Quora

Case Study: Machine Learning Solution for Online Forum

May 29, 2024
Posted by Akvelon

Business Need

Quora, an online conversation forum where individuals can connect with an online community and can ask and answer questions wanted to find a way to identify toxic content in online conversations and needed help preventing insincere questions from being posted on their site.

An insincere question was defined as a statement that has been disguised as a question and contains one or more of the following elements:

Non-neutral tone: has an exaggerated tone or is rhetorical
Disparaging or inflammatory: discriminates against any class of people (gender, race, religion, etc.)
False information
Sexual or inappropriate content

Participants of this competition were tasked with creating models that could identify and flag insincere questions to help Quora ensure the safety and well-being of their users.

Solution

To solve this problem, Danylo Kosmin,a Machine Learning/Data Scientist chose to utilize deep neural networks (including recurrent neural networks like LSTM and GRU) as well as Neural Language Processing.

Danylo took these steps below to create a solution:

Exploration of data
- Calculated metafeatures such as: number of unique words in the text, number of punctuations, characters and etc. for better data understanding. Based on this data, he chose the number of words for training vocabulary and selected preprocessing methods.
Searching data
- He searched for the most popular misspells and typos and created a list of special characters (fig. 1). This list was created for data cleaning.
Working with data
- All stop-words were removed from texts to investigate the most frequent insecure words and n-grams (2-grams, 3-grams, 4-grams). After this, Danylo: took 72 as the maximum number of words in a sequence and took 120,000 most frequent words for vocabulary. He then made a preprocess baseline (text cleaning, number cleaning, misspells cleaning, fill NA values and etc). He also tried using different feature engineering methods and other approaches like stacking/blending different embeddings, creating statistical features, etc. but they did not improve his score.

Danylo’s baseline was based on an embedding layer bidirectional LSTM model using PyTorch. From there, Danylo began selecting and changing different hyper parameters and experimenting with recurrent neural networks (RNN) structures and layers. He took the following steps in his approach:

Defined the embedding layer due to GloVe and Paragram pre-trained embeddings
Worked on bidirectional LSTM with 128 features in the hidden state. LSTM connected to the Attention layer and bidirectional GRU with 256 features in the hidden state
Found the average and max 1D pooling of GRU results. The LSTM, GRU, AvgPool, and MaxPool were concatenated and made up a fully connected layer; this made batch normalization, dropout, and sent to output linear layer.

He used binary cross entropy as a loss function for the network. This resulted in a model that has 4 KFold splits and 5 epochs for each of the folds. The threshold for binarization was selected by checking the F-score on the validation set.

Benefits and Results

In the first stage of the competition, 4,037 teams participated and Danylo’s model took 151st place on the public leaderboard. This qualified him to compete in the 2nd stage with only 1,401 teams. In the 2nd stage of the competition, his model made it into the top 3% and received a silver medal prize.

Technology Used

Natural Language Processing, Machine Learning, neural networks (including recurrent neural networks like LSTM and GRU)

Case Study: Machine Learning Solution for Online Forum

From Ideas to Impact: How Akvelon Uses AI to Accelerate Client Projects

AI-Powered Medical Form Automation for Reducing After-Hours Work

DRI Copilot: Leverage Generative AI for Support Teams

Profit Safely from Conversational AI: A Guide to Secure GPT-4-like Models Integration

Google Document AI for Contracts & Resources Management

How to Optimize Your Machine Learning Process with Data Version Control (DVC)

Akvelon Secures Six Awards from the Puget Sound Business Journal

A Closer Look at Large Language Models

Akvelon enables non-Python apps to integrate machine learning models with MLEM

Case Study: AI ArtDive 360

Creating Emojis Using Machine Learning

Akvelon Develops an AI-powered Tool for Code Searching

This whitepaper is already on its way to your mailbox!

Blog

Case Study: Machine Learning Solution for Online Forum

Related Posts

This whitepaper is already on its way to your mailbox!