Business Need
Quora, an online conversation forum where individuals can connect with an online community and can ask and answer questions wanted to find a way to identify toxic content in online conversations and needed help preventing insincere questions from being posted on their site.
An insincere question was defined as a statement that has been disguised as a question and contains one or more of the following elements:
- Non-neutral tone: has an exaggerated tone or is rhetorical
- Disparaging or inflammatory: discriminates against any class of people (gender, race, religion, etc.)
- False information
- Sexual or inappropriate content
Participants of this competition were tasked with creating models that could identify and flag insincere questions to help Quora ensure the safety and well-being of their users.
Solution
To solve this problem, Danylo Kosmin,a Machine Learning/Data Scientist chose to utilize deep neural networks (including recurrent neural networks like LSTM and GRU) as well as Neural Language Processing.
Danylo took these steps below to create a solution:
- Exploration of data
- Calculated metafeatures such as: number of unique words in the text, number of punctuations, characters and etc. for better data understanding. Based on this data, he chose the number of words for training vocabulary and selected preprocessing methods.
- Searching data
- He searched for the most popular misspells and typos and created a list of special characters (fig. 1). This list was created for data cleaning.
- Working with data
- All stop-words were removed from texts to investigate the most frequent insecure words and n-grams (2-grams, 3-grams, 4-grams). After this, Danylo: took 72 as the maximum number of words in a sequence and took 120,000 most frequent words for vocabulary. He then made a preprocess baseline (text cleaning, number cleaning, misspells cleaning, fill NA values and etc). He also tried using different feature engineering methods and other approaches like stacking/blending different embeddings, creating statistical features, etc. but they did not improve his score.
Danylo’s baseline was based on an embedding layer bidirectional LSTM model using PyTorch. From there, Danylo began selecting and changing different hyper parameters and experimenting with recurrent neural networks (RNN) structures and layers. He took the following steps in his approach:
- Defined the embedding layer due to GloVe and Paragram pre-trained embeddings
- Worked on bidirectional LSTM with 128 features in the hidden state. LSTM connected to the Attention layer and bidirectional GRU with 256 features in the hidden state
- Found the average and max 1D pooling of GRU results. The LSTM, GRU, AvgPool, and MaxPool were concatenated and made up a fully connected layer; this made batch normalization, dropout, and sent to output linear layer.
He used binary cross entropy as a loss function for the network. This resulted in a model that has 4 KFold splits and 5 epochs for each of the folds. The threshold for binarization was selected by checking the F-score on the validation set.
Benefits and Results
In the first stage of the competition, 4,037 teams participated and Danylo’s model took 151st place on the public leaderboard. This qualified him to compete in the 2nd stage with only 1,401 teams. In the 2nd stage of the competition, his model made it into the top 3% and received a silver medal prize.
Technology Used
Natural Language Processing, Machine Learning, neural networks (including recurrent neural networks like LSTM and GRU)