Recently, Akvelon Machine Learning Engineer/Data Scientist, Danylo Kosmin, participated in a Kaggle competition and won a Silver Medal placing in the top 3% of 4,037 teams. The subject for this code competition was “Quora Insincere Questions Clarification”, calling for Kagglers to help Quora, an online conversation forum where individuals can connect with an online community and can ask and answer questions. Quora wanted to find a way to identify toxic content in online conversations and needed help preventing insincere questions from being posted on their site. An insincere question was defined as a statement that has been disguised as a question and contains one or more of the following elements:
- Non-neutral tone: has an exaggerated tone or is rhetorical
- Disparaging or inflammatory: discriminates against any class of people (gender, race, religion, etc.)
- False information
- Sexual or inappropriate content
To solve this problem, Danylo chose to utilize deep neural networks (including recurrent neural networks like LSTM and GRU) as well as Natural Language Processing (NLP).
Task
Participants of this competition were tasked with creating models that could identify and flag insincere questions to help Quora ensure the safety and well-being of their users. Quora has existing methods to prevent this problem that are up and running, but wanted to explore a more scalable method. For this competition, the evaluation metric used was the F-score.
Approach:
Participants utilized embedding, a mapping from discrete objects (such as words)to vectors of real numbers. External data sources were not allowed for this competition, but Quora provided several of word embeddings along with the dataset that can be used in the models such as:
- GoogleNews-vectors-negative300
- glove.840B.300d
- paragram_300_sl999
- wiki-news-300d-1M
With access to those data sources, Danylo took these steps below to create a solution:
- Exploration of data
- Calculated metafeatures such as: number of unique words in the text, number of punctuations, characters and etc. for better data understanding. Based on this data, he chose the number of words for training vocabulary and selected preprocessing methods.
- Searching data
- He searched for the most popular misspells and typos and created a list of special characters (fig. 1). This list was created for data cleaning.
- Working with data
- All stop-words were removed from texts to investigate the most frequent insecure words and n-grams (2-grams, 3-grams, 4-grams). After this, Danylo: took 72 as the maximum number of words in a sequence and took 120,000 most frequent words for vocabulary. He then made a preprocess baseline (text cleaning, number cleaning, misspells cleaning, fill NA values and etc). He also tried using different feature engineering methods and other approaches like stacking/blending different embeddings, creating statistical features, etc. but they did not improve his score.
Baseline and Network Model
Danylo’s baseline was based on an embedding layer bidirectional LSTM model using PyTorch. From there, Danylo began selecting and changing different hyper parameters and experimenting with recurrent neural networks (RNN) structures and layers. He took the following steps in his approach:
- Defined the embedding layer due to GloVe and Paragram pre-trained embeddings
- Worked on bidirectional LSTM with 128 features in the hidden state. LSTM connected to the Attention layer and bidirectional GRU with 256 features in the hidden state
- Found the average and max 1D pooling of GRU results. The LSTM, GRU, AvgPool, and MaxPool were concatenated and made up a fully connected layer; this made batch normalization, dropout, and sent to output linear layer.
He used binary cross entropy as a loss function for the network. This resulted in a model that has 4 KFold splits and 5 epochs for each of the folds. The threshold for binarization was selected by checking the F-score on the validation set.
Results
In the first stage of the competition, 4,037 teams participated and Danylo’s model took 151st place on the public leaderboard. This qualified him to compete in the 2nd stage with only 1,401 teams. In the 2nd stage of the competition, his model made it into the top 3% and received a silver medal prize!
Danylo is an important member of the Akvelon team and has been instrumental in the creation of multiple projects, from the Meeting Summarizer to the AI Fitness App.
Thank you for all of your hard work Danylo!