Celebrating Excellence: A Silver Medal at Kaggle

Akvelon Machine Learning Engineer Wins Silver Medal at Kaggle Competition

March 22, 2024
Posted by Shelby Bradford

Recently, Akvelon Machine Learning Engineer/Data Scientist, Danylo Kosmin, participated in a Kaggle competition and won a Silver Medal placing in the top 3% of 4,037 teams. The subject for this code competition was “Quora Insincere Questions Clarification”, calling for Kagglers to help Quora, an online conversation forum where individuals can connect with an online community and can ask and answer questions. Quora wanted to find a way to identify toxic content in online conversations and needed help preventing insincere questions from being posted on their site. An insincere question was defined as a statement that has been disguised as a question and contains one or more of the following elements:

Non-neutral tone: has an exaggerated tone or is rhetorical
Disparaging or inflammatory: discriminates against any class of people (gender, race, religion, etc.)
False information
Sexual or inappropriate content

To solve this problem, Danylo chose to utilize deep neural networks (including recurrent neural networks like LSTM and GRU) as well as Natural Language Processing (NLP).

Task

Participants of this competition were tasked with creating models that could identify and flag insincere questions to help Quora ensure the safety and well-being of their users. Quora has existing methods to prevent this problem that are up and running, but wanted to explore a more scalable method. For this competition, the evaluation metric used was the F-score.

Approach:

Participants utilized embedding, a mapping from discrete objects (such as words)to vectors of real numbers. External data sources were not allowed for this competition, but Quora provided several of word embeddings along with the dataset that can be used in the models such as:

GoogleNews-vectors-negative300
glove.840B.300d
paragram_300_sl999
wiki-news-300d-1M

With access to those data sources, Danylo took these steps below to create a solution:

Exploration of data
- Calculated metafeatures such as: number of unique words in the text, number of punctuations, characters and etc. for better data understanding. Based on this data, he chose the number of words for training vocabulary and selected preprocessing methods.
Searching data
- He searched for the most popular misspells and typos and created a list of special characters (fig. 1). This list was created for data cleaning.
Working with data
- All stop-words were removed from texts to investigate the most frequent insecure words and n-grams (2-grams, 3-grams, 4-grams). After this, Danylo: took 72 as the maximum number of words in a sequence and took 120,000 most frequent words for vocabulary. He then made a preprocess baseline (text cleaning, number cleaning, misspells cleaning, fill NA values and etc). He also tried using different feature engineering methods and other approaches like stacking/blending different embeddings, creating statistical features, etc. but they did not improve his score.

Figure 1 – Punctuations in the dataset

Baseline and Network Model

Danylo’s baseline was based on an embedding layer bidirectional LSTM model using PyTorch. From there, Danylo began selecting and changing different hyper parameters and experimenting with recurrent neural networks (RNN) structures and layers. He took the following steps in his approach:

Defined the embedding layer due to GloVe and Paragram pre-trained embeddings
Worked on bidirectional LSTM with 128 features in the hidden state. LSTM connected to the Attention layer and bidirectional GRU with 256 features in the hidden state
Found the average and max 1D pooling of GRU results. The LSTM, GRU, AvgPool, and MaxPool were concatenated and made up a fully connected layer; this made batch normalization, dropout, and sent to output linear layer.

Figure 2 – Network structure

He used binary cross entropy as a loss function for the network. This resulted in a model that has 4 KFold splits and 5 epochs for each of the folds. The threshold for binarization was selected by checking the F-score on the validation set.

Danylo Kosmin wins silver medal at Kaggle competition

Results

In the first stage of the competition, 4,037 teams participated and Danylo’s model took 151st place on the public leaderboard. This qualified him to compete in the 2nd stage with only 1,401 teams. In the 2nd stage of the competition, his model made it into the top 3% and received a silver medal prize!

Danylo is an important member of the Akvelon team and has been instrumental in the creation of multiple projects, from the Meeting Summarizer to the AI Fitness App.

Thank you for all of your hard work Danylo!

Akvelon Machine Learning Engineer Wins Silver Medal at Kaggle Competition

Task

Approach:

Baseline and Network Model

Results

Smarter Port Operations: Predicting Vessel Discharge Time With Machine Learning

Akvelon enables non-Python apps to integrate machine learning models with MLEM

“MeowTalk” — How to train YAMNet audio classification model for mobile devices

Case Study: Machine Learning Solution for Online Forum

Case Study: Mobile Applications and Machine Learning Projects

Case Study: The Meeting Summarizer

Akvelon Attends YappiDays Conference

Entering the World of Machine Learning and AI with MOOCs

Case Study: Social Chat Bot

Akvelon Ranks in Top 1% for Kaggle Machine Learning Competition

The Winning Mix: Machine Learning and Human Creativity

Crowdsourcing Machine Learning Data Collection for Traffic Safety

This whitepaper is already on its way to your mailbox!

Blog

Akvelon Machine Learning Engineer Wins Silver Medal at Kaggle Competition

Task

Approach:

Baseline and Network Model

Results

Related Posts

This whitepaper is already on its way to your mailbox!