Introduction
Recently, we have told you about our AI-powered Code Documentation Extension, and while developing this extension, the following idea came to our minds:
We have a tool that generates a comment for a selected piece of code. What if we have a comment and need to find a piece of code for it? Shall we create a tool for this case too?
So, we started investigating this opportunity. In this article, we will discuss how we developed and how it works.
Data Collection
Data collection is one of the most important parts of a machine learning project. In our project, this step was divided into the following parts:
- Problem definition
- Selection of suitable tools and libraries
- CLI development
- Creating a pipeline for collecting code
- Storage
- Scaling
To cut a long story short, we needed to collect as many code snippets (functions) and comments written in different programming languages as possible. After much discussion, we decided to create a CLI utility that would receive a programming language as an input and would give repositories from trends as the output. However, there is a restriction in this solution: trends can give only 25 repositories per language and some of them can be very large. Later we managed to solve this problem by writing scripts for parsing sites and pages like Awesome Python/C++/Go etc. A GitHub search feature has also been implemented. This feature also allows us to collect repositories for different languages efficiently by several parameters.
Checking GitHub repositories for contained languages serves as an additional validation. This can be easily done with the PyGithub library. This validation allows us to receive more data from each repository, as sometimes some repositories can contain several languages (for example, “quick” Python extensions are written in C or C++).
To solve a task on parsing source code, we used Tree-Sitter. This library is written in Rust and C, which is very popular for solving this kind of issue. As a bonus, it has a feature that highlights the syntax. One of its most famous applications is Neovim. Languages are added to this library using new grammars (other repositories or libraries).
How it works
At first, we collected URLs of repositories from different sources, aggregated them for different languages and validated them where possible. Then, we collected csv files or the output of the other CLI subprogram that can collect repositories and run the main parsing program. The main parsing program downloaded the repository to the special folder. After that, Tree-Sitter runs at each file, corresponding to the extension of the programming language. Each found pair “code_snippet – comment” was added to a tsv file.
We have deployed this entire stand on a virtual machine. We need to clarify that while developing this solution, we exclusively made the alpha version, without the idea that we would need to scale, because at the beginning we needed to collect a couple of hundred thousand records for each language. However, the results turned out to be more than suitable for us, even with this deployment. In the future, we are going to get rid of the files and implement a more suitable scalable architecture.
The Model
General description
Currently, we are working on an end-to-end model that gets pairs of code and comments as an input, and produces a similarity of this pair as the output. On the inference stage, we take embeddings from models using all knowledge base examples of the code and save them. When we get a new comment, we calculate embedding for it and calculate the cosine similarity between its embedding and the code index (all saved embeddings). It allows us to find the closest code for a new comment.
Training model
- Generating pairs code – comment – similarity
- Negative examples (code – comment with zero or close to zero similarity) are added automatically during training. There were different experiments with other negative sampling techniques such as knn search and random selection, but they demonstrated worse results
- With the model as the backbone, we use Roberta Large with Ranking Loss or Cosine Similarity Loss. It takes more time to train with ranking loss, but finally it shows better results
Code Search
The workflow is the following: you enter your comment as a search query, and then receive a list of pieces of code, from the most relevant to the least relevant:
As for now, Code Search works with Java.
How it works
The entered search query is sent to the backend, where it gets preprocessed and prepared for the model. The model then returns the most relevant code snippets back to the server and, finally, the user can see them on the screenshot.
Conclusion
Machine learning can simplify software development in many different ways. The latest achievements in ML can help developers not only with creating comments, but also with searching for the necessary code snippets. Our code search tool has proven this in practice.
Try our AI-powered code searching tool here.
This article was written by Sofya Gruzdeva,
Project Manager at Akvelon