Introduction
Many developers consider writing documentation to be a poor use of their time. They realize the importance of well-documented code, but rarely do developers consider committing much time to it. Instead programmers may argue “Oh, come on, good code documents itself” and continue on with more crucial work. Living with this mindset, the codebase grows with only a few people who actually understand how it works due to lack of code documentation. However, when new developers join the project and look at the code, they are left confused and bewildered. A good comment gives clarity, and brings knowledge and context to the code.
Recent developments in artificial intelligence and machine learning provide new opportunities for developers. The code inference and autocomplete models can easily understand and even write code automatically with little to no user involvement. GitHub Copilot generates code based on what the user wants to implement. The codeless model summarizes the code and returns elaborate descriptions of it, and the results are astonishing.
Here at Akvelon, we decided to search for a way to help developers maintain their code and deal with the problem of documenting it once and for all with the help of AI.
The Model
During the experiments, we tried out a few preprocessing techniques:
- Augmentations to mask the name of a function, arguments and constants
- Cleaning the dataset from noisy samples based on variation in out-of-folds Levenshtein scores and out-of-distribution estimation
Preprocessing helped to improve scores on separate models. However, metrics of the general model remained the same. Therefore, we decided to train one model for all programming languages because it allowed for better generalization and a larger training dataset.
We used T5-Small seq2seq architecture, which consists of encoder and decoder blocks with self-attention, layer norm, and feed forward layers. It takes a sequence of tokens and outputs a sequence of tokens.
The main idea is that the self-attention layer aggregates more information than RNN layers, and allows the model to see all information in the batch at once and not one token at the time (as it works in LSTM and GRU layers).
In self-attention, we calculated the dot product of Query and transposed Key matrices. We then divided it by the square root of the dimension of the key vectors and took the softmax of the result. Finally, we calculated the output as a dot product of it with the Values matrix.
T5-Small (60M parameters) was trained first, and we chose to use Adafactor. It is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity with medium batch size of 16 and a big number of accumulation steps of 256. This combination allowed us to get a virtual batch size of 4,096 and therefore train stable models. When the model was trained for around 100 epochs, we were able to incorporate this model into day-to-day programming experience.
Auto Comment
And here it is – Auto Comment, a Visual Studio Сode extension that uses AI to generate comments for your code!
Well-documented code is available within a few clicks! Just select a snippet you want to see a comment for, press Shift-Ctrl-/ and the comment will be generated.
More examples of the Auto Comment in action:
Python: fetch_news() function gets news headlines from the BBC News Service and prints them out
TypeScript: between() returns a new date that is between “from” and “to” dates
Java: getPersonById() fetches and returns a user with the particular id
How does it work?
Magic, isn’t it? The dark sorcery is performed by the Auto Comment Visual Studio Сode extension and the model’s backend. Now let us show you how it all works under the hood:
The Visual Studio Сode extension talks to the server with the code-processing model. The extension sends the code snippet to the backend, where it gets preprocessed and prepared for the model. The model then returns the most relevant comment back to the server and, finally, the extension pastes the comment right above the selected snippet.
Currently, the extension supports the following list of languages:
- JavaScript
- Python
- TypeScript
- Java
- C#
- PHP
- Go
- Ruby
Don’t see your favorite language? Worry not! We’re working hard on adding more languages to the service in the future. Let us know what languages you would like to see on the Auto Comment here!
Data policy
Here we would like to point out that our service does not store any data the extension sends over. We use the snippets solely to provide you with comments.
Conclusion
Well-documented and clear code is an important part of any successful project. Harnessing the latest advancements in artificial intelligence, we were able to help developers worldwide maintain their code and provide clarity with automated insightful commenting!
Want to try it for yourself? Our Auto Comment is available here:
Auto Comment – Visual Studio Marketplace
We are constantly improving our extension, so we appreciate your feedback! Please leave your feedback, questions, and language suggestions here.
This project was completed by Ilya Polishchuk, Team Lead; Lev Mizgirev, Software Developer; Yury Bolkonsky, ML Developer; Maxim Kuznetsov, DevLead and others.