Clustering Java Code

In today's rapidly evolving software landscape, it's essential for engineers to stay updated on the latest technologies and programming languages. Java, a versatile and widely-used language, is often a top choice for many professionals.

In this article, we explore Akvelon's extensive expertise in ML-related projects and provide insights for engineers looking to learn Java.

We have conducted extensive research, collecting and clustering numerous Java code snippets from GitHub repositories. Our innovative approach utilizes cutting-edge machine learning techniques, such as BERT embeddings and BERTopic for cluster analysis.

We also demonstrate how these techniques can be applied to source code analysis with UniXcoder.

This comprehensive guide aims to equip engineers with the knowledge to enhance their understanding of Java and inspire them to create innovative projects leveraging the power of this versatile programming language.

Github stores all kinds of code, from large-scale business solutions to beginners’ projects.

What would happen if we downloaded a ton of random Java code from Github and then tried to categorize it? This would essentially allow us to take a look at the entire Java world from a bird’s eye view.

Let’s find out!

The Data

We downloaded 16Gb of Java code from Github, then divided it into code snippets of 2 kinds:

Method snippets: Java classes were parsed (using the tree-sitter library) to extract methods. Then, we applied a filtering procedure to make sure that there are no duplicates or methods that are too short (can’t be under 5 lines)
Class snippets: They contain entire Java files

We are going to cluster both kinds of snippets separately. This will yield different results: method clusters are going to be more specific, whereas class clusters should cover entire technological domains.

BERTopic

BERTopic is a Python clustering library that will help us to analyze our data.

It takes a list of string documents and then calculates the corresponding embeddings using a BERT model (if custom embeddings are not provided). It then reduces the dimensionality of the embeddings with UMAP and performs clustering using the HDBSCAN algorithm. The most important feature of HDBSCAN is that it’s hierarchical - it also produces the binary tree of resulting clusters.

With BERTopic, we can easily get valuable insights on clustering results such as:

The list of clusters sorted by their size. Clusters are named by their top keywords put together
Clusters hierarchy plot
Top keywords associated with each cluster. which are words in the cluster documents with the highest TF-IDF metric (these words occur in the cluster documents way more often than in English texts in general)
Some representatives of each cluster

UniXCoder

Documents that we’re going to pass to BERTopic are code snippets. There are two options: either let BERTopic calculate the embeddings by itself (using a BERT model), or pass our own custom embeddings. In our research, we are dealing with the source code, not the natural language, so it would be much better to find an encoding model that is specifically designed to process the code and pass its custom embedding to the BERTopic.

One of the latest such models is UniXCoder, released in July 2022. The most important feature of UniXCoder is that it first builds up the syntax tree of the code input, and then rearranges the tree into the token sequence. All the details can be found in this paper.

In our research, we’re going to only use UniXCoder encoding capabilities (encode a code snippet into an embedding). However, you can do some other cool stuff with it too. For instance,

It can handle code comments as well. If you encode the code snippet and its corresponding comment, the embeddings should be somewhat similar
In the decoder mode, UniXCoder can autocomplete the source code. However, please keep in mind that the model itself is quite small, so don’t expect miracles from it
In the decoder-encoder mode, the model can perform the mask prediction (which can even be more than 1 token long)

Documents Preprocessing

BERTopic is designed to work with natural language documents, whereas the source code has a different structure which can break the BERTopic keyword extraction. Before passing documents into BERTopic, we have turned them into sequences of English words (our tokenization algorithm also splits camelCase identifiers into separate words).

For example, the following method:

public ArrayList<Player> AddPlayer(Player p){

this.players.add(p);

return this.players;

}

turns into the sequence “public array list player add player player p this players add p return this players”

There is some vocabulary that is very typical for those code documents but not that frequent in English overall: words like public, return, and string. In order for them to not dominate keyword lists, we tokenized all the code snippets and removed from them those words that appear in the top 100 frequency list.

Methods Clusterings Results

We have collected 600,000 Java methods to cluster.

There are 2 ways to control the number of clusters in BERTopic:

Directly specifying the nr_topics argument
Setting the min_topic_size argument. We’re going to use this option, min_topic_size = 550

Almost half of all the methods (291,597) were identified as “outliers” (not belonging to any cluster).

The plot chart below shows the cluster size distribution. We expected it to follow Zipf's law, but in reality it diverges from it quite a lot:

With BERTopic, we can visualize statistics on the top keywords in the clusters.

Let’s see what we can find in these clusters:

Cluster 0 has the methods with classes from org.apache.iginte package
Cluster 1 is related to graphics for Android apps
Cluster 2 contains unit tests
Cluster 3 includes code for Minecraft mod development
Cluster 4 has methods for working with string values and byte-by-byte shift
Cluster 5 is a network cluster. It contains methods that use TCP/UDP protocols
Cluster 6 contains component tests and integration tests
Cluster 7 contains verifications that use classes from org.compiere.model.PO. One more captivating fact about this cluster is that the code in it is written in the style of C++
Cluster 8 is related again to Android development. Here, you can find some processing of ActionEvent type of events
Cluster 9 is the Spanish code related to HTML. The code here is aimed to restore the document and write in the database

Cluster 10 is again related to Android, but here code uses com.android.dx.rop package

The general conclusion is that most of the clusters include methods that are used for solving some very specific sorts of tasks. Oftentimes, understanding the meaning of the cluster requires some special knowledge.

Some other interesting facts:

The fourth largest cluster (Cluster 3) often includes the Minecraft modding code
There are “language” clusters with identifiers/comments written predominantly in languages other than English:
- Cluster 9 is clearly Portuguese
- Cluster 43 is quite interesting, as it contains Indonesian/Malay, Polish, and Dutch languages.
There is some machine learning written in Java (cluster 22). However, it’s really unpopular, so it does not appear on the top of the list

The hierarchical plot is also provided by the BERTopic library. It shows the rearrangement of all the clusters into the binary tree with clusters as leafs: similar clusters have smaller distance to their lowest common ancestor.

Classes Clusterings Results

Let’s now switch to clustering entire Java classes.

We have collected about 86,000 classes and set the minimum cluster size to be 100, resulting in 65 clusters with almost 30,000 classes falling into the outliers category.

Cluster sizes somewhat follow Zipf’s law.

Let’s now make sense of some clusters that we got:

The two largest clusters (0 and 1) contain classes that have a license header. The algorithm somehow separates those sets of classes though: the first cluster clearly has more code that is “Licensed to the Apache Software Foundation”
Cluster 2 is the Android development cluster
Cluster 3 contains Minecraft modding - it’s a bit funny that it came up to be that high on the list
Cluster 4 mostly contains business logic
Cluster 5 groups together UI classes that make use of Swing components
Cluster 6 is the “cracking the coding interview” cluster. It contains solutions to algorithmic problems. Usually, it also involves shortage of time, for this reason some common “I have no time to think about it” variable names appear in the keyword list.
Cluster 7 is the Spanish cluster. It groups up classes developed by Spanish speakers who decide to name code entities in their native language

Some thoughts based on the clusters hierarchical structure:

The Spanish cluster is really close to the Swing cluster. Perhaps it means that Swing is especially popular among Spanish speaking developers. Maybe Swing is being widely used in programming classes in Spanish speaking countries - that would also explain the tendency of those developers to name identifiers in their native language instead of in English
The closest cluster to the JavaRush one is the SQL cluster. Perhaps it means that SQL takes up a large portion of JavaRush classes
Clusters 6, 32, 63, 3, 22, and 45 (all quite close to each other) represent the “very mathy block”: the block involves a lot of geometry, graph theory, and NLP. Funny enough, it also contains the Minecraft cluster
The top of the plot (clusters 20-14) is quite distant from the rest of the clusters. However, there are some really big and important clusters within this group: two license clusters, the Android cluster, two Eclipse clusters, and one big business logic cluster

Conclusion

In conclusion, the application of BERTopic and UniXCoder to analyze the Java code on Github has provided valuable insights into the landscape of Java development. The clustering results have uncovered the diverse range of applications, technologies, and coding practices within the Java community. By sharing this information publicly, we hope to achieve several benefits, including:

Enhancing the knowledge of the developer community: Sharing our approach and findings contributes to the broader developer community's knowledge and inspires others to experiment with similar techniques in their work. This can lead to further innovations and improvements in the field.
Demonstrating the capabilities of advanced clustering techniques for code analysis: By demonstrating our ability to analyze massive code repositories and extract meaningful insights, we can showcase our expertise as software development engineers. This can help build trust and credibility with potential clients, who may be more inclined to work with us, knowing that we use cutting-edge techniques to provide better solutions.
Identifying correlations between different domains or the popularity of certain technologies within specific groups of developers: The insights and patterns uncovered in this study can help identify correlations between different domains or the popularity of certain technologies within specific groups of developers. This can lead to further innovations and improvements in the field, ultimately benefiting the entire Java community.

Overall, this research demonstrates the potential of using advanced clustering techniques for code analysis and contributes expertise to the broader developer community. By sharing our approach and findings, we hope to inspire others to experiment with similar techniques and foster collaboration and innovation in the industry.

This article was written by

Clustering Java Code