Technology Articles

How to Optimize Your Machine Learning Process with Data Version Control (DVC)

What is Data Version Control (DVC)?

DVC is a version control system for data that covers model storage, experiment tracking, and metrics reporting. It can also be used to track changes in workflows and store them in a centralized repository. DVC allows users to track changes in their data and store them in their own data storages It also provides users with the ability to automate their workflows and ensure data security.

At Akvelon, we use DVC to manage our data and ensure our project success. We have benefitted from using it to track changes in our data, store it in a centralized repository, and automate our workflows.

Benefits of Using DVC

Many companies are turning towards using DVC as well to take advantage of these benefits:

  • DVC enables users to accurately synchronize their dataset and code versions, allowing for parallel work with different dataset versions and machine learning approaches
  • DVC helps to automate or reduce data processing steps 
  • DVC provides experiment tracking and easy comparison of metrics. 
  • Its straightforward implementation provides the ability to use DVC through CI/CD steps.

Akvelon’s Guide for Using DVC

Akvelon has developed a comprehensive guide for using DVC. Our guide covers everything from setting up a centralized repository to automating workflows and ensuring data security. To begin working with DVC, you first need a repository and storage. Here are some of the key steps in our guide:

  1.  Set up a centralized storage for your data. This will enable you to easily track changes in your data and store it in your own version.
  2.  Once you have set up centralized storage, you need a Git repository to begin versioning your data. This will allow you to easily track changes in your data and store it in different versions and share between team members. The following command will initialize DVC in your repository:
  1. The next step is to configure DVC with the storage you use. DVC will refer to it as "mystorage" through the endpoint URL. If you use s3, you need to define a bucket name as shown in our example. To work with storage locally, use the "--local" option; It is necessary to prevent leak of credentials to your storage and data. Also, you can provide it as CI/CD variables. All you need to configure the "endpointurl", "access_key_id", and "secret_access_key":

To use this, the main commands are "add", "push", and "pull". The commands straightforwardly associate with Git commands. Moreover, just like in Git, you can specify a file or directory, or update all files that are under tracking.

  1. Next, you should establish automated workflows, which are essential to ensure smooth work and reliability of the developmentDVC enables you to easily automate your workflows so that your projects are, providing fast development and quick access to results.  By default, DVC uses anonymous analytics, so to avoid data leaks you can set:

Then, add your variables to the repository CI/CD variables. Don't forget to use Masked or Protected settings for CI/CD variables. It is also possible to use DVC without Git for production, just be sure to use the flag:

DVC provides an experiment management tool. To run a basic experiment, you need to configure the dvc.yaml file:

After running the experiment, push updates to DVC and Git. DVC will automatically track plots, metrics, outputs, and dependencies. 

  1. Now you can easily track changes in your data, models and metrics. DVC allows to versionate dataset and use different checkpoints of models, which leads to more transparent workflow and easily comparable results. 
  2. Then, it is crucial to protect your data by leveraging DVC, which makes it easier to ensure data security by protecting data from unauthorized access
  3. Finally, it is imperative to view your experiments and metrics for a successful project, which can be done using DVC.

There are two options for metrics:

  • "dvc metrics show --md":

Path

fscore

precision

recall

roc_auc

metrics.json

0.76667

0.76667

0.76667

0.93061

  • "dvc metrics diff --md" with option --all shows all metrics. Even with non-zero change, you can look through the list of experiments that have already been completed
  • "dvc exp show --md" shows the experiment’s results and params
  • "dvc plots" store your values, such as loss, metrics, etc. in json/yaml/csv file, and you can plot them and the difference between loss from different branches

Machine Learning and AI Models with DVC

At Akvelon, we have seen the tremendous benefits of using DVC for machine learning and AI models. DVC enables us to easily track changes in our data, store it in a centralized repository,  synchronize the code base with models and datasets, keep information about results updated, and automate our workflows. It also allows us to ensure data security, making sure that our data is safe from unauthorized access.

Using DVC for machine learning and AI models gives organizations the ability to track changes in their data, store it in their own version, and ensure data security. Furthermore, it enables organizations to automate their workflow, thus providing fast development, secure data exchange, and quick access to results.

Centralized Store and Data Versioning with DVC

At Akvelon, we use DVC to set up a centralized store for our data and to version it. Having a centralized store allows us to easily track changes in our data and store it in our own version. Versioning our data enables us to track changes in our data and store it in different versions for each project. Regarding access rights, you can find metrics and models and reproduce the experiments. 

Using DVC for centralized storage and data versioning provides organizations with the ability to track changes in their data and store it in their own version. It also allows organizations to easily automate their workflow and ensure data security.

Using DVC with CI/CD Pipelines and Report Metrics

For the production environment, it is important that all the necessary models (which are quite heavy in machine learning) and configuration files are loaded quickly and only once, so that you don’t have to load them again in case the container drops or the connection is interrupted. This makes DVC a reliable tool for serving models.

In our case, CI/CD pipelines are arranged as follows: we have our own GitLab with configured runners in K8s, and we use S3 as data storage, so each project has its own bucket. DVC is concisely integrated into this architecture and helps to maintain the continuity of the entire deployment process. It also helps to keep the code and model up-to-date.

Ensuring Data Security with DVC

At Akvelon, we use DVC to integrate our data with cloud services. Integrating DVC with cloud services allows us to easily track changes in our data and store it in a centralized repository. It also allows us to automate our workflows and ensure data security. 

Basically, security depends on the settings of the repository and storage itself; DVC interacts with most existing storage, such as AWS S3, Microsoft Azure Blob Storage, Google Drive, and Google Cloud Storage. DVC also allows for checking secondary parameters, such as time zone, etc.

All of this allows DVC, as a layer between the storage and the developer, to save the security settings configured for the project.

DVC for Project Success

Many engineers have agreed that DVC is quite easy to use, and it is rare for those who have previously worked with Git to have any problems with DVC. In our projects with DVC work, our C#, Java, and Python developers have attested that they have had no difficulties with it.

Additionally, DVC facilitates the interaction and transfer of models and datasets, both within and between teams. It also ensures the synchronization of data and code, resulting in reproducibility and easy launching of models and continuation of experiments.

Conclusion

Many companies, ourselves included, have enjoyed the tremendous benefits of leveraging DVC. It allows users to track changes  in their data, store it in a centralized repository, and automate their workflows. It also allows users to ensure data security, making sure that their data is safe from unauthorized access.

We hope this guide has helped you to understand the benefits of using DVC and the steps you can take to leverage it to ensure the success of your projects. If you have any questions about DVC or need help using it, please don’t hesitate to contact us. We’d be happy to help you get started.

For more insights on our experience with Iterative.ai products, check out this article that we recently shared.

This article was written by

Ayagoz Mussabayeva, Data Scientist at Akvelon

Ayagoz Mussabayeva

Data Scientist at Akvelon

Denis Nosov, Project Manager at Akvelon

Denis Nosov

Project Manager at Akvelon