PHI Data De-Identification

Our client is a technology start-up founded by one of the largest healthcare providers in the US (UPMC, from top-20 in the USA) to examine de-identified PHI for analytical purposes. They provide services dedicated to removing PHI patient data from different medical data sources under HIPAA Compliance rules.

PHI Data De-Identification UPMC start-up founded largest healthcare

Medical Documents Recognition

Team size

9 members


6 months



Business Need

As a startup, our client was lacking strong technical and engineering in-house expertise to translate their ideas into reality. The client was aiming to build the first version of the framework as early as possible, as they intended to enter a niche market with their unique product.

There were no ready-made solutions that allowed for processing multiple types of data using modern technologies to de-identify patients’ PHI data, so the solution had to be built from scratch.

At the beginning of the project, our client had the basic idea of what results the expected and general business requirements, but they had no technical description of the project or codebase.

The main challenge was to be the 1st to the market performing at a commercial level. This involved creating a technical plan, solution architecture design, and a roadmap for developing a solution under tight deadlines.

They needed a team of experienced developers able to create a competitive product in full compliance with business needs. The main requirements for development were:

  • The ability to de-identify PHI patient data from CSV, DICOM, and .txt files.
  • The ability to process large volumes of data with high performance (1GB of CSV-data per hour).
  • Further flexible scalability according to customer requirements.
  • Serverless structure.
  • The ability to train existing ML models in production using real patient data.

It was necessary to support the product delivery pipelines via CI/CD, including building new framework versions, deployment environment, and testing automation running.


The Akvelon team started with the business requirement analysis. Using this data the team composed a roadmap with all the features to meet MVP. Along with the client, we took part in writing epics and splitting them into user stories with further planning and estimation.

The Akvelon team was responsible for the delivery process, the manager from our side driving the development process, including all Scrum and Agile ceremonies.

As our team was tasked with building a SAAS, AWS was chosen as the core platform for developing the solution because it was fully compatible with security, scalability, and serverless requirements. Spark was chosen as the main data processing engine. To find PHI data the team utilized the spaCy NLP library, which has pre-trained ML models, as well as the ability to retrain ML models and add new models.

Several pipelines in the AWS batch were developed for processing various data types:

  • The Akvelon team built a CSV data processing pipeline that is able to remove PHI data from any CSV file. This feature helps work with any database exported in CSV, so the pipeline sanitizes the data (removes all the personal information).
  • The main idea of processing .txt files was to be able to deal with various documents and notes (examples include: doctor notes, free text, etc). TXT files were processed and placed into a separate folder with PHI being anonymized.
  • DICOM files in which patient scans were stored were processed similarly to .txt files. The pipeline checked and removed PHI from descriptions. The computer vision module also identified and de-identified PHI data from images.

We built a separate NLP module based on spaCy with standard pre-trained modules and custom ML models trained on Harvard medical data set with open medical data to identify PHI data in the text. This module was utilized for processing all data types.

To make the system easy to use for data analysts and honest brokers, the Akvelon team has created a desktop application (Win, Mac) to run data processing without accessing the AWS Console.

To be able to easily deploy/destroy/modify AWS infrastructure the team has developed CLI (which was like AWS CLI, designed and structured the same way).

Licensing services built by our team helped prevent access to the system for users who don’t have permission/license.

Business Impact

Within 6 months, the Akvelon team built from scratch, documented, and delivered a ready-to-use system which was fully compatible with the client’s requirements. The software allows the user to de-identify major PHI data entities such as names, addresses, SSN, telephones, etc. with a high-level of accuracy. With the client we have created a product backlog and defined further product development strategies.

Technology Used

Cloud: AWS Batch, AWS ECR, AWS S3, Aurora, AWS Lambda, API Gateway,
Data process: Spark, PySpark, Python, Scala, spaCy
CI/CD: Terraform, GitHub, GitHub Actions

Let’s discuss your idea!