Case Studies

De-Identification of Patient Data


How De-Identification of Patient Data Works?

The healthcare data de-identification is facilitated through a specialized NLP module, leveraging spaCy’s standard pre-trained modules alongside custom ML models. These models are refined using a Harvard medical dataset and are supplemented by open medical data to adeptly identify PHI within texts, ensuring comprehensive processing across all data types.

To enhance user accessibility, Akvelon’s team developed a desktop application compatible with Windows and MacOS, enabling data analysts and honest brokers to conduct data processing seamlessly without needing to navigate the AWS Console. Additionally, we created a CLI tool that mirrors the design and structure of the AWS CLI to simplify the deployment, destruction, and modification of AWS infrastructure. Moreover, a licensing service was established to restrict system access exclusively to authorized users, thereby maintaining strict control over who can process and handle sensitive information:

  • Personal Identifiers (names, SSN, medical record numbers, etc.)
  • Contact information
  • Geographical identifiers
  • Vehicle and device identifiers
  • Digital identifiers
  • Biometric identifiers


The Process of Product Development

Initially, our client outlined their expectations and business needs, but needed assistance developing a technical plan. The goal was to quickly develop a market-leading product aimed at healthcare data de-identification, requiring a detailed technical strategy, architecture design, and a development roadmap. Our highly-skilled engineering team was tasked to build, document, and deliver a system that leveraged ML-powered methods for de-identification of personal data within six months to meet our client’s requirements. This project required deep expertise in PHI management, HIPAA compliance, and experience with handling DICOM medical format.

Technology stack

Cloud: AWS Batch, AWS ECR, AWS S3, Aurora, AWS Lambda, and API Gateway

Data process: Spark, PySpark, Python, Scala, and spaCy

CI/CD: Terraform, GitHub, and GitHub Actions


Business Impact

Tasked with the de-identification of patient data, the solution created by Akvelon’s team stands at the intersection of HIPAA compliance, machine learning, and seamless data processing. This initiative isn’t just about building software; it is about pioneering a secure, scalable, and intelligent platform capable of de-identifying PHI across multiple data formats, thus setting a new benchmark in medical documentation management.

Aimed at healthcare data de-identification, our solution assists with:

  • De-identification of PHI from CSV, DICOM, and .txt files, ensuring compliance with HIPAA regulations.
  • High-performance data processing capabilities: 1GB of CSV data can be processed per hour.
  • The flexibility to enhance machine learning models in production, using patient data to improve accuracy and reliability.
  • Cost efficiency achieved through a serverless architecture that offers seamless scalability to meet evolving customer requirements.


Have a question?

What is the de-identification of PHI?

De-identification of patient health information (PHI) data is a process of removing specific identifiers that might disclose the patient’s identity from the data set. This refers to identifiable data that must be removed according to the Code of Federal Regulations.

Why is healthcare data de-identification important for healthtech companies?

By removing personal identifiers from health data, companies can explore and develop new treatments and technologies while safeguarding individual privacy and adhering to legal standards, thus fostering a trustworthy environment for digital health advancements.

What organizations are subject to HIPAA compliance?

There is a broad spectrum of organizations and individuals in healthcare, and related services, that are mandated to protect patient data, complying with HIPAA’s privacy and security regulations. In general, they can be divided into two categories: Covered Entities (CEs) and Business Associates (BAs). CEs include health plans, healthcare clearinghouses, and healthcare providers involved in electronic transactions. BAs are entities that perform functions or activities involving the use or disclosure of protected health information (PHI) on behalf of, or to, a covered entity.