De-Identification
of Patient Data
Enable seamless and scalable ML-powered processing of large data volumes, ensuring the automatic de-identification of patient data across multiple file formats. Prioritize compliance with HIPAA regulations, focusing on the specifics of patient data protection through advanced de-identification techniques.
Data processing pipeline that is able to remove PHI from any CSV file.
Files are processed to anonymize PHI and are sorted into a separate folder.
Files are cleared from PHI through a unified pipeline and computer vision module.
Industry Specifics
Medical document recognition in Healthcare has to be compliant with HIPAA standards.
Algorithm-powered
Increased accuracy is ensured by continuous training of existing ML models.
Results
We developed a solution that processes medical documentation in CSV format at a rate of 1GB per hour.
How De-Identification of Patient Data Works?
The healthcare data de-identification is facilitated through a specialized NLP module, leveraging spaCy's standard pre-trained modules alongside custom ML models. These models are refined using a Harvard medical dataset and are supplemented by open medical data to adeptly identify PHI within texts, ensuring comprehensive processing across all data types.
To enhance user accessibility, Akvelon’s team developed a desktop application compatible with Windows and MacOS, enabling data analysts and honest brokers to conduct data processing seamlessly without needing to navigate the AWS Console. Additionally, we created a CLI tool that mirrors the design and structure of the AWS CLI to simplify the deployment, destruction, and modification of AWS infrastructure. Moreover, a licensing service was established to restrict system access exclusively to authorized users, thereby maintaining strict control over who can process and handle sensitive information:
- Personal Identifiers (names, SSN, medical record numbers, etc.)
- Contact information
- Geographical identifiers
- Vehicle and device identifiers
- Digital identifiers
- Biometric identifiers
The Process of Product Development
Initially, our client outlined their expectations and business needs, but needed assistance developing a technical plan. The goal was to quickly develop a market-leading product aimed at healthcare data de-identification, requiring a detailed technical strategy, architecture design, and a development roadmap. Our highly-skilled engineering team was tasked to build, document, and deliver a system that leveraged ML-powered methods for de-identification of personal data within six months to meet our client’s requirements. This project required deep expertise in PHI management, HIPAA compliance, and experience with handling DICOM medical format.
Technology stack
Cloud: AWS Batch, AWS ECR, AWS S3, Aurora, AWS Lambda, and API Gateway
Data process: Spark, PySpark, Python, Scala, and spaCy
CI/CD: Terraform, GitHub, and GitHub Actions
Business Impact
Tasked with the de-identification of patient data, the solution created by Akvelon’s team stands at the intersection of HIPAA compliance, machine learning, and seamless data processing. This initiative isn't just about building software; it is about pioneering a secure, scalable, and intelligent platform capable of de-identifying PHI across multiple data formats, thus setting a new benchmark in medical documentation management.
Aimed at healthcare data de-identification, our solution assists with:
- De-identification of PHI from CSV, DICOM, and .txt files, ensuring compliance with HIPAA regulations.
- High-performance data processing capabilities: 1GB of CSV data can be processed per hour.
- The flexibility to enhance machine learning models in production, using patient data to improve accuracy and reliability.
- Cost efficiency achieved through a serverless architecture that offers seamless scalability to meet evolving customer requirements.
Have a question?
De-identification of patient health information (PHI) data is a process of removing specific identifiers that might disclose the patient’s identity from the data set. This refers to identifiable data that must be removed according to the Code of Federal Regulations.
By removing personal identifiers from health data, companies can explore and develop new treatments and technologies while safeguarding individual privacy and adhering to legal standards, thus fostering a trustworthy environment for digital health advancements.
There is a broad spectrum of organizations and individuals in healthcare, and related services, that are mandated to protect patient data, complying with HIPAA's privacy and security regulations. In general, they can be divided into two categories: Covered Entities (CEs) and Business Associates (BAs). CEs include health plans, healthcare clearinghouses, and healthcare providers involved in electronic transactions. BAs are entities that perform functions or activities involving the use or disclosure of protected health information (PHI) on behalf of, or to, a covered entity.