Open-Source Software to Aid Cancer Researchers

Open-Source Software to Aid Cancer Researchers

Written By
Stacy Lawrence
Stacy Lawrence
Mar 8, 2006
2 minute read
eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Records of cancer patients nationwide may soon be networked for researchers to access, and now a new study has found a way to de-identify individual records, making the project more feasible.

De-identifying electronic medical records so they can be used for research purposes is a must for health care institutions anxious to remain compliant with HIPAA (Health Insurance Portability and Accountability Act). But this is often a costly and labor-intensive process.

In the study conducted by Harvard University researchers on an open-source software program designed to scrub individual information from patient records, 19 identifiers were removed from every patients record including patient, institution and physician names as well as addresses, dates and medical record number.

The study focused specifically on pathology reports, in response to a project by the National Cancer Institute. This project has successfully demonstrated a prototype of a Web-based, searchable, peer-to-peer network for identifying and locating pathologic tissue samples at various institutions by searching information contained within pathology reports.

/zimages/2/28571.gifClick hereto read about a report saying that open source will invigorate health IT.

The intent of the project is to create and demonstrate software that would then automate the de-identification of the patient records that are likely to be available via the National Cancer Institute network.

Existing “scrubbing” software is either proprietary or only offers partial solutions by removing only one type of patient information. After creating and refining the software, 1,800 new pathology reports were processed.

Each report in the Harvard study was reviewed manually before and after de-identification to catalog all identifiers and note those that were not removed.

About seven out of 10 of these reports contained identifiers in the body of the report totaling 3,499 individual identifiers.

Of these, the program successfully removed more than 98 percent of them. Only 19 HIPAA-specified identifiers, mainly consult accession numbers and misspelled names, were missed.

Of the 41 non-HIPAA identifiers missed, the majority were partial institutional addresses and ages. Outside consultation case reports typically contained several identifiers and were reportedly the most challenging to de-identify comprehensively.

There was variation in performance across the three institutions that participated in the study and the researchers argued that this highlights the need for site-specific customization, which can be accomplished with their tool.

A PDF version of the full report is available here for download.

/zimages/2/28571.gifCheck out eWEEK.coms for the latest news, views and analysis of technologys impact on health care.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.