Label Extraction and AI for Digital Pathology

Tissue-based studies generate large amounts of histology data containing important biological information in the form of imagery and metadata. These digital pathology slides are labeled using text and barcodes for their identification. The older technologies used printed or handwritten labels for specimen labeling. The Label Extraction Solution uses state-of-the-art OCR technologies, image processing, and AI to read, understand, and store label data from digital pathology slides. Additional manual validation of the data leads to a highly automated process which reduces the time to search and find slides. The extracted label text is translated into a structured data format, stored in a database with search capabilities. This solution has significantly saved time and effort for pathologists by avoiding repeat sample orders, quick access to historic data, and accuracy.

Features of Digital Pathology


This platform performs the archival and retrieval of metadata using a standard data structure.

Decision Support

This program supports determinations, judgments, and courses of action to solve problems in decision-making

Data Harmonization

Standard structured datasets help to identify the outliers and trends

Quality Control

Easy search and access of all the datasets support further research and analytical activities

Remote Viewing

Easy search and access of all the datasets support further research and analytical activities.

Case Study :: Digital Pathology & AI
Digital pathology driving medical breakthroughs and a new set of challenges

Digital pathology is rapidly becoming the new standard of care, thanks to recent approvals from the Food and Drug Administration (FDA) for applications such as primary disease diagnosis.

Regarded as the bridge between science and medicine, pathology involves teams of scientists and medical staff studying biological samples to understand drivers of illnesses and diseases. It plays an important role in investigating the effects of new drugs as well as understanding the characteristics of viruses and bacteria.

The use of modern technology in pathology has not only improved the efficiency, precision, and granularity at which biological specimens can be captured, it now enables a more automated means for how the metadata associated with specimens is extracted, analyzed, and stored. Computer algorithms can drastically accelerate our ability as a species to identify, prevent, and treat factors in our environment that can harm us.

“This convergence of advanced imaging, automation, and powerful analytics like natural language processing (NLP), machine learning, and artificial intelligence (AI) in healthcare and life sciences organizations are bringing together the tools needed for scientists and clinicians to unlock medical breakthroughs at a pace like never before,” says David Dimond, Chief Innovation Officer Global Healthcare & Life Sciences at Dell Technologies.

Advancements in digital pathology have create a new set of challenges however as researchers begin to adopt new digital workflows. The shift to digital workflows means specimens that were captured using older technologies would require conversion into a digital format in order to remain useful.

Aventior’s experience in developing and implementing technology for companies in the life sciences industry made us a perfect fit to solve the challenges faced by a Massachusetts-based biotechnology company developing gene therapies for severe genetic disorders and cancer.

Over the course of the years, the company had collected tens of thousands of biological specimens as part of their research. Tissue-based studies generate large amounts of histology data containing biological information in the form of imagery and metadata. These digital pathology slides are labeled using text for their identification, and older technologies used printed or handwritten labels for specimen labeling.

As such, it becomes virtually impossible for pathology teams to quickly search and find specimens they are looking for or categorize them based on tissue, disease, markers, and other attributes. Certain specimens can be extremely difficult to come by, and the ability to access archived specimens is critical in this field.

Aventior worked with the client to address this challenge by developing a solution that utilized a combination of techniques that would allow them to efficiently capture, store, and access biological specimens for analysis.

AI used in label extraction for digital pathology

With tens of thousands of slides stored in various formats and handwritten identification, extracting and organizing the information efficiently would not only require a series of steps, it would also involve the use of image processing techniques, artificial intelligence (AI), and optical character recognition (OCR).

Based on the requirements at hand Aventior developed a solution which would include four key steps:

  1. The first step involves extracting label text from files stored in Mirax file format. It scans all the data files (.dat) to find the data file that is associated with the label file. The associated data label file is converted to a raw PNG image.
  1. These raw PNG files are then process for better text extraction which includes image rotation, image enhancement and image thresholding modules. The platform would also perform the morphological transformations like erosion and dilation on the text if needed.
  1. The text from the processed image would then be extracted using OCR techniques.
  1. The extracted text would then be appended in the user defined structure data format, stored in a database with search capabilities, where a manual validation may be conducted to verify the quality of the extracted data.

A proof of concept was first developed to test the approach. The test involved the processing of 1000 slides with an Excel file as the output. Once validation of the platform had been completed, additional functionality was developed to support the output of the data directly to an SQL database.

Using this approach, Aventior was able to design, develop, and launch the platform to the client within the time span of a couple months.

More value and efficiency from existing specimens

Aventior’s AI-based automated label extraction platform quickly allowed the client to enhance their research capabilities. Pathology teams saved time by avoiding having to manually review specimen slides, which meant more time could be spent analyzing data and gaining important insights. 

Processing times were reduced 80% compared to the manual process.

In the manual process, information would be stored in disparate locations, making it difficult to identify outliers and trends. Through use of the platform, data is now being stored in a much more harmonious manner, enabling teams to find and analyze data faster.  

Data harmonization helps users to identify outliers and trends in a much more efficient manner.

Another benefit that had emerged was improved accessibility to the data, as researchers no longer had to be in the same physical location as the slides to access the information. This allowed pathologists to extract additional value out of the archived specimen data as information could be shared more rapidly across a wider range of applications.

Easy search and access of the datasets support further research and analytical activities.

At Aventior, we believe the use of AI in healthcare will continue to accelerate medical breakthroughs. As more healthcare centers continue to adopt the use of digital pathology, one can only imagine what new capabilities will be unlocked in future years to come – and that is why our company is committed to supporting the adoption of AI and machine learning in the healthcare and life science industries.