Machine Learning

How Natural Language Processing (NLP) is used to extract clinical notes
Veena Pulicharla
September 28, 2020

Electronic health records (EHR) have been widely viewed to have great potential to advance clinical research and healthcare delivery. With increasing deployment of EHR systems, many institutions have vast amounts of clinical data that can contribute to diverse and valuable uses.

On the other hand, free-text data in narrative clinical document remains relatively under-utilized, because it is more challenging to extract the information from free text data accurately and efficiently. To achieve the goal of “meaningful use ”, transforming routinely generated EHR data into actionable knowledge requires systematic approaches.

Predera clinical app provides a solution to this by replacing manual methods with autonomous computational extraction. The clinical note parser can be used to identify and extract clinical concepts and clinical codes from free clinical text documents like discharge summaries.

Natural Language Processing (NLP):

Natural language processing - a subfield of artificial intelligence concerned with interactions between computer and human language, in particular how to program computers to process and analyze large amounts of natural language data.

NLP studies the structure and rules of natural language and creates intelligent systems capable of deriving meaning from text by helping to solve problems like text classification and text extraction. Named entity recognition is an nlp task that allows us to identify entities in text.

Named Entity Recognition (NER):

Named entity recognition - also known as entity identification or entity extraction is an information extraction technique that identifies named entities in text and classifies them into predefined categories. Entities can be names, organizations, quantity, percentages and more.

Extracting entities is really useful for analyzing unstructured text. When we read a text we naturally recognize entities as people, locations, values.. For example, in the sentence

 “He has cough but there is no evidence of fever and patient used aspirin 100 mg p.o.q.d for one day ”

We can identify many types of entities here,

Problem -  “cough”, “fever”

Drug - “aspirin”

Route - “p.o”

Strength - “100 mg”

Frequency - “q.d”

Dosage - “one day”

There are many ways to extract entities from text, few techniques include lexicon based, rule based, machine learning based and hybrid of rule based and machine learning. Here we use machine learning and hybrid techniques.

Machine learning based systems learn to recognize the entities in text based on previous examples they have seen during training, to build an extractor model needs to be fed with a large volume of annotated training data, so that it can learn what an entity is, the model becomes smarter overtime as it learns new examples.

Hybrid approach is a combination of rule based and machine learning based systems. It consists of a model which has been trained with annotated examples, which is then fine tuned with a series of handcrafted rules to improve the performance of the model.

The clinical parser app is an information extraction application that uses natural language processing techniques.  The parser includes identifying clinical concepts like diseases, drugs, procedures, medication details, detecting negative context and splitting of notes into different sections. For every extracted concept the parser provides classification of concept, it’s location in text, clinical code and also it easily extracts entities like medications and medication related concepts.

To extract concepts, the clinical parser uses machine learning models trained using named entity recognition technique on clinical documents, the clinical parser underhood uses machine learning models and open source libraries, the text document is pre-processed i.e. all noise from text and unnecessary patterns are cleaned and sent to their respective models, each feature in clinical parser is provided as an api endpoint.

The clinical parser app also provides clinical mapping for extracted concepts. Use of standard clinical terminology allows accurate information to be shared or exchanged across different departments and facilities, and also supports interoperability of systems, which results in better patient outcomes and operational and financial benefits to healthcare providers. Currently the clinical parser provides SNOMED CT, RxNorm, MedDRA clinical ontologies and soon ICD-9, ICD-10  and LONIC will be added.


  • Section Splitter

The clinical reports or EHR data most frequently is divided into sections, each section provides additional details about patient condition like laboratory tests, past medical history, reason of diagnosis etc,. We may only be interested in particular sections or few sections from the whole clinical report depending on our application or use.

 For a given document, the clinical parser provides the section splitter api endpoint which splits the clinical document based on  section titles.

  • Medications

In any clinical report or medical bills one of the main things we look at is medications, there's a lot of information around medications that can be very useful but hard to capture from free text like drug name, dosage, route of administration, frequency, strength and form of medication etc.

The clinical parser easily effortlessly identifies and extracts all such information. The parser uses the Medacy open source library built on spacy framework and trained on the clinical notes to identify medications and medication related concepts. 

  • Clinical Concepts

The clinical parser provides concept classification, it identifies concepts and provides categories for each extracted concept like problem (disease condition), treatment, or  test (laboratory test). For clinical concept extraction, Clinical NER (CliNER) a named entity recognition model trained on MIMIC II has been used. 

  • Negation

Negative and uncertain findings are frequent and common in clinical reports, but discriminating them from positive findings is hard in information extraction. Physicians write about disease conditions and symptoms that a patient has experienced and not experienced,  just identifying findings doesn’t help without knowing the positive or negative context of findings. The clinical parser identifies negative context and returns negation words, it uses negex algorithm and scientific spacy model to identify negation.

  • Medical ontologies

As EHRs have been widely used, it requires a standard terminology to share information across multiple systems, the coding terminologies provide the way to map EHR data qualitatively  and to share and use easily. One of the main and dominant features of the clinical parser is that it returns clinical ontology code mappings, it gives codes from  SNOMED CT, RxNorm, MedDRA clinical terminology dictionaries. It uses cTakes for SNOMED CT codes extraction, cTakes is an open source natural language system it is mainly used for clinical analysis and information extraction.

  • Existing solutions

For above described scenarios clinical parsers like these provide a solution in structuring unstructured complex healthcare data effortlessly. There are many such clinical tools or clinical support provider services in the market today.

Amazon Comprehend Medical: Comprehend medical is HIPAA eligible Named Entity Recognition and relationship extraction service launched under AWS, trained using state-of-art-deep learning models. Currently, Comprehend Medical performs NER in five medical categories: Anatomy, Medical Condition, Medications, Protected Health Information, Treatment, Test and procedure. Additionally, the service provides relationship extraction for detected entities as well as contextual information such as negation and temporality in the form of traits.

Lexigram: Lexigram is one of such proprietary applications providing healthcare knowledge and data extraction APIs, Lexigram is similar to perderas clinical app, it extracts clinical entities such as drugs, diseases, symptoms and contextual information.

Talix: Talix is another tool to handle unstructured clinical data. Talix tries to identify clinical concepts semantically using Natural Language Understanding (NLU) is a subset of NLP. It identifies semantic types i.e, clinical entities and extracts their medical codes.

Why Clinical Parser matters?

The clinical parser can benefit developers and researchers to take advantage of whole unstructured free text, and helps them in taking standard and value decisions. Clinical parser can play very important role in clinical, pharma and pharmacovigilance domains,

For example,  let’s take an example of tagging an adverse event, adverse events are reported through many ways like phone, mail, fax, etc,. But these reports are viewed by pharma agents and they have to manually check for symptoms, disease conditions and also get the context for positive or negative conditions which is a very tedious process but where all this can be easily captured by the parser.

Extraction of the clinical information supports clinical decision making and improves the quality of care. Valuable insights remain locked in unstructured medical records such as scanned documents in PDF format , while human readable, present a major obstacle to the automation analytics required. More than billion medical records are created every year, the clinical and financial insights incorporated within these records are required by an average of 20+ roles and processes downstream of records generation. Currently healthcare providers need an army of professionals to read, understand and extract health care data from the flood of clinical documents generated every day. But the success has been elusive

Predera Clinical app examples:

Medications extraction

Sample Input : Levofloxacin 500 mg p.o. q.d. for a seven day course to be completed on [**2118-6-21**].

Output :

" Medications": [


             "drug": "Levofloxacin",

             "route": "p.o",

             "strength": "500 mg",

             "frequency": "q.d.",

             "dosage": "seven day"




Concept classification

Sample Input :  Continued on aspirin Imdur, and diltiazem for rate control per her outpatient regimen. change in EKG.

Output :



         "category": "treatment",

         "concept": "aspirin imdur,",

         "end_index": 3,

         "line_number": 1,

         "start_index": 2



         "category": "treatment",

         "concept": "diltiazem",

         "end_index": 5,

         "line_number": 1,

         "start_index": 5



         "category": "test",

         "concept": "ekg",

         "end_index": 15,

         "line_number": 1,

         "start_index": 15



Medical codes

Input : She did have another episode on the medical floor of chest pain, which showed no evidence of EKG changes and negative troponin.

Output :


  "disorders":  None,

  "medications": [



  "procedures": [



  "snomed_codes": [


      "code": "29857009",

      "text": "Chest Pain"



      "code": "22253000",

      "text": "Pain"



      "code": "51185008",

      "text": "Chest"



  "symptoms": [

    "chest pain",





term : “EKG changes ”,

start index : 93,

end  index : 104



Term : “troponin”,

start index : 118,

End index : 126




We hope you found our blog post informative. If you have any project inquiries or would like to discuss your data and analytics needs, please don't hesitate to contact us at We're here to help! Thank you for reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.