Part of my research involved a lot of work with medical discharge records. These documents needed to be deidentified of any PHI (Private Health Information) like names, dates, locations, ages, etc. The format specified for these types of documents was an XML based solution a colleague referred to as "Standoff Annotations". This was because one element in the XML document contained the raw text (within a CDATA tag) and the other element contained the actual annotations in the form of "PHI Tags".

This is an example of how a Standoff Annotation looks in my editor (using an emacs mode I created of course):

i2b2-mode emacs screenshot

Here's a list of common tasks required when working with large sets (read: thousands) of these documents:

  • Tokenizing, or operating on a "sliding window" of tokens
  • Evaluating F1 measures of one set of documents versus another (test vs train)
  • Creating custom rules to define something, i.e. forcing a regex of (19|20)[0-9]{2} to be a DATE tag
  • Determining if a StandoffAnnotation is actually valid
  • Retrieving all of the valid StandoffAnnotation objects from a filesystem.
  • Determining if a document has overlapping PHI.
  • Determining what PHI exists at a certain offset or range of offsets
  • Retrieving PHI that match a certain set of criteria from one or more documents
  • Retrieving a set of tokens from a document with the pertinent PHI associated with a token
  • Remapping sets of tags metadata across multiple documents; i.e. changing all DATE tags of TYPE "year" to TYPE "month"
  • Converting documents from StandoffAnnotation to a format of an inline XML format
  • Converting back into StandoffAnnotation, and converting to and from 3rd party document formats

The new set of tools I created allow you to do all of those with just a few lines of code. It has a pretty well documented API including examples. In addition it comes with some base unit tests and is licensed under Apache 2.0 - so feel free to fork, contribute, or just use for your own research.

i2b2tools repository

i2b2tools README