Data.gov.uk provides sets of open data – various statistics regarding government and public services. DataArt launched an internal R&D project to create a tool that would help people to use this information.
We took a set of data listing prescriptions issued by medical centers. The set was made up of 35 .zip files, one for each month from August 2010 to June 2013. Monthly updates by data.gov.uk allow to keep the application up-to-date.
Each .zip file had 3 .csv files with:
medical centers information, including name, address and zip code;
BNF (British National Formulary) codes and names;
complete statistics for a month, about 1Gb in size, containing medical center code, prescription code (not only for medications, but for things like bandages as well), prescription names (abbreviated and shortened down to 40 symbols and thus in need of extra processing), the number of issued prescriptions, and price.
Our aim was to visualize summary statistics for selected filters and show them against a map of England and along with a key. The filters include time range, prescription form (i.e. tablets, drops, injections, etc.), a BNF section upon which the prescriptions are listed; and also geofilters, such as region or medical center.
To fulfil this, we set ourselves four major tasks.
Find a map of England and its regions and display it in svg format.
Our team decided against using google/yandex maps, choosing to work with vector graphics from the start and utilize d3 (java script library) and svg.
Bind medical centers to the coordinates on the map as well as to the regions.
The medical centers location can be extracted from the zip codes provided in the original files. We used geonames.org services to get coordinates from the zip codes. It turned out, however, that quite a few provided zip codes were outdated, which meant we had to update the information manually.
Process medications information.
The thing is that while all prescriptions stats were easily accessible, the information regarding medications and BNF was scarce and hard to process.
In essence, BNF is just a book divided into sections. Prescriptions are split into groups according to these sections as well as by medication active ingredient.
Surprisingly, the full list of prescriptions included certain sections that are non-existent in BNF. Mostly those were non-medications, e.g. swabs, sterile dressings, etc.
We had to pull together diversified informations (not just from data.gov.uk) manually to form a directory tree with sections, subsections and so on, down to active ingredient breakdown.
Also we had to manually process shortened and abbreviated prescription names to turn them into readable ones. For some reason this information was hard to find in the first place.
Performance, or getting search results fast.
We decided to utilize a MongoDB database. The number of entries (i.e. a medical center issuing a prescription on a certain date for a certain price) grew to 350 million. If you keep all of them in a single collection, the index size takes up to 16 GB. We had only 4Gb RAM available on the machine, thus we needed to find a way to divide the data so that the indices would fit into the RAM and so we could perform searches.
Our team found the solution in mongoDB’s functionality. We took the original set and divided it into several collections with smaller sets of keys and a nested data structure so that they are not part of the indices, but help to group search results.
For instance, from the original three keys (medCenterId, prescriptionId, date) we built a collection with the “medCenterId, date” primary key, in which each entry has an array of all prescriptions, e.g.:
We got a total of 700K of such entries (instead of 350 m) that we could use to apply the medical center filter with subsequent data aggregation to get summary statistics.
A similar procedure was done for prescriptions and other filter parameters. As a result we now have a number of collections (which means an increase in size on HDD), but we can get results almost instantly.
To find out more about the Data Visualization tool’s capabilities or to see it in action, visit this page.