Field notes: Transforming canadian procurement data to OCDS format

In this post, lead developer, Sarah Bird, documents the process of converting an existing contracting dataset into the draft standard.

With the 0.3.3 release of the data standard we now have an initial look at the release fields and record merging strategy for the open contracting standard.

As part of testing out our work, we are reformatting existing published data in the open contracting data standard format to help us weed out any easy to spot problems. Of course, there will still be many remaining issues, but this helps us do at least one round of triage and has been a very useful process.

This blog post describes in detail the tools used to do the mapping and how to use them. Most of the tools used were created at the EuroPython sprint and we’re very grateful to all the developers who volunteered at that sprint – more details here.

It is worth noting that most of these steps are not needed if a publisher is publishing data in the Open Contracting Data Standard (OCDS). In this case, only the steps that make a compiled and a versioned record might be useful.

The other steps are only useful if you have data in a different format and want to put it into the open contracting data standard format.

Sample Data

We are using csv downloads of tender notice, award, and contract data from buyandsell.gc.ca that were downloaded in August 2014. The sample data used is available on github here, but to get accurate contracting data for the Canadian Government, please go directly to https://buyandsell.gc.ca/

Overview

The basic steps taken were:

Pre-process the contract history data to include a reference that can be used as the ocid to link all three data sets together.
Make a map between the OCDS fields and the buyandsell.gc.ca fields in each of the three data sets
Use the mapper library to take the csv data and turn it into json based on the maps we made in the previous steps
Validate the individual releases files
Use the compiler library to create the most basic record package.
Build compiledReleases & versionedReleases using jsonmerge library.

Preprocess

OCDS introduces the idea of the OCID a unique ID that follows the data through all phases of the contracting process – from planning through tender and award to completion. As buyandsell.gc.ca isn’t in OCDS format, it doesn’t have an OCID on all its data. There is a common identifier between the notice and award data, but not the contract data, although we can infer it for a subset of the contract data. This ipython notebook shows how we got the ocid onto the contract data so that we could build open contracting records for all the data.

In addition there were some small pieces of data cleanup we had to carry out (e.g. removing duplicates where duplicates weren’t allowed).

As these processes are unique to this particular dataset, we won’t focus on the specifics, many datasets are likely to need some kind of cleaning and preprocessing but the details will always be different.

Mapping

There are two stages to making the map between the OCDS fields and the buyandsell fields:

Making an empty map
Populating the map

First we use the open-contracting branch of the json-schema-random library – https://github.com/open-contracting/json-schema-random/tree/opencontracting.

The json-schema-random library originally takes a schema and makes up dummy json data that meets that schema so that you can test your schema. Our branch has a couple of fixes (which are being merged back into the original library) and extra features. In particular, we can now generate data a single empty json document that conforms to the OCDS schema:

$ ./cli.js ../standard/standard/schema/release-schema.json --no-additional --no-random > blank-ocds-release.json

This gives us a json document that looks like this

{ 
    "uri": "", 
    "publishedDate": "", 
    "releases": [ 
        { 
            "ocid": "", 
            "releaseID": "", 
            "releaseDate": "",
		...

The blank maps can be seen here: https://github.com/open-contracting/sample-data/tree/master/buyandsell/mappings/blank

Next, we take that map and for each csv specify which fields in the csv correspond to which ocds fields. This gives us something like:

 "releases": [{
	"ocid": "reference_number",
	"releaseDate": "constant:2014-07-26",
	"releaseTag": "constant:tenderNotice",
	"language": null,
	"buyer": {
		"id": {
			"name": "end_user_entity",

This map is going to be used by the mapper library and so we can use features of that library such as specifying a constant by using the prefix “constant:” – See here.

Note, if a field is not available in the data, we must specify null for mapper to produce helpful data.

Also note that the mapper only handles things inside releases, the data package items like publisher or publishedDate cannot be populated by mapper. Whatever values are in there will be transferred to the new file. For example:

Under releases, the field ocid will be populated with the csv column “reference_number” and the buyer name with data from the column “end-user-entity_en” . However, for example, the publisher name will by populated with the value “Buyandsell.gc.ca,” it will not go and attempt to look up a column called “Buyandsell.gc.ca.” If, under publisher name, we had put “constant:Buyandsell.gc.ca” then the whole value “constant:Buyandsell.gc.ca” would have been mapped into the json. Where as within the releases, the syntax is respected and the value “constant:2014-07-26” for the releaseDate in the json, the value will be “2014-07-26.” It becomes very clear how this works if you just give it a try, so do some trial runs of your mapping to make sure everythings working as expected before running it against your whole dataset (which can take some time).

The mapper library can create a lot of unnecessary json if we’re not careful so we can remove any fields that are unused by the data to save space and time. This is particularly useful when processing the releases into records.

Build sets of releases

Now we have the maps, we can use the mapper library to build the open contracting releases. Download the github repo, and as long as you have python 2.7 on your machine, it should run as there are no external dependencies. We can call mapper from the command line to build out json releases. For example: to build tenders_releases.json from the mapping canada_tenders_20140916.json and the csv tpsgc-pwgsc_ao-t_e-notices_english.csv we would run the following command:

$ python mapper.py --csv-file path/to/tpsgc-pwgsc_ao-t_e-notices_english.csv --mapping-file path/to/canada_tenders_20140916.json > path/to/tenders_releases.json

We can do this for each set of data so that we have tenders_releases.json, awards_releases.json, and contract_releases.json. These have been compressed and are available here.

After making the three releases, we can validate them to check we have no errors before moving forward.

Validate the releases

For now, validation can be done with any json schema validation library that supports the Draft v4 standard. Open contracting put a validator online here though which can be used to validate the data.

You can either paste in json, upload a file, or specify a url. The validator accepts gzipped json as well as plain json.

For example, to check the validity of the sample data:

Go to: https://review.standard.open-contracting.org
Paste the awards_releases url into the url box https://github.com/open-contracting/sample-data/blob/master/buyandsell/ocds_data/tenders_releases.json.gz?raw=true
Select “Release” under Schema
Press submit (note, you may need to wait a couple of minutes to see the result for large files)

Compile the record

Now we are confident that we have valid releases, we can compile the record. The record is designed to make it easier to discover the current state of a contracting process.

First we do a simple compile where we assemble a record. At this point we just assemble the simplest record package possible. That is a list of records, one for each unique OCID, where each record contains the identifiers of the releases that relate to that ocid e.g.

  "records": [ 
    { 
      "ocid": "PW-$AO-002-23619", 
      "releases": [ 
        { 
          "scheme": "", 
          "name": "Buyandsell.gc.ca-20140918-fd97b811-fade-47a1-b18e-eacfa5df27e8", 
          "uri": "file:///path/to//awards_releases.json", 
          "uid": "Buyandsell.gc.ca-20140918-fd97b811-fade-47a1-b18e-eacfa5df27e8" 
        }, 
        { 
          "scheme": "", 
          "name": "Buyandsell.gc.ca-2014-07-26-a73a2cad-8f83-4fa0-92f5-d94bce02a9dc", 
          "uri": "file:///path/to//contracts_releases.json", 
          "uid": "Buyandsell.gc.ca-2014-07-26-a73a2cad-8f83-4fa0-92f5-d94bce02a9dc" 
        }, 
        { 
          "scheme": "", 
          "name": "Buyandsell.gc.ca-2014-07-26-75f818d2-33b9-4904-8170-08a540d91fac", 
          "uri": "file:///path/to//contracts_releases.json", 
          "uid": "Buyandsell.gc.ca-2014-07-26-75f818d2-33b9-4904-8170-08a540d91fac" 
        } 
      ] 
    }, ...

To do this compilation we use the compiler library and then need to make a couple of tweaks ourselves. To compile the releases into a file records_linked.json

$ python compile.py linked 
	file:///path/to/tenders_releases.json 
	file:///path/to/awards_releases.json 
	file:///path/to/contracts_releases.json > records_linked.json

You can see the compiler documentation for more information, but the linked keyword after compile.py inserts just the identifier of the releases rather than putting in all the records. Now we have done this we can validate it, but it will tell us that it’s missing some key fields for the data package.

So we manually open up the json file and add in these required properties to give us a valid record.

With a little examination, we can see how many individual releases we have for a given contracting process, for the data we examined (note this is not representative of the canadian data as we had to guess about identifiers, but is just an example of what could be done).

We can see that most records have more than one release – which makes sense – we would hope to see, at least one release from each of the tender, award, and contract history data sets.

Optionally, add the compiledReleases and versionedReleases to the record.

At this point we have our releases and our record which links the releases together. This is the minimum amount of data required to be open contracting compliant.

However, two important fields remain in the record, the compiledRelease and the versionedRelease. The compiledRelease gives a release with all the latest information from all the releases.

This process is demonstrated in the ipython notebook.

The process is the same for the compiledRelease and versionedRelease.

The jsonmerge library has a merge function that takes:

the schema – this specifies the mergeStrategies (when none are specified, values are overridden – updated to the latest value)
base – this is the initial piece of json that will be updated with any new values in the way specified in the schema
head – this is the next piece of json that contains values to update the base.

So, for each release specified in the releases list of a record, we run the merge function each time and update to get a final compiled or versioned release. The only difference between the compiled and versioned release is the schema that is used to do the merge, as the versioned merge has a different set of merge strategies.

In our case, the record only contains the identifiers of each release, so we must also look up the original release to get hold of the original data. You can see how we did it in the ipython notebook, but we won’t dwell on it as there are better ways to do if you were doing it over and over again – for example – put the data into a database and look it up.

But the code to use the jsonmerge library is probably more useful, so I’ll show it here. Note that we use the open-contracting ocds branch of jsonmerge which specifies our own unique mergeStrategy ocdsVersion.

import json
import jsonmerge

# Get the Release Schema – to make the CompiledRelease
with open('../../../standard/standard/schema/release-schema.json', 'rb') as f:
    release_schema = json.loads(f.read())

# For each record, in the package.
for record in records:
    # Start with an empty base
    base = None

    # For each release in the record
    for release in record['releases']:
	   # Get the release
        full_release = lookup_release(release['uid'])  # get the release

	   # Use jsonmerge to merge together the base document, the full release, as specified by
	   # the release schema. We make a new base in the process which is used the next time
	   # through this loop.
        base = jsonmerge.merge(base, full_release, release_schema)

    record.update({
        "compiledRelease": base
    })

The method is exactly the same for the versionedRelease, we just use the versioned release schema, instead of the normal release schema.

Important limitations of the method outlined above

In the method outlined above, the compiledRelease is made by looping through the releases in each record and, if there’s a new value, updating the compiledRelease. This means that the order of the releases in the record is very important. In the compiling step we did, the compiler did not order the releases in any particular way, and in the merging step, we did not check for the releaseDate. As it turns out we don’t have a releaseDate for each release as it wasn’t in the original data. So, once again, this is a nice proof of concept but should not be taken as a correct representation of the Buyandsell.gc.ca data.