Linking the what to the who: matching contracting and company data
Note: we may be updating as we work with the data.
Today, the ICIJ is releasing the full database of company names included in the Panama Papers. We’ve been thinking about approaches to look at this data alongside open contracting data. We’d like to better understand if and what connections exist between public contracts, governments’ number one corruption risk, and offshore companies.
Some interesting stories involving potential corruption in public contracts are emerging. Among those implicated in the Petrobras scandal are 50 officials who owned offshore companies in tax havens, according to the recent Panama Papers leaks. Two firms connected to a Senegalese minister signed contracts worth about $35 million for consulting and advisory services relating to the port in Dakar between 2008 and 2012. In Bosnia, questionable pharma deals involve millions being laundered.
The more data we have, the more we can interlink, relate, and connect. Unfortunately, there is little truly open data on public contracts, and what there is for example in Europe is often lacking lots of information.
This blog post explores some of the practical steps that might involve, focussing on the contracts dataset.
Our friends at OpenCorporates will be looking at the beneficial owners behind the company data. (We’ve been supporting the campaign for a global register of this kind of information, the use case is ever more clear)
We hope it will help you in joining us in discovering these connections. We will be updating as we start looking at the data that becomes available. Let us know what you find!
(1) Finding and preparing a contracts datasets
First, you will need to get hold of a dataset on public contracting. Our open contracting worldwide map links to a number of data sources, including an increasing number available in the Open Contracting Data Standard format. A good sense for how open the procurement data in your country is provided by Open Knowledge’s Open Data Index, which also includes links to where you can find the data. The Web Foundation’s Open Data Barometer is another great resource on open data in your country.
Once you’ve found some data, there is a choice ahead. Do you want to search for individual collections? Or run the whole dataset to look for matches between contracts and offshore companies?
(2a) The single search route: seek and explore
With a list of contracts, and the search tools provided by the ICIJ (or other third parties) you can look for contracts you are interested in, or that have been involved in for example corruption allegations before, and then search for the companies mentioned in the Panama Papers dataset.
Individual contracts are easier to come by in the official databases by governments. From there, even starting with a Google search, or checking the winning company’s name on OpenCorporates can give you some leads.
Fun fact: when checking against names of the owners of companies, make sure to look for children and romantic partners as well (whether officially married or not). This list provides a good idea for general red flags for corruption. The World Bank has done a report that looks at the most common corruption schemes in projects financed by it but can be applied more broadly. And the OECD Principles for Integrity in Public Procurement provides a compolete picture of managing risks in public procurement.
(2b) The bulk data route: preparing your data
Warning, it will get a bit more technical, so some experience with data is needed.
If you want to run a bulk analysis, then regardless of the data source you use, there will likely be a few steps involve in preparing the data for analysis.
Checking the data
Firstly, you need to check whether the data includes company information. In the best case, a publisher will have provided:
- The legal registered name of a company;
- The name it is known as in the contract; and
- An official identifier (e.g. company number or tax ID).
However, in practice, some or all of this data may be missing. You will need at least a company name to be able to carry out matching against other company lists.
It is also important to check your understanding of what the data represents. It this contract awards? Is this signed contracts? When is the data from and how reliable is it?
Formatting the data
You will probably find it most useful to turn your data into a simple CSV file with a row for each contract award or signed contract, including the company awarded the contract and then other information about the contract.
If you have data in Open Contracting Data Standard (OCDS) JSON format, then you can use the OCDS Validator tool to do this (as long as each individual file is under 10 Mb or so).
Enter the URL or upload the file at http://standard.open-contracting.org/validator/
Once the file is validated, select the ‘convert to spreadsheet’ option.
You can then download the Excel file. You will find a list of suppliers mentioned in Awards in the ‘suppliers’ tab. Other information is split across the other tabs. This is to allow one-to-many relationships to be captured.
(Note: This is correct as of May 2016. Updates the validator may change the spreadsheet output slightly in Q3/4 2016).
(For large files, or working offline, the python flatten-tool that drives this conversion can be used if you are familiar with using python scripts)
You can run analysis directly against the supplier list, and then look back to link this to contract information later, but you may want to prepare a more detailed spreadsheet with one row for all the key information you want in advance.
With the OCDS spreadsheet format, you can use lookup formulas to bring all the information you want into one sheet (there are other approaches to doing this, but here we focus on those that are possible just using spreadsheet software).
For example, in a spreadsheet that looks something like the example below, with a tab for releases and suppliers, a VLOOKUP formula can pull content from one sheet into the other.
Assuming the data you have has one row for each contracting process in the releases sheet (some datasets which have been publishing real-time data may have rows for each change in the contracting process, and in these cases, a more sophisticated approach to join data could be required), then you can lookup/join on the ‘id’ column.
For example, if ‘id’ is column B, then in the suppliers tab, add the formula below in row 2 in a spare column:
This checks for the value of cell ‘B2’ in suppliers (the release ID), in column B of the releases sheet, and then picks the value of the 55th column in the range (column BD) to display, to give us in this case, the buyer identifier alongside the supplier.
At the end of this process, you should be able to produce a spreadsheet something like this:
|Contract ID (ocid)||Tender date||Buyer||Contract name||Awarded company||Tender value||Award value||Contract value|
This should be the basis for your next steps.
(3) Matching against the company list
Now this is where it will get interesting: matching your contract dataset to the company dataset from the Panama Papers.
Essentially what you are looking for is cases where a company that has been involved in public contracting matches against a company mentioned in the Pamana papers. This forms the basis for deeper investigation to understand the nature of the relationships.
At the time of writing, we’re not yet sure quite how the ICIJ database will be structured, but there are a couple of ways of looking for matches:
- On company names — looking for cases where a company in the contracting list, matches a company in the Panama Papers list.
This is likely to generate some false positives (matches which are not really matches — perhaps due to two entirely different companies with similar name the same names, or with the same name, but in different countries), and will also in all likelihood generate some false negatives (missed matches, due to different spellings and formatting of names).
- On secondary information — such as addresses, or even common shareholders and directors. These can give clues to relationships that are otherwise hard to find from the name of companies alone. Verifying any of these relationships will require careful investigation.
Our power-tool of choice for this sort of activity is the open source OpenRefine application. Download and install a copy, and then you can load in CSV data, and get OpenRefine to run a series of transformations and matching operations on the data.
For example, OpenCorporates provide a OpenRefine Reconciliation API which is documented here lets you run through a list of companies, and check for matches in the OpenCorporates database. With a few extra steps you can then fetch data in from OpenCorporates against each company, such as directorship data, if this is available.
Reconciliation uses fuzzy matching. This is important to avoid too many false negatives. For example, with direct matching, if one dataset talks about ‘Acme Ltd’ and another talks about ‘Acme Limited’, a computer would miss that these might be the same thing. However, with a fuzzy matching approach, there is a greater chance of the match being spotted.
ReconcileCSV is a great tool from Open Knowledge labs which helps create fuzzy matching reconciliation tools against an arbitrary CSV file.
With the data available from https://offshoreleaks.icij.org/about/download we downloadad a copy of ReconcileCSV and used the command:
java -Xmx2g -jar reconcile-csv-0.1.2.jar data/nodesNW.csv searchField_ Unique_ID
to set-up a reconciliation service against the data that would help us initially identify matches.
(4) Now, track down the beneficial owner of the company
And finally, you have the contract, you have the company name mentioned in the Panama Papers. Now you will likely go on another treasure hunt, tracking down who owns this company (We’ve written about the need for a global register of beneficial ownership).
We didn’t say it was easy. But it is important. It’s money that is stolen from the citizen, money for medicine, schools and roads.
Over the next few days, groups across the world will be holding hackathons looking at this data: hopefully that will also contribute to the emergence of further helper tools and HowTos to guide engagement with the data. We’d love to hear your stories of how you have explored the data.