This is a guest post co-authored by Patrick Nsukami (Dakar Linux User Group), Pierre Chrzanowski (Open Knowledge Foundation France) and Tangui Morlier (Regards Citoyens).
As part of the “Tandem Dakar Paris” a Digital Week organized in May by the French Institute, Jokkolabs (Dakar) and La Cantine (Paris), people from Senegalese and French Open Data communities joint their forces during a workshop in Dakar to explore Senegal Public Data. During our exploration, we have been pleased to find that Public Procurement Data were regularly published on the Senegalese Authority for Public Procurement website. But the information were released in PDF files and therefore difficult to analyze. So we decided to extract the data into machine readable and open format. In a less than one day, 4800 lines of public procurements awarded between years 2012 and 2013 have been extracted into a single CSV file. It was then easy to upload the file onto Open Spending, making the data easily browsable. Here is how we did it…
Downloading PDF files
The Senegalese Authority for Public Procurement releases public procurement data as PDF files on a quarterly basis
The files contain information on each public procurement contract awarded during this period of time, including :
- amount of the procurement;
- date of the procurement has been awarded;
- public authority who delivered the procurement;
- private or public entity who received the procurement (the beneficiary);
- type of procurement: service, good, etc.
We choose to download files from the first quarter of 2012 to the first quarter of 2013 (the last one available).
One of the tricky part was then to extract data from the PDF files. To transform a PDF file to a CSV file, it was important to split the task into small ones, and follow the KISS principle (Keep It Simple and Stupid). That’s what we did, see it here :
Task #1 – Split the PDF files : First step was to split the PDF into one-page-only files that would be easier to handle by scripts. This means to go from one document, containing lots of pages, for example 100 pages, to 100 documents of 1 page each. We used pdftk a free and opensource command lines tool, to perform this operation.
Task #2 – PDF to XML : Second step was to turn the tables within the PDF files to an XML document. It is easier to parse an XML document (the structure) than a PDF one. For this task, we used pdftohtml
Task #3 – XML to CSV : then we wrote a script in perl (but you can do it in python, ruby, etc.) that turned the XML document to a CSV document. Here is the XML to CSV for 1 page script
The script creates the columns and the lines we defined. When we were satisfied with the result for the first page, we wrote another script to apply the first script to all the XML pages (a bash script). Here is the XML to CSV bash script
Task #4 – Retrieve public procurement data only : The result was a CSV file containing all the information from the tables, including the data on Public Procurements but not only. The final step was to retrieve only data on “Public Procurement”. This has been done by parsing the XML file and retrieving only lines with a specific content. The Retrieve script is here
Task #5 – Run all the scripts : finally, we wrote a bash script to perform all the tasks in a row :
- split pdf
- turn pdf pages to xml ones
- convert XML documents to CSV ones
- retrieve all the lines on “Public Procurement”
Eventually, we got a pretty good first version of the csv file with all data on public procurements inside. Next step was to ensure the quality of the dataset and prepare it to be uploaded onto Open Spending.
Cleaning and preparing data
When extracting data from a PDF file, most of the time your csv comes with conversion errors such as misplaced or divided cells. It is important to correct those errors before going any further. Your dataset is a combination of different inputs from different public authorities : you must also ensure that every reference in the dataset has a unique name (public authority, company, etc.). This will facilitate the analysis. Finally, for uploading the dataset onto Open Spending, you must also ensure the quality of the data such as avoiding empty cells, or converting amounts and dates in the appropriate formats. All this is part of the cleaning process.
To perform those operations we used only 3 kind of tools, linux shell tools, a spreadsheet application LibreOffice Calc (but you can also use Excel) and Open Refine (formerly known as Google Refine).
Linux shell tools such as awk, grep or wc that help to identify the error and check if they are correctly corrected. They help to aggregate data and be sure that corrections needed were done globally. A text editor can also be used for this step.
Open Refine is a powerful application to help you quickly clean your dataset. We mostly used Refine for its “cluster” feature, which helps you find groups of different cell values that might be alternative representations of the same thing, and then let you attribute a unique name to those cells. We used the “cluster” tool once we were sure that the cells contained the correct information.
And we used LibreOffice Calc for all other operations: aggregate divided cells; convert amounts in the right currency; fill in empty cells with a default value.
Here is a summary of the tasks performed during the cleaning process :
- Fill in empty cells with default values;
- Concatenate values that are divided into several cells;
- Attribute a unique value to same entities (public authorities, companies);
- Convert all money amount into local currency “Francs CFA”;
- Convert date format from DD/MM/YYYY to YYYY-MM-DD (standard for Open Spending).
Uploading data onto Open Spending
Once the csv has been cleaned, we published the dataset on NosDonnées.fr the French Open Data Hub, under the Odbl licence. The file was then available for all in an open and machine readable format.
Last part was to upload and configure the dataset to be available on Open Spending :
Step #1 – Import and validate the dataset: assign a name, an identifier, a country, language, a currency and a description to your dataset. Then specify the url of your dataset. The file must be available on the web, this is why we published it first on NosDonnées.fr.
Step #2 – Create a model for your dataset: once the file has been validated by Open Spending, you are invited to attribute dimensions related to your spreadsheet, i.e defining which field designate what. The mandatory fields are the field date, the field amount, the field “to” which is the entity who received the contract, and the field “from” which is the “entity” offering the contract. Fields “to” and “from” must be compounds dimensions, however the model will not be valid. Make also sure you have an unique ID for each entry. In our case we had to assign an unique code for each procurement row since it was not present in the original file. You can also specify which dimensions you want to be able to browse by in the Open Spending application. These are the facets of your model. We choose “public authorities”, “beneficiaries” and “type of procurements” and “procurements titles” as facets for our model. Once the model was ready, we saved it. Then, Open Spending started loading the entire dataset. In case of errors during the loading process, you can always come back to your dataset, do a bit of refine and reload it. After some trials and errors, the result was quite nice to watch :
Step #3 – Create a view: Dataset was now browsable. Last step (which is not mandatory) was to create specific views or visualisation. On Open Spending you can choose among “Table of aggregates”, “Bubble tree” and “Tree Map”. In our case we wanted to see a list of beneficiaries sorted by who received the most. So we choose “table of aggregates” with a view aggregated by beneficiaries. Once the view was created, we could embed it in our website like this:
So that’s all ! Senegal Public Procurements are now available on Open Spending and easily browsable. We hope this article will be helpful for your exploration into other country’s public procurements or PDF files.