Medicine or not medicine? Semantic search and AI classification techniques for medicine price comparison in Brazil
The problem
In short: We don’t know if an item is a medicine or not
Over 15 million Brazilians lack access to essential medicines, partly due to inefficient public procurement. Non-standardized processes, lack of transparency, and unstructured data obstruct effective decision-making and oversight, enabling overpricing and corruption.
The new Public Procurement Law (14.133/2021) mandated the creation of the National Public Procurement Portal (PNCP), which centralizes procurement by national and subnational entities. To promote more efficient procurement in the health sector, the law requires procuring entities to consult a “Health Price Bank,” a specialized database that provides reference prices for medicines and health products based on historical procurements, when making purchases for the universal health system.
But monitoring medicine prices using the PNCP is currently impossible. Around 12 million purchase items were published on the portal in 2024 alone, but there is no way to systematically identify which of those items are medicines.
Why AI
In short: Because we have too much non-standardized data but not enough time!
Currently, the description of public procurement items (especially medicines) in Brazil does not always follow a standard or catalog. This makes it difficult and time-consuming to compare prices across similar items, and manual price comparisons would be costly. It takes a human about 90 minutes to manually identify 100 purchased items by looking up descriptions in a catalog. Using AI, this time can be reduced to seconds.
The solution
In short: Use an LLM-based embedding model to classify an item as a medicine or not by comparing its description to an existing medicines catalog
We matched procurement item descriptions from the public procurement system to entries in the goods and services catalog, using the open contracting data published in PNCP and the Sistema de Catálogo de Materiais e Serviços do Governo Federal (CATMAT/CATSER) – Brazil’s federal catalog. In the PNCP, descriptions of goods and services purchased appear as free-text in the items fields associated with tenders and awards. We used an LLM-based embedding model to associate each medicine-related PNCP item with a CATMAT entry (identified by its catalogue code, codeBr), based on the similarity between the descriptions, and then applied a similarity threshold to decide whether the item should be classified as a medicine.
The project, Medicamentos Transparentes, was developed with the support of OCP’s Lift impact accelerator program by a coalition of partners, including the civil society organization Transparência Brasil and the Controladoria-Geral da União (CGU), with the collaboration of the Secretariat of Management and Innovation (SEGES), part of the Ministry of Management and Innovation in Public Services, which is responsible for the federal e-procurement system and manages PNCP, among other partners.
The results
In short: In tests, the model identified most medicine purchases correctly
It’s possible to calculate the model’s precision, recall, and overall accuracy through its predictions. In our tests, using a sample of 1,000 manually labeled medicines, the model achieved 98% accuracy in predicting whether a given item is a medication or not and 86% accuracy in predicting the medicine’s codeBr (national catalog).
The detailed technical approach
This section describes the detailed technical approach we used for developing the solution. As per our previous blog: A gentle introduction to applying AI in procurement, we will describe the task, select the method, narrow the task, understand the input, and select the model.
- 1. Describing the task
In our case, the task involves looking for item descriptions that are similar to an existing medicine catalog. Our task is sentence similarity. We want to compare a PNCP item description to catalog descriptions, assign a similarity score, and then use that score to classify the item as a medicine or not.
- 2. Selecting the method
For sentence similarity, the most appropriate method was converting medicine descriptions from the PNCP into vector embeddings using a Large Language Model (LLM) and comparing against all catalog items using cosine similarity. Because CATMAT uses controlled vocabulary to describe catalog items – with each category of goods having a defined set of terms and attributes – we can use this structure to make the search more efficient. In the production pipeline, we first perform a lexical pre-filter to detect active ingredients from the CATMAT medicine group in the item description. We then compare each item’s embedding only against CATMAT entries sharing those same active ingredients, selecting the closest match and its catalog code (codeBR).
- 3. Narrowing the task
Under this formulation, the task is not to interpret every procurement item published in the PNCP, nor to search the full catalog for every item, but to perform a constrained retrieval task focused on medicine identification. Items with similarity scores of 0.5 or higher are classified as medicines.
- 4. Understanding the input
Our input was free-text item descriptions in Portuguese from the PNCP. These descriptions are often short, inconsistent, and noisy: they may omit accents, abbreviate dosage and presentation, or combine product name, concentrations, and packaging information in a single field.
For example, a catalog description such as “267203 – Dipirona Sódica, Dosagem: 500 MG – Comprimido” (codeBr 267203 at CATMAT) may appear in PNCP as “Dipirona S. 500 mg comp.” In this case, the lexical pre-filter first links the PNCP description to the CATMAT group for “dipirona sódica”, and the embedding model then compares it only with CATMAT entries in that group, identifying the catalog description as the closest match because both refer to the same active ingredient, strength, and dosage form.
In contrast, the reference catalog is highly standardized. CATMAT descriptions include structured attributes and identifiers, such as the Padrão Descritivo de Material (PDM), which in this context broadly corresponds to the main active ingredient, and codeBr. This makes CATMAT a suitable target for semantic matching, but it also requires the model to bridge the gap between noisy procurement text and curated catalog entries.
- 5. Selecting the model
We used Snowflake/snowflake-arctic-embed-l-v2.0, loaded through Sentence Transformers, because it is a multilingual embedding model optimized for semantic retrieval. This matters for our use case because the pipeline treats PNCP item descriptions as queries (as a search input) and CATMAT descriptions as documents (the reference entries to be retrieved). The model is used in a zero-shot setting, meaning that we apply the pretrained model directly, without fine-tuning or retraining it on our own procurement data.
This repository also stores the catalog embeddings in a cache file, avoiding the need to recalculate them on every run. The embedding model and cache location can be changed if needed. More details are available in the classifier README, and the source code is available at Transparencia-Brasil/cesta-de-precos-pncp · GitHub.
Ready to try it out?
This project demonstrates how open contracting data and AI techniques can be combined to create quality data to make better procurement decisions. The approach is replicable: any procurement system with descriptions of awarded items and a goods product catalog can benefit from a similar classification model. Whether the goal is to buy medicines better, compare prices, or improve procurement for another sector more broadly, if you are interested in applying these techniques in your context, reach out to us.
A big thanks also to Transparência Brasil team members Talita Lôbo and Luiz Fonseca, who worked on the project!