Hi, I have been trying to look at the problem of multiple activities being published with the same activity-identifier. There are about 1062408 activities in IATI data, but only 1020102 unique ones, see https://analytics.codeforiati.org/activities.html - that means that around 42,000 activities use an activity-identifier that is already in use.

I think this causes a problem because tools such as the datastore and d-portal just take one of the activities with each ID and discard the rest - this means that key tools are missing 40,000 activities and not showing the full picture.

As an additional aspect, see this activity from d-portal: https://d-portal.org/q.xml?aid=1301-4.0000 which has the same activity-id as this activity from the datastore https://iatidatastore.iatistandard.org/search/activity?q=(title_narrative:"1301-4.0000" OR description_narrative:"1301-4.0000" OR iati_identifier:"1301-4.0000" OR transaction_description_narrative:"1301-4.0000")&wt=xslt&tr=activity-xml.xsl&rows=1 - but when you look, they are for different activities with different dates, different budget etc. This outcome means that the same search with the different tools will return different answers which seems a fundamental flaw.

How can this situation be resolved? Are files with duplicate-IDs marked as invalid by the validator and rejected (it seems not or they would not be in DSv2), can the tools be adjusted to return also the duplicate activities, is there a process to reach out to publishers to let them know that their data is not actually available as it uses duplicate activity-IDs?

Here is my quick scope of the publishers with over 100 duplicate activity-id's in their currently active projects.

And for those wanting the detail - here is the link to get the dataset.....from datasette.codeforiati.org - beware the 36mb request might timeout in your browser: https://bit.ly/3u4s0o7

You can also include the non-active projects by sorting the duplicates columns here: https://analytics.codeforiati.org/identifiers.html the main offenders seem similar.


No comment from any of the publishers: Switzerland, Japan, France, Italy, Canada.....with thousands of duplicate activities.......not worried that your activities are not visible to users - or is there something wrong in the analysis (could well be!)?

See here (sort by the duplicates column) https://analytics.codeforiati.org/identifiers.html

Are files with duplicate activities rejected by the validator?

Amy Silcock

Hi matmaxgeds, thanks for this.

We reached out to all publishers with duplicated IDs last week using the IATI Dashboard's list: http://dashboard.iatistandard.org/identifiers.html 

We've got a 30% response rate so far and will keep following up with publishers.

If you're a publisher who's got duplicated activity IDs and we haven't contacted you please do comment on this post or e-mail the tech team at support@iatistandard.org.




That's awesome (and quite a chunk of chasing work!) - thanks [~566] 

Do you know if this is something that publishers could get notified of via the validator - presumably it is a pretty serious rule break?

Amy Silcock

Thanks Matt. Currently the validator checks on a per file level, so if there are duplicated IDs within a file these get flagged.

It doesn't check a file against all available IDs published to IATI. Often duplicates are spread between files. It's a possible feature, one we'd need to work out the logic and process for.

