Discussion

IATI Datastore - what data should go in?

Steven Flower • 29 January 2019

As part of the build of the new IATI datastore, there’s an important point for our community to consider: what goes into the datastore?

A common response might be “all published IATI data, surely?”, but I wanted to offer an alternative, which I think others will support.

In short, I propose that the IATI datastore initially pulls in data that is:

  1. Valid to the IATI schema, and;

  2. Published under an open licence, and;

  3. Using version 2.0x of IATI

To be clear: this does not encompass all current published data. So - why limit ourselves? Here’s three reasons:

1 - Schema validation is something we are very used to

For those of us that grapple with data validation around IATI, we know it can often mean many things. The term “compliance with IATI” is often heard, but not universally agreed.

However, we have a very simple mechanism to help us: the IATI schema. The schema provides exactness on how data should be ordered and structure: it’s the minimum level of validation one should pass.

The online IATI validator has always provided a means to test data against the schema. It’s true that there are a range of further “validation” tests one could make, including a host of rulesets and codelist checking - and even extending to the coherence of organisation identifiers. However, for us to get a basic level, we should begin by working with data that passes this initial schema inspection.

My argument here is simple: if we start to support data that is not valid against the schema, why have a schema? Or even - what support are we giving to data users, if we supply invalid data?

2 - Data licencing supports data use

You might be surprised to see mention of data licencing in this proposal, as it can often be something that is added at the last moment of publication, whilst someone sets up their IATI Registry account. However, appropriate data licencing is an absolute must if we are to support people to successfully use IATI data.

In fact, we should really consider the datastore as a data user! In this light, it needs to be clear that they can access data under a licence that is permissive and open. When data is licenced under restricted grounds, then it cannot be reused (as that is what the licence says!).

My challenge: Why would we support the datastore to use data that has no licence for reuse?

3 - Version 1 of the standard is deprecated

The TAG meeting in Kathmandu supported the decision of IATI members to deprecate version 1 of the standard, meaning that from June 2019 publishers using this will not be supported.

Whilst it’s technically possible to convert data from version 1 to version 2 of the IATI standard, this would take up a limited resource on the datastore project we could deploy elsewhere.

My rationale: to get the support of the new datastore, organisations need to supply data in a version that is actively supported by the initiative.

Support for our principles?

A common thread between these three conditions is mutual support. We all want to support our data standard and data users via the datastore project. To do this, we must ensure that we respect the core protocols we have around the schemas, licences and versions for our standard. Given that the datastore represents a renewed focus on data quality and use, I can’t imagine a scenario where we would actively go against these.

Of course, there are currently a range of publishing organisations that would be omitted from inclusion in the datastore, in terms of failed schema tests, restrictive licencing and/or use of unsupported versions,. However, we should be careful to not start to cite examples in order to find reason for a relaxation of this criteria. I do believe this is a relatively low bar for entry - and that our community and tech team can provide positive support to those that need to remedy their data.

What next? I’m hoping those active in our community can support these principles, so that we can in turn endorse the data that makes its way into the datastore. Maybe respond with a quick nod of a approval to get us moving on…

My guess (based on informal discussions - see below) is that the first two principles are very agreeable, whilst there’s a dilemma about use of version 1 data. That seems fine - and is a reason for my separating v1 into a new point.

After this, we can start to extend our discussions around data validity, compliance and quality in other, more advanced, ways. But, I do hope colleagues are able to step back and agree that this initial benchmark is for the betterment of the initiative.

Disclosure: prior to posting this I quickly discussed these ideas with [~522], [~571], [~471] [~379] , [~336], [~469] & [~479] - more as an valued and accessible sounding board rather than definitive answer (but thanks, nevertheless!)!

Comments (52)

matmaxgeds

Hi [~433] - can you recommend a way to work out what this would mean in terms of what data the datastore would/wouldn’t return under the three rules you propose? Ideally this would be an IATI xml file that contains the data that would be rejected so I can scan it and make sure that I won’t miss what is being dropped. That seems the best way for me to be able to say something sensible/evidence based rather than just debating the principles.

Andy Lulham

Hi [~350],

The following table shows the datasets that would not be pulled in according to this proposal (based on registry data from 2019-01-30.) The reason for exclusion is also listed. Clicking the reasons will show the list of datasets excluded for that reason.

Reason for exclusion Dataset count Activity count
Invalid XML 272 -
Version 1.0x 1,157 126,352
Invalid against (v2.0x) IATI schema 762 100,716
Closed license 51 411
License not specified 1,092 21,819

Note:

  • The groups above are mutually exclusive by definition – datasets can’t be in multiple groups.
  • Schema validation was performed at activity level (as suggested by Steven Flower). I found roughly 14,000 valid activities in invalid datasets. The datastore could exclude invalid activities, rather than invalid datasets.
  • It’s unclear whether datasets with an unspecified license would be excluded or not, so you can maybe ignore those ones.
Herman van Loon

[~433] [~350] [~336] Concerning missing license info: since IATI is open data, publications without licence should be considered open by default.

Concerning closed licenses: should they even be allowed on the registry. The whole IATI effort is about sharing open data. Attribution licences should not pose a problem: since IATI supports the ‘reporting organisation’ item, all IATI date can be attributed to the publisher.

A last thought about licences: shouldn’t we consider the datastore as IATI infrastructure instead of an IATI data use application?

Andy Lulham
matmaxgeds:

if these changes were implemented, we would lose access to e.g SIDA data

I meant to add a note about Sida – their data has been offline for a couple of days, hence why they’re in the “invalid XML” category. It appears to be back up now. It’s v2.01, openly licensed and valid, so it would be included

Herman:

Concerning missing license info: since IATI is open data, publications without licence should be considered open by default.

Yes! I’m hopeful that’s the case. If so, then that’s great.

Andy Lulham

Flagging this from earlier in this thread:

andylolz:
  • Schema validation was performed at activity level (as suggested by Steven Flower). I found roughly 14,000 valid activities in invalid datasets. The datastore could exclude invalid activities, rather than invalid datasets.

I checked again today, and the number of schema-valid activities in schema-invalid datasets is now 74,752. It’s possible to validate at activity level and still provide access to raw XML, by excluding the invalid activities.

Steven Flower

Thanks [~336] - really very useful

[~350] does this data answer your question? I think you also mean some kind of function from the datastore, containing the excluded activities - but the stats ^^ are useful context for us.

matmaxgeds

[~336] huge thanks
[~433] - yes, I think it does. In Somalia we are building an aid management system that will allow users to use IATI data via the datstore - if these changes were implemented, we would lose access to e.g SIDA data: http://preview.iatistandard.org/index.php?url=http%3A//iati.openaid.se/xml/SO.xml - which answers my question about how significant this is…significant.

So from my side, I think the list of changes/principles is excellent. For me the problems to solve are:

  1. IATI datastore data no longer = IATI registry data - this means that it will no longer be good enough to do research and share data and give the source as ‘IATI 2019-01-21’ but will now have to specify that it is from the datastore, and because of XYZ, ABC are excluded - pretty confusing for readers, but would be key because the amount dropped would be a significant difference especially for some publishers.
  2. I presume that this change would need the same approval as a shift from 2.03 to 3.01 - what would the process for that be? And I presume this process should include a period where all offending publishers were contacted and helped to work through the (typically minor) tweaks needed to pass the tests? But who are we going to assign the time to do this - assuming we are talking several hundred publishers - and what would the cutoff be - 80, 90% of activities made compliant?
  3. Putting the two above together, why not just apply this at the registry level - remove links to all files that do not pass the test and benefit from these principles throughout the whole IATI ecosystem, not just one small part. That would also do it at the moment of publishing which is by far the easiest stage to have a conversation with the actual person responsible for publishing, and would give far more leverage, if it is just on the datastore, then they can confirm to their boss/funder that they are publishing to IATI and not worry further.
  4. I am worried about a situation where for those funders (e.g. Netherlands) that require IATI publishing, that if this move removed their recipients required IATI data from the datastore, it will mean those funders no longer use the datastore as their way of checking - and the datastore further loses the critical mass of IATI data use that it needs to exist.

In summary, I think this is a great idea (it would help the data use side hugely), so good in fact that it should be applied at the registry level, and the secretariat should dedicate resources to bring it about in a way that supports publishers, and those users/systems that currently use the datastore - not just have them as collateral damage of a good step forward.

Andy Lulham
matmaxgeds:

if these changes were implemented, we would lose access to e.g SIDA data

I meant to add a note about Sida – their data has been offline for a couple of days, hence why they’re in the “invalid XML” category. It appears to be back up now. It’s v2.01, openly licensed and valid, so it would be included

Herman:

Concerning missing license info: since IATI is open data, publications without licence should be considered open by default.

Yes! I’m hopeful that’s the case. If so, then that’s great.

Yohanna Loucheur

These are very sensible principles to uphold for all the reasons outlined in Steven’s post.

No objection from Canada on #1 and #2 given how central this is to the whole IATI standard endeavour. Also agree with suggestions that not specifying a license should be considered open by default.

I understand the concerns about #3, but ultimately we need to move in this direction, for the same reasons that we have to deprecate 1.0x. Matt, you raise valid concerns about losing access to some data, but I don’t see this having as much of an impact as you anticipate. The few remaining active publishers using 1.0x are preparing to move to 2.0x. I can’t speak on their behalf, but it seems unlikely that the UK or Netherlands would accept a data file published in a deprecated version of the standard.

So the main issue for principle #3 would be files published in the past by now-inactive publishers - and there are a lot of them. I doubt it is used in aid management systems, as partner countries tend to focus on current and future data. Still, this older data can have tremendous value for some users e.g. evaluations, audits, historical trends, etc. If we were to concentrate on this specific use case, could we perhaps find solutions to maintain some form of access to 1.0x data?

Herman van Loon
YohannaLoucheur:

I understand the concerns about #3, but ultimately we need to move in this direction, for the same reasons that we have to deprecate 1.0x. Matt, you raise valid concerns about losing access to some data, but I don’t see this having as much of an impact as you anticipate. The few remaining active publishers using 1.0x are preparing to move to 2.0x. I can’t speak on their behalf, but it seems unlikely that the UK or Netherlands would accept a data file published in a deprecated version of the standard.

The Netherlands IATI reporting guidelines require that publishers use IATI version 2.02 or higher. We are technically still processing 1.x IATI files though. Since 1.x from the information content point of view is largely a subset of 2.x, the continued processing of 1.x was in our case a very small technical effort because we choose to skip processing of 1.x which are depreciated in 2.x (e.g. some location elements).

My concern with not processing 1.x anymore is that the datastore can not be considered as an authorative source of IATI data anymore, since relevant data is missing. The decision wheter or not to process 1.x could i.m.o. be dependant on two criteria:

  1. the number of active publishers who will not have migrated to 2.x on june 30 2019 (an active publisher defined as a publisher who publishes at least once each year);
  2. the technical effort to additionally process 1.x data excluding the depreciated 1.x data-elements
Andy Lulham
bill_anderson:
  • I suspect most closed or missing licences are oversight, not deliberate. A job for Tech Team and community to address.

^^ Agreed / cool. Step #1 is this ticket, which would stem the tide of “license unspecified” data.

David_Megginson:

I agree about the reuse problem. That’s why I’d have the data excluded from common queries by default, and included only when the user explicitly opted in (e.g. “Include non-open data” option in the UI, or “&license=nonopen” in the API).

It sounds like there’s appetite for removing the option to publish closed IATI data going forward (FWIW I support this). Plus the number of activities published with a closed license is really small (see table above). If the option of a closed license were to be removed, I doubt it would be worth special casing for closed data in the datastore API.

Bill Anderson

See ^^. It has already been agreed that …

bill_anderson:

The datastore will do a one-off load of non-active* Version 1 activities.

The DS spec does not require for deprecated versions of the standard to be processed.

It was decided, pragmatically, that although DS will come online before V1.0 is deprecated, we are talking about a couple of months and it was not worth the effort to complicate the load.

Bill Anderson
  • The datastore will do a one-off load of non-active* Version 1 activities.

  • I suspect most closed or missing licences are oversight, not deliberate. A job for Tech Team and community to address.

  • Personally I would load all valid non-active Version 1 activities irrespective of licence.

  • I also personally agree with [~471] that any data discoverable via the registry is de facto open. Publishing to an open data standard and insisting on licence restrictions (other than attribution) I would imagine to be legally questionable.

(* I agree with [~471]’s twitter definition of active meaning publishing at once a year. So all publishers who haven’t published (or refreshed) anything in the last year are non-active.)

David Megginson

As I mentioned on Twitter, we ignore Postel’s fundamental law of the Internet – “be conservative in what you send, [but] liberal in what you accept” – at our own peril. If there’s any reasonable way we can keep accepting v.1 IATI from active reporters, then it might not be a bad idea to do so.

As for messy licenses, just as Wikipedia isn’t running out of paper, the IATI Datastore won’t be running out of index cards. Let’s take in as much data as we can, from anyone who wants to provide it, then we can flag “bad” data to exclude from the headline reports, leading indicators, and visualisation dashboards (so that there’s still a consequence to not being open).

D

Steven Flower

Thanks everyone to the detailed, considered and useful answers. It’s like a Technical Advisory Group!

Allow me for a moment to sit on my TAG chair cushion and undertake my duties. In amongst all these exciting conversations and (potential) tangents, I think this is where we are:

  1. On schema validation - I see no objection.
  2. On open licences - we seem to also agree on the principle, but see a contradiction in how an open data standard can accommodate closed licences.
  3. On 1.0x, we seem less ready to “reject” that data - but think the deprecation of v1 should mean active publishers will make a plan to migrate to v2

There are a few tasks coming from this, it seems:

  • clarifying our guidance on closed licences
  • understanding why/how the Registry would allow them
  • thinking through how the Registry might apply some / all of these principles
  • considering how we make available / archive “non-active” version 1 publishers
  • understanding our position on limiting data, in an unlimited data world

But - as we break for the weekend (think of it as a coffee break in this energetic meeting we’re having, but the chance to get some actual fresh air) I’m hoping this is a adequate summary of where we are at.

SJohns

Good summary - going back to Andy’s original summary of the activities affected, would any exclusions be based on excluding activities rather than whole data files? Just thinking of the 600+ CSO publishers, some of whom have old activities going back to 2011 that won’t meet these criteria, but are part of the same datafile as newer activities that will be 2.0x and meet the criteria. They are not going to have the resources to go back and update older activities. And as many donors now link the payment of funds to the publication of data - it could be a real risk to them to have their datafile pulled completely. What would be the best advice you can give a CSO in advance of these changes?

Steven Flower

Hi [~373] - thanks, it’s a very valid question

In terms of specific file having a mix of 1.0x and 2.0x activities within it, then I don’t think this is actually possible. The version attribute is only applicable at the <iati-activities> element, not the <iati-activity>, so it can only be declared once per file. It used to be different (in version 1.0x) - but was changed in the move to 2.01 (see changelog). [~522] [~453] do you agree?

However, the point still remains that it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think [~336] did some stats on this too…

Andy Lulham
SJohns:

They are not going to have the resources to go back and update older activities. And as many donors now link the payment of funds to the publication of data - it could be a real risk to them to have their datafile pulled completely. What would be the best advice you can give a CSO in advance of these changes?

[~373]: pragmatically, I’d suggest any publisher that can’t go back and update old v1.0x data should ensure all new activities are created in a brand new v2.03 activity file. This means all future data will be “datastore compliant”. And perhaps at some point, the old v1.0x data could be one-off converted.

stevieflow:

it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think [~336] did some stats on this too…

That’s true – in the stats above, schema validation was performed at activity level (i.e. rather than validate each dataset, I validated each activity.) So in practice this means the “activity count” is a count of invalid activities, rather than a count of all activities inside invalid datasets.

SJohns

[~433] [~336] thanks for clarifying. I was really thinking of this scenario - older activities that are poorer quality within the same datafile as activities of good quality, but I didn’t express it well!! So just thinking about Andy’s suggestion - how would it work for AidStream users.

Aidstream users (who are using the full version) should click on the button to upgrade to version 2.03 and then continue to add in their data for the current activities. If they have older, closed activities on AidStream that are poor quality (data missing/incomplete), then they can convert them to draft activities in AidStream by editing them. This means that when they publish the datafile, only the current activities will show up in a datafile that is tagged as iati-activities version=“2.03”.

This 2.03 datafile should (if no other issues) get pulled through to the new database without the older activities, which will no longer be publicly available. This should not therefore impact their funding (because the current activities are published) but will shorten their track record.

Then if an organisation has extra resources, they can go back and fix the older files if they want to show a longer track record.

For organisations with a smaller amount of activities, this will be feasible to do.For organisations that use AidStream to publish many activities,for multiple donors, it’s going to be a headache, so the more time and warning you can give, the better.

Unfortunately, as soon as funders link an open, public good like IATI to withdrawing funding which an organisation receives to run their programmes (which vulnerable people depend on) it gets a lot more complicated than just excluding data and teliing organisations to update it as and when.

Andy Lulham
SJohns:

They are not going to have the resources to go back and update older activities. And as many donors now link the payment of funds to the publication of data - it could be a real risk to them to have their datafile pulled completely. What would be the best advice you can give a CSO in advance of these changes?

[~373]: pragmatically, I’d suggest any publisher that can’t go back and update old v1.0x data should ensure all new activities are created in a brand new v2.03 activity file. This means all future data will be “datastore compliant”. And perhaps at some point, the old v1.0x data could be one-off converted.

stevieflow:

it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think [~336] did some stats on this too…

That’s true – in the stats above, schema validation was performed at activity level (i.e. rather than validate each dataset, I validated each activity.) So in practice this means the “activity count” is a count of invalid activities, rather than a count of all activities inside invalid datasets.

matmaxgeds

Slightly off-topic but does IATI give a ‘lifespan’ estimate when new version of the standard are created i.e. with an operating system, updates are guaranteed for X number of years? It seems like it might be helpful/standard practice to say with new version that they will not be depreciated (dropped from the registry/core tools) for X years, or until X date?

Steven Flower

Thanks again [~373]

I think we are into some of the implementation details , based on the agreement of the principles above.

[~336] would it be possible to share your twitter feedback in a new thread, where we can discuss this in a dedicated space? [~373] by no means am I saying we should ignore this - but I want to keep this thread to our shared three principles. Just in the same way we have a new discussion on follow-ups for licencing, we should detail the support needed for Aistream publishers in a concentrated channel.

Steven Flower

Hi everyone

I’m just flagging that our technical advice to the [~453] & partners via @siemvaessen looks to be a clear line on the datastore initially ingesting data that is:

  • Valid to the relevant schema
  • Openly licenced
  • Version 2.0x (but actively checking valid/open 1.0x data alongside this)

As we can see, there are follow ups and actions elsewhere, but I wanted to thank everyone for their input here, and pass onto [~571] in terms of implementation of the datastore. Thanks!

Andy Lulham
stevieflow:

[…] the datastore initially ingesting data that is:

  • […]
  • Version 2.0x (but actively checking valid/open 1.0x data alongside this)

Could you expand on this, [~433]? It’s unclear what this would mean for v1.0x publishers.

Thanks

Bill Anderson

See ^^. It has already been agreed that …

bill_anderson:

The datastore will do a one-off load of non-active* Version 1 activities.

The DS spec does not require for deprecated versions of the standard to be processed.

It was decided, pragmatically, that although DS will come online before V1.0 is deprecated, we are talking about a couple of months and it was not worth the effort to complicate the load.

Herman van Loon

Then there hopefully will be no active 1.x data publishers anymore after the depreciation date in June this year.

Herman van Loon
stevieflow:

Version 2.0x (but actively checking valid/open 1.0x data alongside this)

[~433] i agree with [~336]: it needs to be very clear what is going to happen with valid 1.0x data. I would expect the DS to process this data.

Agree?

Steven Flower

Thanks all

It’s useful for us to reaffirm our role as a community here. We’re giving technical advice (the TA of TAG!) to the datastore project. We’re not in a position to project manage the datastore, for example. For this reason, it’ll be great to hear a progress update from the [~453]

In terms of the discussion we’ve had so far on 1.0x, then apologies if I left that vague. My understanding is that we’d leave the door open for valid 1.0x data, but that other factors instigated by the [~453] may mean becomes less of an issue:

  • Existing active 1.0x publishers shift to 2.0x before June
  • There’s a process in place to support the Aidstream users, who may have a mix of versions
Mark Brough

Jumping on this thread a little late – I think it would be great to ensure that the needs of the ultimate users of the data are factored in here. There are currently some big donors publishing v1.x data to IATI (see below). It would be really unfortunate if the data from these organisations, which is currently available through the Datastore, became no longer available.

I don’t really understand the suggestion of loading all v1.x data into the Datastore once, and then never again – I would assume the development time required would be more or less the same, and presenting out of date data to users from a Datastore that is supposed to update nightly would arguably be misleading. Perhaps a better approach would be to gracefully degrade for older versions – i.e., trying to load v1.x data on a “best efforts” basis, but not focusing development or maintenance time on this.

Here are few suggestions about how to avoid losing access to all these organisations’ data:

  1. IATI tech team works intensively with priority organisations to support/encourage them to begin publishing v2.x data. I would argue that prioritisation should be based primarily on size of organisation.
  2. If there are still a large number of organisations (especially large ones) publishing v1.x data, then have a policy of gracefully degrading for no longer supported versions.
  3. The Datastore importer could perhaps use something like [~336]’ v1.x to v2.03 transformer where possible to simplify the import process.

IATI Dashboard - Versions

v1.03

  • AFD (France)
  • AsDB
  • Finland
  • France (Ministry of Foreign Affairs)
  • Switzerland (SDC)
  • UNOPS

v1.04

  • European Commission (FPI)

v1.05

  • Germany (Environment Ministry)
  • Climate Investment Funds
  • New Zealand
  • The Global Fund to Fight AIDS, Tuberculosis and Malaria
Bill Anderson
markbrough:

I don’t really understand the suggestion of loading all v1.x data into the Datastore once, and then never again

The TAG consensus to deprecate v1 in June was based on the realistic expectation (based on the ongoing work of the Tech Team) that all big publishers will upgrade. Your Suggestion 1 has been going on for some time.

Mark Brough

[~522] that’s great to hear, then perhaps we can just revisit this question around June, once we know how much progress these publishers have made.

IATI Technical Team

Hi all,

Please note that a new topic has been created which outlines the technical team’s plans for version 1 files in the context of the new Datastore: Version 1 Files in DataStore

Bill Anderson

I would like to reopen this discussion.

I am pulling together data on resource flows going to Kenya in response to COVID-19. I know that the World Bank has given Kenya a $50m loan.

This can’t be right. In the interests of providing users with the maximum amount of usable data we surely need to change the validation and datastore ingestion guidelines to operate at the activity level, not the file level.

Mark Brough

I agree with [~522] on this – I also noticed recently that some files were not validating and therefore not entering the datastore just because some elements were ordered incorrectly.

I would go a couple of steps further than Bill and suggest:

  1. relaxing the strong requirement for every file to pass schema validation, in favour of a weaker “best efforts” attempt to import every activity (even if that activity fails validation), and alert if particular activities could not be imported. For example, having elements in the wrong order shouldn’t present a major issue to importing data.
  2. making more visible (and actively contacting publishers) when datasets/activities fail validation, or cannot be downloaded (e.g. files were accidentally moved around on the server and become inaccessible through the registered URLs). Perhaps some combination of IATI Canary and the IATI Dashboard could be used for this.
matmaxgeds

Seems to me like this is another unsolved discussion of (huge simplification here to make a point):

  1. IATI is primarily a data interface standard…therefore defend the standard working as it should…do not allow validator rejected files in the DSv2
    vs
  2. IATI is a transparency initiative…so show more data, even if it makes it harder to operate the standard…

Of course the two are linked, and do not exclude each-other, but I think it would help us to take a lot of decisions if there was a clearer answer to the first question…theory of change…what have we learnt from 10 years of the original Theory of Change etc

Rolf Kleef

Funny, we actually fixed the sequence of elements within an activity as part of the upgrade to 2.01…

A standard is for both producers and consumers, to make the exchange of information easier. We try to make it easier for producers by offering a schema (and hopefully a ruleset) that you can use to check your data before publishing. All with the intent to make it easier for more data consumers to use what is published.

The idea that the datastore just tries its best to process “anything” as a solution is shifting the problem from the producer to the consumer. And it basically says: don’t try to develop your own IATI-consuming application and feel free to publish about anything.

We need to fix this by making data quality part of the publisher’s process. And so it needs to be adequately resourced and prioritised. Bombarding a technical department with emails won’t change a thing until management sees that this is a problem. It helps if they see that their data is missing.

This is what’s happening with grantees of the Dutch government: programme staff get called by grant managers that they are missing from the dashboard, and need to fix their data.

If an organisation like the Worldbank is able to regularly update ~500MB across nearly150 files, they should be able to to a simple schema validation step as part of their QA when publishing.

If it’s a matter of ordering the elements in the right way, I’d be happy to work with them on a simple XSLT script to do just that.

But I assume their technical staff is already well aware of this.

My guess is: it’s not a priority, you can be #2 in the Aid Transparency Index even though you publish schema-invalid files. And the IATI Secretariat is happy to push data consumers to accept your data, you don’t even have to do that yourself.

To echo Matt:

  • Is IATI a still data standard to make it easier to exchange information between all kinds of parties?
  • Or is it a database offered by DI to please some users, and we don’t care that the EU, USAID, governments, multilateral NGO networks, project management software platforms, etc, also need or want to exchange information between systems?

Making sure you have schema-valid XML has been solved over 20 years ago. We need to push publishers to make that part of their production system. So that we can move on to including business rules compliance as well. And discuss actual business rules as part of the standard, instead of still being stuck on this basic level.

Bill Anderson
rolfkleef:

A standard is for both producers and consumers

Agree. But you go on to focus solely on the supply side.

As a consumer I’m not interested in what producers should or shouldn’t be capable of doing. I just want the data. I’m not bothered whether my banana is malformed so long as it is edible.

I’m quite happy to hold my hand up and admit that for the best part of ten years I was part of a machinery (and community) that paid insufficient attenion to users outside of our immediate supply chain. Now that I’m on the other side of the fence things look different …

rolfkleef:

it’s not a priority, you can be #2 in the Aid Transparency Index even though you publish schema-invalid files. And the IATI Secretariat is happy to push data consumers to accept your data, you don’t even have to do that yourself.

This kind of (much-used) argument is fundamentally flawed. Improving data quality and maximising the use of what currently exists are two very separate ideas that actually reinforce each other.

Bill Anderson
markbrough:

I think part of the issue is that 2.01 made it much easier to fail schema validation by requiring elements to be in a particular order (it did this in order to make certain fields “mandatory”, which I think was the wrong way of enforcing compliance). I think that was a mistake.

I agree. I’m not an xml expert, but isn’t there another way of checking mandatory fileds without ordinality?

Herman van Loon

Agree with you Mark: data order should be irrelevant. Not so sure about other examples such as missing transaction type codes, currency codes, etc. Those activities should i.m.o. not be in the data-store (these are the really inedible rotten banana’s)

Accepting files with schema errors and doing ‘activity level’ validation only, would make file level schema validation unnecessary.

But then the question is how you are going to do ‘activity level’ validation. When this is bound to only validating and cleaning IATI data in the data-store ingestion process, it would mean that every existing IATI XML consuming application would be forced to make use of the data-store if it wanted to use validated IATI data.

The data-store will become the only source of validated IATI data since validating the raw IATI data of the publisher against the XSD will lose its meaning. This is i.m.o. only acceptable if the data-store would also provide fully IATI standard compliant XML output (which validates against the XSD), with just the erroneous activities being removed.

Mark Brough

Thanks for this discussion!

rolfkleef:

The idea that the datastore just tries its best to process “anything” as a solution is shifting the problem from the producer to the consumer

I think part of the issue is that 2.01 made it much easier to fail schema validation by requiring elements to be in a particular order (it did this in order to make certain fields “mandatory”, which I think was the wrong way of enforcing compliance). I think that was a mistake. That didn’t matter that much before now, because everyone could continue to use data even though it failed validation, but obviously it would begin to make much more of a difference if we stick to this approach.

I don’t think making it impossible to access schema-invalid data through the IATI Datastore shifts any problem from a consumer to a producer. At the moment, it just makes it much more difficult for the consumer to access the data (even if it’s just a question of one element in one activity in one file being in the wrong order). If publishers quickly resolved data validation issues, that would be fine. However, the evidence suggests that around 10% of publishers have invalid files, and the number has remained fairly stable for the last three years – see these charts.

As various people have mentioned, one way of squaring this circle might be for publishers to be automatically notified (or politely bombarded) when their data fails validation.

If you’re a publisher reading this thread – you can sign up for alerts from IATI Canary!

matmaxgeds

From a user perspective, it would be great that if there can be activity level validation and some activities are dropped due to failing validation, if there could be some unmissable notification that the activities shown are not all of what the publisher intended to publish - because quite often users see the existence of data for a publisher as implying a comprehensive dataset from that publisher.

Bill Anderson

[~471] I disagree with much of your approach. As an increasingly heavy user** (and champion) of IATI data I want access to as much usable data as possible. That’s what I expect from the datastore. Being told that I can’t have access because the datastore is on a mission to produce ‘pure’ data won’t wash.

In my particular use case my biggest problem is ensuring that both geographic and sector percentage splits add up to 100 so that I can to reliable transaction arithmetic. There’s a whole load of other QA issues that I’m not the slightest bit interested in. I would rather deal with my particular problem myself if I know that the Datastore is doing as little as possible to get in the way.

This has got nothing to do with letting the publisher off the hook. That’s got nothing to do with me (or the datastore). If we can get useful information based on usable data in front of a lot of people (not just those responsible for supply chain accountability) the incentives for publishers to improve their data will far outweigh our moral sticks.

(** For the record I have nothing to do with the supply side or governance any more)

IATI Technical Team

Thanks to everyone for engaging with this conversation.

The Tech Team is currently focussing on ensuring that the original TOR are delivered as we prepare for launch. Any consideration of requirements outside of the current TOR will be led by the Tech Team after launch; we will engage with the community to make sure this complex issue in particular is fully explored. In the meantime, the Tech Team is contacting publishers with schema invalid files this week to urge them to address their issues.

Thanks again for all your good input above; we look forward to discussing this further after we launch.

Rory Scott

I understand [~453] in the above, so this is in no way to try to contradict that timeline. I’m putting it here for posterity when this conversation re-opens after the initial launch.

I’m in agreement with [~522], [~491] and others on this. I am routinely asked why IATI doesn’t line up with official statistics, or why expected IATI data is missing, or why we can’t we can’t trust IATI data. In my opinion it isn’t satisfactory to say that a publisher fluffed one number in a dataset of thousands of activities and therefore all of that data is inaccessible, and that this is by design.

The removal of arbitrary, valid data will undermine trust in IATI, and frustrate both users and publishers, and exacerbate existing narratives about the viability of the entire corpus of data we work to produce.

Regarding [~471]’s concern about the onus moving away from the publisher: I understand this principle, but at the moment we need pragmatism. I would be happy with a number of measures to re-establish that onus that don’t involve removing access to valuable data.

For example:

  1. We could institutionalise IATI Canary, making use of it a basic participation prerequisite for engagement.
  2. We could take this further by publishing the response times to data validation issues, and possibly push for this to be a relevant metric in future transparency evaluations such as the Aid Transparency Index or the IATI Dashboard
  3. We could include a flag within the XML to denote validity, and put garish, unsightly banners on relevant D-Portal pages or other presentation sites to make it clear that there are validation issues.
  4. We could celebrate the rapid engagement with and resolution of data validation issues in newsletters and official communications (if the publisher consents).
  5. We could have a public ‘caution’ list of publishers with invalid data.

I’m not seriously suggesting all of these, and some of them might seem extreme, but for me they are all sensible* compared to removing an unknown quantity of valid data from the one official data store.

*To add some numbers to this sentiment (see workings here):

  • There are currently ~982k activities.
  • If we take the publisher stats and add an activity to file ratio value, we can see that the top 25 publishers by number of activities published account for ~814k activities, about 82.89% of the total.
  • These activities are split amongst 2,234 files (meaning a total activity-to-file ratio of 364 among them).

The median activity-to-file ratio among them is 530, and the arithmetic mean is 1,657. This is because of our top five activity-to-file publishers:

  • GlobalGiving.org
  • UN Pooled Funds
  • Norad - Norwegian Agency for Development Cooperation
  • Food and Agriculture Organization of the United Nations (FAO)
  • The Global Alliance for Improved Nutrition

Together these five account for 38,000 activities spread between 5 files.

Going back to our top 25 publishers by activity count, it’s fairly clear that one validation error in any of these publishers will mean a serious loss of valid IATI data.

If GlobalGiving have one sector percent that doesn’t add up to 100 one missing sector field or other schema error, we could lose nearly 2% of all IATI data pertaining to nearly 10,000 organisations.

EDIT: changing the sector example as per [~421]’s correction.

Josh Stanley
rory_scott:

If GlobalGiving have one sector percent that doesn’t add up to 100, we could lose nearly 2% of all IATI data pertaining to nearly 10,000 organisations.

Just to be clear, data quality issues such as sector percentages not adding up to 100 will not prevent the dataset from being ingested by the Datastore, as this is a Standard rule (a must), rather than something that is dictated by the Schema.

David Megginson

[~471] - I don’t think we’re far off: we both agree that the datastore shouldn’t try to ingest an iati-activity if it’s malformed. We also both agree that there should eventually be a feedback mechanism to let data providers know when the datastore does not ingest one of their activities because it’s malformed (and explain why), though I also acknowledge that this last part might be a new work item that needs to be triaged and scheduled.

The difference is over whether one malformed activity in an iati-activities package should cause the datastore to reject (e.g.) 999 other, well-formed activities in the same package, assuming that the error doesn’t affect parsing outside of the activity element.

Since the grouping of activities is non-semantic (the datastore discards the grouping and stores activities individually in any case), I think that would be an overreaction. OTOH, if there is an error at the top level (e.g. the attributes on iati-activities, that might justify a wholesale rejection, because we can’t be sure that we’re applying those attributes to the individual activities correctly.

Herman van Loon

[~491]: agree that when we have an activity level validation we can skip the bad activities. But only if there is an active feedback mechanism. Not ‘eventually’ as you are suggesting, but at the moment activity level validation is implemented in the data store ingestion process.

Agree with [~350] that we should define which kind of errors lead to:

  • processing the individual activity even if there is an error in the content in an activity (e.g. a non existing sector code)
  • skipping the whole activity
  • rejecting all of a specific publisher files (if more than x% of the activities have errors, malformed XML or XSD errors).

The questions remains of course is acceptable as an maximum number of errors (0.1%, 1%, 10%, ?).

About the reservation [~350] has about e-mail feedback to the publisher: e-mail feedback is by no means a panacea. It is i.m.o. better to engage with a publisher directly, but this rarely happens (see also the recent Catalpa report). By lack of a better feedback mechanism, I would rather have a working e-mail feedback process than nothing at all.

Herman van Loon

[~479] and [~522] I understand the need to have as much data available as possible. As mentioned before I am not against having an activity level schema validation for ingesting XML data in the data-store as such, provided there is an active feedback mechanism to the publisher (active meaning that no action is required from the publisher to get informed about the data quality issues).

[~479] proposes a number of interesting feedback mechanism, to which I would like to add one: sending an e-mail the the e-mail address provided on the activity level (iati-activities/iati-activity/contact-info/email) , or if there is not such an e-mail address, sending it to the contact e-mail address as stored in the registry.

I object to any solution which would silently skip activities being processed without any notification to the user or the publisher of the data. Users will be kept in the dark about the completeness of the data and publishers will be kept in the dark about the quality problems in their data.

One last thought: if a large publisher has just a few tiny errors in the many activities published, why not simply contact that publisher and ask to correct the problem. I.m.o. it is this lack of active engagement of data users and publishers that causes a great deal of these problems.

matmaxgeds

Given we are considering this, please can we properly document it in the rules/IATI approach etc so that it is not just an ad-hoc decision:

  1. Spell out the logic, so that it can apply in other situations e.g. we are establishing the concept that in core IATI tools, where there is a conflict between availability of data, and XYZ, then we prioritise availability of data - I suspect if we apply this lens more widely, we might find a lot more ‘tweaks’ to make to IATI tools e.g. because IATI does not replicate several key codelists, thousands of activities do not have machine readable sector or result narratives. Therefore if we follow the logic we are doing for activities in the datastore, we are establishing the primacy of user access to data and so IATI should replicate many more codelists that it currently doesn’t?
  2. What do we think about transactions - should we really ditch a whole activity if just one transaction fails to validate?
  3. It seems like it is also not sensible to ditch a whole activity if e.g. one result element fails validation
  4. There are different types of failing validation - please can we be clear that ‘fail’ in this case applies just to the ‘must’ elements of the rules, or have I misunderstood this?

For what it is worth, I think that we shouldn’t do this - if a major publisher cannot promptly fix a problem with their data that causes thousands of activities to disappear, I think we should not be trying to fix this via the datastore. I do not think that the ‘they will get an email’ is a serious proposition at least until we have proof that it works - I have previously tested the response rate to IATI emails previously with mixed results. Also, surely major publishers should themselves be checking e.g. that their activities are available in IATI tools - again if they are not doing that, I think IATI has a problem that needs a solution that lies outside the datastore.

Finally, maybe it would also be better to work this though so actual cases - do we have other examples in addition to the World Bank case (and what has happened in that situation?). Is it also not worth seeing what happens when the validator is launched? We spoke multiple times about all being aware that once implemented it will cause a huge increase in rejections - presumably we agreed then that we were ok with this and expected that after a short period of rapid fixes, everything would settle down?

Andy Lulham
bill_anderson:
  • I suspect most closed or missing licences are oversight, not deliberate. A job for Tech Team and community to address.

^^ Agreed / cool. Step #1 is this ticket, which would stem the tide of “license unspecified” data.

David_Megginson:

I agree about the reuse problem. That’s why I’d have the data excluded from common queries by default, and included only when the user explicitly opted in (e.g. “Include non-open data” option in the UI, or “&license=nonopen” in the API).

It sounds like there’s appetite for removing the option to publish closed IATI data going forward (FWIW I support this). Plus the number of activities published with a closed license is really small (see table above). If the option of a closed license were to be removed, I doubt it would be worth special casing for closed data in the datastore API.


Please log in or sign up to comment.