Seeking community input on structured JSON

Share this page

The IATI Technical Team is seeking community input on a structured JSON format for the IATI Datastore API.

Background

While the IATI Standard is structured hierarchically, the latest IATI Datastore first flattens the hierarchical IATI XML data into individual fields in JSON to be processed by our search engine, Solr. This enables fast searching and querying of data, but loses some of the hierarchical context of the original XML.

For example, the value element of a transaction element gets flattened into the JSON field `transaction_value`. While this flat structure is ideal for searching the data and exporting it to flat files (e.g. CSV and Excel), it does introduce some ambiguity in the interpretation for elements that can have arbitrary numbers of optional child elements. For example, an activity may contain two result elements that both have a title, but only one has a document-link element. This would be represented in the flattened JSON as:

“result_title_narrative”: [“Result one”, “Result two”],
“result_document_link_title_narrative”: “A document”

In the example above, the flat structure enables Solr to search the text and subelements of the result elements, but it does not allow a user of the flattened JSON to tell which result element in particular has the document-link. Our current solution to this ambiguity is to direct users of the IATI Datastore API to export the activities that result from a query in their original IATI XML format, and then convert it to JSON if desired.

Proposal

In order to make development around the Datastore API more accessible to developers that may be unfamiliar with XML, we’re exploring doing the conversion from XML to JSON, and storing those results in our data lake for fast access. Since there is no JSON schema for the IATI Standard, there are multiple different ways to make this conversion, and so we are seeking input from the community on what format would be the most useful.

The attached file contains a random export of 10 activities in a prototype JSON format that would be the response body for a planned endpoint of https://api.iatistandard.org/datastore/activity/iati_json. Note that we would only support the /activity endpoint for structured JSON.

Questions

Would this format enable you or your organisation to more readily use the IATI Datastore API?
Is the structure logical and easy to understand?
What changes would you like to see made to the structure?
Do you presently use the existing flat JSON format on the Datastore Search, and would you be interested in having access to the structured JSON format via the Datastore Search as well?

Please provide your feedback to the above questions as a comment to this post, or directly to code@iatistandard.org by October 28, 2022.

Thank you in advance for your time,

IATI Technical Team

Files

Prototype structured JSON API output105.78 KB

Comments (11)

Sylvan Ridderinkhof • 1 year ago

Has Solr's "Nested Documents" feature been considered to accomplish this same goal? And if so, what were the reasons for discarding that approach?

From a practical perspective it makes sense to make sure the relation between items in a multi-valued field is kept alive, but should there really be a completely new endpoint to accomplish this goal? Will searching, similar to the regular activity core, be available, or will the users have to either "relearn how to search" or will there just be a link to the json format for each activity?

IATI Technical Team • 1 year ago

Hi Sylvan,

Thanks for taking a look at this. Users of the Datastore API won't be required to use the new endpoint or change how they execute a query. This format is stored in an unindexed field called "iati_json," and the new endpoint simply adds "fl=iati_json:[json]&wt=json" to any query for convenience. It's possible to add this new field to any previous activity-level query. All the previous fields remain indexed as they were before to enable searching.

From the SOLR documentation (https://solr.apache.org/guide/8_1/indexing-nested-documents.html), you can see the complexity inherent in nested documents. Not only would each child element need to be ingested into SOLR under a separate document schema created to mirror the IATI Standard hierarchy, but then each child element would require an individual ID, and it would put additional, unnecessary indexing stresses on the system. By providing the structured JSON in the way we've selected, it's a full and complete representation of the original XML without the need for any additional system complexity.

Mark Brough • 1 year ago

Thanks for sharing this for consultation, IATI Technical Team !

It would be great to hear from users who would like to access IATI XML data as structured JSON. For me, this brings us back to a question we discussed in the DUWG earlier this year -- what is the objective of the Datastore? We discussed three broad options:

1. Other systems (big chunks – bulk) - e.g. country AIMS
2. Front-end websites (small pieces) - e.g. publisher websites / visualisations
3. Analysts (spreadsheets) - e.g. journalists / CSOs / officials trying to do analysis across the data, for example through pivot tables and PowerBI

JSON feels to me particularly useful for other front-end websites, but I think in that case you might want smaller extracts of data available, and also particular aggregations. But would be great to understand if there's a particular use case you're thinking of here -- I guess it's more like the first option above?

I think a key question to think about is around portability across different datastores. Even though it can be a bit more complex to work with, I prefer to work with IATI XML data rather than some other transformation of it because it makes it much easier to switch to a different datastore if you need to (as the IATI-XML output should obviously be consistent across all datastores). For this reason, a long (!) time ago we collectively worked on an API conventions / standards document:

https://docs.google.com/document/d/1gxvmYZSDXBTSMAU16bxfFd-hn1lYVY1c2ol…

Datastore v1 and its successor Datastore Classic both follow that convention (AFAIK).

In any case, I think it would be good to try to standardise the JSON output of Datastore v3 with one of the other datastores currently in place, such as Datastore Classic or IATI.Cloud, to the extent that makes sense.

Finally: is there a roadmap for Datastore v3, and would it be possible to share that? It would be useful to understand plans going forward.

Many thanks!

IATI Technical Team • 1 year ago

Hi Mark,

Thanks for sharing your thoughts and that history on API conventions.

Ideally, this JSON serialization could better enable back-end systems, front-end websites, and any analysts with tools that happen to work with JSON. So far we’ve heard comments from one publisher government and one non-publisher NGO both building front-end dashboards that have asked for this output format. Given that our current JSON serialization of IATI was structured solely for indexing reasons, we thought it would be important to provide a JSON structure that is a more faithful representation of the standard. It also more closely aligns with the principles laid out in the conventions document you shared.

From our recent testing of the IATI.cloud API, we have not been able to retrieve JSON structured like IATI XML. Regarding comparisons with the JSON output from Datastore Classic, I think you'll find that our previously attached prototype JSON is functionally very similar. The only notable differences are that most child elements have been structured as arrays (to allow for consistent accessing of child elements where multiples may occur), that we've included some XPath notation in the naming of keys to indicate when a key comes from an attribute (with an "@"), and that we refer to text by the XPath function "text()". In the attached image, the same activity is shown; exported in JSON from the Datastore Classic on the left, and the Datastore V3 on the right.

For accessing the sector code for the first sector, for example, Datastore Classic syntax would be “result['iati-activities'][0]['iati-activity']['sector']['code']”, where Datastore V3 syntax would be “docs[0]['iati_json’]['iati-activity'][0]['sector'][0]['@code']”. By serializing elements as arrays, it makes accessing sub-elements more consistent across activities that have varying numbers of sub-elements. You don’t need to check whether the sector key contains an array or an object before processing it, it will always contain an array.

We would be interested in hearing your thoughts regarding these differences. At this early stage, we’re open to making any changes to the serialized JSON structure, like dropping the “@” from attributes or rendering “text()” as “text” instead, if that would improve the utility for most users. So far we have not heard any feedback regarding compatibility between this JSON structure and past Datastores, but that’s something we can consider if there’s a desire for it.

As for a product road map for the Datastore, we’re hoping to write a light-touch product road map as part of our transition planning over the next few months. We felt that providing structured JSON was relatively low-hanging fruit, that would enable more developers to make use of the Datastore while we continue to work on transition planning and the product road map.

As always, thanks for your time in looking into this proposal.

Mark Brough • 1 year ago

IATI Technical Team Many thanks for your reply!

You make a very good point re: consistently serialising elements as arrays. I think we should look into adjusting Datastore Classic so that it consistently outputs elements as arrays. I have added an issue here:
https://github.com/codeforIATI/iati-datastore/issues/381

As discussed there, Datastore Classic uses the *xmltodict* library to convert IATI XML data to JSON format on the fly. It looks like you can request it to output arrays for particular elements by passing the *force_list* parameter. It appears that a design decision was taken when DSv1 was built to not include the "@" symbol in front of attributes and to call the text node "text" rather than "#text" (as is the default in xmltodict). In the context of the IATI Standard I think those things probably make sense and help readability of the output.

I also wonder whether only *some* elements should be output as arrays (those elements that can occur more than once). For example, *< activity-status >* can only occur once, so I think putting that output into an array probably doesn't make sense, and it will be easier for users to access that data as an object instead.

IATI Technical Team • 1 year ago

Hi Mark,

If this is useful, here is the code written in the develop branch of the refresher in order to nest the JSON: https://github.com/IATI/refresher/blob/develop/src/library/lakify.py#L1…

We wanted to try and keep the code lightweight, generalizable, and let the structure be driven by the XML. So actually every element is serialized as an array, and the only keys that get serialized as strings are attributes and 'text()' nodes. In those cases, XML forbids having more than one attribute with the same key, and element text can always be serialized as one string.

Steven Flower • 1 year ago

Hi IATI Technical Team

Many thanks for this - very interesting. We've worked on quite a few data standards (eg: Open Contracting; 360Giving; Beneficial Ownership Data Standard) that have JSON as the canonical format, so we were very keen on this. Apologies for not getting a message here before last week, but my colleague David tried to post this, which I think is of use:
=========

For https://iati-tables.codeforiati.org/ we also did a conversion to JSON first.

We used the https://pypi.org/project/xmlschema/ library to do the conversion. It supports many XML to JSON conversions https://xmlschema.readthedocs.io/en/latest/converters.html but I found the default the best. Here is the example data converted using it https://gist.github.com/kindly/092556a65a4ee8b9b72c98f685eddaf6.

The main benefit of using this library is that it is schema aware, so it knows when items should be lists or not. This makes the output a lot clearer.

Here is a notebook showing how to do the conversion and other examples of conversions including what flatten-tool uses https://deepnote.com/workspace/Open-Data-Services-Cooperative-46a54078-….

David Raznick • 1 year ago

I found a Rust library that did the conversion fairly well, like the xmlschema library but faster and does not care about ordering. It is a lot like xml2dict.

So I wrote a python wrapper for it specifically for iati XML.

https://github.com/kindly/iati2json

It is now on pypi.

I image it should be a lot faster then any python option. Though I have not tested that yet.

Here is the example output.

https://github.com/kindly/iati2json/blob/main/example/example.json

Not tested very well yet, should probably run it agaist the whole registry.

If there is interest in making a JSON output format that we can all agree on then I would be happy to improve this. The advantage of having it in rust too is that we could make binding for other languages.

David Raznick • 1 year ago

I just improved

https://github.com/kindly/iati2json

It manages to process the whole registry in about 12 mins on my computer and have added testing, works for orgs too.

Mark Brough • 1 year ago

Hey David Raznick that sounds very impressive performance! I am not familiar with Rust, but from a brief glance at the code, I guess this would need to be updated if there were any changes to the IATI Standard. If that's the case, would it not make sense to just use Python's xml2dict instead, given that a lot of IATI code is already written in Python, and this seems like a very stable library? I am not sure what the performance difference would be though...

David Raznick • 1 year ago

Mark Brough That has changed now, the list in the code is just used if you do not supply the schema.

https://github.com/kindly/iati2json#custom-iati-schemas

Lots of the code is python:
https://github.com/kindly/iati2json/blob/main/iati2json/__init__.py

It uses the python xmlschema libarary to work out where the arrays are. So you could supply the correct version of the schema for the data and it caches the result for next use.

Yes this could be done with xml2dict I think, the force_list option is not well documented,
https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L302
I got confused how to use it for selecting that objects should be lists based on the path.

The rust library was clearer and was why I decided to it with that.

I did a test with xml2dict (without arrays) and it would not be too much slower, so a pure python implementation could make sense.

Please log in or sign up to comment.