Background
While the IATI Standard is structured hierarchically, the latest IATI Datastore first flattens the hierarchical IATI XML data into individual fields in JSON to be processed by our search engine, Solr. This enables fast searching and querying of data, but loses some of the hierarchical context of the original XML.
For example, the value element of a transaction element gets flattened into the JSON field `transaction_value`. While this flat structure is ideal for searching the data and exporting it to flat files (e.g. CSV and Excel), it does introduce some ambiguity in the interpretation for elements that can have arbitrary numbers of optional child elements. For example, an activity may contain two result elements that both have a title, but only one has a document-link element. This would be represented in the flattened JSON as:
“result_title_narrative”: [“Result one”, “Result two”],
“result_document_link_title_narrative”: “A document”
In the example above, the flat structure enables Solr to search the text and subelements of the result elements, but it does not allow a user of the flattened JSON to tell which result element in particular has the document-link. Our current solution to this ambiguity is to direct users of the IATI Datastore API to export the activities that result from a query in their original IATI XML format, and then convert it to JSON if desired.
Proposal
In order to make development around the Datastore API more accessible to developers that may be unfamiliar with XML, we’re exploring doing the conversion from XML to JSON, and storing those results in our data lake for fast access. Since there is no JSON schema for the IATI Standard, there are multiple different ways to make this conversion, and so we are seeking input from the community on what format would be the most useful.
The attached file contains a random export of 10 activities in a prototype JSON format that would be the response body for a planned endpoint of https://api.iatistandard.org/datastore/activity/iati_json. Note that we would only support the /activity endpoint for structured JSON.
Questions
- Would this format enable you or your organisation to more readily use the IATI Datastore API?
- Is the structure logical and easy to understand?
- What changes would you like to see made to the structure?
- Do you presently use the existing flat JSON format on the Datastore Search, and would you be interested in having access to the structured JSON format via the Datastore Search as well?
Please provide your feedback to the above questions as a comment to this post, or directly to code@iatistandard.org by October 28, 2022.
Thank you in advance for your time,
IATI Technical Team
Hi Mark,
Thanks for sharing your thoughts and that history on API conventions.
Ideally, this JSON serialization could better enable back-end systems, front-end websites, and any analysts with tools that happen to work with JSON. So far we’ve heard comments from one publisher government and one non-publisher NGO both building front-end dashboards that have asked for this output format. Given that our current JSON serialization of IATI was structured solely for indexing reasons, we thought it would be important to provide a JSON structure that is a more faithful representation of the standard. It also more closely aligns with the principles laid out in the conventions document you shared.
From our recent testing of the IATI.cloud API, we have not been able to retrieve JSON structured like IATI XML. Regarding comparisons with the JSON output from Datastore Classic, I think you'll find that our previously attached prototype JSON is functionally very similar. The only notable differences are that most child elements have been structured as arrays (to allow for consistent accessing of child elements where multiples may occur), that we've included some XPath notation in the naming of keys to indicate when a key comes from an attribute (with an "@"), and that we refer to text by the XPath function "text()". In the attached image, the same activity is shown; exported in JSON from the Datastore Classic on the left, and the Datastore V3 on the right.
For accessing the sector code for the first sector, for example, Datastore Classic syntax would be “result['iati-activities'][0]['iati-activity']['sector']['code']”, where Datastore V3 syntax would be “docs[0]['iati_json’]['iati-activity'][0]['sector'][0]['@code']”. By serializing elements as arrays, it makes accessing sub-elements more consistent across activities that have varying numbers of sub-elements. You don’t need to check whether the sector key contains an array or an object before processing it, it will always contain an array.
We would be interested in hearing your thoughts regarding these differences. At this early stage, we’re open to making any changes to the serialized JSON structure, like dropping the “@” from attributes or rendering “text()” as “text” instead, if that would improve the utility for most users. So far we have not heard any feedback regarding compatibility between this JSON structure and past Datastores, but that’s something we can consider if there’s a desire for it.
As for a product road map for the Datastore, we’re hoping to write a light-touch product road map as part of our transition planning over the next few months. We felt that providing structured JSON was relatively low-hanging fruit, that would enable more developers to make use of the Datastore while we continue to work on transition planning and the product road map.
As always, thanks for your time in looking into this proposal.
IATI Technical Team Many thanks for your reply!
You make a very good point re: consistently serialising elements as arrays. I think we should look into adjusting Datastore Classic so that it consistently outputs elements as arrays. I have added an issue here:
https://github.com/codeforIATI/iati-datastore/issues/381
As discussed there, Datastore Classic uses the *xmltodict* library to convert IATI XML data to JSON format on the fly. It looks like you can request it to output arrays for particular elements by passing the *force_list* parameter. It appears that a design decision was taken when DSv1 was built to not include the "@" symbol in front of attributes and to call the text node "text" rather than "#text" (as is the default in xmltodict). In the context of the IATI Standard I think those things probably make sense and help readability of the output.
I also wonder whether only *some* elements should be output as arrays (those elements that can occur more than once). For example, *< activity-status >* can only occur once, so I think putting that output into an array probably doesn't make sense, and it will be easier for users to access that data as an object instead.