Services or raw data? Invalidate your cache, stupid!
Imagine, you are dependent on Open Government Data and want to include them in your workflow. Do you prefer web services (like WFS or Esri’s Feature services) or do you want the raw data? On Twitter and offline, I had several discussions on this topic. In the following, I summarize my thoughts for future reference1.
What is Open Government Data anyway?
I like to think of the use of Open Government Data (or Open Data) as using layers of services2:
- Raw data contains the actual measurements or atomic information. It is usually served as one block in a basic file format (CSV, Excel or even worse as unstructured PDF), may be zipped and delivered via websites, ftp or, shockingly, using traditional media such as DVD.
- Individual features are usually served by an API, like WFS or a proprietary JSON API. The features may also be requested in bulk.
- Aggregations or topics are based on the inidviual features and provide higher level abstractions, like a summary of the population of all cantons, based on raw population numbers.
- Presentations may be reports, websites or even apps, which present the aggregated data in a form that is structured as a response to a problem.
It’s easy to see that raw data is the basis for all of the subsequent layers. If the raw data changes, so do the derived features, aggregations, presentations and possibly decision that has been informed by the data (products).
Consuming Open Data – It’s about invalidating your cache
In the long-term, one of the biggest challenges of Open Data is not to have the raw data publicly available, but to provide and consume up-to-date and high quality services based on the original data. As a consumer, you have to keep in mind that everything is a cache, as long as you are not the owner of the data. And arguably, handling caches is a hard thing to tackle3: You are in a dependency chain and you are not in control of the whole chain. Keep in mind that the uncontrollability of the dependency chain applies to both the provider and the consumer of third party data.
The Open Knowledge Foundation says: „Our Mission is a World of Frictionless Data“. In my opinion, the challenge of keeping up-to-date is one of the long-term frictions we have to overcome. By the way, other frictions are astutely described in Tyler Bell’s write-up on using 3rd party data.
Invalidating your cache means being responsible with data: If you use raw data for your services or apps, you have the responsibility to make sure your data is up-to-date for your service (or otherwise clearly mark the source and the compilation date of your content).
What should be offered, services or raw data?
But looking at this from the other end: what should the government provide? Raw data for sure. If it is part of their mission, they should also provide at least feature level and aggregation services. For example, geodata is used within many governmental processes. So, an agency has to provide services like maps or cadastral information in order to make sure others can fulfill their task, like issuing building permits. On the other hand, there are cases where services or apps are not part of the government’s mission and they should not build them (example: location analysis for real estate).
If the dataset is huge and not too many things change, data services are favorable instead of raw data downloads since it is easier to invalidate your local data cache using up-to-date data services. The turn-around from changes at the local government to changes within your workflow is generally faster than with raw data downloads (which may only be available every six month, if at all).
In Switzerland, the recently launched Open Data government portal is a great step towards frictionless data. It is a common landing point to get data that can be referenced using permalinks. However, most of the data is in Excel format with unstructured metadata (some of it zipped) or, even worse, in PDFs. And it’s still raw data.
The canton of Zurich experimentally started an approach towards services with its geodata: Currently, there are 8 WFS services, which can be accessed directly with common GIS tools, or in more popular formats like JSON with HSR’s GeoConverter.
Update 2014-04-03: On opendata.admin.ch, there is now a WFS service category (currently from canton Zürich). If you look for service for the city of Zürich, there is a catalog with 90+ datasets and services in various formats.
Update 2016-09-29: Good news: The national open data platform has been released at opendata.swiss. Among other organizations, SBB and both the canton and city of Zürich have joined the national portal. And some organiozations provide their data even both as a download and as a service – great!
What can you do as an Open Data consumer?
As said above, both the provider and the consumer should act responsibly since they are both part of the unavoidable dependency chain. As a consumer of data, you can do lots of things, some of them are:
- Provide exact attribution with exact reference and timestamp.
- Make sure that you get notified when the original data changes.
- Make sure your workflow keeps your product up-to-date, such as Mike Bostock’s data compilation approach.
- As a community: Think of an ethos of data consumers, which reminds everyone in the dependency chain to act responsibly.
Did I miss something? Probably lots of things. Let me know by pinging me on Twitter: @ping13 or by commenting below.
- An earlier version of this article was published on my personal blog. ↩
- You might be reminded of the disputable DIKW pyramid:
Wisdom > Knowledge > Information > Data. ↩
- There is a variant of the „two-hard-things“ saying, namely: There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors (Source). ↩