Data | Social Sustainability Data Observatory

Survey Harmonization

Mon, 05 Jul 2021 08:00:00 +0000

We provide retrospecitve, ex post, and ex ante survey harmonization to our partners.

The aim of retrospective survey harmonization is to pool data from pre-existing surveys made with a similar methodology in different points in time and different countries or territories. Ex post survey harmonization is in a way a passive form of pooling research funding because you can utilize information from surveying that were made on somebody else’s expense.

The Arab Barometer surveys do not have a consolidated codebook, but our retroharmonize software created one, and put together data from three years and collected in many countries about various public policy issues.

The aim of ex ante survey harmonization is to maximize the value from future retrospective harmonization; in a way, it is an active form of pooling research funding, because you benefit from money spent on related open governmental and open science survey programs.

In this example we designed a survey representative among music professionals that it can be compared with large-sample, national surveys on living conditions and attitudes, and with occupational groups. Nationally representative surveys do not question enough musicians to allow such specific use; musician only surveys do not allow comparison.

retorhamonize is a peer-reviewed, scientfic statistcal software that allows the programmatic retrospective harmonization of surveys, such as the last 35 years of all Eurobarometer microdata, or all Afrobarometer microdata. Eurobarometer grew out of certain CEE member states’ need for comparable data about their music and audiovisual sectors. We commissioned surveys following ESSNet-Culture guidelines and combined our survey data with open access European microdata-level surveys.

regions solves the problems caused by Europe’s shifting regional boundaries, which have undergone changes in several thousand places over the last twenty years, meaning member states’ and Eurostat’s regional statistics are not comparable over more than two to three years. This software validates and, where possible, changes the regional coding from NUTS1999 until the not yet used NUTS2021, opening up vast, valuable, untapped data sources that can be used for longitudinal analysis or for panel analysis far more precise than what national data alone would allow. It was originally designed in a research project at IVIR in the University of Amsterdam to understand the geographical dynamics of book piracy. Because of the needs this software fills, it had 700 users in the first month after publication. It is particularly useful to re-code old surveys, as regional boundaries are changing in each decade several hundred times in Europe.

Metadata

Tue, 01 Jun 2021 11:00:00 +0000

Our observatory has a new data API which allows access to our daily refreshing open data. You can access the API via api.greendeal.dataobservatory.eu

All the data and the metadata are available as open data, without database use restrictions, under the ODbL license. However, the metadata contents are not finalized yet. We are currently working on a solution that applies the FAIR Guiding Principles for scientific data management and stewardship, and fulfills the mandatory requirements of the Dublic Core metadata standards and at the same time the mandatory requirements, and most of the recommended requirements of DataCite. These changes will be effective before 1 July 2021.

The Competition Data Observatory temporarily shares an API with the Economy Data Observatory, which serves as an incubator for similar economy-oriented reproducible research resources.

api.greendeal.dataobservatory.eu descriptive metadata

Descriptive Metadata


Identifier	An unambiguous reference to the resource within a given context. (Dublin Core item), but several identifiders allowed, and we will use several of them.
Creator	The main researchers involved in producing the data, or the authors of the publication, in priority order. To supply multiple creators, repeat this property. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.)
Title	A name given to the resource. Extends Dublin Core with alternative title, subtitle, translated Title, and other title(s).
Publisher	The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role. For software, use Publisher for the code repository. (Dublin Core item.)
Publication Year	The year when the data was or will be made publicly available.
Resource Type	We publish Datasets, Images, Report, and Data Papers. (Dublin Core item with controlled vocabulary.)

Recommended for discovery

The Recommended (R) properties are optional, but strongly recommended for interoperability.


Subject	The topic of the resource. (Dublin Core item.)
Contributor	The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.) When applicable, we add Distributor (of the datasets and images), Contact Person, Data Collector, Data Curator, Data Manager, Hosting Institution, Producer (for images), Project Manager, Researcher, Research Group, Rightsholder, Sponsor, Supervisor
Date	A point or period of time associated with an event in the lifecycle of the resource, besides the Dublin Core minimum we add Collected, Created, Issued, Updated, and if necessary, Withdrawn dates to our datasets.
Related Identifier	An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Rights	We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Description	Recommended for discovery.(Dublin Core item.)
GeoLocation	Similar to Dublin Core item Coverage

The Subject property: we need to set standard coding schemas for each observatory.
Contributor property:
- DataCurator the curator of the dataset, who sets the mandatory properties.
- DataManager the person who keeps the dataset up-to-date.
- ContactPerson the person who can be contacted for reuse requests or bug reports.
The Date property contains the following dates, which are set automatically by the dataobservatory R package:
- Updated when the dataset was updated;
- EarliestObservation, which the earliest, not backcasted, estimated or imputed observation.
- LatestObservation, which the earliest, not backcasted, estimated or imputed observation.
- UpdatedatSource, when the raw data source was last updated.
The GeoLocation is automatically created by the dataobservatory R package.
The Description property optional elements, and we adopted them as follows for the observatories:
- The Abstract is a short, textual description; we try to automate its creation as much as a possible, but some curatorial input is necessary.
- In the TechnicalInfo sub-field, we record automatically the utils::sessionInfo() for computational reproducability. This is automatically created by the dataobservatory R package.
- In the Other sub-field, we record the keywords for structuring the observatory.

Optional

The Optional (O) properties are optional and provide richer description. For findability they are not so important, but to create a web service, they are essential. In the mandatory and recommended fields, we are following other metadata standards and codelists, but in the optional fields we have to build up our own system for the observatories.


Language	A language of the resource. (Dublin Core item.)
Alternative Identifier	An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Size	We give the CSV, downloadable dataset size in bytes.
Format	We give file format information. We mainly use CSV and JSON, and occasionally rds and SPSS types. (Dublin Core item.)
Version	The version number of the resource.
Rights	We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Funding Reference	We provide the funding reference information when applicable. This is usually mandatory with public funds.
Related Item	We give information about our observatory partners’ related research products, awards, grants (also Dublin Core item as Relation.) We particularly include source information when the dataset is derived from another resource (which is a Dublin Core item.)

In the Language we only use English (eng) at the moment.
By default We do not use the Alternative Identifier property. We will do this when the same dataset will be used in several observatories.
The Size property is measured in bytes for the CSV representation of the dataset. During creations, the software creates a temporary CSV file to check if the dataset has no writing problems, and measures the dataset size.
The Version property needs further work. For a daily re-freshing API we need to find an applicable versioning system.
The Funding reference will contain information for donors, sponsors, and co-financing partners.
Our default setting for Rights is the CC-BY-NC-SA-4.0 license and we provide an URI for the license document.
In the RelatedItem we give information about:
- The original (raw) data source.
- Methodological bibilography reference, when needed.
- The open-source statistical software code that processed the data.

Administrative (Processing) Metadata

Like with diamonds, it is better to know the history of a dataset, too. Our administrative metadata contains codelists that follow the SXDX statistical metadata standards, and similarly strucutred information about the processing history of the dataset.

api.greendeal.dataobservatory.eu processing metadata

See for further reference The codebook Class.


Observation Status	SDMX Code list for Observation Status 2.2 (CL_OBS_STATUS), such as actual, missing, imputed, etc. values.
Method	If the value is estimated, we provide modelling information.
Unit	We provide the measurement unit of the data (when applicable.)
Frequency	SDMX Code list for Frequency 2.1 (CL_FREQ) frequency values
Codelist	Euros-SDMX Codelist entries for the observational units, such as sex, etc.
Imputation	SDMX Code list for Frequency 2.1 (CL_IMPUT_METH) imputation values
Estimation	The estimation methodology of data that we calculated, together with citation information and URI to the actual processing code
Related Item	We give information about the software code that processed the data (both Dublin Core and DataCite compliant.)

See an example in the The codebook Class article of the dataobservatory R package.

Data Sharing

Sun, 16 May 2021 00:00:00 +0000

we would like to actively encourage the sharing of data assets.

Open Data

Sun, 16 May 2021 00:00:00 +0000

Many countries in the world allow access to a vast array of information, such as documents under freedom of information requests, statistics, datasets. In the European Union, most taxpayer financed data in government administration, transport, or meteorology, for example, can be usually re-used. More and more scientific output is expected to be reviewable and reproducible, which implies open access.

What’s the Problem with Open Data?

How We Add Value?

Is There Value in It?
If it’s money on the street, why nobody’s picking it up?

Datasets Should Work Together to Give Information
Data is only potential information, raw and unprocessed.

What’s the Problem with Open Data?

“Data is stuff. It is raw, unprocessed, possibly even untouched by human hands, unviewed by human eyes, un-thought-about by human minds.” [1]

Most open data cannot be just “downloaded.”
Often, you need to put more than $100 value of work into processing, validating, documenting a dataset that is worth $100. But you can share this investment with our data observatories.
Open data is almost always lacking of documentation, and no clear references to validate if the data is reliable or not corrupted. This is why we always start with reprocessing and redocumenting.

Our review of about 80 EU, UN and OECD data observatories reveals that most of them do not use these organizations’s open data - instead they use various, and often not well processed proprietary sources.

How We Add Value?

We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
Metadata is a potentially informative data record about a potentially informative dataset. We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.

Is There Value in Open Data?

A well-known story tells of a finance professor and a student who come across a $100 bill lying on the ground. As the student stops to pick it up, the professor says, “Don’t bother—if it were really a $100 bill, it wouldn’t be there.”

But this is not the case with open data. Often, you need to put more than $100 into processing, validating, documenting a dataset that is worth $100.

In the EU, open data is governed by the Directive on open data and the re-use of public sector information - in short: Open Data Directive (EU) 2019 / 1024. It entered into force on 16 July 2019. It replaces the Public Sector Information Directive, also known as the PSI Directive which dated from 2003 and was subsequently amended in 2013.

Open Data is potentially useful data that can potentially replace costlier or hard to get data sources to build information. It is analogous to potential energy: work is required to release it. We build automated systems that reduce this work and increase the likelihood that open data will offer the best value for money.

Most open data is not publicy accessible, and available upon request. Our real curatorial advantage is that we know where it is and how to get this request processed.
Most European open data comes from tax authorities, meteorological offices, managers of transport infrastructure, and other governmental bodies whose data needs are very different from yours. Their data must be carefully evaluated, re-processed, and if necessary, imputed to be usable for your scientific, business or policy goals.
The use of open science data is problematic in different ways: usually understanding the data documentation requires domain-specific specialist knowledge. Open science data is even more scattered and difficult to access than technically open, but not public governmental data.

From Datasets to Data Integration, Data to Information

“Data is only potential information, raw and unprocessed, prior to anyone actually being informed by it.” ^[2]

We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.

Our service flow and value chain

FAQ

Why Downloading Does Not Work?

Most open data is not available on the internet.
If it is available, it is not in a form that you can easily import into a spreadsheet application like Excel or OpenOffice, or into a statistical application like SPSS or STATA.
Even the data quality of trusted web sources, like the Eurostat website, can be very low. Eurostat just publishes what it gets from governments, and often has no mandate to fix errors. The data is full with missing information, and in the case of regional statistics, faulty region codes and region names that make matching your data or placing them on a map impossible.
Adjusting euros with millions of euros, correctly translating dollars to euros, pounds to kilograms requires plenty of work. This is a very error-prone process when done by humans.

Can Open Data be Used in Machine Learning and AI?

Most public and open data sources have many missing observations; machine learning models usually cannot hanlde missingness. These points must be carefully imputed with approximations, which can be very challenging when the data has geographical dimension.
Removing missing values makes samples extremely biased and your model will learn from omissions, not information.

Photo Credits

What’s the Problem with Open Data? illustration is a photo by Cristina Gottardi How We Add Value? illustration is a photo by Nana Smirnova. Is There Value Left in It? is a photo by Imelda Datasets Should Work Together to Give Information is a photo by Lucas Santos

Footnote References

[1] Pomerantz, Jeffrey. 2021. “Metadata.” MIT Press essential knowledge series. MIT Press. Cambridge, Massachusetts ; London, England : The MIT Press, [2015]

[2] Pomerantz, Jeffrey. 2021. “Metadata.” MIT Press essential knowledge series. MIT Press. Cambridge, Massachusetts ; London, England : The MIT Press, [2015]