The contents of this site are moving to
Please change your bookmark if you have one
CASSIS Interoperability - Metadata
Coordination Action for the integration of

Solar System Infrastructures and Science

Administrative Metadata

Administrative Metadata is used to locate and access resources within the system. Two of the main elements are the resource registry and a means of authenticating and authorizing users.

In principle the Administrative Metadata could be and should be domain independent. This will simplify interoperability with Virtual Observatory projects from related communities, however, the management of Administrative Metadata across multiple collaborative environments needs to be discussed.

Data Management

File Formats

It is not appropriate to force all varieties of data into a single file type – some types of file are more appropriate for certain types of data than others and research infrastructures and VOs should be able to accommodate all types of data are available.

However, there is a problem with many of the existing file formats that they do not properly define what parameters in the file represent. For example, the FITS standard is used widely in many astronomical fields but is actually quite a loose standard. There is nothing that requires the inclusion of specific header records, or the names that should be used, and it is difficult to associate records providing details about a parameter. Also, variation in the names used for a parameter can cause problems to analysis software even though they are really just synonyms. There are similar in other file types unless standards are followed.

In a relatively new text-based format recently developed by the IVOAVOTable – in addition to the name, type and units of the parameter, the FIELD records contain UCD and utype parameters. Using these – the utype is derived from a data model – it is possible to unambiguously define what a parameter means and it if done correctly it should be much easier for an external person to pick up and use the data.

Data Provenance

If there are multiple copies of the data, it is essential that which is master and which are slaves is understood. The master should always have the most recent and most complete copy of a given dataset. The system should then provide the means to track which data are available on other sites and which version of processing they represent.

Note that this issue has not been handled properly (if at all) in the past and should be addressed for future datasets.

Note also that what is meant by provenance in the particle physics community and in our domain seems to differ. The European Data Grid was designed around the needs of the LHC at CERN and in that context they are used to extracting clumps of data for analysis that they wish to keep track of; in our context we are talking about entire archives. Unfortunately a lot of the work that has been done in this area has been on behalf of the particle physics community and it is flavoured by their needs.

Resource Management

Resource Description and Registry

The Registry describes resources that are available through the system and how to use them. Exactly what this means is not clear – different projects are interpreting the need for a registry in different ways and are placing different types of information in their registries.

For example, the HELIO and VAMDC projects are funded under the same call of the FP7 Capacities programme and use the same technology based on IVOA standards (developed by K. Benson at UCL-MSSL who works on both projects) but the information included is different. We need to compare the two to each other and with IVOA standards.

Even within a project, opinions of the type of information that should be included differ – how much detail should be in the registry and at what point should the individual service be consulted.

While this might at first glance appear to be an issue that only relates to the projects individually, if the desire is to share resources between projects across domain boundaries then having a standard set of metadata that can be exchanged is a major issue.

Authentication and Authorization

The purpose of Authentication and Authorization is to maintain user identity and control access to certain resources.

In order to ensure that users are not discouraged from using a research infrastructure, it should be possible to use most capabilities without needing to authenticate or even provide an identity. If a user wishes only to return to pick up sets of results that were generated earlier then a temporary identity (possibly supported by cookies) should be sufficient. But, if the user wishes to store data or user preferences on a semi-permanent basis, or use processing capabilities, then the system needs to know who they are – i.e. their identity.

For a few activities the user also needs to be authorized; these are limited to executing user-defined code (which could endanger the system) and storing user defined material (which could have inappropriate content).

The metadata used to describe Authentication and Authorization needs to be domain independent and this should be relatively easy to achieve. There are a limited number of well established techniques for this and a project would normally choose to adopt one of them rather than trying to develop its own; this is very necessary since the steps that are required to get this type of token accepted worldwide are substantial.