Coordination Action for the integration of

Solar System Infrastructures and Science

Guidelines for Data Archives

Following a simple set of rules makes it easier to integrate a data archive into any virtual observatory (VO). Below are rules that are derived from those proposed by the IAU Div. II Working Group (Sun and heliosphere) on International Data Access.

Although the Working Group concentrates on access to solar and heliospheric data, the rules have been expressed as generically as possible and they have relevance to any archive and any VO – we urge data providers to follow them as far as possible.

The rules are grouped into two halves:

Issues in the first group should primarily be decided by the providers and accommodated by the VO.

Although those in the second group are also in the province of the providers, following simple rules can make a lot of difference as to how easily the required observations can be found by the VO and supplied to the scientist.

As much data as practical should be made available. From an analysis standpoint, a "regular cadence" with a minimum number of several observations per hour (6+) is desirable; this would make it possible to track the general evolution of phenomena although rapid changes would be missed.

This document is open for discussion, please contact us if you have any comments.

Access Method:

The protocol used for the interface into a data archive is not critical – a virtual observatory should be able to handle whatever protocol the data provider adopts.

Not all data providers can provide the same level of support – in this context, EGSO developed the concept of resource-rich and resource-poor providers:

  • Resource-rich providers – e.g. data centres – should be able to respond to requests through a simple interface. For resource-rich providers, how the data are stored in an internal issue; catalogues can be used to determine exact access path, etc.
  • For resource-poor providers, if the VO needs to find the data by itself, logically named files within a hierarchical directory structure are desirable – see below. Standard access options include FTP, HTTP, Web Service, etc. – potentially the first two require least effort by the provider.

File Formats:

As volume of data available increases, and the number of data sets grows, it is becoming increasingly important that the data be ready for use – i.e. calibrated – although this is by no means obligatory. A virtual observatory should be able to support the use of data in any format although some file formats are more useful than others.

For quick-look purposes simple image files are adequate – e.g. JPEG, PNG, GIF, etc. – but the lack of metadata associated such formats with makes it difficult to use them for serious research.

If the objective is to compare data from different instruments, files with formats that can contain fully formed metadata are strongly preferred – e.g. FITS, CDF or equivalent.

If the data in file are not processed to a high level, then appropriate software and calibration files must be provided if data needs to be "manipulated" before use.

File Names & Metadata:

There are no hard and fast rules on the file names but the name needs to be sufficiently unique that:
  1. The type and origin of the file can easily be identified, and
  2. It can exist without causing confusion when removed from the context of where it is normally stored (on the source archive system)
Ideally the name should identify the "date & time" that the observations were made and the "observatory & instrument" that made them – an indication of the type of observation can also be useful.

The "date & time" need not be a full specification, some kind of a sequential numbering might be sufficient. However, if file naming is not based on time, a catalogue or simple translation table is needed to allow the VO to select the appropriate file.

The SOHO mission developed a "convention" for the names of files in its summary and synoptic databases – see Naming Convention for Files (SOHO with BBSO extensions). A simpler convention might be sufficient, but this provides a gold standard for how things can be done.

Note that the information contained in the file name is not enough when the data are to be used for analysis; it is essential that all files contain good metadata describing in detail how the observations were made. It is also important that the metadata are properly formed – if they are not it may be impossible to use the data in some circumstances.

Again a "convention" was established during the time of SOHO – see Solarsoft Standard.

Directory Structure within the Archive:

A hierarchical structure to the data directories makes it easier to find files and is strongly preferred. This is essential for resource-poor providers and is also beneficial for a data centre.

Ideally the directory structure should be a tree based on date (and time?):

The number of directory levels really depends on number of files generated by the instrument. If only one file is produced per day, the number of levels of subdirectories can be reduced.

On Unix-based archives, if the directory structure is different to the one suggest above, it is possible to map to a more compliant structure using symbolic links without having to reorder the data themselves. The mapped directory structure can then be presented to the external interface.

Summary of Observations

It can greatly simplifies access if the archive maintains a summary of the observations that have been made.

If an observing log is available a VO can determine what observations are available without needing to search the archive directories looking for files.

  • The observing log should contain the minimum information that are required in the file metadata although repeated information (such as the observatory name and location?) could be abstracted into a header section.
  • The observing log should also contain information explaining why there are gaps in data coverage because of operational reasons, bad weather, etc. (?? assume that day/night and radiation belts can be calculated?; eclipse season?)
If it is not possible for an archive to hold all its observations on-line, the observing log can be used by a VO to identify that suitable observations have been made – a request for the required data can then be generated. This route could also be used to advertise the existence of proprietary data so that other users at least know that the observations exist.

Revised Jan 2011, RDB