Coordination Action for the integration of

Solar System Infrastructures and Science

File Management

Assuming that the data can be made easier to use across domain boundaries by improving the quality of the Observational Metadata, the next task is to make them more accessible. Following a few simple rules in the way files are named and in the directory structure used to hold them can make a significant difference to how easily they can be accessed.

File Naming and Storage

There are no hard and fast rules for file names, but the names should be sufficiently unique that files can be stored outside of their "native environment" – i.e. where they are normally stored on the system of the group that generated the file. In other words, the file should be able to exist without causing confusion when removed from the context of where it is normally stored.

The file names should also ideally identify the type and origin of the file, the nature of the data and when the observations were made; the name might also indicate if an image represented a partial or the full field-of-view of the instrument.

Although not strictly metadata in itself, the structure of the archive can make a lot of difference to how easily the data can be accessed. If the data are held in a hierarchical structure based on date and time, it is much simpler to create metadata that can be used to find the required data. Storage with this type of structure is essential for resource-poor providers and would be beneficial for data centres.

A summary of observations that have been made, or Observational Catalogue, is useful and simplifies access. It is particularly useful if not all the observations in the archive are available on-line.

Data Provenance

If there are multiple copies of the data, it is essential that which is master and which are slaves is understood. The master should always have the most recent and most complete copy of a given dataset. The system should then provide the means to track which data are available on other sites and which version of processing they represent.

Note that this issue has not been handled properly (if at all) in the past and should be addressed for future datasets.

Note also that what is meant by provenance in the particle physics community and in our domain seems to differ. The European Data Grid was designed around the needs of the LHC at CERN and in that context they are used to extracting clumps of data for analysis that they wish to keep track of; in our context we are talking about entire archives. Unfortunately a lot of the work that has been done in this area has been on behalf of the particle physics community and it is flavoured by their needs.