How to use the Data Directory Cataloger¶
This page explains how the content of the Directories is created and updated.
We will be looking at the directory /shared/eresearch
as an example
as all HPC users can read most of the contents of that directory.
Overview¶
The Data Directory Cataloger (the DDC) helps your research group to manage its data directories
on the HPC. It also helps eResearch to know what the data is, and who manages that data. We use
the DDC program to help us to manage our own directories under /shared
.
Your research group can start documenting your own data by placing a small text file named
README.yaml
in some of your directories. This file stores metadata about the contents of
that directory. Typical metadata that you would store might be “Title”, “Description”,
“Data Manager” and “Disposal Date”.
See the General References for external links on what is metadata and YAML.
When the DDC program is run on a directory it looks for README.yaml files in the immediate subdirectories. From those README.yaml files it reads the metadata, and outputs a single Markdown document listing each subdirectory and summarising the metadata in its README.yml file. This Markdown doc can then be easily transformed to a HTML page which will provide a single point of information about the contents of the directories. See the General References for external links what is Markdown.
The real advantages of the program are realised when multiple directories are cataloged and the set of pages are combined into a web site like this one.
You don’t need a README.yaml file in every directory, just the important, top level directories.
What a README.yaml File Looks Like¶
Example 1: /shared/eresearch/pbs_job_examples/README.yaml
Title: HPC Examples for PBS Job Submission
Description: Contains some simple examples of how to submit PBS jobs.
Data Manager: Mike Lake
Earliest possible disposal date: 2025
Example 2: /shared/eresearch/pbs_manuals/README.yaml
Title: Copies of the PBS Manuals for Users
Description: Contains copies of the PBS manuals which users can download.
Data Manager: Mike Lake
Earliest possible disposal date: 2025
You can see that in each directory the README.yaml file contains metadata fields and metadata values which describe the contents of the directory. If you are logged into the HPC you can see these files by running these commands:
$ cat /shared/eresearch/pbs_job_examples/README.yaml
and
$ cat /shared/eresearch/pbs_manuals/README.yaml
The DDC program has extracted that metadata and created a web page for the
/shared/eresearch
directory. Have a look and you will find this same metadata
in the table on that page here:
Directory /shared/eresearch
Note: The README.yaml files should contain the same metadata fields for each directory at the same level. The idea being is that each directory at a given level are in some way related and would have the same required metadata fields. The DDC program looks at all the fields and if any README.yaml file is missing a field then that is flagged as a warning that a metadata field might be missing.
Sections in the Directory Pages¶
The Directory pages are listed on the left hand side under “Directories”. The “Summary” section of a directory page will summarize some of the metadata fields. This will always show the Directory that each README.yaml file was found in. Not all README metadata fields will be shown. We show just the fields that are likely be useful to users. You can add other metadata fields if you wish into the README.yaml files.
The “Metadata Information” section will contain a list of all the metadata fields found in the README.yaml files for this level of directories. This section is more likely to be used by administrators wanting to know what metadata can be found in the README.yml files. You can have any number of fields, but its best to keep them few and simple. Just make sure that each directory at the same level has the same fields.
A “Metadata Warnings” section will be shown at the top of the page if there are README.yaml files that are possibly missing a metadata field, or if a directory is missing a README.yaml file.
Remember, if you are logged into the HPC, you can always use the cat
command to
concatenate the README.yaml file to the screen to see all the metadata
i.e. $ cat /path/to/README.yaml
Automatic Updating of this Site¶
Each evening the HPC runs the Data Directory Catalog program which recreates these web pages. So if you update a README.yml the new content will automatically appear in the directory pages once the program has run.
At the bottom of each directory page will be the date and time of the last update.
Other Metadata Fields One Could Use¶
- Title
- Description
- “Data Manager” or “Maintainer”
- Provenance e.g. Downloaded from xxx on 2020.01.01
- RDMP link
- “Minimum retention period” or “Earliest possible disposal date”
The “RDMP link” would be a link to the Research Data Management Plan for this data in UTS Stash.
Suggestions for Your Groups Shared Directory¶
As a good example we will use the Climate Change Cluster’s shared directory. Their site is here: C3 Data Directory Catalog Home
The Climate Change Cluster group has a private directory called /shared/c3
.
Only users of the C3 group can access this directory, so you might not be able
to look inside, but here is how they have setup their content.
They have 5 directories under /shared/c3
being:
apps <== contains applications
archives <== contains archived projects
bio_db <== contains common bioinformatics databases
instruments <== contains information on their lab instruments
projects <== contains 97 project directories
Each of those directories has a README.yaml file describing the contents and who manages that data. Each of the README.yaml files contains the same metadata fields.
When the DDC programs runs it combines that metadata into the page that you can view here: Directory /shared/c3
Disallowed Characters¶
There are some characters that you should not include in the README.yaml files.
These characters are: < > { } ( ) ;
If those characters are included then some of those characters may be removed
from the text. The right hand side “Table of contents” for that page will also
show a link “Metadata Warnings” and the bottom of the page will show:
Metadata Warnings
The following README.yaml files contained at least one of the disallowed characters: < > { } ( ) ;
This might be a bit inconvenient but it helps to ensure the security of users browsing the site.
More Detailed Help¶
For those interested in the Python code for the Data Directory Cataloger it’s open source and is available from the GitHub source repository here: https://github.com/UTS-eResearch/data_directory_cataloger
If you wish to:
- Look at multiple README.yaml files.
- Create multiple new README.yaml files in a directory.
- Add a field to multiple README.yaml files.
- Remove a field from multiple README.yaml files.
- Modify a field in multiple README.yaml files.
Then have a look at the Usage Doc in the Git repository for this software.
The FAQ, in the Git repository, covers some common questions.
General References¶
What is metadata?: https://en.wikipedia.org/wiki/Metadata
What is YAML?: https://hitchdev.com/strictyaml/what-is-yaml/).
What is Markdown?: https://www.markdownguide.org/getting-started/
Mike Lake
October 2022