Data management

Data management#

A challenging aspect of tutorial development is how to manage dataset dependencies. Ideally, the data you use will be publicly accessible and permanent for reproducibility. In this section we present practical guidelines for effective data sharing for Hackweek Tutorials and Projects.

Important

Remember, a hackweek tutorial is learning-oriented and should guide participants through a step-wise process with a meaningful outcome. If you typically work with large datasets, consider designing your tutorial to work with a small subset (~10 MB) that still enables your learning objectives to be met.

Computational resource considerations#

In order for tutorial notebooks to be executable on widely available public computing infrastructure we recommend targeting limited computational requirements such as 2-core CPU, 8 GB of RAM memory, 10 GB of disk space (at the time of writing)

Guidelines for Tutorials#

Try to use the smallest amount of data possible for your tutorial. If your tutorial starts with downloading data from a remote location, keep in mind that it may take longer than usual if hundreds of participants are accessing the same datasets simultaneously. Below we provide recommendations for common data volumes.

<10MB#

If your tutorial just needs a small image, or tabular data like a .csv file, go ahead and add it to the repository along with your tutorial code.

10 - 100 MB#

You can create a separate repository on GitHub to publicly host your tutorial dataset. Per GitHub repository limits, individual files are required to be < 100 MB.

Here is an example for images
Here is an example for n-dimensional arrays.

Note

If using a subset be sure to capture data provenance, for example by including a script that you used to access the original full-sized dataset from the data provider.

GitHub Release artifacts#

Generally it is not advisable to store binary files in GitHub repositories. Event if you make small changes to a file, an entire new copy is saved in the revision history and the size of the repository will quickly get unwieldy.

GitHub Releases are a feature of GitHub repositories that archive a snapshot of files in your repository in addition to other auxiliary files. According to official GitHub documentation:

You can create a release to package software, along with release notes and links to binary files, for other people to use.

At the time of writing, each file included in a release must be under 2 GiB. So storing tutorial data as files attached to a GitHub release of tutorial code can work well to keep code and associated data together.

Here is an example of attaching a large 100 MB Geotiff to a release artifact

Note

The GitHub Command Line Interface (CLI) provides a convenient method for downloading release data https://cli.github.com/manual/gh_release_download

>100 MB#

Stream from URLs#

Increasingly there are ways to load remote data in a streaming fashion, which allows you to avoid download and storage altogether! Essential this means using software that can read URLs instead of local file paths, such as at the beginning of this tutorial.

Note

Software that can read URLs still ultimately must download data! It will either be stored only in RAM, or as a temporary file on your hard drive, so be aware that you are still constrained by your local computing resources.

Warning

Check that scheduled server downtime for maintenance isn’t planned during your presentation! Also, be aware that URLs can be changed at any time by the data provider.

Here is an example using the Python earthaccess library
Libraries like Xarray can read data directly from cloud storage

Use Zenodo.org #

Another approach is to upload your data on Zenodo, which at the time of writing has a standard 50 GB limit (https://library.cfa.harvard.edu/data-archiving-and-sharing).

Note

fatiando/pooch is a nice Python utility to fetch data from Zenodo

Data permanence considerations#

Be aware that GitHub repositories can be deleted at any time by repository owners. For guaranteed long-term (10+years) hosting of a tutorial dataset that receives a citable Digital Object Identifier (DOI) you can use Zenodo.org. You can easily link a GitHub repository with Zenodo

Here is an example tutorial that retrieves a dataset from a Zenodo ‘record’

Data management

Contents

Data management#

Computational resource considerations#

Guidelines for Tutorials#

<10MB#

10 - 100 MB#

GitHub Release artifacts#

>100 MB#

Stream from URLs#

Use Zenodo.org #

Data permanence considerations#

Guidelines for Projects#

Data management

Contents

Data management#

Computational resource considerations#

Guidelines for Tutorials#

<10MB#

10 - 100 MB#

GitHub Release artifacts#

>100 MB#

Stream from URLs#

Use Zenodo.org#

Data permanence considerations#

Guidelines for Projects#

JupterHub Data Sharing#

Use Zenodo.org #