GIS Data Curator/ Specialist
by Michele Tobias – March 29, 2018
When thinking about storing and sharing digital geospatial data for the long-term, we need to think about how to ensure data remain usable in the future. Data formats that are popular and easy to read at one point in time may later be rendered unreadable by changes in software and updates to the format definitions. Data can also come in a proprietary formats that can only be opened by a particular software. Proprietary formats pose problems not only for data longevity (how to you ensure access to that particular software version later?) but also access to the data if a potential user of the data cannot install or afford the software needed to open the data format. Fortunately for the geospatial data professional, format exchange libraries like GDAL that provide methods for converting one digital format into another help ensure data readability in open and known formats. But there are steps we can take as researchers to ensure data are usable for the foreseeable future.
The Open Archival Information System (OAIS) provides a framework for archiving data and ensuring long-term availability and usability of data, but how do we apply this framework specifically to geospatial data? We’ll go over the concepts of interest to researchers and data producers.
There are two important components to storing data. First, the data itself must be in a format that is likely to be usable for several years to come. Second, the data needs metadata. Medatada is documentation about the data. It should include contact information for the data’s producer, information about the methods used to create the data, and any other information a person would reasonably need to know to use the data properly.
Data contained in a single file makes exchange easier. Consider using .geojson (vector), geopackage (vector or raster), or geotiff (raster) for sending data to others. Their single file format makes them an easier option for moving between computers. Like storage formats, the second and equally important part of data exchange is to include metadata.
Gespatial science is one of the few academic disciplines that teaches students about metadata as a part of their core learning (Library Science is another). It will not surprise geospatial professionals to know that ideally, data stored for long-term use or sent to another user should be accompanied by metadata. Metadata should contain information that a person using the data might need to know to use the data effectively. For example, you might include the name of the person who made the data, which organization they work for, how to contact the person, a summary of methods or a citation for a paper that explains the methods, and data facts like the size of raster cells or the minimum mapping unit for digitized vector data and the projection. Geospatial data typically stores metadata in an .xml file (structured plain text, with tags similar to html), but you could also store the necessary information a text file called README.txt. The advantage to using the .xml standard to geospatial data is that graphical user interface-based GIS programs can read the data and display it in the layer properties for easy reading.
Geopackage was designed as an exchange format, but also functions well for storage because it has been incorporated into the GDAL library and can be opened by all the common GIS programs. Geopackage can contain raster or vector data. Geopackage is not a good choice for all raster data. Because it stores rasters as either a .png (for data with an alpha or transparency channel) or .jpg, geopackage can only store three data bands (plus alpha for .png).
Geojson stores vector data in structured human-readable text. Vertex locations are stored as latitude/longitude coordinates in decimal degrees. As a storage format, this is ideal because minimal technology would be needed to recover the data in the event of massive technology failure. Because the files are text-based (rather than binary), simple text comparison programs can be used to determine differences in files. Geojson is a good exchange format for open source GIS programs, but can be tricky to use in the ESRI suite of software.
Geotiff stores raster data. It can store an unlimited number of bands and supports “no data” cells, both of these features are an advantage over geopackage raster data. Geotiff is a good exchange format because it can be natively loaded in all of the common GIS programs.