by Michele Tobias — March 29, 2018
by Michele Tobias — March 29, 2018

Geospatial Data for Storage & Exchange Guide

When thinking about storing and sharing digital geospatial data for the long-term, we need to think about how to ensure data remain usable in the future.  Data formats that are popular and easy to read at one point in time may later be rendered unreadable by changes in software and updates to the format definitions.  Data can also come in a proprietary formats that can only be opened by a particular software.  Proprietary formats pose problems not only for data longevity (how to you ensure access to that particular software version later?) but also access to the data if a potential user of the data cannot install or afford the software needed to open the data format.  Fortunately for the geospatial data professional, format exchange libraries like GDAL that provide methods for converting one digital format into another help ensure data readability in open and known formats.  But there are steps we can take as researchers to ensure data are usable for the foreseeable future.

The Open Archival Information System (OAIS)  provides a framework for archiving data and ensuring long-term availability and usability of data, but how do we apply this framework specifically to geospatial data?  We’ll go over the concepts of interest to researchers and data producers.

Michele Tobias

Data Management
GIS Data Curator/ Specialist

mmtobias@ucdavis.edu

(530) 752‑7532

In this Guide:

Storage Formats

There are two important components to storing data.  First, the data itself must be in a format that is likely to be usable for several years to come.  Second, the data needs metadata.  Medatada is documentation about the data.  It should include contact information for the data’s producer, information about the methods used to create the data, and any other information a person would reasonably need to know to use the data properly.

Exchange Formats

Data contained in a single file makes exchange easier.  Consider using .geojson (vector), geopackage (vector or raster), or geotiff (raster) for sending data to others.  Their single file format makes them an easier option for moving between computers.  Like storage formats, the second and equally important part of data exchange is to include metadata.

Gespatial science is one of the few academic disciplines that teaches students about metadata as a part of their core learning (Library Science is another).  It will not surprise geospatial professionals to know that ideally, data stored for long-term use or sent to another user should be accompanied by metadata.  Metadata should contain information that a person using the data might need to know to use the data effectively.  For example, you might include the name of the person who made the data, which organization they work for, how to contact the person, a summary of methods or a citation for a paper that explains the methods, and data facts like the size of raster cells or the minimum mapping unit for digitized vector data and the projection.  Geospatial data typically stores metadata in an .xml file (structured plain text, with tags similar to html), but you could also store the necessary information a text file called README.txt.  The advantage to using the .xml standard to geospatial data is that graphical user interface-based GIS programs can read the data and display it in the layer properties for easy reading.

The USGS has a helpful collection of information and software related to metadata.  The Federal Geographic Data Committee (FGDC) sets standards for metadata.

Suggested Data formats:

Geopackage was designed as an exchange format, but also functions well for storage because it has been incorporated into the GDAL library and can be opened by all the common GIS programs.  Geopackage can contain raster or vector data.  Geopackage is not a good choice for all raster data.  Because it stores rasters as either a .png (for data with an alpha or transparency channel) or .jpg, geopackage can only store three data bands (plus alpha for .png).

Geojson stores vector data in structured human-readable text.  Vertex locations are stored as latitude/longitude coordinates in decimal degrees.  As a storage format, this is ideal because minimal technology would be needed to recover the data in the event of massive technology failure.  Because the files are text-based (rather than binary), simple text comparison programs can be used to determine differences in files.  Geojson is a good exchange format for open source GIS programs, but can be tricky to use in the ESRI suite of software.

Geotiff stores raster data.  It can store an unlimited number of bands and supports “no data” cells, both of these features are an advantage over geopackage raster data.  Geotiff is a good exchange format because it can be natively loaded in all of the common GIS programs.

File formats to avoid for storage and exchange

  1. Formats that contain no data:
    1. Graphical user interface map project files, such as ArcMap’s .mxd files or QGIS’ .qgs files. These are files that store information about which data you want to use in a map and how to style it. They don’t contain any of the data itself, but rather pointers to where to find the data on your file system.
    2. Style Files such as ESRI’s Layer files (.lyr). These files contain information about how to style data, but don’t contain the data itself.
  2. Proprietary Formats are data formats that can only be opened by software from a particular company. Files like ESRI’s Package Layer (.lpkx) format is an extreme example of a proprietary format.  It can only be opened (at the time of writing this article) by the ArcGIS Pro software, and not by other software developed by the parent company, ESRI.  ArcInfo Interchange Format (.e00) files require processing before they can be used in any software.  These formats are not bad for their intended use, but they do not make good storage or exchange formats because ensuring access to use the data is difficult.
  3. Formats Requiring Multiple Files Shapefile is a common format for vector data but because Shapefiles require three or more files be stored together on a computer, it is easy to misplace or lose some of the necessary files. ArcInfo Coverage is another example of a format requiring multiple files.