Hack 68. Convert Geospatial Data Between Different Formats

The wide range of arcane GIS data formats can be confusing and difficult to manage, but with the right tools, you can make quick work of them.

So you found some GIS data about an area of interest: now what do you do with it? In many cases, the next step is to try to import it into a GIS application, such as GRASS or Quantum GIS. But what if you want to convert it into a more useful format first? Or what if you just want to find out what format the data is already in, what coordinate system it uses, or what region it covers?

Before we address these questions, it might help to quickly review the basics of geospatial data formats. There are two kinds of geospatial data: vector and raster. Much like a digital photo, raster data takes the form of a matrix of rows and columns of rectangular pixels, or cells. In the same way that each pixel of a digital photo has some value describing its color, each cell of a raster layer has some value associated with it that represents some detail about the worldperhaps elevation, population, land-use patterns, rainfall, or ecological criteria, like rainfall or biodiversity. Raster layers are specified with extents, which are the spatial coordinates the layer represents, and spatial resolution, which indicates how much of the surface of the planet each cell corresponds to. Each layer of raster data in a given data set is referred to as a band, and a single data set can contain multiple bands.

GIS gurus like to say that "raster is faster, but vector is corrector." Vector data is made up of points, lines, and polygons. It's typically used to represent individual locations, such as water, rail and road networks, political boundaries, park areas, municipal districts, military reservations, and so on.

Of course, be it raster or vector, most geospatial data comes with a bit of metadata, which describes everything we need to know to use the data, including the extents, resolution or scale, data type (e.g., integer or floating point), coordinate system and datum, and more. Sometimes the metadata is stored in the same file or files as the data itself; more often it is stored in a separate set of files, which are sometimes referred to as world files or projection files.

Different GIS formats abound, many of them legacy or proprietary formatssuch as ESRI's Ungenerate, or E00, formatwhich have incomplete or nonexistent free/open source implementations. We'd suggest that the canonical forms for open geodata are GeoTIFF for raster data and ESRI Shapefile for vector data, with the latter remaining the standard until GML stabilizes and is popularized.

For most of our open source geodata conversion desires, we need look no further than Frank Warmerdam's amazing Geospatial Data Abstraction Library, or GDAL. The GDAL library and utilities deal exclusively with raster data formats. The OGR library and utilities, which are shipped as part of the GDAL package, are useful for managing vector data. (The acronym OGR doesn't actually stand for anything.) Together, GDAL and OGR form a sort of Swiss Army Knife for GIS data. The GDAL package is available in source and binary forms at http://remotesensing.org/gdal/. You can also find it in Debian APT under gdal-bin.

6.6.1. Converting Raster Data Between Different Formats

To illustrate use of GDAL, we'll convert a digital elevation model that we got from the U.S. Geological Survey's Seamless Data Distribution System [Hack #67] to GeoTIFF format. We selected a region of San Francisco and then opted to download the 1/3" National Elevation Data for the city. The data comes in ArcGrid format, in a ZIP file with an eight-digit filename. If we unzip this file and look inside, there's quite a lot of stuff there:

$ unzip 79314780.zip
$ cd 79314780
$ ls -R
79314780 METADATA.dbf METADATA.shp info
79314780.aux METADATA.prj METADATA.shx meta1.html
 
./79314780:
dblbnd.adf hdr.adf prj.adf sta.adf w001001.adf w001001x.adf
 
./info:
arc.dir arc0000.dat arc0000.nit arc0001.dat arc0001.nit

We can use the gdalinfo utility distributed with GDAL to examine the metadata properties of this data set. Ordinarily, gdalinfo is only run on a single file, but the important part of this complex arrangement of files is the directory named 79314780, so we'll run it on that:

$ gdalinfo 79314780
Driver: AIG/Arc/Info Binary Grid
Size is 570, 416
Coordinate System is:
GEOGCS["NAD83",
 DATUM["North_American_Datum_1983",
 ...
 AUTHORITY["EPSG","4269"]]
Origin = (-122.517015,37.812947)
Pixel Size = (0.00027777,-0.00027777)
Corner Coordinates:
Upper Left (-122.5170148, 37.8129466) 
Lower Left (-122.5170148, 37.6973942) 
Upper Right (-122.3586859, 37.8129466) 
Lower Right (-122.3586859, 37.6973942) 
Center (-122.4378504, 37.7551704) 
Band 1 Block=285x4 Type=Float32, ColorInterp=Undefined
 Min=-2.201 Max=284.040

There's a ton of metadata here, which we excerpted earlier. The key things to note are that the data is in Arc/Info Binary Grid format, 570 pixels wide and 416 high, referenced to geographic coordinates (i.e., latitude and longitude) against the NAD 1983 datum. The origin and resolution (i.e., pixel size) are given, and from these figures the coordinates of the corners are collected. This data set consists of one band representing elevation above sea level, presumablyin 32-bit floating-point representation, ranging from about -2 (meters, as it happens) to 284.

Now that we know a little bit about this data set, we can use another GDAL utility, gdal_translate, to convert our elevation to GeoTIFF, possibly in order to import it into another application:

$ gdal_translate -of GTiff 79314780 sf_dem.tif 
Input file size is 570, 416
0...10...20...30...40...50...60...70...80...90...100 - done.

If you run gdalinfo on the resulting GeoTIFF, you'll see that the essential details are preserved. The single output file contains all of the data, and most of the metadata, of the original zipped ArcGrid file. gdal_translate can read and write a variety of formats; run it without any options to see which ones your version supports.

6.6.2. Clipping and Warping Raster Data

gdal_translate has some other tricks up its sleeve. For example, if we want to extract just the portion of the digital elevation model that covers the Presidio, a national park in the northwest corner of San Francisco, we can use the "projection window" parameter of gdal_translate to clip out that bit and save it to a separate file. (Note that, as usual, longitude comes before latitude because it's the x-coordinate.)

$ gdal_translate -projwin -122.4862 37.8116 -122.4462 37.7860 sf_dem.tif presidio.tif
Input file size is 570, 416
Computed -srcwin 111 4 144 92 from projected window.
0...10...20...30...40...50...60...70...80...90...100 - done.

If you don't specify an output format with -of, gdal_translate assumes you want a GeoTIFF by default.

Another common problem you find when working with geodata is that you're using one coordinate system or projection in your GISsay, Universal Transverse Mercatorwhile the data you have is in another projection or coordinate system, such as latitude and longitude. Unfortunately, the situation is a bit like apples and oranges, so the new data needs to be converted to your existing projection. This process is referred to as reprojecting or warping the raster data, and, quite logically, the tool for the task is called gdalwarp. If we want to warp our DEM of the Presidio into a UTM projection in order to, say, combine it with a UTM topo layer, we might use gdalwarp as follows:

$ gdalwarp -t_srs "+proj=utm +zone=10" presidio.tif presidio_utm.tif
0...10...20...30...40...50...60...70...80...90...100 - done.

The -t_srs parameter stands for "target spatial reference system," which is quite a mouthful, but it just refers to the projection you want your data warped into. gdalwarp also supports a -s_srs option, short for "source spatial reference system," which allows you to specify the projection and/or coordinate system that your data originates in. You usually won't need to bother specifying a source projection, but it's handy to have when dealing with raster data that doesn't have projection metadata directly attached. The projection parameters for gdalwarp use the same format as the PROJ.4 toolkit [Hack #26] and are documented more thoroughly on the PROJ.4 homepage (http://remotesensing.org/proj/).

gdal_translate and gdalwarp have plenty of other options, so don't hesitate to read their manpages!

6.6.3. Once More from the Top, with Vectors

That's all well and good for raster data, but what about vector data? Fortunately, the OGR half of the GDAL toolkit features a counterpart to gdalinfo called ogrinfo. We can use ogrinfo to determine the features of a given vector data set as follows:

$ unzip tgr06075.zip
$ ls TGR*
TGR06075.MET TGR06075.RT4 TGR06075.RT7 TGR06075.RTC TGR06075.RTI TGR06075.RTS
TGR06075.RT1 TGR06075.RT5 TGR06075.RT8 TGR06075.RTE TGR06075.RTP TGR06075.RTT
TGR06075.RT2 TGR06075.RT6 TGR06075.RTA TGR06075.RTH TGR06075.RTR TGR06075.RTZ
$ ogrinfo TGR06075.RT1
ERROR 4: Tiger Driver doesn't support update.
Had to open data source read-only.
INFO: Open of `TGR06075.RT1'
using driver `TIGER' successful.
1: CompleteChain (Line String)
2: AltName (None)
3: FeatureIds (None)
4: ZipCodes (None)
5: Landmarks (Point)
...

Here, we take the ZIP file from the Census Bureau's download site containing the TIGER/Line data for San Francisco and unpack it into the current directory. As you can see, the 2003 TIGER/Line data set contains 17 different record types, plus a metadata file. ogrinfo shows the name of each layer, along with the geometry type. If we give ogrinfo a specific layer name, it returns the metadata about that particular layer:

$ ogrinfo -ro -so TGR06075.RT1 CompleteChain
Layer name: CompleteChain
Geometry: Line String
Feature Count: 16355
Extent: (-123.173825, 37.639830) - (-122.281780, 37.929824)
Layer SRS WKT:
GEOGCS["NAD83",
 DATUM["North_American_Datum_1983",
 SPHEROID["GRS 1980",6378137,298.257222101]]]
MODULE: String (8.0)
TLID: Integer (10.0)
SIDE1: Integer (1.0)
SOURCE: String (1.0)
FEDIRP: String (2.0)
FENAME: String (30.0)
FETYPE: String (4.0)
FEDIRS: String (2.0)
CFCC: String (3.0)
...

The -ro option in ogrinfo stands for "read-only" and is there mostly just to spare the warning about TIGER/Line being read-only. More importantly, the -so option stands for "summary only"; try running ogrinfo on the TIGER/Line CompleteChain layer without it! The layer metadata more or less speaks for itself. The layer consists of line stringsas opposed to polygons or pointsand the GEOGCS["NAD83", ...] bit tells us that the coordinate system is latitude and longitude, using the NAD 1983 datum. After the metadata properties, we get a list of data attributes associated with each geometric feature in the layer. So far, so good.

You might be wondering what a geometry type of "None" signifies. In this case, it represents data in TIGER/Line that isn't attributable to any particular geometric feature. OGR can extract the values to a DBase-type .dbf file, but that's about it.

Now, let's extract the CompleteChain layerwhich is, in this case, all the line segments of various kinds in San Franciscointo a more portable ESRI Shapefile, possibly for import into another GIS application.

$ ogr2ogr -f "ESRI Shapefile" san_fran.shp TGR06075.RT1 CompleteChain
$ ls san_fran.*
san_fran.dbf san_fran.prj san_fran.shp san_fran.shx

Here, we've used OGR's vector-wielding counterpart to gdal_translate, a utility called ogr2ogr. The -f option tells ogr2ogr in what format we expect the output; otherwise, the order of arguments is reversed from gdal_translate, in that the target file is put first and the data source second. This is primarily so that a list of layers to be converted can be given afterward. (We only specified one layer, but ogr2ogr permits multiple layers to be converted at once.) The output is an ESRI Shapefileactually, a collection of files which together can be imported into another application. As you might expect, you can run ogrinfo on the resulting shapefile to see that the data is indeed converted. You can run ogr2ogr without any arguments to see which formats your build supports.

6.6.4. Points and Lines and Polygons, Oh My

Naturally, ogr2ogr also offers spatial clipping of vector data, which can be accessed via the -spat parameter. Somewhat more interesting is the option to filter based on attribute values, using an SQL-like syntax, with the -where parameter. For example, the following generates a shapefile of interstate freeways in San Francisco, by filtering the "feature name" attribute for routes beginning with the prefix I-:

$ ogr2ogr -where 'fename like "I-%"' sf_freeways.shp TGR06075.RT1 CompleteChain

The following example, perhaps slightly more useful, extracts just the street features from all the other gobbledygook in TIGER/Line, by matching features with Census Feature Classification Codes beginning with the letter A:

$ ogr2ogr -where 'cfcc like "A%"' sf_streets.shp TGR06075.RT1 CompleteChain

As you might have guessed, if the -f option to ogr2ogr is omitted, output to ESRI Shapefile is assumed.

Most basic SQL comparison and logical operators (e.g., =, <>, and, or, etc.) are supported in the -where clause. Finally, should you ever find it necessary, ogr2ogr will happily reproject your vector data sets, using the same good old -t_srs and -s_srs options supported by gdalwarp. Of course, ogr2ogr has a lot of other options, so be sure to peruse the manpage if you need to munge some vector data and you're not sure if OGR can do it.

One of the most frequent causes of the failure of two geospatial data sets in the same projection to properly line up with each other is a difference in geodetic datum, which is the definition of the precise origin of the spatial coordinate system. There are hundreds of standard geodetic datums, with NAD27, NAD83, and WGS84 being the most common you'll encounter.

GDAL can fix datum mismatches. For example, the following bit of code translates the NAD27-referenced hydrographic data from [Hack #64] to NAD83, to make it match up with the underlying raster layer:

$ ogr2ogr -s_srs "+proj=UTM +zone=12 +datum=NAD27" 
 -t_srs "+proj=UTM +zone=12 +datum=NAD83"  
hydro_nad83.shp hydro_main.shp

Similarly, the following command, which uses the gdalwarp utility, converts a raster in GeoTIFF format from NAD83 to NAD27:

$ gdalwarp -s_srs "+proj=UTM +zone=12 +datum=NAD83" -t_srs 
"+proj=UTM +zone=12  +datum=NAD27" doq_nad83.tif doq_nad27.tif

As a side note, WGS84 and NAD83 are identical to the eighth decimal place, so you usually don't need to convert between them.

The GDAL and OGR utilities can take a little getting used to, but they make a lot of things possible and even easy that would have been difficult or impossible otherwise. What's more, many an open source GIS application depends on the GDAL and OGR libraries for GIS data handling.

Mapping Your Life

Mapping Your Neighborhood

Mapping Your World

Mapping (on) the Web

Mapping with Gadgets

Mapping on Your Desktop

Names and Places

Building the Geospatial Web

Mapping with Other People