Hack 71. Turn Your Tracklogs into ESRI Shapefiles

Converting your GPS tracklogs into a standard vector format allows you to import them into any GIS or mapping application automatically.

Once you've gotten your GPS tracklogs into GRASS as site data [Hack #70], something you'll notice almost right away is that, when you zoom in close, your tracklog points appear as, well, individual points, and not as the tracks you might expect. The reason for this hardly needs explanation. Your GPS receiver's tracklog is really just a sample of your location over time, rather than a continuous stream. When you import that data into GRASS as point data, you are telling GRASS that you want to see it as a set of points, rather than as the connected arc that (approximately) describes your path through time and space. What's more, as site data in GRASS, it's relatively difficult to get that data back out and display it in a GIS browser, such as Quantum GIS or Mapserver.

One neat answer to this whole set of problems is ESRI's Shapefile format, an openly published vector data format that has become a sort of lingua franca for GIS vector data. Shapefiles offer a way of portably grouping and storing points, sets of points, arcs, and polygons. Frank Warmerdam's shapelib project offers a good, free C API for reading and writing shapefiles, so most GIS and mapping applications that speak vector data will speak ESRI's shapefile format. We'll hack up a quick fifty-line Perl script to turn your gpstrans tracklogs into shapefiles, so that they can be exported to other applications as line data, rather than as point data.

As it happens, the term shapefile itself is a bit of a misnomer: a shapefile actually consists of three or four separate files, usually distributed together in a ZIP archive or tarball. The .shp file contains the actual vector data. Typically, you will find a .shx file that contains index data about the shapefile. Sometimes there's a .prj file that describes the projection of the vector data. Finally, you'll often also see a .dbf file, which isno lie!a DBase/FoxPro database that contains quantitative data about the features described in the shapefile, such as population or land area. You can actually browse this .dbf file in Excel or Gnumeric!

We'll use Geo::Shapelib, which is a thin layer of Perl wrapped around shapelib, to generate all three of these files from your tracklog.

You can get Geo::Shapelib from the CPAN using the following command as the superuser from a *NIX shell:

# perl -MCPAN -e 'install Geo::Shapelib'

In this script, which we'll call track2shp.pl, we'll also use the very handy Time::Piece module to parse the dates of the tracklog entries and do calculations with them. You can get Time::Piece from the CPAN.

Here's a sample bit of tracklog as output by gpstrans -dt:

Format: DDD UTC Offset: -8.00 hrs Datum[100]: WGS 84
T 02/07/2004 11:40:07 37.7396250 -122.4198818
T 02/07/2004 11:41:20 37.7395391 -122.4198604
T 02/07/2004 11:47:56 37.7394533 -122.4197745
T 02/07/2004 11:48:13 37.7394962 -122.4199033
T 02/07/2004 12:00:50 37.7393031 -122.4196029
T 02/07/2004 12:06:54 37.7394533 -122.4198604

As you can see, the gpstrans output format is very simple. There's a header line with some metadata, followed by several lines of data, each containing four tab-separated fields: the data type (i.e., T for tracklog), the timestamp, and a latitude/longitude pair. Obviously, the output from other GPS data download programs will look different, but the principle is the same. Extending the following code to work with the output of GARNIX and other programs is left as an exercise for the reader.

On the Mapping Hacks web site, you'll find a GPX-based version of track2shp.pl under http://mappinghacks.com/code/.

6.9.1. The Code

#!/usr/bin/perl 
 
use Geo::Shapelib;
use Time::Piece;
use constant TIME_THRESHOLD => 10; # minutes
use constant TIME_FORMAT => "%m/%d/%Y %H:%M:%S";
use strict;
use warnings;
 
my ($in, $out) = @ARGV;
die "Usage: $0 
" unless $in and $out;
 
my $shp = new Geo::Shapelib;
$shp->{Name} = $out;
$shp->{Shapetype} = 3; # PolyLine, according to Geo::Shapelib source
$shp->{FieldNames} = ['ID', 'Date'];
$shp->{FieldTypes} = ['Integer', 'String:16'];
 
open TRACK, "<", $in or die "Can't open $in: $!";
 
my $id = 0;
my (@vertices, $previous);
 
sub add_shape {
 return unless @vertices > 2;
 push @{$shp->{Shapes}}, {
 SHPType => 3, # PolyLine
 ShapeId => $id,
 NVertices => scalar @vertices,
 Vertices => [@vertices]
 };
 push @{$shp->{ShapeRecords}}, [$id++, $previous->ymd];
 @vertices = ( );
}
 
while ( ) {
 chomp; 
 my ($type, $date, $lat, $long) = split /	/, $_, 4;
 my $now = eval { Time::Piece->strptime($date, TIME_FORMAT) };
 next unless $now;
 
 if (@vertices) {
 add_shape( ) if $now - $previous > TIME_THRESHOLD * 60;
 }
 
 push @vertices, [$long, $lat]; # note, long before lat!
 $previous = $now;
}
 
add_shape( );
$shp->save($out);

In its broadest outline, track2shp.pl reads each line from the given tracklog, splitting the line up into its component fields and throwing away the lines it can't parse. The script builds up a set of vertices from the tracklog points, storing them together as needed in a single PolyLine shape. (PolyLine is just ESRI's term for a series of connected line segments.) Finally, all of the accumulated PolyLines are dumped to the shapefile(s).

6.9.2. Running the Code

Assuming the script is marked as executable and is in your current directory, you can run it as follows:

$ gpstrans -dt > tracklog.txt
$ ./track2shp.pl tracklog.txt tracklog.shp
$ ls
track2shp.pl tracklog.dbf tracklog.shp tracklog.shx tracklog.txt

As you can see, you should have three new files in the same directory: the .shp file containing the shapes, the .shx index file, and the .dbf datafile.

Figure 6-31 shows a sample map of a trip we made through Los Angeles, stopping off in Hollywood to pick up friends and then visiting Venice Beach, before dropping our friends off and continuing north on Interstate 5. This map was made with about five minutes of work in Quantum GIS, using the U.S. state borders and major roads layers from http://nationalatlas.gov and, of course, a GPS track processed with gpstrans and track2shp.pl.

Figure 6-31. A Shapefile made from a trip through Los Angeles

6.9.3. Understanding the Code

We have glossed over some of the details of this script. For example, what's this TIME_THRESHOLD business? Well, at first blush, we might consider it easiest to simply whack the whole tracklog together into a single PolyLine and call it done. The problem with this is that GPS reception frequently suffers dropouts of varying kinds, leading to missing data points. Sometimes dropouts are momentary, perhaps the result of your receiver simply losing its satellite lock for a minute or two as you drove past an inopportunely placed hill or building. In this case, we can probably just draw a straight line between the two known-good points on either side, and call it close enough.

But what if you lose reception for more than a few minutes, as can easily happen in urban areas? What if you shut your GPS receiver off and don't turn it back on until many days or many kilometers later? In these circumstances, it would be downright inaccurate and even misleading to simply interpolate a straight line between the two otherwise correct points on either side. This is where the TIME_THRESHOLD value comes in: track2shp.pl keeps track of how much time has elapsed between successive log entries. If the elapsed time is more than 10 minutes, then track2shp.pl assumes that a dropout (accidental or otherwise) has occurred and treats it as a signal to start a new PolyLine. In this fashion, we can tweak the TIME_THRESHOLD setting to group tracklogs of different trips together in a single shapefile, while treating them as individual features for the sake of mapping or GIS use.

There are a few other things worth mentioning about this script. First, note the TIME_FORMAT constant defined at the top. This string is passed to Time::Piece to tell it how to parse the timestamp in each tracklog entry. If you need to alter it for any reason, refer to the manual page for strptime(3) on your system for the syntax. One could make this "Do What I Mean" in more and different circumstances by replacing Time::Piece with Date::Parse from the CPAN, which would be more flexible but far less efficient.

Finally, note the use of the $shp->{ShapeRecords} array, which allows us to associate other data values with each feature in the shapefile. The data fields are defined by $shp->{FieldNames} and $shp->{FieldTypes}, about which you can read more in perldoc Geo::Shapelib. The important thing is that Geo::Shapelib stores the values from $shp->{ShapeRecords} in the resulting .dbf file, which you can read with most modern spreadsheet programs, such as Gnumeric or Excel. Load the .dbf file produced by track2shp.pl in one of these programs, and you should see two columns. The first is an arbitrary numeric identifier for each feature that track2shp.pl found in the tracklog, and the second is the date on which that feature was finally recorded, which is returned from the call to $previous->ymd. Obviously, with a little experimentation, you can add more fields and put all kinds of other stuff in here, such as the start and stop times of each individual track, the time elapsed, distance traveled, average speed, and more.

You can now import and visualize your tracklog in a plethora of mapping and GIS applications, including GRASS, Quantum GIS, Thuban, Mapserver, Manifold, and even ArcView.

6.9.4. Hacking the Hack

There is one other weak point remaining in this hack, which is the possibility of spurious data in the GPS tracklog. In one real-life example, a drive from San Francisco to Oakland on the lower deck of the Bay Bridge caused the GPS reception to cut out completely, with the exception of a single reading placed off the coast of Baja California. If we treat this point as valid, then it appears as if the automobile suddenly teleported from one end of the bridge out to the Tropic of Cancer over the Pacific Ocean and then back to the other end of the bridge, all in a matter of minutes! This effect is shown in Figure 6-32 by the two red lines shooting off to the south of the map.

Figure 6-32. A sudden, unexplained jaunt to Baja California

Presumably, the GPS signals bouncing around inside the steel infrastructure of the bridge resulted in an accumulation of multipath error in the receiver, causing the sudden and obviously false data point. How can we detect and eliminate these errors from our shapefiles? One way might be to calculate the mean velocity between successive tracklog points. If the calculated velocity is in excess of any speed a human being is likely to be traveling near the surface of the earthsay, the 800 km/h cruising speed of your average jet airlinerthen we can safely assume that the next data point is spurious and ought to be ignored.

In order to calculate velocity, we need measures of time, which are already in the tracklog, and distance, which for reasons described in [Hack #27] can be a bit tricky for short hops. For the sake of expediencythis is a hack, after allwe will rely on Geo::Distance, which will suffice insofar as we are not after exact distances so much as a rule of thumb with which to judge the validity of our collected position data. Geo::Distance can be obtained from the CPAN.

First, make the following additions near the top of track2shp.pl:

use Geo::Shapelib;
use Geo::Distance ':all'
use Time::Piece;
use constant SPEED_THRESHOLD => 1000; # km/h
use constant TIME_THRESHOLD => 10; # minutes
use constant TIME_FORMAT => "%m/%d/%Y %H:%M:%S";
use strict;

Next, in the main loop, add the distance calculation:

 if (@vertices) {
 my $d = distance_calc( kilometer => $vertices[-1], [$long, $lat] );
 next if $d / ($now - $previous) > SPEED_THRESHOLD / 3600;
 add_shape( ) if $now - $previous > TIME_THRESHOLD * 60;
 }

For the call to distance_calc(), we pass the units we want back, the previous point in the vertex list, and the current point we want to test. If the result divided by the time difference is greater than our threshold speed (scaled here to km/s for calculation purposes), then the next data point is assumed to be spurious and is skipped. With four extra lines of code, we have extended track2shp.pl to filter out bad data caused by intermittent GPS reception. Figure 6-33 shows the resulting shapefile.

Figure 6-33. The same shapefile, corrected

Although the line interpolated across the Bay in place of actual data doesn't quite overlap perfectly with the vector feature matching the bridge, it certainly comes a lot closer than our previous version did!

This filtering method has other potential drawbacks. Should the first data point collected be spurious and the rest valid, the script will make the mistake of assuming the subsequent points are wrong. Also, as mentioned earlier, the results returned from Geo::Distance might be inaccurate for very small distances, which could conceivably trigger false positives in exceptional circumstances. Meanwhile, track2shp.pl is starting to accumulate a few configurable settings that good software design dictates should be turned into command-line options, rather than source code constants. Of course, if we solved all of these remaining issues, we might find ourselves with something a lot closer to a software product than a simple hack!

Mapping Your Life

Mapping Your Neighborhood

Mapping Your World

Mapping (on) the Web

Mapping with Gadgets

Mapping on Your Desktop

Names and Places

Building the Geospatial Web

Mapping with Other People