One of my customers recently needed their website to know the distance between people. It did not have to be exact so we based it off of zip code latitudes/longitudes (which we got indirectly from the US Census Bureau) and some fuzzy trigonometry. The first step was to find a way to put the zip code information into a database. The newest one (circa 2004) that was easily workable came from http://www.boutell.com/zipcodes/. We created a table so that we could import the data.
CREATE TABLE zipcodes (
zip varchar(5) NOT NULL default '',
city varchar(64) default NULL,
state varchar(2) default NULL,
latitude float NOT NULL default '0',
longitude float NOT NULL default '0',
timezone int(11) default NULL,
dst tinyint(1) NOT NULL default '0',
PRIMARY KEY (zip)
)
This will allow for a direct import of the CSV data with one small annoyance. The .csv file has blank lines in it. You can remove those by hand. I used this command in Vim until it didn’t find any more blank lines:
:%s/\r\r/\n/
For some reason Vim will find “\r” but if you tell it to put a “\n” into a replacement it does not handle things right. I haven’t taken the time to figure out the reason yet. As a refresher, "" represents the linefeed character and "" is the carriage return. Windows uses both, Mac uses “\r” exclusively and Linux generally uses only “\n.” If you did try to import before removing those blank lines, you will want to truncate (empty) the database before importing again. MySQL should stop you automatically if you made the zip code a primary key, as I did, but you do not want to duplicate this information. After saving, I imported the data into the database by using phpMyAdmin. You can also use the SQL command LOAD DATA LOCAL INFILE. Once it has been imported, we needed a way to track a point in space. Even though there is little support for it in some third-party programs (such as phpMyAdmin), it turns out that MySQL does have a POINT data type which is specifically designed for this purpose. You need to alter the structure of the table using this command:
ALTER TABLE `zipcodes` ADD `location` POINT NOT NULL
The next part gets into the realm of math. If you have ever taken a look at 3D programming, speed is squeezed out of the code by performing as many calculations as possible before they are needed. One popular trick, at least in the past, has been to create sine and cosine lookup tables. Instead of having to calculate these numbers you can instruct the program to find their value by going to the right location in an array. This allowed some 3D games to run smoothly on my old 80 Mhz computer. Distance calculations in three dimensions (as our world requires for larger distances) could benefit from this same speed increase. It would be bad to overload the server too much with these calculations that are going to be common. After considering the possibilities, I decided that the best option was to calculate ahead of time the distance from latitude and longitude 0. If we save both the latitudinal and longitudinal distances, we can use Pythagoras’ theorem (a^2 = b^2 + c^3) to find distances between two of those points. Now the most processor-intensive calculation is a single square root for each distance calculated, which will speed up the routine calculations significantly. Some people who have not done this have ended up with 15 second SQL queries, which is unacceptable. There is another trick later that will give us even more speed. I visited MeridianWorldData.com and borrowed their calculations. The first step was to boil them down because they were designed to calculate from two specific coordinates and I wanted one of them to have both an x and y coordinate of 0. This is what I ended up with:
UPDATE `zipcodes` SET location = GeomFromText(CONCAT('POINT(', IF(latitude < 0, -1, 1) * 3963.0 * ACOS(COS(latitude/57.2958) * COS(longitude/57.2958 - longitude/57.2958)), ' ', IF(longitude < 0, -1, 1) * 3963.0 * ACOS(SIN(latitude/57.2958) * SIN(latitude/57.2958) + COS(latitude/57.2958) * COS(latitude/57.2958) * COS(longitude/57.2958)), ')'))
Notice that the numbers can be negative in the database. The reason for this is to prevent our calculations from matching American zip codes to Asian cities if this is ever expanded. I feel bad about using GeomFromText() because that adds processing overhead and I am a huge fan of optimization, but I was not able to get this to work without it and the command is only run when creating (or updating) the database. Now it is time to test this with a random zip code in New York city:
SELECT ROUND(X(location)) AS x, ROUND(Y(location)) AS y FROM zipcodes WHERE zip = 10285
This returns an X value of 2816 and a Y value of -3755. That looks good so far. The comment on this blog entry provides the best reference I have found so far for running the distance calculations. If you take the X and Y values from the last SELECT statement, you can find the distance between them and every zip code in the entire database (with the nearest ones first) by using this command:
SELECT city, state, zip, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(location), POINT(2816, -3755))))) As distance FROM `zipcodes` ORDER BY distance ASC
The query reports that Jersey City is only a couple miles away, which is mostly accurate (MapQuest gives different distances because you have to find a bridge). Since we are using zip code areas and not exact addresses anyway, a little variation is alright. You would want to use different calculations if you needed distances to be accurate to within a centimeter anyway. You might also notice the speed on this query. For me it required roughly 0.1 seconds which is a whole lot faster than if we had not created the POINT data but it is still too long. There are a couple things we can still do to speed this up without hard-coding the distances (which would limit us to searches only from one location). One speed enhancement would be to index the POINT data:
ALTER TABLE `zipcodes` ADD SPATIAL INDEX (location)
That modification brought it down one or two tenths of a second per search but we can take even more advantage of the speed by creating boundaries for our query. If we decided that we wanted the zip codes within 10 miles, we could use “WHERE distance < 10” but that requires us to calculate distance for every row in the table (which we are already doing). If we can avoid doing that then we can shave off more math calculations. What if we asked for a box? A query that asks for any zip codes within 60 miles of the latitude and 60 miles of the longitude would look like this:
SELECT city, state, zip, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(location), POINT(2816, -3755))))) As distance FROM `zipcodes` WHERE X(location) > 2816 - 60 AND X(location) < 2816 + 60 AND Y(location) > -3755 - 60 AND Y(location) < -3755 + 60 ORDER BY distance ASC
That cut the time in half for a query on my server, which puts it somewhere around 0.04 seconds each. That isn’t bad. The difference in speed is now minuscule if you delete the POINT index but there is still a small difference. I’ve chosen to keep it. You will probably notice that some of the values returned are as much as 85 miles from our starting point. That is okay. We can silently discard any rows with a distance above 60 when displaying the data to the users. We know general distances now. Let’s find out how far it is to a specific location, such as Miami, Florida. This is all that it takes to check the distance to one zip code:
SELECT city, state, zip, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(location), POINT(2817, -3754))))) As distance FROM `zipcodes` WHERE zip = 33125
It says 1548. Mapquest reports 1292. For Memphis, TN (37544) we get 1198 and MapQuest says 1099. Portland, OR (97232) says 1529 and MapQuest says 2895. Yeah, it isn’t exact but it is close enough for small distances. Feel free to adapt this to your needs. Hopefully these SQL functions will give you a few ideas if nothing else and I will probably refigure some of the math as time goes on. I’d also be interested in hearing what you are doing.