• Geolocation

PostgreSQL Geo Queries Made Easy

Paul Dmitriev
23 Apr 2020
29
geo location image

Being a mobile developer you’d never imagine having to work with geospatial indexes and geolocation queries; real life has a different opinion though. In one of my recent projects, a dating app, we needed to be able to find users that were physically near each other. To accomplish this I summoned some of my old backend tricks and I wanted to share the finding with you.

There are several approaches for geolocation queries that vary in speed and accuracy, so let’s talk about these options. This post is aimed towards the mid-level developers, but hardcore geeks would also find interesting comparisons and approaches in the end of an article.

PostgreSQL image

Routine tasks like “find all app users near city X” or “sort something by distance from current user” are really common requests. In one of the legacy projects I’ve been supporting, this task was solved in a pretty straightforward way. They simply used the following query to select coordinates fitting a necessary square

WHERE lat > 10 AND lat < 20 AND lng > 30 AND lng < 45

and then additionally filtered all founded results in Python script, using distance calculation. I’m sure you can imagine the performance of such an approach and the amount of edge conditions it had. Of course, all edge conditions could be resolved and carefully scaffolded with unit tests, but let’s look at how this could be done on a DB level. And no, we’re not doing that in order to shift the responsibility onto DBA.

The last time this strategy was used was during the development of our secure dating startup Zin; where geolocation was a critical feature. We needed to find all posts that were close to a user and sort them by distance. After researching it appeared we had 3 options (more precisely two and a half), which we’ll cover.

For clarification, I’ll say that we needed “big circle distance”, without terrain consideration. Also, a possible error was pretty big, so the main criteria for us was implementation simplicity. Looking ahead I can say that precision appeared to be bigger than we expected, and I’ll show that in practice.

Book a strategy session

Get actionable insights for your product

    Success!

    We’ll reach out to schedule a call

    Test data

    As our DA contains some sensitive data, I’ll use two other datasets to provide examples. First one, it’s a big table of cities and their coordinates taken from http://download.geonames.org/export/dump/. It’s a bit bloated, as it contains not only cities, but their districts, notable landmarks, and other stuff, but as the main goal here is an overall amount of data. So, here is the structure of the first table.

    CREATE TABLE cities (
        id integer DEFAULT nextval('public.cities_seq'::regclass) NOT NULL,
        name character varying NOT NULL,
        region character varying NOT NULL,
        lat double precision NOT NULL,
        lng double precision NOT NULL
    );
    
    ALTER TABLE ONLY cities
        ADD CONSTRAINT cities_pkey PRIMARY KEY (id);
    
    CREATE INDEX cities_name_low_idx ON cities USING btree
    (lower((name)::text));
    
    CREATE INDEX cities_location_idx ON cities USING gist (ll_to_earth(lat, lng));

    It’s pretty simple: city name, code of the country it belongs and a pair of attributes to store latitude and longitude. Additionally, we’re building two indices: BTREE for fast search of city by name and the one we need for location queries. It uses GIST (huge thanks to PostgreSQL team for this magnificent mechanism) and ll_to_earth function that we’ll discuss later on. 

    Usually, we have two ways of storing information: use a pair of numbers (usually float8) in different columns or use one of the domain-specific types that we’re discussing later (point, earth, geometry). There is no viable difference in performance, so separate columns are more user options, but of course, it depends on the particular requirements of the project. 

    I won’t post the helper Python scripts I used for data collection and DB population, they are really straightforward. If you’d like to take a sneak peek at them, here is a GitHub repo: https://github.com/cleg/GeoReqTest

    After transferring the whole dataset to PostgreSQL, let’s check that we have a big enough table to play with.

    SELECT COUNT(*), pg_size_pretty(pg_relation_size('cities')) AS data, pg_size_pretty(pg_indexes_size('cities')) AS idxs 
    FROM cities;

     

    count     | data   |  idxs
    ----------+--------+---------
     12005938 | 848 MB | 1836 MB
    (1 row)
    
    Time: 2073.919 ms (00:02.074)

     

    The second dataset I created for a more visual demonstration of complicated queries contains two much smaller tables. First, cities_min contains the biggest cities of Montenegro (15 in total), the second one, named pizzerias, contains, you got it, list of local pizzerias (172 total) that I’ve got from FourSquare. This scraping script (pun intended) is also available in the abovementioned repo.

    Now that we have the intro out of the way, let’s proceed to the main topic: discussing the options we have.

    PostGIS

    This is the “hard way” we wanted to avoid, as is hinted in the title. But I’ll be covering it as we need to know why it’s not the best choice for us.

    Of course, this method has some really viable advantages: it’s the most versatile and precise, it has no edge conditions and it allows changing the spatial reference identifier. 

    But that comes with a big cost, starting with tons of heavy dependencies (simple installation via Homebrew requires installing more than 30 additional formulas, including mastodons like GCC, Qt, boost, etc.) and ending with overall complexity (non-standard data types, 5000+ rows of additional tables and views with dictionaries). Also, it’s slower than other methods, not by a big margin, but anyway, why spend resources for the precision we don’t need?

    Luckily, the core team of PostgreSQL developers created a really convenient extension, containing the solution we used. Installation is really simple:

    CREATE EXTENSION earthdistance CASCADE;

    This extension gives two options for working with coordinates.

    <@> operator

    I’m not sure how to call this syntactic chimera (maybe “eye operator”?), but its essence is simple: it calculates that “big circle distance” I mentioned before; between two points on Earth’s surface. To represent coordinates it uses built-in point type, but pay attention that coordinates are represented as point(lng, lat) with longitude in the first place. Its counter-intuitive, as in speech we’re used to naming  latitude first, but longitude is closer to X-axis by its nature, so it goes in the first place. The second peculiarity to be aware of is that this operator returns results in statute miles, so cases when you’ll need to multiply or divide by 1609, will be pretty numerous if you live outside of US/UK. 

    The advantages of this method are pretty obvious: it’s simple, and it uses basic datatypes of PostgreSQL. But it has a lot  of drawbacks: precision (Earth considered being ideal sphere), edge conditions around the poles and 180 meridian, and bad indexing support. On  small datasets and cases with far from edge conditions, this method is doing its job on an acceptable level. 

    Let’s see an example: we need to find places within 10 km from city Bar (yes, it’s a real name) having the following coordinates: 42.1° N, 19.1° E

    SELECT *, (point(lng, lat) <@> point(19.1, 42.1)) * 1609.344 AS distance
    FROM
     cities
    WHERE
        (point(lng, lat) <@> point(19.1, 42.1)) < (10000 / 1609.344)
    ORDER BY
     distance;

     

        id    |       name                  | lat      | lng      | region | distance
    ----------+-----------------------------+----------+----------+--------+--------------------
     27696013 | Bjelisi                     | 42.10083 | 19.10417 | ME     | 356.2025188423806
     27699128 | Topolica                    | 42.09833 | 19.09417 | ME     | 515.6034584176577
     27696061 | Bar                         | 42.0937  | 19.09841 | ME     | 712.7044816072382
     27708312 | King Nikola's Palace        | 42.10062 | 19.09129 | ME     | 721.903811067934
     27708268 | Princess                    | 42.10158 | 19.09132 | ME     | 737.3598549548324
     27708234 | Bar Port                    | 42.09701 | 19.09132 | ME     | 789.5619694327141
     27708184 | Castello                    | 42.10701 | 19.09623 | ME     | 839.2352057529451
     27699126 | Popovici                    | 42.09194 | 19.10528 | ME     | 996.5017600203865
     27708313 | Antivari Pier (historical)  | 42.09708 | 19.08834 | ME     | 1015.3313902853366
     27695933 | Rikavac                     | 42.09278 | 19.08972 | ME     | 1167.8829329656053
    
    (232 rows)
    
    Time: 2844.305 ms (00:02.844)

    So, the query is pretty simple, but execution time looks pitiful, even without using EXPLAIN, obvious that here works plain SCAN. Of course, 3 seconds for 12M+ scan looks impressive, but it’s not usable for production. Usually, to optimize such queries we can use <-> operator that calculates “straight line” distance between points. It works with GIST indexes, and later we can refine results with <@>, but it’s not saving us from edge conditions anyway.

    Do you know that feeling when the author of a book or an article pushes you towards the solution they consider “the most correct”? It’s the right time to recall it, as we’re close to the third (or, more precisely, second with a half, as it’s also declared in the earth distance extension) method.

    earth_distance function and its little helpers

    First of all, I’d like to thank the PostgreSQL development team and all its contributors. I think they are some of the brightest developers on the planet, and making this DB opensource makes them also the most generous ones. Why did I say so? Let’s see. 

    As we’ve seen above, “2D way” (<@> and <-> operators) doesn’t work perfectly. So what can we do? Shift to 3D of course, and that’s what earth_distance can do. With the help of a few additional functions, we can “translate” any pair of latitude+longitude figures into a three-dimensional point in space and use it for a distance computation without edge conditions. As PostgreSQL doesn’t have a data type for 3D point, here they used `cube` data type to represents 3D cubes. Points are represented as zero-sized cubes, simple and efficient. Creators defined domain-specific type earth that is a simple descendant of the cube type, making some additional sanity checks (like “this point isn’t far from the planet’s surface). 

    First, let’s see some useful functions.

    • ll_to_earth(latitude, longitude) — translates latitude, longitude pair into earth type, calculating 3D point location
    • earth_distance(point1, point2) — computes the distance between two points in earth format
    • earth_box(point, radius) — creates a  bounding cube, sized to contain all the points that are not farther than radius meters from a given point. This cube is an approximation, but we can use it for search with @> operator (containment). This operator efficiently supports GIST indexes, but we need to refine the result

    Let’s see an example of a query similar to one we used for <@> demonstration.

    SELECT *, ROUND(earth_distance(ll_to_earth(42.1, 19.1), ll_to_earth(lat, lng))::NUMERIC, 2) AS distance
    FROM
     cities
    WHERE
     earth_box(ll_to_earth (42.1, 19.1), 10000) @> ll_to_earth (lat, lng)
     AND earth_distance(ll_to_earth (42.1, 19.1), ll_to_earth (lat, lng)) < 10000
    ORDER BY
     distance;

     

        id    |       name                  | lat      | lng      | region | distance
    ----------+-----------------------------+----------+----------+--------+----------
     27696013 | Bjelisi                     | 42.10083 | 19.10417 | ME     | 356.60
     27699128 | Topolica                    | 42.09833 | 19.09417 | ME     | 516.18
     27696061 | Bar                         | 42.0937  | 19.09841 | ME     | 713.51
     27708312 | King Nikola's Palace        | 42.10062 | 19.09129 | ME     | 722.72
     27708268 | Princess                    | 42.10158 | 19.09132 | ME     | 738.19
     27708234 | Bar Port                    | 42.09701 | 19.09132 | ME     | 790.45
     27708184 | Castello                    | 42.10701 | 19.09623 | ME     | 840.18
     27699126 | Popovici                    | 42.09194 | 19.10528 | ME     | 997.62
     27708313 | Antivari Pier (historical)  | 42.09708 | 19.08834 | ME     | 1016.48
     27695933 | Rikavac                     | 42.09278 | 19.08972 | ME     | 1169.20
    
    (232 rows)
    
    Time: 102.084 ms

    As you can see, performance is much better now. Even on 2016 MacBook Pro it took about 100 ms to process 12M+ of records. That’s what we can use in production code. Let’s double-check our query with EXPLAIN.

    Gather Merge  (cost=45002.12..45441.98 rows=3770 width=69)
    
       Workers Planned: 2
    
       ->  Sort  (cost=44002.10..44006.81 rows=1885 width=69)
    
             Sort Key: (round((sec_to_gc(cube_distance('(4471920.234086516, 1548541.2222539173, 4276093.607391206)'::cube, (ll_to_earth(lat, lng))::cube)))::numeric, 2))
    
             ->  Parallel Bitmap Heap Scan on cities  (cost=651.34..43899.55 rows=1885 width=69)
    
                   Recheck Cond: ('(4461920.235110745, 1538541.2232781458, 4266093.608415434),(4481920.233062288, 1558541.2212296887, 4286093.606366977)'::cube @> (ll_to_earth(lat, lng))::cube)
    
                   Filter: (sec_to_gc(cube_distance('(4471920.234086516, 1548541.2222539173, 4276093.607391206)'::cube, (ll_to_earth(lat, lng))::cube)) < '10000'::double precision)
    
                   ->  Bitmap Index Scan on cities_location_idx  (cost=0.00..650.21 rows=13572 width=0)
    
                         Index Cond: ((ll_to_earth(lat, lng))::cube <@ '(4461920.235110745, 1538541.2232781458, 4266093.608415434),(4481920.233062288, 1558541.2212296887, 4286093.606366977)'::cube)

     

    Index scan + refinement gives us a good result. So, if your DB is big, it’s always a nice idea to narrow query bounds. Situations where you need to search the whole globe are really rare, most often we limit our search with some particular region (city, country, state). 

    You may have wondered, what is the order of inaccuracy we got with “approximation” methods? At least I was very curious about that. It’s really simple to find out. 

    SELECT name,
        ST_Distance(ST_MakePoint(lng, lat)::geography, ST_MakePoint(19.1, 42.1)::geography) AS postgis_distance,
        earth_distance(ll_to_earth(42.1, 19.1), ll_to_earth (lat, lng)) AS earth_distance,
        (point(lng, lat) <@> point(19.1, 42.1)) * 1609.344 AS point_distance
    FROM
     cities
    WHERE
     earth_box(ll_to_earth (42.1, 19.1), 10000) @> ll_to_earth (lat, lng)
     AND earth_distance(ll_to_earth (42.1, 19.1), ll_to_earth (lat, lng)) < 10000
    ORDER BY
     earth_distance;

     

                name             | postgis_distance |   earth_distance   | point_distance
    -----------------------------+------------------+--------------------+--------------------
     Bjelisi                     | 357.05152825     | 356.6040157475758  |  356.2025188423806
     Topolica                    | 516.71294939     | 516.1846255398445  |  515.6034584176577
     Bar                         | 712.02799946     | 713.5078129379874  |  712.7044816072382
     King Nikola's Palace        | 723.77941534     | 722.717511506365   |  721.903811067934
     Princess                    | 739.14564214     | 738.1909768134576  |  737.3598549548324
     Bar Port                    | 791.12180279     | 790.4519313791742  |  789.5619694327141
     Castello                    | 838.76185601     | 840.1811573389859  |  839.2352057529451
     Popovici                    | 996.13740888     | 997.6249760322966  |  996.5017600203865
     Antivari Pier (historical)  | 1017.61930422    | 1016.4758302850378 | 1015.3313902853366
     Rikavac                     | 1168.91272962    | 1169.199322822384  | 1167.8829329656053
    
     Knjezdovo                   | 9751.02688274    | 9743.710964250973  |  9732.740617250734
     Rt Sapavica                 | 9776.31845857    | 9772.941912657994  |  9761.938654824216
     Livrsanik                   | 9790.74772202    | 9781.953343427898  |  9770.939939713666
     Kabeta                      | 9803.27894257    | 9791.177885397401  |  9780.154095863507
     Pecurice                    | 9825.29031086    | 9833.21849793169   |  9822.147375291843
     Dedici                      | 9825.38102702    | 9836.570130079455  |  9825.495233870668
     Kozjak                      | 9860.53074801    | 9882.053049767033  |  9870.926944792083
     Livari                      | 9899.30877216    | 9888.084255774567  |   9876.95136032431
     Ocas                        | 9877.66304883    | 9893.775540351273  |  9882.636237141522
     Kamisora                    | 9932.59217935    | 9926.356311450303  |  9915.180325874655
     Lunje                       | 9955.46356913    | 9951.533852256645  |  9940.329519539344
     Pepici                      | 9950.93625773    | 9971.099451151647  |  9959.873089720966
     Bajraktari                  | 9993.82232318    | 9992.844754025287  |  9981.593909774985
    (228 rows)
    
    Time: 500.176 ms
    

    As you can see, the error is about 30 meters for 10 km in the worst cases, which is more than acceptable in most scenarios. Actually, we were expecting a much larger error, and it was still acceptable for us. 

    Let’s summarise information about the “cubes method”. Advantages: speed, simplicity, possibility to “redefine” planet’s radius (helpful for non-Earth calculations of to re-define units of measurement). And only one disadvantage is left: precision is still a bit lower than the  PostGIS one; as Earth still considered spherical. But as we’ve seen before, the error is on an acceptable level for most applications. 

    More complex queries

     Now that we have learned the basics we can discuss more complicated queries. I will show a pair of examples, using a “smaller” dataset with pizzerias.

    For example, what if we needed to find the farthest pizzerias and detect what cities they belong to. We’ll assume that pizzeria “belongs” to a city with the center nearest to the venue location. Of course, SQL has dozens of ways to express this query, but I’ll use a version that has the most straightforward logic of search. 

    WITH pizzerias AS (
        SELECT
            name,
            ll_to_earth(lat, lng) as point,
            earth_distance(ll_to_earth(lat, lng), ll_to_earth(42.1, 19.1)) AS distance
        FROM pizzerias
        ORDER BY
            distance
            DESC
        LIMIT 25
    )
    SELECT c.name AS city, p.name, p.distance
        FROM pizzerias p,
             LATERAL (SELECT name FROM cities_min c
                      ORDER BY earth_distance(ll_to_earth(c.lat, c.lng), p.point) LIMIT 1) c;

     

         city     |   name                    | distance
    --------------+---------------------------+--------------------
     Pljevlja     | Brut                      | 141584.33308274313
     Zabljak      | Balkan                    | 117451.9785288297
     Zabljak      | Caffe pizzeria Balkan     | 117435.32995569201
     Zabljak      | Lazar                     | 117308.9999099511
     Bijelo Polje | cezar                     | 116663.29228130652
     Bijelo Polje | Orao                      | 116642.62285195319
     Bijelo Polje | Pierre                    | 116580.71776744149
     Bijelo Polje | Padrino                   | 104133.00996894344
     Kolasin      | Padrino Pizzerua          | 103643.11705954213
     Kolasin      | Pizzeria Kapri            | 87709.3205690596
     Niksic       | Chaplin                   | 76339.67053537675
     Niksic       | Pizzeria Garfield         | 76139.35825221038
     Niksic       | Stari Nikšić              | 76118.70681853226
     Niksic       | Piazza Restaurant & Caffe | 76110.42052261815
     Niksic       | K2                        | 75899.71658365549 
     Niksic       | Šef                       | 75850.41193677587
     Herceg Novi  | Grifonee                  | 62681.009366742284
     Herceg Novi  | Nautilus                  | 62597.163060010935
     Herceg Novi  | Caffe Pizzeria Biblioteka | 62586.73859525146
     Herceg Novi  | Forte                     | 61781.28207372379
     Herceg Novi  | Atelier                   | 61018.18974275154
     Herceg Novi  | Popaj                     | 60869.123535469735
     Herceg Novi  | Cafe Pizzeria Trg         | 60721.08839248005
     Herceg Novi  | Caffe Pizzeria 5          | 60702.71868706536
     Herceg Novi  | Pic Nic                   | 60417.41134618965
    (25 rows)
    
    Time: 8.487 ms

    First, we select the farthest venues, using the distance sort in a descending order. Then we simply have to “traverse” this data, selecting the matching city for each row.

    One more example, let’s find the most “pizza-loving cities”, assuming that such cities will have the biggest count of pizzerias in a 10 km radius from the center. A more logical approach would  be to calculate the ratio of pizzerias counts to cities population, but I didn’t want to search for demographic data, so we’ll just count absolute value. 

    SELECT c.name, count(cp)
        FROM cities_min c,
             LATERAL (SELECT "name"
                      FROM pizzerias p
                      WHERE earth_distance(ll_to_earth(p.lat, p.lng), ll_to_earth(c.lat, c.lng)) < 10000
             ) AS cp
        GROUP BY c.name
        ORDER BY count(cp) DESC;

     

         name     | count
    --------------+-------
     Podgorica    | 46
     Tivat        | 36
     Kotor        | 31
     Budva        | 23
     Bar          | 19
     Herceg Novi  | 16
     Ulcinj       | 16
     Petrovac     | 11
     Niksic       | 6
     Cetinje      | 6
     Danilovgrad  | 3
     Bijelo Polje | 3
     Zabljak      | 3
     Pljevlja     | 1
     Kolasin      | 1
    (15 rows)
    
    Time: 65.703 ms

    The results speak for themselves. Besides the capital (with the biggest population), coastal cities like pizza much more than mountain areas. Even trendy ski resorts don’t skew the data very much.

    2.0 PostgreSQL image

    Conclusion

    • PostgreSQL has two “built-in” ways of dealing with geolocation data: <@> operator and earth_distance functions family
    • It’s better to use the second one, but they can be combined if necessary
    • PostGIS is the most versatile and precise method, but it’s the slowest and the most complex one. So in the vast majority of cases, you can use a simpler method
    • Narrowing your query to some bounding area is always a great idea from performance standpoint

    Book a strategy session

    Get actionable insights for your product

      Success!

      We’ll reach out to schedule a call