- Artificial Intelligence
- Machine learning
- Artificial Intelligence
- Machine learning
Being a mobile developer you’d never imagine having to work with geospatial indexes and geolocation queries; real life has a different opinion though. In one of my recent projects, a dating app, we needed to be able to find users that were physically near each other. To accomplish this I summoned some of my old backend tricks and I wanted to share the finding with you.
There are several approaches for geolocation queries that vary in speed and accuracy, so let’s talk about these options. This post is aimed towards the mid-level developers, but hardcore geeks would also find interesting comparisons and approaches in the end of an article.
Routine tasks like “find all app users near city X” or “sort something by distance from current user” are really common requests. In one of the legacy projects I’ve been supporting, this task was solved in a pretty straightforward way. They simply used the following query to select coordinates fitting a necessary square
WHERE lat > 10 AND lat < 20 AND lng > 30 AND lng < 45
and then additionally filtered all founded results in Python script, using distance calculation. I’m sure you can imagine the performance of such an approach and the amount of edge conditions it had. Of course, all edge conditions could be resolved and carefully scaffolded with unit tests, but let’s look at how this could be done on a DB level. And no, we’re not doing that in order to shift the responsibility onto DBA.
The last time this strategy was used was during the development of our secure dating startup Zin; where geolocation was a critical feature. We needed to find all posts that were close to a user and sort them by distance. After researching it appeared we had 3 options (more precisely two and a half), which we’ll cover.
For clarification, I’ll say that we needed “big circle distance”, without terrain consideration. Also, a possible error was pretty big, so the main criteria for us was implementation simplicity. Looking ahead I can say that precision appeared to be bigger than we expected, and I’ll show that in practice.
Get actionable insights for your product
As our DA contains some sensitive data, I’ll use two other datasets to provide examples. First one, it’s a big table of cities and their coordinates taken from http://download.geonames.org/export/dump/. It’s a bit bloated, as it contains not only cities, but their districts, notable landmarks, and other stuff, but as the main goal here is an overall amount of data. So, here is the structure of the first table.
CREATE TABLE cities ( id integer DEFAULT nextval('public.cities_seq'::regclass) NOT NULL, name character varying NOT NULL, region character varying NOT NULL, lat double precision NOT NULL, lng double precision NOT NULL ); ALTER TABLE ONLY cities ADD CONSTRAINT cities_pkey PRIMARY KEY (id); CREATE INDEX cities_name_low_idx ON cities USING btree (lower((name)::text)); CREATE INDEX cities_location_idx ON cities USING gist (ll_to_earth(lat, lng));
It’s pretty simple: city name, code of the country it belongs and a pair of attributes to store latitude and longitude. Additionally, we’re building two indices: BTREE for fast search of city by name and the one we need for location queries. It uses GIST (huge thanks to PostgreSQL team for this magnificent mechanism) and ll_to_earth function that we’ll discuss later on.
Usually, we have two ways of storing information: use a pair of numbers (usually float8) in different columns or use one of the domain-specific types that we’re discussing later (point, earth, geometry). There is no viable difference in performance, so separate columns are more user options, but of course, it depends on the particular requirements of the project.
I won’t post the helper Python scripts I used for data collection and DB population, they are really straightforward. If you’d like to take a sneak peek at them, here is a GitHub repo: https://github.com/cleg/GeoReqTest
After transferring the whole dataset to PostgreSQL, let’s check that we have a big enough table to play with.
SELECT COUNT(*), pg_size_pretty(pg_relation_size('cities')) AS data, pg_size_pretty(pg_indexes_size('cities')) AS idxs FROM cities;
count | data | idxs ----------+--------+--------- 12005938 | 848 MB | 1836 MB (1 row) Time: 2073.919 ms (00:02.074)
The second dataset I created for a more visual demonstration of complicated queries contains two much smaller tables. First, cities_min contains the biggest cities of Montenegro (15 in total), the second one, named pizzerias, contains, you got it, list of local pizzerias (172 total) that I’ve got from FourSquare. This scraping script (pun intended) is also available in the abovementioned repo.
Now that we have the intro out of the way, let’s proceed to the main topic: discussing the options we have.
This is the “hard way” we wanted to avoid, as is hinted in the title. But I’ll be covering it as we need to know why it’s not the best choice for us.
Of course, this method has some really viable advantages: it’s the most versatile and precise, it has no edge conditions and it allows changing the spatial reference identifier.
But that comes with a big cost, starting with tons of heavy dependencies (simple installation via Homebrew requires installing more than 30 additional formulas, including mastodons like GCC, Qt, boost, etc.) and ending with overall complexity (non-standard data types, 5000+ rows of additional tables and views with dictionaries). Also, it’s slower than other methods, not by a big margin, but anyway, why spend resources for the precision we don’t need?
Luckily, the core team of PostgreSQL developers created a really convenient extension, containing the solution we used. Installation is really simple:
CREATE EXTENSION earthdistance CASCADE;
This extension gives two options for working with coordinates.
I’m not sure how to call this syntactic chimera (maybe “eye operator”?), but its essence is simple: it calculates that “big circle distance” I mentioned before; between two points on Earth’s surface. To represent coordinates it uses built-in point type, but pay attention that coordinates are represented as point(lng, lat) with longitude in the first place. Its counter-intuitive, as in speech we’re used to naming latitude first, but longitude is closer to X-axis by its nature, so it goes in the first place. The second peculiarity to be aware of is that this operator returns results in statute miles, so cases when you’ll need to multiply or divide by 1609, will be pretty numerous if you live outside of US/UK.
The advantages of this method are pretty obvious: it’s simple, and it uses basic datatypes of PostgreSQL. But it has a lot of drawbacks: precision (Earth considered being ideal sphere), edge conditions around the poles and 180 meridian, and bad indexing support. On small datasets and cases with far from edge conditions, this method is doing its job on an acceptable level.
Let’s see an example: we need to find places within 10 km from city Bar (yes, it’s a real name) having the following coordinates: 42.1° N, 19.1° E
SELECT *, (point(lng, lat) <@> point(19.1, 42.1)) * 1609.344 AS distance FROM cities WHERE (point(lng, lat) <@> point(19.1, 42.1)) < (10000 / 1609.344) ORDER BY distance;
id | name | lat | lng | region | distance ----------+-----------------------------+----------+----------+--------+-------------------- 27696013 | Bjelisi | 42.10083 | 19.10417 | ME | 356.2025188423806 27699128 | Topolica | 42.09833 | 19.09417 | ME | 515.6034584176577 27696061 | Bar | 42.0937 | 19.09841 | ME | 712.7044816072382 27708312 | King Nikola's Palace | 42.10062 | 19.09129 | ME | 721.903811067934 27708268 | Princess | 42.10158 | 19.09132 | ME | 737.3598549548324 27708234 | Bar Port | 42.09701 | 19.09132 | ME | 789.5619694327141 27708184 | Castello | 42.10701 | 19.09623 | ME | 839.2352057529451 27699126 | Popovici | 42.09194 | 19.10528 | ME | 996.5017600203865 27708313 | Antivari Pier (historical) | 42.09708 | 19.08834 | ME | 1015.3313902853366 27695933 | Rikavac | 42.09278 | 19.08972 | ME | 1167.8829329656053 … (232 rows) Time: 2844.305 ms (00:02.844)
So, the query is pretty simple, but execution time looks pitiful, even without using EXPLAIN, obvious that here works plain SCAN. Of course, 3 seconds for 12M+ scan looks impressive, but it’s not usable for production. Usually, to optimize such queries we can use <-> operator that calculates “straight line” distance between points. It works with GIST indexes, and later we can refine results with <@>, but it’s not saving us from edge conditions anyway.
Do you know that feeling when the author of a book or an article pushes you towards the solution they consider “the most correct”? It’s the right time to recall it, as we’re close to the third (or, more precisely, second with a half, as it’s also declared in the earth distance extension) method.
First of all, I’d like to thank the PostgreSQL development team and all its contributors. I think they are some of the brightest developers on the planet, and making this DB opensource makes them also the most generous ones. Why did I say so? Let’s see.
As we’ve seen above, “2D way” (<@> and <-> operators) doesn’t work perfectly. So what can we do? Shift to 3D of course, and that’s what earth_distance can do. With the help of a few additional functions, we can “translate” any pair of latitude+longitude figures into a three-dimensional point in space and use it for a distance computation without edge conditions. As PostgreSQL doesn’t have a data type for 3D point, here they used `cube` data type to represents 3D cubes. Points are represented as zero-sized cubes, simple and efficient. Creators defined domain-specific type earth that is a simple descendant of the cube type, making some additional sanity checks (like “this point isn’t far from the planet’s surface).
First, let’s see some useful functions.
Let’s see an example of a query similar to one we used for <@> demonstration.
SELECT *, ROUND(earth_distance(ll_to_earth(42.1, 19.1), ll_to_earth(lat, lng))::NUMERIC, 2) AS distance FROM cities WHERE earth_box(ll_to_earth (42.1, 19.1), 10000) @> ll_to_earth (lat, lng) AND earth_distance(ll_to_earth (42.1, 19.1), ll_to_earth (lat, lng)) < 10000 ORDER BY distance;
id | name | lat | lng | region | distance ----------+-----------------------------+----------+----------+--------+---------- 27696013 | Bjelisi | 42.10083 | 19.10417 | ME | 356.60 27699128 | Topolica | 42.09833 | 19.09417 | ME | 516.18 27696061 | Bar | 42.0937 | 19.09841 | ME | 713.51 27708312 | King Nikola's Palace | 42.10062 | 19.09129 | ME | 722.72 27708268 | Princess | 42.10158 | 19.09132 | ME | 738.19 27708234 | Bar Port | 42.09701 | 19.09132 | ME | 790.45 27708184 | Castello | 42.10701 | 19.09623 | ME | 840.18 27699126 | Popovici | 42.09194 | 19.10528 | ME | 997.62 27708313 | Antivari Pier (historical) | 42.09708 | 19.08834 | ME | 1016.48 27695933 | Rikavac | 42.09278 | 19.08972 | ME | 1169.20 … (232 rows) Time: 102.084 ms
As you can see, performance is much better now. Even on 2016 MacBook Pro it took about 100 ms to process 12M+ of records. That’s what we can use in production code. Let’s double-check our query with EXPLAIN.
Gather Merge (cost=45002.12..45441.98 rows=3770 width=69) Workers Planned: 2 -> Sort (cost=44002.10..44006.81 rows=1885 width=69) Sort Key: (round((sec_to_gc(cube_distance('(4471920.234086516, 1548541.2222539173, 4276093.607391206)'::cube, (ll_to_earth(lat, lng))::cube)))::numeric, 2)) -> Parallel Bitmap Heap Scan on cities (cost=651.34..43899.55 rows=1885 width=69) Recheck Cond: ('(4461920.235110745, 1538541.2232781458, 4266093.608415434),(4481920.233062288, 1558541.2212296887, 4286093.606366977)'::cube @> (ll_to_earth(lat, lng))::cube) Filter: (sec_to_gc(cube_distance('(4471920.234086516, 1548541.2222539173, 4276093.607391206)'::cube, (ll_to_earth(lat, lng))::cube)) < '10000'::double precision) -> Bitmap Index Scan on cities_location_idx (cost=0.00..650.21 rows=13572 width=0) Index Cond: ((ll_to_earth(lat, lng))::cube <@ '(4461920.235110745, 1538541.2232781458, 4266093.608415434),(4481920.233062288, 1558541.2212296887, 4286093.606366977)'::cube)
Index scan + refinement gives us a good result. So, if your DB is big, it’s always a nice idea to narrow query bounds. Situations where you need to search the whole globe are really rare, most often we limit our search with some particular region (city, country, state).
You may have wondered, what is the order of inaccuracy we got with “approximation” methods? At least I was very curious about that. It’s really simple to find out.
SELECT name, ST_Distance(ST_MakePoint(lng, lat)::geography, ST_MakePoint(19.1, 42.1)::geography) AS postgis_distance, earth_distance(ll_to_earth(42.1, 19.1), ll_to_earth (lat, lng)) AS earth_distance, (point(lng, lat) <@> point(19.1, 42.1)) * 1609.344 AS point_distance FROM cities WHERE earth_box(ll_to_earth (42.1, 19.1), 10000) @> ll_to_earth (lat, lng) AND earth_distance(ll_to_earth (42.1, 19.1), ll_to_earth (lat, lng)) < 10000 ORDER BY earth_distance;
name | postgis_distance | earth_distance | point_distance -----------------------------+------------------+--------------------+-------------------- Bjelisi | 357.05152825 | 356.6040157475758 | 356.2025188423806 Topolica | 516.71294939 | 516.1846255398445 | 515.6034584176577 Bar | 712.02799946 | 713.5078129379874 | 712.7044816072382 King Nikola's Palace | 723.77941534 | 722.717511506365 | 721.903811067934 Princess | 739.14564214 | 738.1909768134576 | 737.3598549548324 Bar Port | 791.12180279 | 790.4519313791742 | 789.5619694327141 Castello | 838.76185601 | 840.1811573389859 | 839.2352057529451 Popovici | 996.13740888 | 997.6249760322966 | 996.5017600203865 Antivari Pier (historical) | 1017.61930422 | 1016.4758302850378 | 1015.3313902853366 Rikavac | 1168.91272962 | 1169.199322822384 | 1167.8829329656053 … Knjezdovo | 9751.02688274 | 9743.710964250973 | 9732.740617250734 Rt Sapavica | 9776.31845857 | 9772.941912657994 | 9761.938654824216 Livrsanik | 9790.74772202 | 9781.953343427898 | 9770.939939713666 Kabeta | 9803.27894257 | 9791.177885397401 | 9780.154095863507 Pecurice | 9825.29031086 | 9833.21849793169 | 9822.147375291843 Dedici | 9825.38102702 | 9836.570130079455 | 9825.495233870668 Kozjak | 9860.53074801 | 9882.053049767033 | 9870.926944792083 Livari | 9899.30877216 | 9888.084255774567 | 9876.95136032431 Ocas | 9877.66304883 | 9893.775540351273 | 9882.636237141522 Kamisora | 9932.59217935 | 9926.356311450303 | 9915.180325874655 Lunje | 9955.46356913 | 9951.533852256645 | 9940.329519539344 Pepici | 9950.93625773 | 9971.099451151647 | 9959.873089720966 Bajraktari | 9993.82232318 | 9992.844754025287 | 9981.593909774985 (228 rows) Time: 500.176 ms
As you can see, the error is about 30 meters for 10 km in the worst cases, which is more than acceptable in most scenarios. Actually, we were expecting a much larger error, and it was still acceptable for us.
Let’s summarise information about the “cubes method”. Advantages: speed, simplicity, possibility to “redefine” planet’s radius (helpful for non-Earth calculations of to re-define units of measurement). And only one disadvantage is left: precision is still a bit lower than the PostGIS one; as Earth still considered spherical. But as we’ve seen before, the error is on an acceptable level for most applications.
Now that we have learned the basics we can discuss more complicated queries. I will show a pair of examples, using a “smaller” dataset with pizzerias.
For example, what if we needed to find the farthest pizzerias and detect what cities they belong to. We’ll assume that pizzeria “belongs” to a city with the center nearest to the venue location. Of course, SQL has dozens of ways to express this query, but I’ll use a version that has the most straightforward logic of search.
WITH pizzerias AS ( SELECT name, ll_to_earth(lat, lng) as point, earth_distance(ll_to_earth(lat, lng), ll_to_earth(42.1, 19.1)) AS distance FROM pizzerias ORDER BY distance DESC LIMIT 25 ) SELECT c.name AS city, p.name, p.distance FROM pizzerias p, LATERAL (SELECT name FROM cities_min c ORDER BY earth_distance(ll_to_earth(c.lat, c.lng), p.point) LIMIT 1) c;
city | name | distance --------------+---------------------------+-------------------- Pljevlja | Brut | 141584.33308274313 Zabljak | Balkan | 117451.9785288297 Zabljak | Caffe pizzeria Balkan | 117435.32995569201 Zabljak | Lazar | 117308.9999099511 Bijelo Polje | cezar | 116663.29228130652 Bijelo Polje | Orao | 116642.62285195319 Bijelo Polje | Pierre | 116580.71776744149 Bijelo Polje | Padrino | 104133.00996894344 Kolasin | Padrino Pizzerua | 103643.11705954213 Kolasin | Pizzeria Kapri | 87709.3205690596 Niksic | Chaplin | 76339.67053537675 Niksic | Pizzeria Garfield | 76139.35825221038 Niksic | Stari Nikšić | 76118.70681853226 Niksic | Piazza Restaurant & Caffe | 76110.42052261815 Niksic | K2 | 75899.71658365549 Niksic | Šef | 75850.41193677587 Herceg Novi | Grifonee | 62681.009366742284 Herceg Novi | Nautilus | 62597.163060010935 Herceg Novi | Caffe Pizzeria Biblioteka | 62586.73859525146 Herceg Novi | Forte | 61781.28207372379 Herceg Novi | Atelier | 61018.18974275154 Herceg Novi | Popaj | 60869.123535469735 Herceg Novi | Cafe Pizzeria Trg | 60721.08839248005 Herceg Novi | Caffe Pizzeria 5 | 60702.71868706536 Herceg Novi | Pic Nic | 60417.41134618965 (25 rows) Time: 8.487 ms
First, we select the farthest venues, using the distance sort in a descending order. Then we simply have to “traverse” this data, selecting the matching city for each row.
One more example, let’s find the most “pizza-loving cities”, assuming that such cities will have the biggest count of pizzerias in a 10 km radius from the center. A more logical approach would be to calculate the ratio of pizzerias counts to cities population, but I didn’t want to search for demographic data, so we’ll just count absolute value.
SELECT c.name, count(cp) FROM cities_min c, LATERAL (SELECT "name" FROM pizzerias p WHERE earth_distance(ll_to_earth(p.lat, p.lng), ll_to_earth(c.lat, c.lng)) < 10000 ) AS cp GROUP BY c.name ORDER BY count(cp) DESC;
name | count --------------+------- Podgorica | 46 Tivat | 36 Kotor | 31 Budva | 23 Bar | 19 Herceg Novi | 16 Ulcinj | 16 Petrovac | 11 Niksic | 6 Cetinje | 6 Danilovgrad | 3 Bijelo Polje | 3 Zabljak | 3 Pljevlja | 1 Kolasin | 1 (15 rows) Time: 65.703 ms
The results speak for themselves. Besides the capital (with the biggest population), coastal cities like pizza much more than mountain areas. Even trendy ski resorts don’t skew the data very much.
Get actionable insights for your product