Distribution of Pokémon in Munich

After the launch of Pokémon Go at the beginning of July everyone seemed to be on the lookout for Pokémon. While I found the hunt for Pokémon to become tedious after a week or two, I found the idea to have a look at the data behind the game a lot more interesting. To that end, I forked the very popular Pokémon Go Map project and added some logic to collect and store the data in a MySQL database.

I then used Tableau to build some fancy visualizations of the dataset at hand. All the data I will refer to later can be found here

TL;DR: Did analytics and visualizations on Pokémon - see Interactive Tableau Sheet

The Data

The data collected by the script is pretty raw and simple, with essentially a single table only containing ~206k rows, each one representing the appearance of a single Pokémon. The area I scanned can be best described as a hexagon with a “radius” of around 3.65km (=34.6km²) around the center of Munich, Germany. This already gives us some interesting statistics (within certain margin of error of course - see my thoughts regarding the quality of the data below), i.e. a Pokémon appearance rate of 124 Pokémon / km² / h.

Raw Pokémon data

This raw data is already interesting on its own, since it contains 3 relevant dimensions: the ID of the Pokémon as well as time and location of its appearance. However, in order to allow for a more extensive and human-friendly representation, we need more data, which I - of course - found in the depth of the internet in the form of an unbelievable number of projects aiming at bringing Pokémon into a structured, queryable format. This allowed me to extract additional data like details, evolution levels or types and store them in my database as well.

Using some simple SQL statements, I created a single SQL table that made a few changes to the original data:

  • Added the type of the Pokémon (id + human-friendly name)
  • Added the groupId and groupOrder are used to enable clustering on a finer lever by adding Pokémon into logical groups, i.e. Bulbasaur and all its evolutions would be have groupId=1 and the respective orders 1,2,3. The screenshot below illustrates this.
  • Added the German and English name of the Pokémon
  • Replaced the complicated alphanumeric encounter_id with a numeric encounterId
  • Omitted spawnpoint_id, which seemed to be only random, non-repeating IDs

We can now use this table and join it with the other tables to create the final result that we’ll use for the visualizations. The following query will give us everything we need - I’m sure it can be further optimized by proper indexes, temporary tables and SQL Voodoo, but since it completes in less than 20 seconds, I didn’t really bother.

aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
SELECT 
p.enc_id AS encounterId,
p.pokemon_id AS pokemonId,
pd.name_de AS nameDE,
pd.name_en AS nameEN,
pd.groupId,
IF(id < 4, id, (pd.id % pp.groupStartsAt) + 1) AS groupOrder,
pt.type_id AS typeId,
t.type AS pokemonType,
p.latitude AS lat,
p.longitude AS lng,
UNIX_TIMESTAMP(p.disappear_time) AS disappearsAt
FROM
(SELECT
pp.*, @rownum:=@rownum + 1 AS enc_id
FROM
pokemon pp, (SELECT @rownum:=0) r
WHERE
pp.disappear_time > '2016-07-26 00:00:00'
ORDER BY pp.disappear_time ASC) AS p
JOIN
(pokemon_types pt,
pokemon_details pd,
types t,
(SELECT
groupId, id AS groupStartsAt
FROM
pokemon_details
GROUP BY groupId)
AS pp)
ON (p.pokemon_id = pt.pokemon_id
AND p.pokemon_id = pd.id
AND t.type_id = pt.type_id
AND pp.groupId = pd.groupId)
ORDER BY encounterId , disappearsAt ASC

Enhanced Data

Adding the type of a Pokémon increased the number of entries from 205k to 322k since a Pokémon can have multiple types and thus will be listed once for each type, i.e. Pidgey in the screenshot below. We will account for this later in Tableau by making sure we use type as a dimension in our vizs.

Enhanced Pokémon data

Preprocessing

Tableau proved to be surprisingly clumsy in handling CSV as import data, making it hard for me to bring the data in the desired form. After a few unsatisfying attempts to properly import the CSV with Tableau, I decided to do the preprocessing in Excel and then use this sheet as a data source for the import to Tableau. This went significantly smoother then importing from CSV and gave me nice data to work with. The only things that remained to be adjusted manually where:

  • Transforming the Unix Time of disappear to a proper date using the following formula: DATEADD('hour',-8,(Date("1/1/1970") + ([disappears]/86400)))
  • Changing the type of ids from continuous to discrete numbers
  • Tell Tableau that the decimals are actually lat / long coordinates

Quality of the data

The available data had a few flaws that we should talk about before continuing.

Missing Data

While collecting the data, I paused the script multiple times for a short while. In this time, no data has been collected at all. This is not a problem if we display the data on a map, but distorts the visualization in time based charts.

Missing data
Distorted visualization

Changing search radius

In addition to stopping the script several times, I also changed the parameters of it multiple times, i.e. increasing/decreasing the search radius. This had a direct influence on the number of Pokémon found in a certain interval. As a result of this, absolute numbers are not comparable for different timespans.

Total appearances per hour

Geographical distribution

Caused by the way the algorithm works (simulating a Pokémon Go player running in an endless spiral from the inside to the outside and then starting over from the center again), the likelihood of missing the appearance of a Pokémon gets higher, the further you move away from the center. We can see this by using the cluster function of Tableau with the number of appearances of a Pokémon as its only parameter. This gives us three clusters (low=red, medium=yellow, high=blue) for the number of total appearances.

Cluster

This clustering is also influenced by the change in search radius, but in my opinion mostly caused by the increased probability of missing a Pokémon in the areas further away from the center.

disappearTime

The data we use when aggregating over time is not actually the time when the Pokémon appeared on the map, but rather the date when it will disappear. This is again due to the way the script works, which leads to the fact that we neither know the actual time of appearance nor the Pokémon Go API telling us when a certain Pokémon did actually appear. However, since Niantic tells us the disappear date of each Pokémon, we just assume that each Pokémon stays on the map for the same duration and thus can use the disappear for our timeline, especially when looking for interesting pattern rather than concrete predictions.

Visualization

The Tableau Sheet I created is available here and you can try out everything I’ll describe here - which you should absolutely do.

Count per type

This visualization shows all the Pokémon for a certain type (i.e. Bug, Fire, ..) on a map. Pokémon belonging to the same group (i.e. Weedle -> Kakuna -> Beedrill) have the same base color. The size of the marker on the map indicates the number of total appearances for this particular Pokémon. If you analyze the data, different pattern emerge, i.e. for Pokémon of type bug, you can see in the example below, that all three Pokémon of the Weedle -> Kakuna -> Beedrill evolution chain are found in exactly the same places, just with decreasing probability.

Bug-Pokémon found on map
Weedle found on map
Kakuna found on map
Beedrill found on map

Each pattern is interesting on its own and there’s not enough time to talk about each own here, but one I found particularly interesting is the distribution pattern of the rare Dragon Pokémon, which seems to be aligned with Munich’s heavily frequented Altstadtring and the river Isar.

Dragons found on map

Count per type and evolution

This is probably one of the neatest visualizations since it shows a lot of information in a visually very appealing way. It’s basically the same concept like the one we already talked about, but instead of using just one map, we use a grid of maps, where the X-axis is the evolution (aka groupOrder) of the Pokémon and the Y-axis is the type. Again I really encourage you to look at the interactive Tableau Sheet by yourself - you can do it all in your browser and apply your own filters / constraints.

Map grid

Spawns over time

That’s a cool one, too. It shows the spawns of all Pokémon on a timeline (in minutes), grouped by Pokémon groups. Each pixel-wide line marks the spawntime (actually disappearsAt - see above) of a Pokémon, color coded by group and pokemonId. You can see that there has been barely a minute, where no Pidgey has spawned somewhere in the area we scanned. You can also see, that spawntimes seem to be distributed fairly even, although the frequency decreases the higher a Pokémon’s evolution level is.

Spawn over time

Statistical analysis

Another interesting way of looking at the available data is taking coordinates as what they actually are: decimal numbers. This allows as to apply statistical metrics like average or median on them and get valid lat/long coordinates as well, which we can than visualize again, i.e. by showing both the average an median spawn points of each Pokémon on our map. The chart shows a higher density of spawn point in the center for the average compared to the median. This reflects what we saw on the previous visualizations as well: More Pokémon spawned in the center than to towards the borders of the scanned area.

Median and average spawns of Pokémon

We can also look at metrics like the standard deviation of the spawn points of each Pokémon. This shows how outspread the individual spawn points around the average spawn point of each Pokémon are, meaning that the further a Pokémon is to the right/top of the plot, the higher its deviation is. The plot shows the deviation in meters, which at such low distances can be fairly accurately derived from the lat/long distances without having to account for the curvature of the earth.

When we look at the median of the standard deviations, we can claim (I know: this is very simplified and not really significant) that the average Pokémon appears 1.8km north/south and 1.4km east/west from its arithmetical center.

I also tried to find some clusters in the standard deviation plot, but neither type nor group nor groupId resulted in any significant clusters. The clustering in the screenshot below is solely based on the standard deviation of lat and long, which obviously finds (meaningless) clusters.

Median and average spawns of Pokémon

Conclusion

There’s no real conclusion, but we nonetheless found out a few interesting things about where, when and how Pokémon appear in Pokémon Go.

  • Spawn rate and ratio seem to be constant and are not changing over time.
  • Related Pokémon most of the time spawn in the same areas, although the higher evolved, the less frequent.
  • Pokémon tend to appear near places that match their type, i.e. Water Pokémon are mostly found near rivers (The same applies for Ice and - weirdly enough - Fire Pokémon).
  • 95.6% of all recorded Pokémon are of the lowest evolution level (1), 4.1% have evolved to the second level and only 0.32% have reached evolution level 3.
  • There’s a Pidgey and Rattata epidemic in Munich, with each accounting for 14% of all Pokémon.
  • 28 Pokémon never showed up in Munich, 123 out of 151 however did.
  • There is a relation between evolution level and spawn rate/probability (not really surprising), but there is no statistically significant relation between evolution level and spread over the map.

I really enjoyed putting this together and hope you found it as fascinating as I did :).

Edit November 2016 Niantic effectively locked-out all crawlers and bots by now, so there will be no fresh data to be collected.