Geodaten in Python

Working with geodata in Python

In this blog post we explain how to work efficiently with geodata in Python.

Geodata can symbolize different objects – the most important are the following three.

  • Points (e.g. individual addresses or measuring points)
  • Lines (e.g. routes)
  • Polygons (e.g. countries or postcode areas)

In addition to coordinate formats, geodata can also be stored as addresses.

We’ll explain in this post:

Geocoding (Geocoding)

Prerequisites

The libraries required for this section can be installed with PyPI as follows (here for Python3):

pip3 install geopy
pip3 install ratelimit
pip3 install tqdm # for progress bars

Geocoding with GeoPy and Nominatim

Geocoding refers to the conversion of addresses into coordinates and, vice versa, the conversion of coordinates into the corresponding address (reverse geocoding).

There are a number of freely available geocoding APIs that are suitable for smaller use cases, e.g. geocoding of less than 10,000 points per day.

In most cases it is not necessary to call the APIs manually.geopy is an excellent Python library for (among others) geocoding and reverse geocoding that supports many APIs. In this example we use the Nominatim API, which is based on OpenStreetMap (OSM) data. The OSM data is subject to the Open Database License (ODbL).

A single address is geocoded in geopy with minimal effort:

from geopy.geocoders import Nominatim
gc = Nominatim(user_agent="fintu-blog-geocoding-python")
gc.geocode("Unter den Linden 1, Berlin")
# Location(Kommandantenhaus, 1, Unter den Linden, Spandauer Vorstadt, Mitte, Berlin, 10117, Deutschland, (52.51720765, 13.3978343993255, 0.0))

The Nominatim API does not necessarily require an address consisting of street, house number, and city, but also knows many business addresses and points of interest. We use the API to create a list of the 114 largest football stadiums in Germany with coordinates – which we present in the next section on a map.

The Terms of Use of the Nominatim API allow a maximum frequency of one call per second. We limit the API calls with the library <ratelimit.

import pandas as pd
from ratelimit import limits, sleep_and_retry

# Progress bar for Pandas apply
from tqdm import tqdm
tqdm.pandas()

# Import stadiums
df = pd.read_csv("./data/stadien.csv")

@sleep_and_retry
@limits(1,1)
def rate_limited_geocode(query):
    return gc.geocode(query)

def geocode(row):
    lookup_query = row["Name"] + ", " + row["Stadt"]
    lookup_result = rate_limited_geocode(lookup_query)
    return lookup_result

df["geo_location"] = df.progress_apply(geocode, axis=1)

The table now contains the complete addresses and geo-coordinates of the stadiums.

df.loc[0,"geo_location"]
# Location(Signal Iduna Park, 50, Strobelallee, Brünninghausen, Dortmund, Regierungsbezirk Arnsberg, Nordrhein-Westfalen, 44139, Deutschland, (51.49266925, 7.45083008801226, 0.0))

Some of the addresses could not be geocoded – we ignore them for now.

df[df.geo_location.isnull()].shape[0]
# 9
df = df[~df.geo_location.isnull()]

For geocoding of more than a few thousand addresses per day or for large batch geocoding, it is recommended to set up your own nominatim server. Instructions can be found here.

Visualization

Requirements

In this section we use Jupyter Notebooks / Jupyter Lab. Instructions for installing Jupyter can be found here. The required libraries can be installed with PyPI as follows (here for Python3):

pip install ipyleaflet
jupyter nbextension enable --py --sys-prefix ipyleaflet # can be skipped for notebook 5.3 and above

# IPython widgets must be enabled for interactive maps
jupyter nbextension enable --py --sys-prefix widgetsnbextension

# For Jupyter Lab, an additional extension is required
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install jupyter-leaflet

Now you can create interactive maps in Jupyter / Jupyter Lab:

from ipyleaflet import Map
m = Map(center=(52.204793, 360.121558),zoom=4, scroll_wheel_zoom=True)

Interactive maps with leaflet.js and ipyleaflet

We can now create a map of the geocoded football stadiums.

from ipyleaflet import Map, Marker, MarkerCluster
m = Map(center=(50.721283, 9.056725),zoom=5, scroll_wheel_zoom=True)

def create_marker(row):
    loc = row["geo_location"]
    name = row["Name"]
    lat_lon = (loc.latitude, loc.longitude)
    return Marker(location=lat_lon,
                    draggable=False,
                    title=name)

markers = df.apply(create_marker, axis=1)
marker_cluster = MarkerCluster(markers=tuple(markers.values))
m.add_layer(marker_cluster)

m

For better visibility we change the background to a more monotonous map. We also replace the markers with circles whose radius represents the capacity (number of seats) of the stadium.

from ipyleaflet import Map, basemaps, basemap_to_tiles, CircleMarker, LayerGroup

background = basemap_to_tiles(basemaps.OpenStreetMap.DE)
m = Map(layers=(background, ),
        center=(50.721283, 9.056725),
        zoom=5,
        scroll_wheel_zoom=True)

biggest_stadium_size = df["Plätze"].max()

def create_marker(row):
    loc = row["geo_location"]
    name = row["Name"]
    size = int((row["Plätze"]/biggest_stadium_size)*15)
    lat_lon = (loc.latitude, loc.longitude)
    return CircleMarker(location=lat_lon,
                    draggable=False,
                    title=name,
                    radius=size,
                    weight=1)

markers = df.apply(create_marker, axis=1)
layer_group = LayerGroup(layers=tuple(markers.values))
m.add_layer(layer_group)

m

Distance calculation and regression on geodata

Requirements

In this section we again use geopy and for a k-nearest neighbors regression the scikit-learn library.

pip3 install geopy
pip3 install scikit-learn

K-nearest-neighbors regression on geodata

In this example use a data set of 74,219 geocoded measurements of water pumps in Tanzania (source: DrivenData). The (partly missing or wrong) geocodings have already been cleaned up and added.

df = pd.read_csv("./data/tanzania.csv", low_memory=False)
df.shape[0]
# 74219
df.latitude.isnull().sum() + df.longitude.isnull().sum()
# 0

However, the column gps_height, which specifies the height of the measuring point, still contains missing values for about 1/3 of all measuring points.

df.gps_height.isnull().sum()
# 25637

We use a weighted k-nearest neighbors regression on the columns latitude and longitude to estimate the missing values. Note that the missing values for gps_height have a very regional distribution, which tends to have a negative effect on the result of our regression.

from ipyleaflet import Map, basemaps, basemap_to_tiles, CircleMarker, LayerGroup

# Wir nutzen aus Performancegründen 1000 zufällig ausgewählte Messungen
df_sel = df.sample(n=1000, random_state=2000)

background = basemap_to_tiles(basemaps.OpenStreetMap.BlackAndWhite)
m = Map(layers=(background, ),
        center=(-6.57729778901325, 35.11030912399293),
        zoom=5,
        scroll_wheel_zoom=True)

def create_marker(row):
    lat_lon = (row["latitude"], row["longitude"])
    is_missing = np.isnan(row["gps_height"])
    if is_missing:
        color = "#FF0000"
    else:
        color = "#0033FF"
    return CircleMarker(location=lat_lon,
                    draggable=False,
                    fill_color=color,
                    fill_opacity=0.5,
                    radius=5,
                    stroke=False)

markers = df_sel.apply(create_marker, axis=1)
layer_group = LayerGroup(layers=tuple(markers.values))
m.add_layer(layer_group)

m

The red circles mark measuring points with missing value for gps_height, the blue circles mark measuring points where gps_height is present.

The weighted k-nearest-neighbors regression shows a very good result on the available data.

from geopy.distance import great_circle
from sklearn import neighbors
from sklearn.model_selection import train_test_split

# n_neighbors regression parameter
n_neighbors = 15

# Splitting of data according to the presence of `gps_height`
df_regr = df[~df.gps_height.isnull()][["latitude", "longitude", "gps_height"]]

# Splitting into training/test sets
df_train, df_test = train_test_split(df_regr, test_size=0.2, random_state=42)

# Splitting into independent and dependent variables
def split_x_y(df):
    X = df[["latitude", "longitude"]].as_matrix()
    y = df["gps_height"].as_matrix()
    return X,y
X_train, y_train = split_x_y(df_train)
X_test, y_test = split_x_y(df_test)

# Defining the regression
regr = neighbors.KNeighborsRegressor(n_neighbors = n_neighbors, weights="distance", n_jobs=-1)

# Training
regr.fit(X_train, y_train)

# Calculation of regression R²
regr.score(X_test, y_test)
# 0.9977175260422764

The algorithm calculates the distance between two points as Euclidean (L2) distance in the dimensions longitude and latitude. However, we have not yet taken into account that …

  • … longitude and latitude have a different conversion factor in actual distance in this dimension (in km)
  • … the conversion factor for the longitude is dependent on the latitude
  • .

This can interfere with the algorithm in the selection of the “nearest neighbor” and lead to worse results. Of course, the further the points are apart, the greater the effect. Even if in our example the variance in the latitude is low and the points are quite close to each other, and our algorithm with k=15 is very robust against uncertainties in the distance, we try to improve the regression by calculating the “real” distance and weighting by distance. This is particularly useful due to the regional distribution of points with missing gps_heightvalue mentioned above.

from geopy.distance import great_circle
from sklearn import neighbors
from sklearn.model_selection import cross_val_score

# n_neighbors regression parameter
n_neighbors = 15

# Definition of distance function
def dist_km(a,b):
    return great_circle(a,b).kilometers

# Splitting of data according to the presence of `gps_height`
df_regr = df[~df.gps_height.isnull()][["latitude", "longitude", "gps_height"]]

# Splitting into independent and dependent variables
X = df_regr[["latitude", "longitude"]].as_matrix()
y = df_regr["gps_height"].as_matrix()

# Defining the regression
regr = neighbors.KNeighborsRegressor(n_neighbors = n_neighbors, weights="distance", n_jobs=-1)
regr_realdist = neighbors.KNeighborsRegressor(n_neighbors = n_neighbors, weights="distance", metric=dist_km, n_jobs=-1)

# Calculating the precision of the regression via cross-validation (this can take a few minutes!)
score = cross_val_score(regr, X, y, cv=5, verbose=2)
score_realdist = cross_val_score(regr_realdist, X, y, cv=5, verbose=2)

df_scores = pd.DataFrame(np.matrix([score, score_realdist]).T, columns=["Länge/Breite Metrik", "KM-Metrik"])
%matplotlib inline
df_scores.boxplot()

The test shows that for the “dense” points with an gps_height value present, it makes no significant difference whether the Euclidean distance is used at longitude and latitude or the actual distance in kilometres.

However, since the points whose gps_height value we want to estimate are sometimes very far from the points used in the training of the algorithm, we nevertheless use the actual distance to estimate the non-existent values.

# Training of the algorithm on the full available data set
regr_realdist.fit(X, y)

df_missing = df[df.gps_height.isnull()][["id","latitude","longitude"]]
X_missing = df_missing[["latitude", "longitude"]].as_matrix()
y_missing = regr_realdist.predict(X_missing)
df_missing["gps_height"] = y_missing

df_not_missing = df[~df.gps_height.isnull()]
df_imputation = pd.concat([df_not_missing,df_missing])[["id","gps_height"]]

df_imputed = df.copy()
del df_imputed["gps_height"]
df_imputed = df_imputed.join(df_imputation.set_index("id"), on="id")