In this blog post we explain how to work efficiently with geodata in Python.
Geodata can symbolize different objects – the most important are the following three.
- Points (e.g. individual addresses or measuring points)
- Lines (e.g. routes)
- Polygons (e.g. countries or postcode areas)
In addition to coordinate formats, geodata can also be stored as addresses.
We’ll explain in this post:
- how addresses can be converted to coordinates (latitude and longitude) (geocoding),
- how points and polygons can be displayed on an interactive map (visualization),
- and how distances can be calculated and a K-Nearest Neighbors regression can be made on geodata (Distance calculation and regression on geodata)
The libraries required for this section can be installed with PyPI as follows (here for Python3):
pip3 install geopy pip3 install ratelimit pip3 install tqdm # for progress bars
Geocoding with GeoPy and Nominatim
Geocoding refers to the conversion of addresses into coordinates and, vice versa, the conversion of coordinates into the corresponding address (reverse geocoding).
There are a number of freely available geocoding APIs that are suitable for smaller use cases, e.g. geocoding of less than 10,000 points per day.
In most cases it is not necessary to call the APIs manually.
geopy is an excellent Python library for (among others) geocoding and reverse geocoding that supports many APIs. In this example we use the Nominatim API, which is based on OpenStreetMap (OSM) data. The OSM data is subject to the Open Database License (ODbL).
A single address is geocoded in
geopy with minimal effort:
from geopy.geocoders import Nominatim gc = Nominatim(user_agent="fintu-blog-geocoding-python") gc.geocode("Unter den Linden 1, Berlin") # Location(Kommandantenhaus, 1, Unter den Linden, Spandauer Vorstadt, Mitte, Berlin, 10117, Deutschland, (52.51720765, 13.3978343993255, 0.0))
The Nominatim API does not necessarily require an address consisting of street, house number, and city, but also knows many business addresses and points of interest. We use the API to create a list of the 114 largest football stadiums in Germany with coordinates – which we present in the next section on a map.
import pandas as pd from ratelimit import limits, sleep_and_retry # Progress bar for Pandas apply from tqdm import tqdm tqdm.pandas() # Import stadiums df = pd.read_csv("./data/stadien.csv") @sleep_and_retry @limits(1,1) def rate_limited_geocode(query): return gc.geocode(query) def geocode(row): lookup_query = row["Name"] + ", " + row["Stadt"] lookup_result = rate_limited_geocode(lookup_query) return lookup_result df["geo_location"] = df.progress_apply(geocode, axis=1)
The table now contains the complete addresses and geo-coordinates of the stadiums.
df.loc[0,"geo_location"] # Location(Signal Iduna Park, 50, Strobelallee, Brünninghausen, Dortmund, Regierungsbezirk Arnsberg, Nordrhein-Westfalen, 44139, Deutschland, (51.49266925, 7.45083008801226, 0.0))
Some of the addresses could not be geocoded – we ignore them for now.
df[df.geo_location.isnull()].shape # 9 df = df[~df.geo_location.isnull()]
For geocoding of more than a few thousand addresses per day or for large batch geocoding, it is recommended to set up your own nominatim server. Instructions can be found here.
In this section we use Jupyter Notebooks / Jupyter Lab. Instructions for installing Jupyter can be found here. The required libraries can be installed with PyPI as follows (here for Python3):
pip install ipyleaflet jupyter nbextension enable --py --sys-prefix ipyleaflet # can be skipped for notebook 5.3 and above # IPython widgets must be enabled for interactive maps jupyter nbextension enable --py --sys-prefix widgetsnbextension # For Jupyter Lab, an additional extension is required jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter labextension install jupyter-leaflet
Now you can create interactive maps in Jupyter / Jupyter Lab:
from ipyleaflet import Map m = Map(center=(52.204793, 360.121558),zoom=4, scroll_wheel_zoom=True)
Interactive maps with leaflet.js and ipyleaflet
We can now create a map of the geocoded football stadiums.
from ipyleaflet import Map, Marker, MarkerCluster m = Map(center=(50.721283, 9.056725),zoom=5, scroll_wheel_zoom=True) def create_marker(row): loc = row["geo_location"] name = row["Name"] lat_lon = (loc.latitude, loc.longitude) return Marker(location=lat_lon, draggable=False, title=name) markers = df.apply(create_marker, axis=1) marker_cluster = MarkerCluster(markers=tuple(markers.values)) m.add_layer(marker_cluster) m
For better visibility we change the background to a more monotonous map. We also replace the markers with circles whose radius represents the capacity (number of seats) of the stadium.
from ipyleaflet import Map, basemaps, basemap_to_tiles, CircleMarker, LayerGroup background = basemap_to_tiles(basemaps.OpenStreetMap.DE) m = Map(layers=(background, ), center=(50.721283, 9.056725), zoom=5, scroll_wheel_zoom=True) biggest_stadium_size = df["Plätze"].max() def create_marker(row): loc = row["geo_location"] name = row["Name"] size = int((row["Plätze"]/biggest_stadium_size)*15) lat_lon = (loc.latitude, loc.longitude) return CircleMarker(location=lat_lon, draggable=False, title=name, radius=size, weight=1) markers = df.apply(create_marker, axis=1) layer_group = LayerGroup(layers=tuple(markers.values)) m.add_layer(layer_group) m
Distance calculation and regression on geodata
In this section we again use
geopy and for a k-nearest neighbors regression the
pip3 install geopy pip3 install scikit-learn
K-nearest-neighbors regression on geodata
In this example use a data set of 74,219 geocoded measurements of water pumps in Tanzania (source: DrivenData). The (partly missing or wrong) geocodings have already been cleaned up and added.
df = pd.read_csv("./data/tanzania.csv", low_memory=False) df.shape # 74219 df.latitude.isnull().sum() + df.longitude.isnull().sum() # 0
However, the column
gps_height, which specifies the height of the measuring point, still contains missing values for about 1/3 of all measuring points.
df.gps_height.isnull().sum() # 25637
We use a weighted k-nearest neighbors regression on the columns
longitude to estimate the missing values. Note that the missing values for
gps_height have a very regional distribution, which tends to have a negative effect on the result of our regression.
from ipyleaflet import Map, basemaps, basemap_to_tiles, CircleMarker, LayerGroup # Wir nutzen aus Performancegründen 1000 zufällig ausgewählte Messungen df_sel = df.sample(n=1000, random_state=2000) background = basemap_to_tiles(basemaps.OpenStreetMap.BlackAndWhite) m = Map(layers=(background, ), center=(-6.57729778901325, 35.11030912399293), zoom=5, scroll_wheel_zoom=True) def create_marker(row): lat_lon = (row["latitude"], row["longitude"]) is_missing = np.isnan(row["gps_height"]) if is_missing: color = "#FF0000" else: color = "#0033FF" return CircleMarker(location=lat_lon, draggable=False, fill_color=color, fill_opacity=0.5, radius=5, stroke=False) markers = df_sel.apply(create_marker, axis=1) layer_group = LayerGroup(layers=tuple(markers.values)) m.add_layer(layer_group) m
The red circles mark measuring points with missing value for
gps_height, the blue circles mark measuring points where
gps_height is present.
The weighted k-nearest-neighbors regression shows a very good result on the available data.
from geopy.distance import great_circle from sklearn import neighbors from sklearn.model_selection import train_test_split # n_neighbors regression parameter n_neighbors = 15 # Splitting of data according to the presence of `gps_height` df_regr = df[~df.gps_height.isnull()][["latitude", "longitude", "gps_height"]] # Splitting into training/test sets df_train, df_test = train_test_split(df_regr, test_size=0.2, random_state=42) # Splitting into independent and dependent variables def split_x_y(df): X = df[["latitude", "longitude"]].as_matrix() y = df["gps_height"].as_matrix() return X,y X_train, y_train = split_x_y(df_train) X_test, y_test = split_x_y(df_test) # Defining the regression regr = neighbors.KNeighborsRegressor(n_neighbors = n_neighbors, weights="distance", n_jobs=-1) # Training regr.fit(X_train, y_train) # Calculation of regression R² regr.score(X_test, y_test) # 0.9977175260422764
The algorithm calculates the distance between two points as Euclidean (L2) distance in the dimensions longitude and latitude. However, we have not yet taken into account that …
- … longitude and latitude have a different conversion factor in actual distance in this dimension (in km)
- … the conversion factor for the longitude is dependent on the latitude
This can interfere with the algorithm in the selection of the “nearest neighbor” and lead to worse results. Of course, the further the points are apart, the greater the effect. Even if in our example the variance in the latitude is low and the points are quite close to each other, and our algorithm with
k=15 is very robust against uncertainties in the distance, we try to improve the regression by calculating the “real” distance and weighting by distance. This is particularly useful due to the regional distribution of points with missing
gps_heightvalue mentioned above.
from geopy.distance import great_circle from sklearn import neighbors from sklearn.model_selection import cross_val_score # n_neighbors regression parameter n_neighbors = 15 # Definition of distance function def dist_km(a,b): return great_circle(a,b).kilometers # Splitting of data according to the presence of `gps_height` df_regr = df[~df.gps_height.isnull()][["latitude", "longitude", "gps_height"]] # Splitting into independent and dependent variables X = df_regr[["latitude", "longitude"]].as_matrix() y = df_regr["gps_height"].as_matrix() # Defining the regression regr = neighbors.KNeighborsRegressor(n_neighbors = n_neighbors, weights="distance", n_jobs=-1) regr_realdist = neighbors.KNeighborsRegressor(n_neighbors = n_neighbors, weights="distance", metric=dist_km, n_jobs=-1) # Calculating the precision of the regression via cross-validation (this can take a few minutes!) score = cross_val_score(regr, X, y, cv=5, verbose=2) score_realdist = cross_val_score(regr_realdist, X, y, cv=5, verbose=2) df_scores = pd.DataFrame(np.matrix([score, score_realdist]).T, columns=["Länge/Breite Metrik", "KM-Metrik"]) %matplotlib inline df_scores.boxplot()
The test shows that for the “dense” points with an
gps_height value present, it makes no significant difference whether the Euclidean distance is used at longitude and latitude or the actual distance in kilometres.
However, since the points whose
gps_height value we want to estimate are sometimes very far from the points used in the training of the algorithm, we nevertheless use the actual distance to estimate the non-existent values.
# Training of the algorithm on the full available data set regr_realdist.fit(X, y) df_missing = df[df.gps_height.isnull()][["id","latitude","longitude"]] X_missing = df_missing[["latitude", "longitude"]].as_matrix() y_missing = regr_realdist.predict(X_missing) df_missing["gps_height"] = y_missing df_not_missing = df[~df.gps_height.isnull()] df_imputation = pd.concat([df_not_missing,df_missing])[["id","gps_height"]] df_imputed = df.copy() del df_imputed["gps_height"] df_imputed = df_imputed.join(df_imputation.set_index("id"), on="id")