Plotting Shapefile Data Using Geopandas, Bokeh and Streamlit in Python.

I was recently introduced to geospatial data in python. It’s represented in .shp files, in the same way any other form of data is represented in say .csv files. However, each line in a .shp file corresponds to either a polygon, a line, or a point. A polygon can represents certain shapes, so in the given context of maps and geospatial data, a polygon could act as a country, or perhaps an ocean and so on. Lines are used to represent boundaries, roads, railway lines etc. Points are used to represent cities, landmarks, features of interest etc. This was very much new to me, so I found it fascinating to see how any sort of map can be broken down into polygons, lines, and points then played around with using code. I was also introduced to streamlit, which provides, say, an alternative to Jupyter Notebooks but with more interaction and in my opinion, better visual appeal. I think one distinct advantage Jupyter Notebook has is compartmentalisation, and how good code and markdown look next to each other, whereas streamlit seems to be more visually appealing. However, one big advantage streamlit has is the fact that it is operated by the command line, making it much more efficient for a person like me who’s very much comfortable with typing out commands and running stuff, rather than dragging my pointer around to click objects.

I used the geopandas library for dealing with shapefile data. It’s incredibly efficient and sort of extends native pandas commands to shapefile data, making everything much easier to work with. One thing I didn’t like was how streamlit didn’t have inbuilt functionality to view GeoDataFrames, so essentially that means I have to output geospatial data using the st.write() method, and that just results in some ugly, green-colored output, very much unlike the clean, tabular output you get when you use st.write() for displaying dataframes. It’s also a bit surprising how st.dataframe() doesn’t extend to a GeoDataFrame, but eh, it works for now.

Bokeh is new for me, I decided to start out with making plots using inbuilt geopandas and matplotlib functionality rather than move straight to Bokeh. Hence, in this post, I’ll be going through how I made and annotated some maps using geopandas, then extended that to Bokeh to make my code much more efficient. A huge advantage Bokeh brings to the table is that it can be used to store plots, so no going back to earlier cells to find a plot. Just output it to an html file and write some code to save the contents of that html file, you’re then good to go.

The very first thing I did was create a very basic plot showing Germany filled with a blue color. Here’s a small piece of code that accomplishes that.

This simple takes the geodata for the entire world, in low resolution. I then select the data specific to Germany, plot it using geopandas, turn plot axes off just to make it look better visually, and output a plot to my streamlit notebook (do I call it that?). Here’s the result

As you can see, it’s a very simple plot showing the nation state of Germany. Now, I thought I’d extend this, make it look better, and annotate it with the names of some cities and the capital, Berlin. The very first thing to do here is to get some data for each city, their latitudes and longitudes, to be specific. You can either import that via a shapefile, which will have each location as a point. However, I manually inputted 6 cities and their data from an internet source to a dataframe, and then used that to annotate my figure, which I’ll be talking about now.

The dataframe at the top contains my city position data. Down below, I’m creating another GeoDataFrame that holds each cities data as a point object. You’ll notice that I’ve used the points_from_xy() method while creating the city dataframe. points_from_xy() wraps around the point() method. You can view it as equivalent to [point(x,y) for x, y in zip(df.Longitude, df.Latitude]. It works really well and removes the need to have a for loop, making, making my code much more efficient. I’ve then plotted the same map as above, except with a white fill and a black outline (better looking imo). After that, I’ve gone over each point using a for loop and added a label, which is the City name (stored as just “Name”). I’ve also increased the size of the marker for Berlin, given that its the capital. The last step is just adding a red marker to indicate each city’s position. Note that st.pyplot() is a streamlit method that outputs any figures we might have. Here is the output of the code above.

I think this looks much better.

Now, I decided to plot a map showing all the main railway lines in India on top of a blank map of India, and output this to a new .html page as a bokeh plot.

As you can see the, the code for this is very simple. I’ve firstly plotted the map of India using the get_path functionality in geopandas. Then, for the sake of visbility, axes lines have been turned off. Then, I’ve read the railway shapefiles, which consists entirely of line objects, and plotted it using the pandas_bokeh library, outputting to an html file. Here’s the result

I find working with geospatial data to be terribly useful, and very much exciting. I think that’s partly because I love data science; playing around with data and modelling it, and working with geospatial data opens up an entire realm of possibilities. I’d describe the experience of making my first shapefile plot as something akin to working for the first time with time series data. Much like time, it adds another dimension to what one can do. In the coming weeks, I’ll be blogging more often, hopefully once every week, on geospatial data specifically. The contents of this post are not even an introduction to what can be accomplished using geospatial data. Next point of exploration, for me, is going to be how to depict elevations, and use color density to indicate population density, average cost of living, gdp etc. One immediate application that comes to mind is making a map that uses color to reflect how much covid-19 has impacted a country, based on not just cumulative confirmed cases, but also factors like healthcare expenditure, economic downturn, unemployment etc. I think it’ll be interesting.

You can find the entire code here.