Background
Objective
Import Libraries
Import Data
Data Pre-processing
Map Ride Share Data
(Bonus): Additional Analysis
As someone who is a regular user of ride sharing services, I thought it would be interesting to dive into my own ride sharing behaviour and see if I could use python to plot my travel history on a folium map.
Let's get started!
This document will outline the following:
By the end of the document, you'll be able to have a cool data visualization about your own ride sharing usage history, which should look something like:
from IPython.display import Image
Image("myridehistory.png")
Let's get started!
First thing we need to do is import all of the libraries / modules we'll need to visualize our data on a map.
import pandas as pd
import numpy as np
import requests
from geopy.geocoders import Nominatim # get lat / long of an address
import folium # Mapping application
from folium import plugins # Used to plot routes on a map
import openrouteservice # Used to get lat / longs in between starting and ending lat / long
from openrouteservice import convert
import geopy.distance # Calculate distance between starting and ending points
% matplotlib inline
In order to get car sharing data, i used the following Chrome extension: https://ummjackson.github.io/uber-data-extractor/
This data extractor will scrape and export your ride sharing history into a csv.
# import csv into a dataframe
ride_data = pd.read_csv('trip-history-final.csv')
# information about my ride sharing dataset
ride_data.info()
pd.set_option('display.max_columns', 500)
# Let's look at the first 5 rows of data
ride_data.head()
What we need to do in order to be able to plot ride history:
Given that the majority of my rides are in Toronto, let's filter the ride sharing data set to just rides in Toronto.
# Filter data on city = Toronto and store it in a df
toronto_rides = ride_data[ride_data['city'] == 'Toronto'].copy()
# View the data to see that we correctly filtered it.
toronto_rides.head()
The first thing we need to do is clean our starting and ending address so that it is in the following format:
street + ', ' + city + ', ' + country
# Let's first split on ',' and create columns for starting street name, postal country and other.
# We'll start with the starting address.
toronto_rides['start_street'],toronto_rides['start_other'],toronto_rides['start_postal'],toronto_rides['start_country'] = toronto_rides['start_address'].str.split(',', 3).str
# Now let's concatenate the relevant columns into a clean starting address we can use to send to geopy
toronto_rides['start_address_io'] = toronto_rides['start_street'].map(str) + ', ' + toronto_rides['city'] + ', ' + toronto_rides['start_country']
Now let's clean the end address using the same approach.
# Next up, cleaning the ending address of the trip.
toronto_rides['end_street'],toronto_rides['end_other'],toronto_rides['end_postal'],toronto_rides['end_country'] = toronto_rides['end_address'].str.split(',', 3).str
# Now let's concatenate the relevant columns into a clean starting address we can use to send to geopy
toronto_rides['end_address_io'] = toronto_rides['end_street'].map(str) + ', ' + toronto_rides['city'] + ', ' + toronto_rides['end_country']
Let's take a look at the new table.
# Updated table with clean addresses - looks good.
toronto_rides.head()
Next let's create a function we can use to acquire lat / longs from geopy using both the cleaned starting and ending addresses.
def get_latlong(df,address):
'''
Background:
A simple function to query and return a lat / longs from a pre-cleaned list of addresses.
This uses geopy. To install simply pip install geopy in your terminal.
Input:
1. df: The dataframe we wish to use
2. address: The addresss column in the df used to fetch when fetching the lat / long
Output:
1. lat: a list of latitudes based on the addresses provided.
2. long: a list of longitudes based on the addresses provided.
'''
# create empty lat / long lists to append our fetched data to.
lat = []
long = []
# iterate through the dataframe
for index, row in df.iterrows():
geolocator = Nominatim(user_agent="my_ride_history_agent", timeout=100)
# determine location based on address
location = geolocator.geocode(row[address])
# Some addresses may not be readable, so let's create a rule to append None if they are not readable by geopy
if location != None:
#append each lat and long to their respective lists.
lat.append(location.latitude)
long.append(location.longitude)
#print(location.latitude)
#print(location.longitude)
else:
# append None if geopy can't properly process. We could try and fix this later.
lat.append(None)
long.append(None)
# Let's see what we've processed
print(lat)
print(long)
# return lists of latitudes and longitudes
return lat,long
Now that we have our formula, let's use it and return lat / longs for our starting address
# get latitudes and longitudes for starting addresses
start_lat,start_long = get_latlong(toronto_rides,'start_address_io')
Let's add these starting lat longs to our toronto_rides df
toronto_rides['start_lat'] = pd.DataFrame({'lat':start_lat})
toronto_rides['start_long'] = pd.DataFrame({'long':start_long})
# Let's see our lat / longs
toronto_rides.head()
Now let's repeat the above steps for and get lat / longs for our ending addresses.
# get latitudes and longitudes for starting addresses
end_lat,end_long = get_latlong(toronto_rides,'end_address_io')
# Let's add the ending lat longs to our toronto_rides df
toronto_rides['end_lat'] = pd.DataFrame({'lat':end_lat})
toronto_rides['end_long'] = pd.DataFrame({'lat':end_long})
Let's view our updated toronto_rides df with starting and ending latitudes and longitudes.
toronto_rides.head()
In order to plot our lat / longs on a map, all of the data needs to have a latitude and longitude. If they do not, we'll just remove them from our df.
# remove starting lat longs = None
toronto_rides = toronto_rides[np.isfinite(toronto_rides['start_lat'])]
toronto_rides = toronto_rides[np.isfinite(toronto_rides['start_long'])]
# remove ending lat longs = None
toronto_rides = toronto_rides[np.isfinite(toronto_rides['end_lat'])]
toronto_rides = toronto_rides[np.isfinite(toronto_rides['end_long'])]
# I noticed that there was some end_address_io that was NaN, but setting a lat long. Let's remove those rows too.
toronto_rides.dropna(subset=['end_address_io'],inplace=True)
toronto_rides.dropna(subset=['start_address_io'],inplace=True)
toronto_rides.head()
toronto_rides.info()
It looks like I have 43 trips to work with and visualize.
def get_distance(df):
distance_travelled = []
for index, row in df.iterrows():
coords_1 = (row['start_lat'], row['start_long'])
coords_2 = (row['end_lat'], row['end_long'])
distance_km = geopy.distance.vincenty(coords_1, coords_2).km
distance_travelled.append(distance_km)
return distance_travelled
distance_travelled = get_distance(toronto_rides)
len(distance_travelled)
toronto_rides['distance'] = pd.DataFrame({'distance':distance_travelled})
toronto_rides['price_clean'] = toronto_rides['price'].str[3:]
toronto_rides['price_clean'] = toronto_rides['price_clean'].apply(pd.to_numeric, errors='coerce')
Ok, now on to the fun stuff. Let's start visualizing our car history on a map.
Couple things we need to do:
In order to plot the paths, we'll need to query the openrouteservice (2) api to return all of the lat / longs in between the starting and ending points. You'll need to register and generate an api key before your can use this yourself.
In order to efficiently do this, let's create a function to help accomplish this. For more details about folium, you can visit this link: http://python-visualization.github.io/folium/quickstart.html
def generate_map(map_location, map_style, start_lat_col, start_long_col, start_color, end_lat_col, end_long_col, end_color):
"""
Background:
This function will return a folium map with starting and ending trip location markers.
Inputs:
map_location: This is where you want to set the default location for the map. Format: [lat_value,long_value]
map_style: The style of map you want to render. I am using "cartodbpositron" style.
start_lat_col: Column where your trip starting latitude points are.
start_long_col: Column where your trip starting longitude points are.
start_color: The color of the starting circle you want to render on the folium map.
end_lat_col: Column where your trip ending latitude points are.
end_long_col: Column where your trip ending longitude points are.
end_color: The color of the ending circle you want to render on the folium map.
Outputs:
folium_map: This is the folium map we created.
"""
# generate a new map
folium_map = folium.Map(location=map_location,
zoom_start=11,
tiles=map_style)
# for each row in the data, add a cicle marker
for index, row in toronto_rides.iterrows():
# add starting location markers to the map
folium.CircleMarker(location=(row[start_lat_col],
row[start_long_col]),
color=start_color,
radius=5,
weight=1,
fill=True).add_to(folium_map)
# add end location markers to the map
folium.CircleMarker(location=(row[end_lat_col],
row[end_long_col]),
color=end_color,
radius=5,
weight=1,
fill=True).add_to(folium_map)
return folium_map
# Let's add the starting and ending lat longs to the folium map using the function we just built.
generate_map([43.6813629, -79.315015],"cartodbpositron","start_lat","start_long",'#0A8A9F',"end_lat","end_long",'#f68e56')
Let's create a function which will access the openrouteservice api and return all of the latitudes and longitudes between our starting and ending trip points.
Important Note: The openrouteservice api exports are reversed (long, lat) vs. what folium uses (lat,long). We'll need to reverse the lat / longs before we can plot the coordinates on a folium map.
For more information about the openrouteservice api please visit:
https://github.com/GIScience/openrouteservice-py
def get_paths(df):
"""
Background:
This function will return all of the paths / routes in latitudes and longitudes in between our starting and ending trip points.
Inputs:
df: The dataframe you wish to pass in.
Outputs:
path_list: A list of lat long tuples for each trip.
"""
path_list = []
for index, row in df.iterrows():
# I included try / except as a precaution in case any paths are extremely long, which we'll skip.
# I noticed this exception error when I accidentally generated a lat / long for no address. Be aware of this and remove prior to using this function.
try:
# coordinates of the trips living within specific table columns.
coords = ((row['start_long'],row['start_lat']),(row['end_long'],row['end_lat']))
# Specify your personal API key
client = openrouteservice.Client(key='{{INSERT_YOUR_KEY_HERE}}')
geometry = client.directions(coords)['routes'][0]['geometry']
decoded = convert.decode_polyline(geometry)
# We need to reverse the long / lat output from results so that we can graph lat / long
reverse = [(y, x) for x, y in decoded['coordinates']]
# Append each route to the path_list list
path_list.append(reverse)
# confirmation of each route being processed. Feel free to comment out.
print(index)
except:
pass
return path_list
# Let's store all of the path lat long data into routes. We'll pass this data into another function below.
routes = get_paths(toronto_rides)
len(routes)
It looks like the openrouteservice api returned 39 routes. We'll have to investigate why a few did not return.
toronto_rides.info()
Nice! We now have all of the lat / long paths in between all of our starting and ending points. Let's create a function to plot these paths on a folium map.
def plot_paths(paths,map_data):
"""
Background:
This function will take all of the paths generated from get_paths(df) and add them to our folium map.
Input:
paths: Our list of paths we generated from get_paths(df)
map_data: Our map we generated using the generate_map() function
Output:
map_data: Our map with all of our routes plotted!
"""
# Loop through all of our paths and add each of them to our map.
for path in paths:
line = folium.PolyLine(
path,
weight=1,
color='#0A8A9F'
).add_to(map_data)
return map_data
Alright, we have all our our points and paths. Let's call a couple functions and generate our final map!
map_data = generate_map([43.6813629, -79.315015],"cartodbpositron","start_lat","start_long",'#0A8A9F',"end_lat","end_long",'#f68e56')
plot_paths(routes, map_data)
Success! Now all of our routes are plotted on a map. \o/
As a bonus, let's take a look at the following:
For this analysis, we'll just look at toronto_rides in its current form as I'd like to compare cost to distance travelled.
toronto_rides['price_clean'].mean()
toronto_rides['price_clean'].plot(kind='hist')
It appears that some calculations of distance were NaN, we'll need to filter these out.
toronto_analysis = toronto_rides[toronto_rides['distance']<1000].copy()
len(toronto_analysis)
It appears only 30 distances were calculated using geopy.
toronto_analysis['distance'].plot(kind='hist')
toronto_analysis['distance'].mean()
from numpy.polynomial.polynomial import polyfit
import matplotlib.pyplot as plt
# Sample data
x = toronto_analysis['price_clean']
y = toronto_analysis['distance']
# Fit with polyfit
b, m = polyfit(x, y, 1)
plt.plot(x, y, '.')
plt.plot(x, b + m * x, '-')
plt.title("Price vs. Distance")
plt.xlabel("price")
plt.ylabel("distance")
plt.show()
As distance increases, so does price, but keep in mind this is only among 30 trips. It would be much better to have a larger dataset. Looking at this was just for fun.
toronto_analysis['price_clean'].corr(toronto_analysis['distance'])