Manhattan Taxi Demand Prediction

Time Series Forecasting to predict the demand for yellow taxis in one of the busiest boroughs of NYC.

Sanjay Chouhan
11 min readOct 10, 2021

Please check out the codes on Github,

Try https://nbviewer.org/ if you are not able to load the notebook on Github.

Table of contents

  1. Introduction
  2. Business Objectives and Constraints
  3. Data Acquisition
  4. Data Formatting
  5. Business Metrics
  6. Data Cleaning
  7. Data Overview
  8. Feature Extraction
  9. Data Pre-processing
  10. Training
    I. Linear Regression
    II. XGBoost Regressor
    III. LSTM (Long Short Term Memory)
  11. Test Data Generation and Pre-processing
  12. Testing
  13. Final Thoughts
  14. References

Introduction

The New York City Taxi and Limousine Commission (TLC)¹ created in 1971, is the agency responsible for licensing and regulating New York City’s medallion (yellow) taxis, and various other taxis. The yellow taxis are the most iconic taxis of NYC, which are widely recognized outside the USA too.

Yellow medallion taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (street hail), but now they may also be hailed using e-hail apps.

NYC has five boroughs Bronx, Brooklyn, Manhattan, Queens, and Staten Island. The yellow taxis are the only vehicles permitted to respond to a street hail from a passenger in all five boroughs of NYC.

Here, I will try out few models to predict the demand for only yellow taxis and only in Manhattan. Manhattan is the busiest borough of NYC by a huge margin. I started out whole NYC yellow taxi data, but later I decided to create a model for only Manhattan. I will explain why in later part of this blog.

Business Objectives and Constraints

  1. Interpretability is not required.
  2. We would want latency to be less than 2 seconds.
  3. We would want less difference between actual and predicted demands.

Data Acquisition

We have collected the data from the official NYC site’s TLC page¹. There you will find month and taxi type-wise data from the year 2009.

I have downloaded only for years 2018 and 2019. Now you might wonder why not 2020 data? That’s because year 2020 had the covid pandemic, which disrupted everything. No machine learning model will work in such high uncertainty.

These types of events occur very very rarely (like once in a hundred years). In statistics, there is a term for this type of event, called the Black Swan event.

The data on TLC page is very well detailed. There are many columns such as location id, pickup time, drop time, passenger count, trip cost, speed, etc. But for our problem, we will need only the location ID and the pickup time. So instead of downloading all the data, which will be huge for my mobile connection. I used Google colab as a medium to download only the relevant columns. 😎

Data Formatting

The data were in different csv files for every month, so I combined them. I have used data from January 2018 and till April 2019 for training (training and cross-validation).

We have 132,757,101 pickups from 264 unique locations in this period of 16 months. The TLC has divided the whole NYC into 266 regions with location id ranging from 1 to 266.

With google map, I checked out time durations of travel between few random locations and their neighboring locations. Which on average turns out to be about 13 minutes. So I decided to predict the demands every 15 minutes. That means we will divide a day into 96 parts.

Now we need to create the training data from the raw data. For each location, we will count the number of pickups in a 15 minutes time interval.

After formatting, the training data looks like,

Formated data.

The first timestamp 1514782800 corresponds to 2018–01–01 00:00 AM. And the second timestamp 1514783700 corresponds to 2018–01–01 00:15 AM and so on.

Let’s see how I calculated the counts. Let’s take the first row for example. The location id is 1 and the timestamp is 1514782800. So I have counted the number of pickups from location id 1 between 2018–01–01 00:00 AM and 2018–01–01 00:14 AM.
Similarly, for the next row, I have counted the number of pickups from location id 1 between 2018–01–01 00:15 AM and 2018–01–01 00:29 AM.

The location and the timestamp are input variables and the count is the output variable. So the task is to predict the number of pickups in 15 minutes duration given a location id and time, which will be multiple of 15 minutes.

Formatting the data was a time-consuming process. However, I was able to do some neat hacks which speed up the process slightly.

Business Metrics

  1. We will use MAE (mean absolute error) as a primary performance measure.
  2. With MAE we also need to check the average of the output. Why? Because MAE alone is not enough.
    Suppose someone says that the MAE is 10 then it is good if the average output is 100 or 500 (problem-specific) but it is not good if the average output is 10.
  3. We will also see MAE% which is MAE/average(y) . Which combination of above two.
  4. RMSE is also a good measure, which we will track.
  5. I have also checked R² error for sanity check.

Note we have not used MAPE. Because we have lots of outputs as zero or very small. MAPE becomes very large if we have lots of outputs as zero.

Data Cleaning

Smoothing

The counts are very jagged. To make it sensible for the training we have to make it a little bit smoother.

For smoothing, we have used counts of 5 time interval’s (t0, t+1, t+2, t-1, t-2). The formula we have used is x[t-2]*.05 + x[t-1]*.20 + x[t]*.50 + x[t+1]*.20 + x[t+2]*.05

Smoothing of counts.

The blue line represents the original counts for a day. The green line represents the smoothed counts. We will do the smoothing only for the training data.

For testing, I will not perform smoothing just to get an idea of the real world’s performance.

Only for Manhattan

We will use only the Manhattan data. Because Manhattan is a very busy city. And the data of Manhattan is very different from other areas because there are fewer pickups.

Out of 132 million pickups, Manhattan has 120 million pickups. The other four boroughs have only 12 million pickups.

In this case, instead of building a single model, we can create different models to get better performance. However, here I will create model for Manhattan only.

In our training data, we have 68 locations that belong to Manhattan.

Data Overview

  • We have two features, location and timestamp.
  • Counts is the output label.
  • We have 3,165,536 data points.
One day’s data.
A week's data.
One year’s data.
  • We can certainly see a daily pattern.
  • But I am surprised that we do not see any clear uptrend in a year.
  • We have 120,342,142.55 pickups. It’s in decimal because we have smoothed.
  • We have a minimum of 0 pickups and a maximum of 659.9 pickups.
PDF of Counts
  • Approx 9.50% of data points have counts=0 and approx 15.55% of data points have counts<1 out of the 3.165MM.
  • 25% of the data have counts ≤ 3.6 and only 10% of the data have counts≥102.7.
  • That pickup counts are highly skewed.

Feature Extraction

At a given time we will have access to all previous data.
Based on counts of previous time intervals we can create some very good features.

  • t-1: pickup counts at last time interval.
  • t-2: pickup counts at second last time interval.
  • t-3: pickup counts at third last time interval.
  • t-4: pickup counts at fourth last time interval.
  • t-5: pickup counts at fifth last time interval.
  • dt0: pickup counts at same time on previous day.
  • dt-1: pickup counts at t-1 time on previous day.
  • dt-2: pickup counts at t-2 time on previous day.
  • dt+1: pickup counts at t+1 time on previous day.
  • dt+2: pickup counts at t+2 time on previous day.
  • tdiff1: which is equal to [t-1]-[t-2]
  • tdiff2: which is equal to [t-2]-[t-3]
  • tdiff3: which is equal to [t-3]-[t-4]
  • tdiff4: which is equal to [t-4]-[t-5]
  • ty1: which is equal to [t-1]-[dt-1]
  • ty2: which is equal to [t-2]-[dt-2]
  • avg_y: which is equal to ([dt-2]+[dt-1]+[dt0]+[dt+1]+[dt+2])/5
  • simple_MA: which is equal to ([t-5] + [t-4] + [t-3] + [t-2] + [t-1]) / 5
  • weighted_MA: which is equal to (5*[t-1] + 4*[t-2] + 3*[t-3] + 2*[t-4] + 1*[t-5]) / 15
  • exponential_MA: which is equal to alpha * { [t-1] + (1-alpha)*[t-2] + ((1-alpha)²)*x[t-3] + ((1-alpha)³)*[t-4] + ((1-alpha)⁴)*[t-5] } where alpha = 0.333
  • week_day: Between 1 and 7 where it is 1 for Monday and 7 for Sunday.

We have created 21 new features. So now we have a total of 23 features.

Data Pre-processing

First of all, we need to remove the data points for which some of the features could not be calculated. That means we have to remove the first 98 rows for each location.

Then we need to create train and cross-validation data. Since there is a temporal nature. We can’t do a random split. We need to divide it based on time.

We have used the whole 2018’s data for training. And the first four months of 2019 as cross-validation data.

Note that I have not performed any normalization or standardization operation.

Training

Linear Regression

I trained a simple Linear Regression model without any regularization or hyper-param tuning.

Performance:

Linear Regression performance.

The MAPE on whole cross-validation data is absurd. It’s in trillions.
But if we calculate MAPE for only non-zero outputs, it’s 0.1365. That’s why we can not rely on MAPE if the actual outputs are zeros or very small values.

The MAE is 2.23, to conclude whether it is good we need to know the average actual outputs. The average actual cv output is 34.75, which means the MAE is sensible. The MAE% is 6.60%, which is good.

RMSE also looks good. And the R² is very close to 1.

Feature Importance:

Feature Importance for Linear Regression.

These are the top 10 important features for the Linear Regression model in descending order. All of them seem very intuitive.

XGBoost Regressor

To tune the XGboost Regressor, I have used hyperopt which usage bayesian optimisation technique.

After tuning I found the best hyper-parameters as {‘learning_rate’: 0.09, ‘max_depth’: 8, ‘n_estimators’: 100, ‘reg_lambda’: 1.0} which reduces the MAE to 2.15.

Performance:

XGBoost Regressor performance.

The performance is very similar to the previous model. There isn’t any significant improvement.

Feature Importance:

Feature Importance for XGBoost Regressor

These are the top 10 important features for the XGBoost Regressor model in descending order.

An interesting thing to note here is that t-5 is getting more importance than t-2, t-3, and t-4.

The location feature in the top ten was also a surprise to me considering that we have more interesting features. But it’s in there that means there is a difference in counts based on the location.

LSTM (Long Short Term Memory)

When it comes to sequence of data, whether it is text data or time-series data. The LSTM has been the default choice.

The LSTM is a deep learning model which is an improvement over the traditional RNN (Recurrent Neural Network). The RNN suffers from a short-term memory problem which was solved in LSTM.

The LSTM will not use the traditional features that we have been using till now.

For LSTM, we will provide the last 100 time interval’s counts as input sequence. And hopefully, the LSTM will recognize the pattern. So we need to create new data.

The structure of the LSTM based deep learning looks like,

LSTM based model.

We have two LSTM layers, one after another. Each has 64 ReLU activation units. The last layer is a dense layer with only one ReLU activation unit.

I have used Adam optimizer which will minimize the MAE loss.

I have also added an early stopping callback which will monitor the validation loss and will stop the training if it’s not decreasing.

After 9 epochs we converged to a sensible value.

Training.

Performance:

Performance of LSTM based model.

The performance is worse than the other two models. We might be able to improve with more fine-tuning or increasing the sequence length.

Test Data Generation and Pre-processing

Till now we have used only the training data. For testing, we need to perform all the steps that we did earlier on raw test data as well.

The raw test data is the data of last 8 months of 2019.

In this section, I have performed all the data formatting, feature extraction, and featurization steps.

Testing

Keep in mind that we have not done smoothing for the test data. So the performance on test data will certainly be worse than before.

The average of output avg(y_test) is 31.25

---- Performance of Linear Regression ----MAE =>                    7.227731813183053MAE% =>                   0.23128834949868637R squared =>              0.9094240989153082RMSE =>                   11.84154255439212
---- Performance of XGBoost Regressor ----MAE => 6.574487879610602MAE% => 0.21038445943729467R squared => 0.9225697752186679RMSE => 10.948565185380145
---- Performance of LSTM ----MAE => 6.34815704342306MAE% => 0.2033560382199119R squared => 0.92908681177385RMSE => 10.465103391641659

The performance is significantly worse than before. But surprisingly LSTM is performing better than the other two models. This means the LSTM based model is more robust and is generalizing better than the others.

Final Thoughts

We could tweak more to get better performance. Maybe try without smoothing or can come up with some new features. Or we can try stacking different models.

Few interesting features that we could try are “is holiday or special day”, “weather type”, “season”, “day or night”, etc.

Note that I have not used the classical time series forecasting models like Holt-Winters or ARIMA. Because on their own they will not perform as well as the current models. However, we can certainly get benefits from them if we do stacking ensembling.

References

  1. NYC Taxi and Limousine Commission (TLC) [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page]
  2. Applied Roots [https://www.appliedroots.com/]

Thanks for reading the blog! You can reach me through my LinkedIn account and can view my other works on my personal website.

Also 👏 if you liked it. And keep reading, keep learning.

--

--