Project Predictive Analytics: New York City Taxi Ride Duration Prediction


Context


New York City taxi rides form the core of the traffic in the city of New York. The many rides taken every day by New Yorkers in the busy city can give us a great idea of traffic times, road blockages, and so on. A typical taxi company faces a common problem of efficiently assigning the cabs to passengers so that the service is hassle-free. One of the main issues is predicting the duration of the current ride so it can predict when the cab will be free for the next trip. Here the data set contains various information regarding the taxi trips, its duration in New York City. Different techniques will be implemented to get insights into the data and determine how different variables are dependent on the Trip Duration.


Objective



Dataset


The trips table has the following fields

The following steps will be taken:

Installing the featuretools library

Note: If !pip install featuretools doesn't work, please install using the anaconda prompt by typing the following command in anaconda prompt

  conda install -c conda-forge featuretools==0.27.0

Importing libraries

Load the Datasets

Display first five rows

Display info of the dataset

Check the number of unique values in the dataset.

Question 1 : Check summary statistics of the dataset (1 Mark)

The median and 75% of the passenger count is 1 and 2 respectively. Average trip distance is 2.73km, with a maximum value of 502.8km and Minimum for trip distance is 0. Average trip duration is 797.7 sec.

Checking for the rows for which trip_distance is 0

Replacing the 0 values with median of the trip distance

Checking for the rows for which trip_duration is 0

Question 2: Univariate Analysis

Question 2.1: Build histogram for numerical columns (1 Marks)

Pickup latitude, dropoff latitude are nearly normal distribution, while pickup longitude,dropoff longitude are a bit right skewed. Trip distance seem to have some outlier, which we can investigate through boxplot. Trip duration is also rightly skewed.

Clipping the outliers of trip distance to 50

Question 2.2 Plotting countplot for Passenger_count (1 Marks)

Write your answers here:_

Question 2.3 Plotting countplot for pickup_neighborhood and dropoff_neighborhood (2 Marks)

Most of the pickup and dropoff are from area AD, AA, A, and D implies these are are the most busy in the city. Areas like AI, AE, and AQ are some areas from where very less number of pickups and dropoff are happening.

Bivariate analysis

Plot a scatter plot for trip distance and trip duration

Step 2: Prepare the Data

Lets create entities and relationships. The three entities in this data are

This data has the following relationships

In <a href="https://www.featuretools.com/"<featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows:

Question 3: Define entities and relationships for the Deep Feature Synthesis (2 Marks)

Next, we specify the cutoff time for each instance of the target_entity, in this case trips.This timestamp represents the last time data can be used for calculating features by DFS. In this scenario, that would be the pickup time because we would like to make the duration prediction using data before the trip starts.

For the purposes of the case study, we choose to only select trips that started after January 12th, 2016.

Step 3: Create baseline features using Deep Feature Synthesis

Instead of manually creating features, such as "month of pickup datetime", we can let DFS come up with them automatically. It does this by

Create transform features using transform primitives

As we described in the video, features fall into two major categories, transform and aggregate. In featureools, we can create transform features by specifying transform primitives. Below we specify a transform primitive called weekend and here is what it does:

In this specific data, there are two datetime columns pickup_datetime and dropoff_datetime. The tool automatically creates features using the primitive and these two columns as shown below.

Question 4: Creating a baseline model with only 1 transform primitive (10 Marks)

Question: 4.1 Define transform primitive for weekend and define features using dfs?

If you're interested about parameters to DFS such as ignore_variables, you can learn more about these parameters here

Here are the features created.

Now let's compute the features.

Question: 4.2 Compute features and define feature matrix

Build the Model

To build a model, we

Transforming the duration variable on sqrt and log

Splitting the data into train and test

Defining function for to check the performance of the model.

Question 4.3 Build Linear regression using only weekend transform primitive

Check the performance of the model

Model is giving only 0.56 Rsquared, with RSME of 6.56 and MAE of ~5.02. Model is slightly overfitting.

Question 4.4 Building decision tree using only weekend transform primitive

Check the performance of the model

The model is overfitting a lot, with train R2 as 0.92 while test R2 as 0.6 This generally happens in decision tree, one solution for this is to Prune the decision tree, let's try pruning and see if the performance improves.

Question 4.5 Building Pruned decision tree using only weekend transform primitive

Check the performance of the model

The pruned model is performing better than both baseline decision tree and linear regression, with R2 as ~.70.

Question 4.6 Building Random Forest using only weekend transform primitive

Check the performance of the model

The score for the model with only 1 transform primitive is ~71%. This model is performing a little better than pruned decision tree model. Model is slightly overfitting.

Step 4: Adding more Transform Primitives and creating new model

Question 5: Create models with more transform primitives (10 Marks)

Question 5.1 Define more transform primitives and define features using dfs?

Now let's compute the features.

Question: 5.2 Compute features and define feature matrix

Build the new models more transform features

Question 5.3 Building Linear regression using more transform primitive

Check the performance of the model

Model is giving 0.62 Rsquared, with RSME of 6.0 and MAE of ~4.53. Model performance has decreased from the last model by adding more transform primitives Model is not overfitting, and giving generalized results.

Question 5.4 Building Decision tree using more transform primitive

Check the performance of the model

The model is overfitting a lot, with train R2 as 1 while test R2 as 0.71. This generally happens in decision tree, one solution for this is to Prune the decision tree, let's try pruning and see if the performance improves.

Question 5.5 Building Pruned Decision tree using more transform primitive

Check the performance of the model

Model is giving ~0.75 Rsquared, with RSME of 4.96 and MAE of ~3.70. Model performance has improved by adding more transform features. Model is slightly overfitting.

Question 5.6 Building Random Forest using more transform primitive

Check the performance of the model

The score for the model with more transform primitive is ~75%.

Question: 5.7 Comment on how the modeling accuracy differs when including more transform features.

As compared to previous model, the score has improved from ~71% to ~75%.

Step 5: Add Aggregation Primitives

Now let's add aggregation primitives. These primitives will generate features for the parent entities pickup_neighborhoods, and dropoff_neighborhood and then add them to the trips entity, which is the entity for which we are trying to make prediction.

Question 6: Create a Models with transform and aggregate primitive. (10 Marks)

6.1 Define more transform and aggregate primitive and define features using dfs?

Question: 6.2 Compute features and define feature matrix

Build the new models more transform and aggregate features

Question 6.3 Building Linear regression model with transform and aggregate primitive.

Check the performance of the model

Model is giving 0.63 Rsquared, with RSME of 5.97 and MAE of ~4.46. Model is not overfitting, and giving generalized results.

Question 6.4 Building Decision tree with transform and aggregate primitive.

Check the performance of the model

The model is overfitting a lot, with train R2 as 1 while test R2 as 0.65 This generally happens in decision tree, one solution for this is to Prune the decision tree, let's try pruning and see if the performance improves.

Question 6.5 Building Pruned Decision tree with transform and aggregate primitive.

Check the performance of the model

Model is giving ~0.75 Rsquared, with RSME of 4.97 and MAE of ~3.7. The model performance has not improved by adding aggregate primitives. Model is overfitting, and is not giving generalized results.

Question 6.6 Building Random Forest with transform and aggregate primitive.

Check the performance of the model

The model Performance has not improved from ~0.75 by the addition of transform and aggregation features.

Question 6.7 How do these aggregate transforms impact performance? How do they impact training time?

The modeling score has not improved much after adding aggregate transforms, and also the training time was also increased by a significant amount, implies that adding more features is always not very effective.

Based on the above 3 models, we can make predictions using our model2, as it is giving almost same accuracy as model3 and also the training time is not that large as compared to model3

Question 7: What are some important features based on model2 and how can they affect the duration of the rides? (3 Marks)

Trip_Distance is the most important feature, which implies that the longer the trip is the longer duration of the trip is. The HOUR features of pickup or dropoff times are also important implying that the trip duration is affected by the time at which the trip is taking place. This would make sense for times in the day with much higher traffic. Features like dropoff_neighborhoods.longitude, pickup_neighborhoods.longitude dropoff_neighborhoods.latitude,pickup_neighborhoods.latitude signifies that trip duration is impacted by pickup and dropoff locations.