# Installing the featuretools library

!pip install featuretools==0.27.0

Collecting featuretools==0.27.0
  Downloading featuretools-0.27.0-py3-none-any.whl (327 kB)
Requirement already satisfied: cloudpickle>=0.4.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (2.0.0)
Requirement already satisfied: dask[dataframe]>=2.12.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (2021.10.0)
Requirement already satisfied: distributed>=2.12.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (2021.10.0)
Requirement already satisfied: numpy>=1.16.6 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (1.20.3)
Requirement already satisfied: pandas<2.0.0,>=1.2.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (1.3.4)
Requirement already satisfied: scipy>=1.3.2 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (1.7.1)
Requirement already satisfied: psutil>=5.6.6 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (5.8.0)
Requirement already satisfied: click>=7.0.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (8.0.3)
Requirement already satisfied: pyyaml>=5.4 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (6.0)
Requirement already satisfied: tqdm>=4.32.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from featuretools==0.27.0) (4.62.3)
Requirement already satisfied: colorama in c:\users\dkhawaja\anaconda3\lib\site-packages (from click>=7.0.0->featuretools==0.27.0) (0.4.4)
Requirement already satisfied: fsspec>=0.6.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from dask[dataframe]>=2.12.0->featuretools==0.27.0) (2021.10.1)
Requirement already satisfied: packaging>=20.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from dask[dataframe]>=2.12.0->featuretools==0.27.0) (21.0)
Requirement already satisfied: partd>=0.3.10 in c:\users\dkhawaja\anaconda3\lib\site-packages (from dask[dataframe]>=2.12.0->featuretools==0.27.0) (1.2.0)
Requirement already satisfied: toolz>=0.8.2 in c:\users\dkhawaja\anaconda3\lib\site-packages (from dask[dataframe]>=2.12.0->featuretools==0.27.0) (0.11.1)
Requirement already satisfied: jinja2 in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (2.11.3)
Requirement already satisfied: msgpack>=0.6.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (1.0.2)
Requirement already satisfied: setuptools in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (58.0.4)
Requirement already satisfied: tornado>=6.0.3 in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (6.1)
Requirement already satisfied: tblib>=1.6.0 in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (1.7.0)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (2.4.0)
Requirement already satisfied: zict>=0.1.3 in c:\users\dkhawaja\anaconda3\lib\site-packages (from distributed>=2.12.0->featuretools==0.27.0) (2.0.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\dkhawaja\anaconda3\lib\site-packages (from packaging>=20.0->dask[dataframe]>=2.12.0->featuretools==0.27.0) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\dkhawaja\anaconda3\lib\site-packages (from pandas<2.0.0,>=1.2.0->featuretools==0.27.0) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in c:\users\dkhawaja\anaconda3\lib\site-packages (from pandas<2.0.0,>=1.2.0->featuretools==0.27.0) (2021.3)
Requirement already satisfied: locket in c:\users\dkhawaja\anaconda3\lib\site-packages\locket-0.2.1-py3.9.egg (from partd>=0.3.10->dask[dataframe]>=2.12.0->featuretools==0.27.0) (0.2.1)
Requirement already satisfied: six>=1.5 in c:\users\dkhawaja\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas<2.0.0,>=1.2.0->featuretools==0.27.0) (1.16.0)
Requirement already satisfied: heapdict in c:\users\dkhawaja\anaconda3\lib\site-packages (from zict>=0.1.3->distributed>=2.12.0->featuretools==0.27.0) (1.0.1)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\dkhawaja\anaconda3\lib\site-packages (from jinja2->distributed>=2.12.0->featuretools==0.27.0) (1.1.1)
Installing collected packages: featuretools
Successfully installed featuretools-0.27.0


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Feataurestools for feature engineering
import featuretools as ft

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Importing gradient boosting regressor, to make prediction
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

#importing primitives
from featuretools.primitives import (Minute, Hour, Day, Month,
                                     Weekday, IsWeekend, Count, Sum, Mean, Median, Std, Min, Max)

print(ft.__version__)
%load_ext autoreload
%autoreload 2

0.27.0


# set global random seed
np.random.seed(40)

# To load the dataset
def load_nyc_taxi_data():
    trips = pd.read_csv('trips.csv',
                        parse_dates=["pickup_datetime","dropoff_datetime"],
                        dtype={'vendor_id':"category",'passenger_count':'int64'},
                        encoding='utf-8')
    trips["payment_type"] = trips["payment_type"].apply(str)
    trips = trips.dropna(axis=0, how='any', subset=['trip_duration'])

    pickup_neighborhoods = pd.read_csv("pickup_neighborhoods.csv", encoding='utf-8')
    dropoff_neighborhoods = pd.read_csv("dropoff_neighborhoods.csv", encoding='utf-8')

    return trips, pickup_neighborhoods, dropoff_neighborhoods

### To preview first five rows. 
def preview(df, n=5):
    """return n rows that have fewest number of nulls"""
    order = df.isnull().sum(axis=1).sort_values().head(n).index
    return df.loc[order]



#to compute features using automated feature engineering. 
def compute_features(features, cutoff_time):
    # shuffle so we don't see encoded features in the front or backs

    np.random.shuffle(features)
    feature_matrix = ft.calculate_feature_matrix(features,
                                                 cutoff_time=cutoff_time,
                                                 approximate='36d',
                                                 verbose=True)
    print("Finishing computing...")
    feature_matrix, features = ft.encode_features(feature_matrix, features,
                                                  to_encode=["pickup_neighborhood", "dropoff_neighborhood"],
                                                  include_unknown=False)
    return feature_matrix


#to generate train and test dataset
def get_train_test_fm(feature_matrix, percentage):
    nrows = feature_matrix.shape[0]
    head = int(nrows * percentage)
    tail = nrows-head
    X_train = feature_matrix.head(head)
    y_train = X_train['trip_duration']
    X_train = X_train.drop(['trip_duration'], axis=1)
    imp = SimpleImputer()
    X_train = imp.fit_transform(X_train)
    X_test = feature_matrix.tail(tail)
    y_test = X_test['trip_duration']
    X_test = X_test.drop(['trip_duration'], axis=1)
    X_test = imp.transform(X_test)

    return (X_train, y_train, X_test,y_test)



#to see the feature importance of variables in the final model
def feature_importances(model, feature_names, n=5):
    importances = model.feature_importances_
    zipped = sorted(zip(feature_names, importances), key=lambda x: -x[1])
    for i, f in enumerate(zipped[:n]):
        print("%d: Feature: %s, %.3f" % (i+1, f[0], f[1]))


trips, pickup_neighborhoods, dropoff_neighborhoods = load_nyc_taxi_data()
preview(trips, 10)


trips.head()


#checking the info of the dataset
trips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 974409 entries, 0 to 974408
Data columns (total 14 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   id                    974409 non-null  int64         
 1   vendor_id             974409 non-null  category      
 2   pickup_datetime       974409 non-null  datetime64[ns]
 3   dropoff_datetime      974409 non-null  datetime64[ns]
 4   passenger_count       974409 non-null  int64         
 5   trip_distance         974409 non-null  float64       
 6   pickup_longitude      974409 non-null  float64       
 7   pickup_latitude       974409 non-null  float64       
 8   dropoff_longitude     974409 non-null  float64       
 9   dropoff_latitude      974409 non-null  float64       
 10  payment_type          974409 non-null  object        
 11  trip_duration         974409 non-null  float64       
 12  pickup_neighborhood   974409 non-null  object        
 13  dropoff_neighborhood  974409 non-null  object        
dtypes: category(1), datetime64[ns](2), float64(6), int64(2), object(3)
memory usage: 137.3+ MB


# Check the uniques values in each columns
trips.nunique()

id                      974409
vendor_id                    2
pickup_datetime         939015
dropoff_datetime        938873
passenger_count              8
trip_distance             2503
pickup_longitude         20222
pickup_latitude          40692
dropoff_longitude        26127
dropoff_latitude         50077
payment_type                 4
trip_duration             3607
pickup_neighborhood         49
dropoff_neighborhood        49
dtype: int64


#chekcing the descriptive stats of the data

#Remove _________ and complete the code

trips.describe()


#Chekcing the rows where trip distance is 0
trips[trips['trip_distance']==0]


trips['trip_distance']=trips['trip_distance'].replace(0,trips['trip_distance'].median())


trips[trips['trip_distance']==0].count()

id                      0
vendor_id               0
pickup_datetime         0
dropoff_datetime        0
passenger_count         0
trip_distance           0
pickup_longitude        0
pickup_latitude         0
dropoff_longitude       0
dropoff_latitude        0
payment_type            0
trip_duration           0
pickup_neighborhood     0
dropoff_neighborhood    0
dtype: int64


trips[trips['trip_duration']==0].head()


trips['trip_duration']=trips['trip_duration'].replace(0,trips['trip_duration'].median())


trips[trips['trip_duration']==0].count()

id                      0
vendor_id               0
pickup_datetime         0
dropoff_datetime        0
passenger_count         0
trip_distance           0
pickup_longitude        0
pickup_latitude         0
dropoff_longitude       0
dropoff_latitude        0
payment_type            0
trip_duration           0
pickup_neighborhood     0
dropoff_neighborhood    0
dtype: int64


#Remove _________ and complete the code
trips.hist(figsize=(12,12))
plt.show()


sns.boxplot(trips['trip_distance'])
plt.show()

C:\Users\dkhawaja\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(


trips[trips['trip_distance']>100]


trips['trip_distance']=trips['trip_distance'].clip(trips['trip_distance'].min(),50)


sns.boxplot(trips['trip_distance'])
plt.show()

C:\Users\dkhawaja\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(


#Remove _________ and complete the code

import seaborn as sns
plt.figure(figsize=(20,5))
sns.countplot(trips.passenger_count)
plt.show()


trips.passenger_count.value_counts(normalize=True)

1    0.709334
2    0.143419
5    0.053409
3    0.041140
6    0.033334
4    0.019338
0    0.000025
9    0.000002
Name: passenger_count, dtype: float64


#Remove _________ and complete the code
trips.pickup_neighborhood.value_counts().sort_values(ascending=False).plot(kind='bar' ,figsize=(20,8))

<AxesSubplot:>


#Remove _________ and complete the code

trips.dropoff_neighborhood.value_counts().sort_values(ascending=False).plot(kind='bar' ,figsize=(20,8))

<AxesSubplot:>


pickup_neighborhoods.head()


sns.scatterplot(trips['trip_distance'],trips['trip_duration'])

C:\Users\dkhawaja\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='trip_distance', ylabel='trip_duration'>


sns.countplot(trips['passenger_count'],hue=trips['payment_type'])

C:\Users\dkhawaja\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='passenger_count', ylabel='count'>


#Remove _________ and complete the codeV

entities = { 
            "trips": (trips, "id", 'pickup_datetime' ),
            "pickup_neighborhoods": (pickup_neighborhoods, "neighborhood_id"),
            "dropoff_neighborhoods": (dropoff_neighborhoods, "neighborhood_id")
           }

#Remove _________ and complete the code
relationships = [("pickup_neighborhoods", "neighborhood_id", "trips", "pickup_neighborhood"),
                 ("dropoff_neighborhoods", "neighborhood_id", "trips", "dropoff_neighborhood")]


cutoff_time = trips[['id', 'pickup_datetime']]
cutoff_time = cutoff_time[cutoff_time['pickup_datetime'] > "2016-01-12"]
preview(cutoff_time, 10)


#Remove _________ and complete the code
trans_primitives = [IsWeekend]

#Remove _________ and complete the code
features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)


print ("Number of features: %d" % len(features))
features

Number of features: 13

[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>]


def compute_features(features, cutoff_time):
    # shuffle so we don't see encoded features in the front or backs

    np.random.shuffle(features)
    feature_matrix = ft.calculate_feature_matrix(features,
                                                 cutoff_time=cutoff_time,
                                                 approximate='36d',
                                                 verbose=True,entities=entities, relationships=relationships)
    print("Finishing computing...")
    feature_matrix, features = ft.encode_features(feature_matrix, features,
                                                  to_encode=["pickup_neighborhood", "dropoff_neighborhood"],
                                                  include_unknown=False)
    return feature_matrix


#Remove _________ and complete the code
feature_matrix1 = compute_features(features, cutoff_time)

Elapsed: 00:05 | Progress: 100%|██████████
Finishing computing...


preview(feature_matrix1, 5)


feature_matrix1.shape

(920378, 31)


plt.hist(np.sqrt(trips['trip_duration']))

(array([  4566.,  35831., 163872., 249551., 225381., 149544.,  80472.,
         38481.,  18054.,   8657.]),
 array([ 1.        ,  6.90499792, 12.80999584, 18.71499376, 24.61999167,
        30.52498959, 36.42998751, 42.33498543, 48.23998335, 54.14498127,
        60.04997918]),
 <BarContainer object of 10 artists>)


plt.hist(np.log(trips['trip_duration']))

(array([1.81000e+02, 5.97000e+02, 7.44000e+02, 1.43900e+03, 2.86700e+03,
        2.00260e+04, 1.35785e+05, 3.69738e+05, 3.50815e+05, 9.22170e+04]),
 array([0.        , 0.81903544, 1.63807088, 2.45710632, 3.27614176,
        4.0951772 , 4.91421264, 5.73324808, 6.55228352, 7.37131896,
        8.1903544 ]),
 <BarContainer object of 10 artists>)


# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = get_train_test_fm(feature_matrix1,.75)
y_train = np.sqrt(y_train)
y_test = np.sqrt(y_test)


#RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())

# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(model, x_train, x_test, y_train,y_test):

    # Insample Prediction
    y_pred_train = model.predict(x_train)
    y_observed_train = y_train

    # Prediction on test data
    y_pred_test = model.predict(x_test)
    y_observed_test = y_test

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                'RSquared':
                    [r2_score(y_observed_train,y_pred_train),
                    r2_score(y_observed_test,y_pred_test )
                    ],
                "RMSE": [
                    rmse(y_pred_train, y_observed_train),
                    rmse(y_pred_test, y_observed_test),
                ],
                "MAE": [
                    mae(y_pred_train, y_observed_train),
                    mae(y_pred_test, y_observed_test),
                ],
            }
        )
    )


#Remove _________ and complete the code

#defining the model

lr1=LinearRegression()

#fitting the model
lr1.fit(X_train,y_train)

LinearRegression()


#Remove _________ and complete the code
model_pref(lr1, X_train, X_test,y_train,y_test)

    Data  RSquared      RMSE       MAE
0  Train  0.576106  6.132169  4.736641
1   Test  0.555831  6.558806  5.024874


#Remove _________ and complete the code

#define the model
dt=DecisionTreeRegressor()

#fit the model

dt.fit(X_train,y_train)

DecisionTreeRegressor()


#Remove _________ and complete the code
model_pref(dt, X_train, X_test,y_train,y_test)

    Data  RSquared      RMSE       MAE
0  Train  0.917069  2.712331  1.518055
1   Test  0.597426  6.244153  4.638653


#Remove _________ and complete the code
#define the model

#use max_depth=7
dt_pruned=DecisionTreeRegressor(max_depth=7)

#fit the model
dt_pruned.fit(X_train,y_train)

DecisionTreeRegressor(max_depth=7)


#Remove _________ and complete the code
model_pref(dt_pruned, X_train, X_test,y_train,y_test)

    Data  RSquared      RMSE       MAE
0  Train  0.733766  4.859783  3.708634
1   Test  0.704190  5.352503  4.049448


#Remove _________ and complete the code

#define the model

#using (n_estimators=60,max_depth=7)

rf=RandomForestRegressor(n_estimators=60,max_depth=7)


#fit the model

#Remove _________ and complete the code
rf.fit(X_train,y_train)

RandomForestRegressor(max_depth=7, n_estimators=60)


#Remove _________ and complete the code

model_pref(rf, X_train, X_test,y_train,y_test)

    Data  RSquared      RMSE       MAE
0  Train  0.738265  4.818549  3.676261
1   Test  0.708235  5.315778  4.019578


#Remove _________ and complete the code
trans_primitives = [Minute, Hour, Day, Month, Weekday, IsWeekend]

#Remove _________ and complete the code
features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)


print ("Number of features: %d" % len(features))
features

Number of features: 23

[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>]


#Remove _________ and complete the code
feature_matrix2 = compute_features(features, cutoff_time)

Elapsed: 00:06 | Progress: 100%|██████████
Finishing computing...


feature_matrix2.shape

(920378, 41)


feature_matrix2.head()


# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train2, y_train2, X_test2, y_test2 = get_train_test_fm(feature_matrix2,.75)
y_train2 = np.sqrt(y_train2)
y_test2 = np.sqrt(y_test2)


#Remove _________ and complete the code

#defining the model

lr2=LinearRegression()

#fitting the model
lr2.fit(X_train2,y_train2)

LinearRegression()


#Remove _________ and complete the code
model_pref(lr2, X_train2, X_test2,y_train2,y_test2)

    Data  RSquared      RMSE       MAE
0  Train  0.623501  5.779195  4.273127
1   Test  0.623651  6.037346  4.537582


#Remove _________ and complete the code

#define the model
dt2=DecisionTreeRegressor()

#fit the model

dt2.fit(X_train2,y_train2)

DecisionTreeRegressor()


#Remove _________ and complete the code
model_pref(dt2, X_train2, X_test2,y_train2,y_test2)

    Data  RSquared      RMSE       MAE
0  Train  1.000000  0.001479  0.000004
1   Test  0.713607  5.266615  3.789094


#Remove _________ and complete the code
#define the model

#use max_depth=7
dt_pruned2=DecisionTreeRegressor(max_depth=7)

#fit the model
dt_pruned2.fit(X_train2,y_train2)

DecisionTreeRegressor(max_depth=7)


#Remove _________ and complete the code
model_pref(dt_pruned2, X_train2, X_test2,y_train2,y_test2)

    Data  RSquared      RMSE       MAE
0  Train  0.768105  4.535566  3.423759
1   Test  0.745845  4.961350  3.698053


#fit the model

#Remove _________ and complete the code
#using (n_estimators=60,max_depth=7)

rf2 = RandomForestRegressor(n_estimators=60,max_depth=7)

#fit the model

#Remove _________ and complete the code
rf2.fit(X_train2,y_train2)

RandomForestRegressor(max_depth=7, n_estimators=60)


#Remove _________ and complete the code
model_pref(rf2, X_train2, X_test2,y_train2,y_test2)

    Data  RSquared      RMSE       MAE
0  Train  0.774642  4.471179  3.366930
1   Test  0.751791  4.902971  3.645133


#Remove _________ and complete the code

trans_primitives = [Minute, Hour, Day, Month, Weekday, IsWeekend]
aggregation_primitives = [Count, Sum, Mean, Median, Std, Max, Min]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=aggregation_primitives,
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)


print ("Number of features: %d" % len(features))
features

Number of features: 61

[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: trip_distance>,
 <Feature: payment_type>,
 <Feature: trip_duration>,
 <Feature: pickup_neighborhood>,
 <Feature: dropoff_neighborhood>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: pickup_neighborhoods.latitude>,
 <Feature: pickup_neighborhoods.longitude>,
 <Feature: dropoff_neighborhoods.latitude>,
 <Feature: dropoff_neighborhoods.longitude>,
 <Feature: pickup_neighborhoods.COUNT(trips)>,
 <Feature: pickup_neighborhoods.MAX(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.MAX(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.MAX(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.MEAN(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.MEAN(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.MEAN(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.MEDIAN(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.MEDIAN(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.MEDIAN(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.MIN(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.MIN(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.MIN(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.STD(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.STD(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.STD(trips.trip_duration)>,
 <Feature: pickup_neighborhoods.SUM(trips.passenger_count)>,
 <Feature: pickup_neighborhoods.SUM(trips.trip_distance)>,
 <Feature: pickup_neighborhoods.SUM(trips.trip_duration)>,
 <Feature: dropoff_neighborhoods.COUNT(trips)>,
 <Feature: dropoff_neighborhoods.MAX(trips.passenger_count)>,
 <Feature: dropoff_neighborhoods.MAX(trips.trip_distance)>,
 <Feature: dropoff_neighborhoods.MAX(trips.trip_duration)>,
 <Feature: dropoff_neighborhoods.MEAN(trips.passenger_count)>,
 <Feature: dropoff_neighborhoods.MEAN(trips.trip_distance)>,
 <Feature: dropoff_neighborhoods.MEAN(trips.trip_duration)>,
 <Feature: dropoff_neighborhoods.MEDIAN(trips.passenger_count)>,
 <Feature: dropoff_neighborhoods.MEDIAN(trips.trip_distance)>,
 <Feature: dropoff_neighborhoods.MEDIAN(trips.trip_duration)>,
 <Feature: dropoff_neighborhoods.MIN(trips.passenger_count)>,
 <Feature: dropoff_neighborhoods.MIN(trips.trip_distance)>,
 <Feature: dropoff_neighborhoods.MIN(trips.trip_duration)>,
 <Feature: dropoff_neighborhoods.STD(trips.passenger_count)>,
 <Feature: dropoff_neighborhoods.STD(trips.trip_distance)>,
 <Feature: dropoff_neighborhoods.STD(trips.trip_duration)>,
 <Feature: dropoff_neighborhoods.SUM(trips.passenger_count)>,
 <Feature: dropoff_neighborhoods.SUM(trips.trip_distance)>,
 <Feature: dropoff_neighborhoods.SUM(trips.trip_duration)>]


#Remove _________ and complete the code
feature_matrix3 = compute_features(features, cutoff_time)

Elapsed: 00:15 | Progress: 100%|██████████
Finishing computing...


feature_matrix3.head()


# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train3, y_train3, X_test3, y_test3 = get_train_test_fm(feature_matrix3,.75)
y_train3 = np.sqrt(y_train3)
y_test3 = np.sqrt(y_test3)


#Remove _________ and complete the code

#defining the model

lr3=LinearRegression()

#fitting the model
lr3.fit(X_train3,y_train3)

LinearRegression()


#Remove _________ and complete the code
model_pref(lr3, X_train3, X_test3,y_train3,y_test3)

    Data  RSquared      RMSE       MAE
0  Train  0.643423  5.624220  4.134927
1   Test  0.631697  5.972457  4.464424


#Remove _________ and complete the code

#define the model
dt3=DecisionTreeRegressor()

#fit the model

dt3.fit(X_train3,y_train3)

DecisionTreeRegressor()


#Remove _________ and complete the code
model_pref(dt3, X_train3, X_test3,y_train3,y_test3)

    Data  RSquared      RMSE       MAE
0  Train  1.000000  0.001479  0.000004
1   Test  0.648706  5.832915  4.190501


#Remove _________ and complete the code
#define the model

#use max_depth=7
dt_pruned3=DecisionTreeRegressor(max_depth=7)

#fit the model
dt_pruned3.fit(X_train3,y_train3)

DecisionTreeRegressor(max_depth=7)


#Remove _________ and complete the code
model_pref(dt_pruned3, X_train3, X_test3,y_train3,y_test3)

    Data  RSquared      RMSE       MAE
0  Train  0.769212  4.524722  3.418452
1   Test  0.745722  4.962549  3.696122


#fit the model

#Remove _________ and complete the code
#using (n_estimators=60,max_depth=7)

rf3 = RandomForestRegressor(n_estimators=60,max_depth=4)

#fit the model

#Remove _________ and complete the code
rf3.fit(X_train3,y_train3)

RandomForestRegressor(max_depth=4, n_estimators=60)


model_pref(rf3, X_train3, X_test3,y_train3,y_test3)

    Data  RSquared      RMSE       MAE
0  Train  0.712060  5.054010  3.872712
1   Test  0.684469  5.528041  4.190322


y_pred = rf2.predict(X_test2)
y_pred = y_pred**2 # undo the sqrt we took earlier
y_pred[5:]

array([ 528.98474395,  215.79232445,  663.81910876, ...,  182.07380292,
       1028.99455389, 1759.33222947])


feature_importances(rf2, feature_matrix2.drop(['trip_duration'],axis=1).columns, n=10)

1: Feature: trip_distance, 0.910
2: Feature: HOUR(dropoff_datetime), 0.036
3: Feature: HOUR(pickup_datetime), 0.019
4: Feature: dropoff_neighborhoods.latitude, 0.016
5: Feature: pickup_neighborhoods.longitude, 0.003
6: Feature: WEEKDAY(dropoff_datetime), 0.003
7: Feature: IS_WEEKEND(dropoff_datetime), 0.003
8: Feature: WEEKDAY(pickup_datetime), 0.003
9: Feature: vendor_id, 0.002
10: Feature: IS_WEEKEND(pickup_datetime), 0.002

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	payment_type	trip_duration	pickup_neighborhood	dropoff_neighborhood
0	0	2	2016-01-01 00:00:19	2016-01-01 00:06:31	3	1.32	-73.961258	40.796200	-73.950050	40.787312	2	372.0	AH	C
649598	679634	1	2016-04-30 11:45:59	2016-04-30 11:47:47	1	0.50	-73.994919	40.755226	-74.000351	40.747917	1	108.0	D	AG
649599	679635	2	2016-04-30 11:46:04	2016-04-30 11:47:41	2	0.33	-73.978935	40.777172	-73.981888	40.773136	2	97.0	AV	AV
649600	679636	2	2016-04-30 11:46:39	2016-04-30 11:58:02	1	1.78	-73.998207	40.745201	-73.990265	40.729023	2	683.0	AP	H
649601	679637	2	2016-04-30 11:46:44	2016-04-30 11:55:42	1	1.40	-73.987129	40.739429	-74.007370	40.743511	2	538.0	R	Q
649602	679638	2	2016-04-30 11:47:30	2016-04-30 11:54:00	1	1.12	-73.942375	40.790768	-73.952095	40.777145	2	390.0	J	AM
649603	679639	1	2016-04-30 11:47:38	2016-04-30 11:57:22	2	1.90	-73.960800	40.769920	-73.978966	40.785698	1	584.0	K	I
649604	679640	1	2016-04-30 11:47:49	2016-04-30 12:01:05	1	4.30	-74.013885	40.709515	-73.987213	40.722343	2	796.0	AU	AC
649605	679641	1	2016-04-30 11:48:17	2016-04-30 12:01:02	1	2.90	-73.975426	40.757584	-73.999016	40.722027	1	765.0	A	X
649606	679642	1	2016-04-30 11:49:44	2016-04-30 12:00:03	1	1.30	-73.989815	40.750454	-74.000473	40.762352	2	619.0	D	P

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	payment_type	trip_duration	pickup_neighborhood	dropoff_neighborhood
0	0	2	2016-01-01 00:00:19	2016-01-01 00:06:31	3	1.32	-73.961258	40.796200	-73.950050	40.787312	2	372.0	AH	C
1	1	2	2016-01-01 00:01:45	2016-01-01 00:27:38	1	13.70	-73.956169	40.707756	-73.939949	40.839558	1	1553.0	Z	S
2	2	1	2016-01-01 00:01:47	2016-01-01 00:21:51	2	5.30	-73.993103	40.752632	-73.953903	40.816540	2	1204.0	D	AL
3	3	2	2016-01-01 00:01:48	2016-01-01 00:16:06	1	7.19	-73.983009	40.731419	-73.930969	40.808460	2	858.0	AT	J
4	4	1	2016-01-01 00:02:49	2016-01-01 00:20:45	2	2.90	-74.004631	40.747234	-73.976395	40.777237	1	1076.0	AG	AV

	id	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	trip_duration
count	9.744090e+05	974409.000000	974409.000000	974409.000000	974409.000000	974409.000000	974409.000000	974409.000000
mean	5.096223e+05	1.664010	2.734356	-73.973275	40.752475	-73.972825	40.753046	797.702753
std	2.944916e+05	1.314975	3.307038	0.035702	0.026668	0.031348	0.029151	576.802176
min	0.000000e+00	0.000000	0.000000	-74.029846	40.630268	-74.029945	40.630009	0.000000
25%	2.545210e+05	1.000000	1.000000	-73.991058	40.739689	-73.990356	40.738792	389.000000
50%	5.093100e+05	1.000000	1.640000	-73.981178	40.755390	-73.979156	40.755650	646.000000
75%	7.647430e+05	2.000000	2.990000	-73.966888	40.768929	-73.962769	40.770454	1040.000000
max	1.020002e+06	9.000000	502.800000	-73.770508	40.849911	-73.770020	40.849998	3606.000000

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	payment_type	trip_duration	pickup_neighborhood	dropoff_neighborhood
852	880	1	2016-01-01 02:15:56	2016-01-01 02:16:17	1	0.0	-74.002586	40.750298	-74.002861	40.750446	2	21.0	AG	AG
1079	1116	1	2016-01-01 03:01:10	2016-01-01 03:03:26	1	0.0	-73.987831	40.728558	-73.988747	40.727280	3	136.0	H	H
1408	1455	2	2016-01-01 04:09:43	2016-01-01 04:10:48	1	0.0	-73.985893	40.763649	-73.985741	40.763672	2	65.0	AR	AR
1440	1488	1	2016-01-01 04:16:54	2016-01-01 04:16:57	1	0.0	-74.014198	40.709988	-74.014198	40.709988	3	3.0	AU	AU
1510	1558	1	2016-01-01 04:36:03	2016-01-01 04:36:16	1	0.0	-73.952507	40.817329	-73.952499	40.817322	2	13.0	AL	AL
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
972967	1018490	1	2016-06-30 19:09:44	2016-06-30 19:22:21	1	0.0	-73.945480	40.751400	-73.945496	40.751549	2	757.0	AN	AN
973384	1018928	2	2016-06-30 20:35:08	2016-06-30 20:35:10	1	0.0	-73.983864	40.693813	-73.983910	40.693817	1	2.0	AS	AS
973555	1019105	2	2016-06-30 21:13:50	2016-06-30 21:14:05	1	0.0	-74.008789	40.708740	-74.008659	40.708858	1	15.0	AU	AU
973607	1019159	2	2016-06-30 21:24:23	2016-06-30 21:37:40	1	0.0	-73.974510	40.778297	-73.977272	40.754047	1	797.0	I	AD
973898	1019464	2	2016-06-30 22:20:27	2016-06-30 22:43:11	2	0.0	-73.978920	40.688160	-73.992317	40.749359	1	1364.0	V	D

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	payment_type	pickup_neighborhood	dropoff_neighborhood
44446	46325	1	2016-01-10 00:48:55	2016-01-10 00:48:55	1	1.20	-73.968842	40.766972	-73.968842	40.766972	3	AK	AK
121544	126869	2	2016-01-26 00:07:47	2016-01-26 00:07:47	6	4.35	-73.986694	40.739815	-73.956139	40.732872	1	R	Z
142202	148598	1	2016-01-30 00:00:29	2016-01-30 00:00:29	1	1.64	-73.989578	40.743877	-73.989578	40.743877	2	AO	AO
172653	180414	1	2016-02-04 19:23:45	2016-02-04 19:23:45	1	0.40	-73.990868	40.751106	-73.990868	40.751106	2	D	D
173013	180795	1	2016-02-04 20:23:10	2016-02-04 20:23:10	1	0.80	-73.977661	40.752968	-73.977661	40.752968	2	AD	AD

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	payment_type	trip_duration	pickup_neighborhood	dropoff_neighborhood
171143	178815	1	2016-02-04 14:05:10	2016-02-04 14:56:37	1	156.2	-73.979149	40.765499	-73.782806	40.644009	1	3087.0	AR	G
248346	259490	1	2016-02-18 09:48:06	2016-02-18 09:50:27	1	501.4	-73.980087	40.782185	-73.981468	40.778519	2	141.0	I	AV
525084	548884	1	2016-04-07 21:19:03	2016-04-07 22:03:17	3	172.3	-73.783340	40.644176	-73.936028	40.737762	2	2654.0	G	AN
530340	554389	1	2016-04-08 19:19:32	2016-04-08 19:41:33	2	502.8	-73.995461	40.724884	-73.986099	40.762108	1	1321.0	X	AA
828650	867217	1	2016-06-02 21:30:17	2016-06-02 21:36:47	2	101.0	-73.961586	40.800968	-73.950165	40.802193	2	390.0	AH	J

	neighborhood_id	latitude	longitude
0	AH	40.804349	-73.961716
1	Z	40.715828	-73.954298
2	D	40.750179	-73.992557
3	AT	40.729670	-73.981693
4	AG	40.749843	-74.003458

	id	pickup_datetime
54031	56311	2016-01-12 00:00:25
667608	698423	2016-05-03 17:59:59
667609	698424	2016-05-03 18:00:52
667610	698425	2016-05-03 18:01:06
667611	698426	2016-05-03 18:01:11
667612	698427	2016-05-03 18:01:12
667613	698428	2016-05-03 18:01:12
667614	698429	2016-05-03 18:01:24
667615	698430	2016-05-03 18:01:36
667616	698431	2016-05-03 18:01:39

	dropoff_neighborhood = AD	dropoff_neighborhood = A	dropoff_neighborhood = AA	dropoff_neighborhood = D	dropoff_neighborhood = AR	dropoff_neighborhood = C	dropoff_neighborhood = O	dropoff_neighborhood = N	dropoff_neighborhood = AK	dropoff_neighborhood = AO	...	pickup_neighborhood = A	pickup_neighborhood = AR	pickup_neighborhood = AK	pickup_neighborhood = AO	pickup_neighborhood = N	pickup_neighborhood = O	pickup_neighborhood = R	dropoff_neighborhoods.latitude	dropoff_neighborhoods.longitude	IS_WEEKEND(pickup_datetime)
id
56311	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	40.721435	-73.998366	False
698423	True	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	40.752186	-73.976515	False
698424	False	False	False	False	False	False	True	False	False	False	...	False	False	True	False	False	False	False	40.775299	-73.960551	False
698425	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	40.793597	-73.969822	False
698426	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	40.766809	-73.956886	False

	WEEKDAY(pickup_datetime)	trip_duration	MINUTE(pickup_datetime)	DAY(dropoff_datetime)	payment_type	IS_WEEKEND(dropoff_datetime)	pickup_neighborhoods.latitude	DAY(pickup_datetime)	dropoff_neighborhood = AD	dropoff_neighborhood = A	...	HOUR(dropoff_datetime)	dropoff_neighborhoods.latitude	vendor_id	MINUTE(dropoff_datetime)	MONTH(dropoff_datetime)	dropoff_neighborhoods.longitude	trip_distance	MONTH(pickup_datetime)	WEEKDAY(dropoff_datetime)	HOUR(pickup_datetime)
id
56311	1	645.0	0	12	1	False	40.720245	12	False	False	...	0	40.721435	2	11	1	-73.998366	1.61	1	1	0
56312	1	1270.0	2	12	2	False	40.646194	12	False	False	...	0	40.715828	2	23	1	-73.954298	16.15	1	1	0
56313	1	207.0	2	12	1	False	40.818445	12	False	False	...	0	40.818445	1	5	1	-73.948046	0.80	1	1	0
56314	1	214.0	2	12	2	False	40.729652	12	False	False	...	0	40.742531	2	6	1	-73.977943	1.33	1	1	0
56315	1	570.0	3	12	1	False	40.793597	12	False	False	...	0	40.818445	2	13	1	-73.948046	2.35	1	1	0

Project Predictive Analytics: New York City Taxi Ride Duration Prediction¶

Context¶

Objective¶

Dataset¶

The following steps will be taken:¶

Installing the featuretools library¶

Note: If !pip install featuretools doesn't work, please install using the anaconda prompt by typing the following command in anaconda prompt¶

Importing libraries¶

Load the Datasets¶

Display first five rows¶

Display info of the dataset¶

Check the number of unique values in the dataset.¶

Question 1 : Check summary statistics of the dataset (1 Mark)¶

Checking for the rows for which trip_distance is 0¶

Replacing the 0 values with median of the trip distance¶

Checking for the rows for which trip_duration is 0¶

Question 2: Univariate Analysis¶

Question 2.1: Build histogram for numerical columns (1 Marks)¶

Clipping the outliers of trip distance to 50¶

Question 2.2 Plotting countplot for Passenger_count (1 Marks)¶

Question 2.3 Plotting countplot for pickup_neighborhood and dropoff_neighborhood (2 Marks)¶

Bivariate analysis¶

Plot a scatter plot for trip distance and trip duration¶

Step 2: Prepare the Data¶

Question 3: Define entities and relationships for the Deep Feature Synthesis (2 Marks)¶

Step 3: Create baseline features using Deep Feature Synthesis¶

Question 4: Creating a baseline model with only 1 transform primitive (10 Marks)¶

Build the Model¶

Transforming the duration variable on sqrt and log¶

Splitting the data into train and test¶

Defining function for to check the performance of the model.¶

Question 4.3 Build Linear regression using only weekend transform primitive¶

Check the performance of the model¶

Question 4.4 Building decision tree using only weekend transform primitive¶

Check the performance of the model¶

Question 4.5 Building Pruned decision tree using only weekend transform primitive¶

Check the performance of the model¶

Question 4.6 Building Random Forest using only weekend transform primitive¶

Check the performance of the model¶

Step 4: Adding more Transform Primitives and creating new model¶

Question 5: Create models with more transform primitives (10 Marks)¶

Build the new models more transform features¶

Question 5.3 Building Linear regression using more transform primitive¶

Check the performance of the model¶

Question 5.4 Building Decision tree using more transform primitive¶

Check the performance of the model¶

Question 5.5 Building Pruned Decision tree using more transform primitive¶

Check the performance of the model¶

Question 5.6 Building Random Forest using more transform primitive¶

Check the performance of the model¶

Step 5: Add Aggregation Primitives¶

Question 6: Create a Models with transform and aggregate primitive. (10 Marks)¶

Build the new models more transform and aggregate features¶

Question 6.3 Building Linear regression model with transform and aggregate primitive.¶

Check the performance of the model¶

Question 6.4 Building Decision tree with transform and aggregate primitive.¶

Check the performance of the model¶

Question 6.5 Building Pruned Decision tree with transform and aggregate primitive.¶

Check the performance of the model¶

Question 6.6 Building Random Forest with transform and aggregate primitive.¶

Check the performance of the model¶

Based on the above 3 models, we can make predictions using our model2, as it is giving almost same accuracy as model3 and also the training time is not that large as compared to model3¶

Question 7: What are some important features based on model2 and how can they affect the duration of the rides? (3 Marks)¶