In this segment, we will examine a real-world example, where we will predict wages of the workers using a linear combination of workers characteristics and assess the predictive performance of our prediction rules using the Mean Squared Error(MSE), adjusted MSE and r-squared as well as out-of-sample MSE and r-squared.
The data comes from the March supplement of the U.S. Current Population Survey, the year 2012. It focuses on the single (never married) workers with education levels equal to high school, some college, or college graduates. The sample size is approx 4,000.
The outcome variable Y is an hourly wage, and the X’s are various characteristics of workers such as gender, experience, education, and geographical indicators.
The dataset contains the following variables:
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
# Load data
df = pd.read_csv('predicting_wages.csv')
# See top 5 row in the dataset
df.head()
female | cg | sc | hsg | mw | so | we | ne | exp1 | exp2 | exp3 | wage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 33.0 | 10.89 | 35.937 | 11.659091 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 27.0 | 7.29 | 19.683 | 12.825000 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 13.0 | 1.69 | 2.197 | 5.777027 |
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2.0 | 0.04 | 0.008 | 12.468750 |
4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 15.0 | 2.25 | 3.375 | 18.525000 |
# Checking info of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3835 entries, 0 to 3834 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 female 3835 non-null int64 1 cg 3835 non-null int64 2 sc 3835 non-null int64 3 hsg 3835 non-null int64 4 mw 3835 non-null int64 5 so 3835 non-null int64 6 we 3835 non-null int64 7 ne 3835 non-null int64 8 exp1 3835 non-null float64 9 exp2 3835 non-null float64 10 exp3 3835 non-null float64 11 wage 3835 non-null float64 dtypes: float64(4), int64(8) memory usage: 359.7 KB
# Printing the summary statistics for the dataset
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
female | 3835.0 | 0.417992 | 0.493293 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
cg | 3835.0 | 0.376271 | 0.484513 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
sc | 3835.0 | 0.323859 | 0.468008 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
hsg | 3835.0 | 0.299870 | 0.458260 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
mw | 3835.0 | 0.287614 | 0.452709 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
so | 3835.0 | 0.243546 | 0.429278 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 |
we | 3835.0 | 0.211734 | 0.408591 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 |
ne | 3835.0 | 0.257106 | 0.437095 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 1.000000 |
exp1 | 3835.0 | 13.353194 | 8.639348 | 2.000000 | 6.00000 | 11.000000 | 19.500000 | 35.000000 |
exp2 | 3835.0 | 2.529267 | 2.910554 | 0.040000 | 0.36000 | 1.210000 | 3.802500 | 12.250000 |
exp3 | 3835.0 | 5.812103 | 9.033207 | 0.008000 | 0.21600 | 1.331000 | 7.414875 | 42.875000 |
wage | 3835.0 | 15.533356 | 13.518165 | 0.004275 | 9.61875 | 13.028571 | 17.812500 | 348.333017 |
df[['exp1','exp2','exp3','wage']].boxplot(figsize=(20,10))
plt.show()
sns.scatterplot('exp1','wage',data=df)
<AxesSubplot:xlabel='exp1', ylabel='wage'>
cols=df.select_dtypes('int').columns.to_list()
cols
['female', 'cg', 'sc', 'hsg', 'mw', 'so', 'we', 'ne']
sns.kdeplot(df.wage,hue=df.female)
<AxesSubplot:xlabel='wage', ylabel='Density'>
for i in cols:
sns.kdeplot(df.wage,hue=df[i])
plt.show()
for i in cols:
sns.scatterplot('exp1','wage',hue=i,data=df)
plt.show()