PREDICTING WAGES

In this segment, we will examine a real-world example, where we will predict wages of the workers using a linear combination of workers characteristics and assess the predictive performance of our prediction rules using the Mean Squared Error(MSE), adjusted MSE and r-squared as well as out-of-sample MSE and r-squared.

The data comes from the March supplement of the U.S. Current Population Survey, the year 2012. It focuses on the single (never married) workers with education levels equal to high school, some college, or college graduates. The sample size is approx 4,000.

The outcome variable Y is an hourly wage, and the X’s are various characteristics of workers such as gender, experience, education, and geographical indicators.

Data Dictionary

The dataset contains the following variables:

  1. wage : weekly wage
  2. female : female dummy
  3. cg : college Graduate Dummy
  4. sc : some college dummy
  5. hsg : High School graduate dummy
  6. mw : mid-west dummy
  7. so : south dummy
  8. we : west dummy
  9. ne : northeast dummy
  10. exp1 : experience(year)
  11. exp2 : experience squared (taken as experience squared/100)
  12. exp3 : experience cubed (taken as experience cubed/1000)

Importing libraries

Checking the info of the dataset

Univariate Analysis

Checking the summary statistics of the dataset

Bivariate Analysis

Let's first look the relationship between the experience and Wages

Now make a list of dummy columns and check there relationship with wage

As now we have done our analysis, lets move to Modeling.

Basic Model

Flexible model

Given that p/n is quite small here, the sample linear regression should approximate the population linear regression quite well.

p R-squared_sample R-squared_adj MSE_adj
basic reg 10 0.0954 0.093 165.680
flexi reg 33 0.1039 0.096 165.118

We conclude that the performance of the basic and flexible model are about the same, with the flexible model being just slightly better (slightly higher R2 lower MSE).

Basic Model on splitted data

Flexible model on splitted data

Conclusion and recommendations

p R-squared_test MSE_test
basic reg 10 0.1027 154.584
flexi reg 33 0.1046 154.260