# 61|PYTHON – Determining Cost Drivers with Regression

BYU Student Author: @Parker_Sherwood
Reviewers: @Jonathan_Weston, @Andrew
Estimated Time to Solve: Time 20 minutes

We provide the solution to this challenge using:

Overview

You have some experience with Python and regression analysis and are ready to put it to use. You know that something is statistically significant if the p-value is less than 0.05. You also know that “R Square” in the regression output denotes a high correlation for values between +/- 0.5 and +/- 1. Values between +/- 0.29 and 0 denote a weak or nonexistent relationship. The best drivers, as a result, will show a “R Square” above 0.5 (since we expect a positive relationship between a driver and its cost).

It’s time to find out just how good headcount is as a cost driver—and if there’s a better option.

Instructions

1. The overhead rate was calculated by summing the Manufacturing Overhead and dividing it by the sum of Headcount. Do you think this is the best way to determine the rate using headcount as a driver? Try another calculation to see if there’s an argument that \$1,000 is even the right number if using headcount.
Suggestions and Hints

Calculate the rate of manufacturing overhead/headcount for each production run, then average those. Is it the same as the initial estimate?

2. Now run a regression analysis on headcount (X variable) and manufacturing overhead (Y variable). What do you learn from the p-value? What do you learn from the R Square?
Suggestions and Hints

You can import stats from the scipy library and use “stats.linregress” to run the regression. The syntax calls for the x variable then the y variable. The regression output contains the following statistics: slope, intercept, r value, p value, and standard error. You can establish five variables to take on these values and set those 5 variables, separated by commas, equal to the linregress code line. In this case, you will need to square the r value to have the R Square value to analyze the data.

3. Now run a regression on direct labor hours (X variable) and direct labor (Y variable). Are the results what you would expect?
Suggestions and Hints

You should notice that direct labor hours and direct labor costs are perfectly correlated. This makes sense for a place that pays hourly wages.

4. Try at least one more regression to see if there’s a variable that is better than headcount at predicting manufacturing overhead costs. Put your R Square in the comments!
Suggestions and Hints

Try machine hours and manufacturing overhead.

5. Bonus: Use the regression outputs to estimate the overhead expense for each production run. You’ll need the intercept, the slope, and the driver for each production run. What else could you do to impress the production manager?
Suggestions and Hints

You may need to manually calculate the predicted values. Do this by multiplying the column containing your selected driver by the slope and then adding the intercept. Creating new columns in the dataframe may be useful as well.

Check Figures
• Step 1: Calculating the overhead rate as described in the hint should yield \$1,138.549.
• Step 2: The R Square value in the first regression should be around 0.01263. This means that this relationship is not very predictive.
• Step 3: The R Square value in the second regression should be 0.9999. This means that this relationship is very predictive.

Data Files

Solution

Solution Code
``````#import packages
import pandas as pd
from scipy import stats

#import data

#Step 1

print()

print()

#Step 2
#Run the regression and name the output

#Print the outputs. Notice that the R Square value is small (less than 0.5).
# Despite being statistically significant, this model is not very predictive.
print("Slope: ", hc_slope)
print("Intercept: ", hc_intercept)
print("R-squared: ", hc_r_value**2)
print("P-value: ", hc_p_value)
print("Standard error: ", hc_std_err)
print()

#Step 3
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df['Direct Labor Hours'], df['Direct Labor'])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print("DIRECT LABOR HOURS AND DIRECT LABOR")
print("Slope: ", dl_slope)
print("Intercept: ", dl_intercept)
print("R-squared: ", dl_r_value**2)
print("P-value: ", dl_p_value)
print("Standard error: ", dl_std_err)
print()

#Step 4
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df['Machine Hours'], df['Manufacturing Overhead'])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print("Slope: ", mh_slope)
print("Intercept: ", mh_intercept)
print("R-squared: ", mh_r_value**2)
print("P-value: ", mh_p_value)
print("Standard error: ", mh_std_err)
print()

#Step 5
df['predicted_overhead'] = mh_slope*df['Machine Hours'] + mh_intercept
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}")
``````

Great job on this challenge!

Here was my code

pd.options.display.float_format = ‘{:,.2f}’.format

directory=os.getcwd()+“\”

#stats.linregress(data[‘’])

# Reg for MOH and headcount

r_value2=reg2.rvalue

p_value2=reg2.pvalue

r_value_sq2=float(r_value2)**2

print(f"the R Squared is {r_value_sq2}")

print(f"the p value is {p_value2}")

# reg for DLH and DL

reg1=stats.linregress(data[‘Direct Labor Hours’],data[‘Direct Labor’])

r_value1=reg1.rvalue

p_value1=reg1.pvalue

r_value_sq1=float(r_value1)**2

print(‘reg for DLH and DL’)

print(f"the R Squared is {r_value_sq1}")

print(f"the p value is {p_value1}")

# Reg for MOH and Machine hours

r_value3=reg3.rvalue

p_value3=reg3.pvalue

print( ‘Reg for MOH and Machine hours’)

r_value_sq3=float(r_value3)**2

print(f"the R Squared is {r_value_sq3}")

print(f"the p value is {p_value3}")

I don’t have any background in python but was able to follow along on the video and get it to work! this is my code:

#import packages
import pandas as pd
from scipy import stats

#import data

#Step 1

print()

print()

#Step 2
#Run the regression and name the output

#Print the outputs. Notice that the R Square value is small (less than 0.5).

# Despite being statistically significant, this model is not very predictive.

print("Slope: ", hc_slope)
print("Intercept: ", hc_intercept)
print("R-squared: ", hc_r_value**2)
print("P-value: ", hc_p_value)
print("Standard error: ", hc_std_err)
print()

#Step 3
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df[‘Direct Labor Hours’], df[‘Direct Labor’])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print(“DIRECT LABOR HOURS AND DIRECT LABOR”)
print("Slope: ", dl_slope)
print("Intercept: ", dl_intercept)
print("R-squared: ", dl_r_value**2)
print("P-value: ", dl_p_value)
print("Standard error: ", dl_std_err)
print()

#Step 4
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df[‘Machine Hours’], df[‘Manufacturing Overhead’])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print("Slope: ", mh_slope)
print("Intercept: ", mh_intercept)
print("R-squared: ", mh_r_value**2)
print("P-value: ", mh_p_value)
print("Standard error: ", mh_std_err)
print()

#Step 5
df[‘predicted_overhead’] = mh_slope*df[‘Machine Hours’] + mh_intercept
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}")

I got this to work for me.

import os
import pandas as pd
from scipy import stats

pd.options.display.float_format = ‘{:,.2f}’.format

directory = os.getcwd()

# Function to perform regression analysis and print results

def perform_regression(x, y, x_label, y_label):
reg_result = stats.linregress(data, data[y])
r_squared = reg_result.rvalue ** 2
p_value = reg_result.pvalue

``````print(f"Reg for {y_label} and {x_label}")
print(f"The R Squared is {r_squared}")
print(f"The p-value is {p_value}")

return r_squared
``````

# Calculate overhead rate using direct labor hours

print(f"A more accurate estimate for Overhead/Direct Labor Hours would be {avg_dlh}")

# Perform regression analysis for direct labor hours and direct labor

r_squared_dlh = perform_regression(‘Direct Labor Hours’, ‘Direct Labor’, ‘DLH’, ‘DL’)

# Perform regression analysis for machine hours and manufacturing overhead

r_squared_machine_hours = perform_regression(‘Machine Hours’, ‘Manufacturing Overhead’, ‘Machine hours’, ‘MOH’)