61|PYTHON – Determining Cost Drivers with Regression

BYU Student Author: @Parker_Sherwood
Reviewers: @Jonathan_Weston, @Andrew
Estimated Time to Solve: Time 20 minutes

We provide the solution to this challenge using:

Need a program? Click here.

Overview
You work at Park’s Production, a simple manufacturing facility that makes one product. The facility has data on its first 500 production runs. The company doesn’t have much experience with managerial accounting, so they took a chance and hired you since you’ve had a class or two about the subject. At your most recent meeting with the production manager, you learned that the company is planning to allocate overhead based on headcount (at a rate of $1,000/head). You know that good cost drivers directly contribute to the cost being allocated. For example, headcount may be a good driver for a supervisor’s salary since supervisors are directly responsible for the people they oversee. Even though you’re new, you have your doubts about headcount being a good driver for all overhead costs such as depreciation, rent, supervisor salaries, etc.

You have some experience with Python and regression analysis and are ready to put it to use. You know that something is statistically significant if the p-value is less than 0.05. You also know that “R Square” in the regression output denotes a high correlation for values between +/- 0.5 and +/- 1. Values between +/- 0.29 and 0 denote a weak or nonexistent relationship. The best drivers, as a result, will show a “R Square” above 0.5 (since we expect a positive relationship between a driver and its cost).

It’s time to find out just how good headcount is as a cost driver—and if there’s a better option.

Instructions

  1. The overhead rate was calculated by summing the Manufacturing Overhead and dividing it by the sum of Headcount. Do you think this is the best way to determine the rate using headcount as a driver? Try another calculation to see if there’s an argument that $1,000 is even the right number if using headcount.
    Suggestions and Hints

    Calculate the rate of manufacturing overhead/headcount for each production run, then average those. Is it the same as the initial estimate?

  2. Now run a regression analysis on headcount (X variable) and manufacturing overhead (Y variable). What do you learn from the p-value? What do you learn from the R Square?
    Suggestions and Hints

    You can import stats from the scipy library and use “stats.linregress” to run the regression. The syntax calls for the x variable then the y variable. The regression output contains the following statistics: slope, intercept, r value, p value, and standard error. You can establish five variables to take on these values and set those 5 variables, separated by commas, equal to the linregress code line. In this case, you will need to square the r value to have the R Square value to analyze the data.

  3. Now run a regression on direct labor hours (X variable) and direct labor (Y variable). Are the results what you would expect?
    Suggestions and Hints

    You should notice that direct labor hours and direct labor costs are perfectly correlated. This makes sense for a place that pays hourly wages.

  4. Try at least one more regression to see if there’s a variable that is better than headcount at predicting manufacturing overhead costs. Put your R Square in the comments!
    Suggestions and Hints

    Try machine hours and manufacturing overhead.

  5. Bonus: Use the regression outputs to estimate the overhead expense for each production run. You’ll need the intercept, the slope, and the driver for each production run. What else could you do to impress the production manager?
    Suggestions and Hints

    You may need to manually calculate the predicted values. Do this by multiplying the column containing your selected driver by the slope and then adding the intercept. Creating new columns in the dataframe may be useful as well.

Check Figures
  • Step 1: Calculating the overhead rate as described in the hint should yield $1,138.549.
  • Step 2: The R Square value in the first regression should be around 0.01263. This means that this relationship is not very predictive.
  • Step 3: The R Square value in the second regression should be 0.9999. This means that this relationship is very predictive.

Data Files

Solution

Solution Code
#import packages 
import pandas as pd 
from scipy import stats 

#import data 
df = pd.read_csv(#Insert Filepath here) 

#Step 1 
total_overhead = df['Manufacturing Overhead'].sum() 
total_headcount = df['Headcount'].sum()
 
print(f"The driver estimate is {total_overhead/total_headcount}") 
print() 
 
df['overhead_per_head'] = df['Manufacturing Overhead']/df['Headcount'] 
print(f"The new estimate of the driver is {df['overhead_per_head'].mean()}") 
print() 

#Step 2 
#Run the regression and name the output 
hc_slope, hc_intercept, hc_r_value, hc_p_value, hc_std_err = stats.linregress(df['Headcount'], df['Manufacturing Overhead'])
 
#Print the outputs. Notice that the R Square value is small (less than 0.5).  
# Despite being statistically significant, this model is not very predictive. 
print("HEADCOUNT AND MANUFACTURING OVERHEAD") 
print("Slope: ", hc_slope) 
print("Intercept: ", hc_intercept) 
print("R-squared: ", hc_r_value**2) 
print("P-value: ", hc_p_value) 
print("Standard error: ", hc_std_err) 
print() 
 
#Step 3 
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df['Direct Labor Hours'], df['Direct Labor']) 
 
#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation. 
#This model is both predictive and statistically significant. 
print("DIRECT LABOR HOURS AND DIRECT LABOR") 
print("Slope: ", dl_slope) 
print("Intercept: ", dl_intercept) 
print("R-squared: ", dl_r_value**2) 
print("P-value: ", dl_p_value) 
print("Standard error: ", dl_std_err) 
print() 
 
#Step 4 
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df['Machine Hours'], df['Manufacturing Overhead']) 

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation. 
#This model is both predictive and statistically significant. 
print("MACHINE HOURS AND MANUFACTURING OVERHEAD") 
print("Slope: ", mh_slope) 
print("Intercept: ", mh_intercept) 
print("R-squared: ", mh_r_value**2) 
print("P-value: ", mh_p_value) 
print("Standard error: ", mh_std_err) 
print() 
 
#Step 5 
df['predicted_overhead'] = mh_slope*df['Machine Hours'] + mh_intercept 
df['differences'] = df['predicted_overhead']-df['Manufacturing Overhead'] 
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}") 

Challenge61_Solution.txt
Solution Video: Challenge 61|PYTHON – Determining Cost Drivers with Regression

Great job on this challenge!

Here was my code

pd.options.display.float_format = ‘{:,.2f}’.format

directory=os.getcwd()+“\”

data=pd.read_csv(directory+“Challenge61_Data.csv”)

data[‘overhead_by_headcount’]=data[‘Manufacturing Overhead’]/data[‘Headcount’]

avg=data[‘overhead_by_headcount’].mean()

print(f’A more accurate estimate for Overhead/Headcount would be {avg}')

data.head()

#stats.linregress(data[‘’])

Reg for MOH and headcount

reg2=stats.linregress(data[‘Headcount’],data[‘Manufacturing Overhead’])

r_value2=reg2.rvalue

p_value2=reg2.pvalue

print(“Reg for MOH and headcount”)

r_value_sq2=float(r_value2)**2

print(f"the R Squared is {r_value_sq2}")

print(f"the p value is {p_value2}")

reg for DLH and DL

reg1=stats.linregress(data[‘Direct Labor Hours’],data[‘Direct Labor’])

r_value1=reg1.rvalue

p_value1=reg1.pvalue

r_value_sq1=float(r_value1)**2

print(‘reg for DLH and DL’)

print(f"the R Squared is {r_value_sq1}")

print(f"the p value is {p_value1}")

Reg for MOH and Machine hours

reg3=stats.linregress(data[‘Machine Hours’],data[‘Manufacturing Overhead’])

r_value3=reg3.rvalue

p_value3=reg3.pvalue

print( ‘Reg for MOH and Machine hours’)

r_value_sq3=float(r_value3)**2

print(f"the R Squared is {r_value_sq3}")

print(f"the p value is {p_value3}")

I don’t have any background in python but was able to follow along on the video and get it to work! this is my code:

#import packages
import pandas as pd
from scipy import stats

#import data
df = pd.read_csv(\“Mac\Home\Downloads\Challenge61_Data.csv”)

#Step 1
total_overhead = df[‘Manufacturing Overhead’].sum()
total_headcount = df[‘Headcount’].sum()

print(f"The driver estimate is {total_overhead/total_headcount}")
print()

df[‘overhead_per_head’] = df[‘Manufacturing Overhead’]/df[‘Headcount’]
print(f"The new estimate of the driver is {df[‘overhead_per_head’].mean()}")
print()

#Step 2
#Run the regression and name the output
hc_slope, hc_intercept, hc_r_value, hc_p_value, hc_std_err = stats.linregress(df[‘Headcount’], df[‘Manufacturing Overhead’])

#Print the outputs. Notice that the R Square value is small (less than 0.5).

Despite being statistically significant, this model is not very predictive.

print(“HEADCOUNT AND MANUFACTURING OVERHEAD”)
print("Slope: ", hc_slope)
print("Intercept: ", hc_intercept)
print("R-squared: ", hc_r_value**2)
print("P-value: ", hc_p_value)
print("Standard error: ", hc_std_err)
print()

#Step 3
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df[‘Direct Labor Hours’], df[‘Direct Labor’])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print(“DIRECT LABOR HOURS AND DIRECT LABOR”)
print("Slope: ", dl_slope)
print("Intercept: ", dl_intercept)
print("R-squared: ", dl_r_value**2)
print("P-value: ", dl_p_value)
print("Standard error: ", dl_std_err)
print()

#Step 4
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df[‘Machine Hours’], df[‘Manufacturing Overhead’])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print(“MACHINE HOURS AND MANUFACTURING OVERHEAD”)
print("Slope: ", mh_slope)
print("Intercept: ", mh_intercept)
print("R-squared: ", mh_r_value**2)
print("P-value: ", mh_p_value)
print("Standard error: ", mh_std_err)
print()

#Step 5
df[‘predicted_overhead’] = mh_slope*df[‘Machine Hours’] + mh_intercept
df[‘differences’] = df[‘predicted_overhead’]-df[‘Manufacturing Overhead’]
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}")