61|PYTHON – Determining Cost Drivers with Regression

Parker_Sherwood · May 25, 2023, 5:19pm

BYU Student Author: @Parker_Sherwood
Reviewers: @Jonathan_Weston, @Andrew
Estimated Time to Solve: Time 20 minutes

We provide the solution to this challenge using:

Python (The challenge requires the pandas and scipy libraries.)
Go here to perform this challenge in Excel

Need a program? Click here.

Overview
You work at Park’s Production, a simple manufacturing facility that makes one product. The facility has data on its first 500 production runs. The company doesn’t have much experience with managerial accounting, so they took a chance and hired you since you’ve had a class or two about the subject. At your most recent meeting with the production manager, you learned that the company is planning to allocate overhead based on headcount (at a rate of $1,000/head). You know that good cost drivers directly contribute to the cost being allocated. For example, headcount may be a good driver for a supervisor’s salary since supervisors are directly responsible for the people they oversee. Even though you’re new, you have your doubts about headcount being a good driver for all overhead costs such as depreciation, rent, supervisor salaries, etc.

You have some experience with Python and regression analysis and are ready to put it to use. You know that something is statistically significant if the p-value is less than 0.05. You also know that “R Square” in the regression output denotes a high correlation for values between +/- 0.5 and +/- 1. Values between +/- 0.29 and 0 denote a weak or nonexistent relationship. The best drivers, as a result, will show a “R Square” above 0.5 (since we expect a positive relationship between a driver and its cost).

It’s time to find out just how good headcount is as a cost driver—and if there’s a better option.

Instructions

The overhead rate was calculated by summing the Manufacturing Overhead and dividing it by the sum of Headcount. Do you think this is the best way to determine the rate using headcount as a driver? Try another calculation to see if there’s an argument that $1,000 is even the right number if using headcount.

Suggestions and Hints

Calculate the rate of manufacturing overhead/headcount for each production run, then average those. Is it the same as the initial estimate?
Now run a regression analysis on headcount (X variable) and manufacturing overhead (Y variable). What do you learn from the p-value? What do you learn from the R Square?

Suggestions and Hints

You can import stats from the scipy library and use “stats.linregress” to run the regression. The syntax calls for the x variable then the y variable. The regression output contains the following statistics: slope, intercept, r value, p value, and standard error. You can establish five variables to take on these values and set those 5 variables, separated by commas, equal to the linregress code line. In this case, you will need to square the r value to have the R Square value to analyze the data.
Now run a regression on direct labor hours (X variable) and direct labor (Y variable). Are the results what you would expect?

Suggestions and Hints

You should notice that direct labor hours and direct labor costs are perfectly correlated. This makes sense for a place that pays hourly wages.
Try at least one more regression to see if there’s a variable that is better than headcount at predicting manufacturing overhead costs. Put your R Square in the comments!

Suggestions and Hints

Try machine hours and manufacturing overhead.
Bonus: Use the regression outputs to estimate the overhead expense for each production run. You’ll need the intercept, the slope, and the driver for each production run. What else could you do to impress the production manager?

Suggestions and Hints

You may need to manually calculate the predicted values. Do this by multiplying the column containing your selected driver by the slope and then adding the intercept. Creating new columns in the dataframe may be useful as well.

Check Figures

Step 1: Calculating the overhead rate as described in the hint should yield $1,138.549.
Step 2: The R Square value in the first regression should be around 0.01263. This means that this relationship is not very predictive.
Step 3: The R Square value in the second regression should be 0.9999. This means that this relationship is very predictive.

Data Files

Solution

Solution Code

#import packages 
import pandas as pd 
from scipy import stats 

#import data 
df = pd.read_csv(#Insert Filepath here) 

#Step 1 
total_overhead = df['Manufacturing Overhead'].sum() 
total_headcount = df['Headcount'].sum()
 
print(f"The driver estimate is {total_overhead/total_headcount}") 
print() 
 
df['overhead_per_head'] = df['Manufacturing Overhead']/df['Headcount'] 
print(f"The new estimate of the driver is {df['overhead_per_head'].mean()}") 
print() 

#Step 2 
#Run the regression and name the output 
hc_slope, hc_intercept, hc_r_value, hc_p_value, hc_std_err = stats.linregress(df['Headcount'], df['Manufacturing Overhead'])
 
#Print the outputs. Notice that the R Square value is small (less than 0.5).  
# Despite being statistically significant, this model is not very predictive. 
print("HEADCOUNT AND MANUFACTURING OVERHEAD") 
print("Slope: ", hc_slope) 
print("Intercept: ", hc_intercept) 
print("R-squared: ", hc_r_value**2) 
print("P-value: ", hc_p_value) 
print("Standard error: ", hc_std_err) 
print() 
 
#Step 3 
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df['Direct Labor Hours'], df['Direct Labor']) 
 
#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation. 
#This model is both predictive and statistically significant. 
print("DIRECT LABOR HOURS AND DIRECT LABOR") 
print("Slope: ", dl_slope) 
print("Intercept: ", dl_intercept) 
print("R-squared: ", dl_r_value**2) 
print("P-value: ", dl_p_value) 
print("Standard error: ", dl_std_err) 
print() 
 
#Step 4 
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df['Machine Hours'], df['Manufacturing Overhead']) 

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation. 
#This model is both predictive and statistically significant. 
print("MACHINE HOURS AND MANUFACTURING OVERHEAD") 
print("Slope: ", mh_slope) 
print("Intercept: ", mh_intercept) 
print("R-squared: ", mh_r_value**2) 
print("P-value: ", mh_p_value) 
print("Standard error: ", mh_std_err) 
print() 
 
#Step 5 
df['predicted_overhead'] = mh_slope*df['Machine Hours'] + mh_intercept 
df['differences'] = df['predicted_overhead']-df['Manufacturing Overhead'] 
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}")

Challenge61_Solution.txt
Solution Video: Challenge 61|PYTHON – Determining Cost Drivers with Regression

Jim_Christensen · December 5, 2023, 7:25pm

Great job on this challenge!

Here was my code

pd.options.display.float_format = ‘{:,.2f}’.format

directory=os.getcwd()+“\”

data=pd.read_csv(directory+“Challenge61_Data.csv”)

data[‘overhead_by_headcount’]=data[‘Manufacturing Overhead’]/data[‘Headcount’]

avg=data[‘overhead_by_headcount’].mean()

print(f’A more accurate estimate for Overhead/Headcount would be {avg}')

data.head()

#stats.linregress(data[‘’])

Reg for MOH and headcount

reg2=stats.linregress(data[‘Headcount’],data[‘Manufacturing Overhead’])

r_value2=reg2.rvalue

p_value2=reg2.pvalue

print(“Reg for MOH and headcount”)

r_value_sq2=float(r_value2)**2

print(f"the R Squared is {r_value_sq2}")

print(f"the p value is {p_value2}")

reg for DLH and DL

reg1=stats.linregress(data[‘Direct Labor Hours’],data[‘Direct Labor’])

r_value1=reg1.rvalue

p_value1=reg1.pvalue

r_value_sq1=float(r_value1)**2

print(‘reg for DLH and DL’)

print(f"the R Squared is {r_value_sq1}")

print(f"the p value is {p_value1}")

Reg for MOH and Machine hours

reg3=stats.linregress(data[‘Machine Hours’],data[‘Manufacturing Overhead’])

r_value3=reg3.rvalue

p_value3=reg3.pvalue

print( ‘Reg for MOH and Machine hours’)

r_value_sq3=float(r_value3)**2

print(f"the R Squared is {r_value_sq3}")

print(f"the p value is {p_value3}")

Tyler_Coburn · December 5, 2023, 11:27pm

I don’t have any background in python but was able to follow along on the video and get it to work! this is my code:

#import packages
import pandas as pd
from scipy import stats

#import data
df = pd.read_csv(\“Mac\Home\Downloads\Challenge61_Data.csv”)

#Step 1
total_overhead = df[‘Manufacturing Overhead’].sum()
total_headcount = df[‘Headcount’].sum()

print(f"The driver estimate is {total_overhead/total_headcount}")
print()

df[‘overhead_per_head’] = df[‘Manufacturing Overhead’]/df[‘Headcount’]
print(f"The new estimate of the driver is {df[‘overhead_per_head’].mean()}")
print()

#Step 2
#Run the regression and name the output
hc_slope, hc_intercept, hc_r_value, hc_p_value, hc_std_err = stats.linregress(df[‘Headcount’], df[‘Manufacturing Overhead’])

#Print the outputs. Notice that the R Square value is small (less than 0.5).

Despite being statistically significant, this model is not very predictive.

print(“HEADCOUNT AND MANUFACTURING OVERHEAD”)
print("Slope: ", hc_slope)
print("Intercept: ", hc_intercept)
print("R-squared: ", hc_r_value**2)
print("P-value: ", hc_p_value)
print("Standard error: ", hc_std_err)
print()

#Step 3
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df[‘Direct Labor Hours’], df[‘Direct Labor’])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print(“DIRECT LABOR HOURS AND DIRECT LABOR”)
print("Slope: ", dl_slope)
print("Intercept: ", dl_intercept)
print("R-squared: ", dl_r_value**2)
print("P-value: ", dl_p_value)
print("Standard error: ", dl_std_err)
print()

#Step 4
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df[‘Machine Hours’], df[‘Manufacturing Overhead’])

#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print(“MACHINE HOURS AND MANUFACTURING OVERHEAD”)
print("Slope: ", mh_slope)
print("Intercept: ", mh_intercept)
print("R-squared: ", mh_r_value**2)
print("P-value: ", mh_p_value)
print("Standard error: ", mh_std_err)
print()

#Step 5
df[‘predicted_overhead’] = mh_slope*df[‘Machine Hours’] + mh_intercept
df[‘differences’] = df[‘predicted_overhead’]-df[‘Manufacturing Overhead’]
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}")

McKay_Ballstaedt · December 11, 2023, 7:31pm

Parker_Sherwood:

You work at Park’s Production, a simple manufacturing facility that makes one product. The facility has data on its first 500 production runs. The company doesn’t have much experience with managerial accounting, so they took a chance and hired you since you’ve had a class or two about the subject. At your most recent meeting with the production manager, you learned that the company is planning to allocate overhead based on headcount (at a rate of $1,000/head). You know that good cost drivers directly contribute to the cost being allocated. For example, headcount may be a good driver for a supervisor’s salary since supervisors are directly responsible for the people they oversee. Even though you’re new, you have your doubts about headcount being a good driver for all overhead costs such as depreciation, rent, supervisor salaries, etc.

You have some experience with Python and regression analysis and are ready to put it to use. You know that something is statistically significant if the p-value is less than 0.05. You also know that “R Square” in the regression output denotes a high correlation for values between +/- 0.5 and +/- 1. Values between +/- 0.29 and 0 denote a weak or nonexistent relationship. The best drivers, as a result, will show a “R Square” above 0.5 (since we expect a positive relationship between a driver and its cost).

It’s time to find out just how good headcount is as a cost driver—and if there’s a better option.

Instructions

The overhead rate was calculated by summing the Manufacturing Overhead and dividing it by the sum of Headcount. Do you think this is the best way to determine the rate using headcount as a driver? Try another calculation to see if there’s an argument that $1,000 is even the right number if using headcount. Suggestions and Hints

Calculate the rate of manufacturing overhead/headcount for each production run, then average those. Is it the same as the initial estimate?

Now run a regression analysis on headcount (X variable) and manufacturing overhead (Y variable). What do you learn from the p-value? What do you learn from the R Square? Suggestions and Hints

You can import stats from the scipy library and use “stats.linregress” to run the regression. The syntax calls for the x variable then the y variable. The regression output contains the following statistics: slope, intercept, r value, p value, and standard error. You can establish five variables to take on these values and set those 5 variables, separated by commas, equal to the linregress code line. In this case, you will need to square the r value to have the R Square value to analyze the data.

Now run a regression on direct labor hours (X variable) and direct labor (Y variable). Are the results what you would expect? Suggestions and Hints

You should notice that direct labor hours and direct labor costs are perfectly correlated. This makes sense for a place that pays hourly wages.

Try at least one more regression to see if there’s a variable that is better than headcount at predicting manufacturing overhead costs. Put your R Square in the comments! Suggestions and Hints

Try machine hours and manufacturing overhead.

Bonus: Use the regression outputs to estimate the overhead expense for each production run. You’ll need the intercept, the slope, and the driver for each production run. What else could you do to impress the production manager?

I got this to work for me.

import os
import pandas as pd
from scipy import stats

pd.options.display.float_format = ‘{:,.2f}’.format

Read data from CSV

directory = os.getcwd()
data = pd.read_csv(os.path.join(directory, ‘Challenge61_Data.csv’))

Function to perform regression analysis and print results

def perform_regression(x, y, x_label, y_label):
reg_result = stats.linregress(data, data[y])
r_squared = reg_result.rvalue ** 2
p_value = reg_result.pvalue

print(f"Reg for {y_label} and {x_label}")
print(f"The R Squared is {r_squared}")
print(f"The p-value is {p_value}")

return r_squared

Calculate overhead rate using direct labor hours

data[‘overhead_by_dlh’] = data[‘Manufacturing Overhead’] / data[‘Direct Labor Hours’]
avg_dlh = data[‘overhead_by_dlh’].mean()
print(f"A more accurate estimate for Overhead/Direct Labor Hours would be {avg_dlh}")

Perform regression analysis for headcount and manufacturing overhead

r_squared_headcount = perform_regression(‘Headcount’, ‘Manufacturing Overhead’, ‘headcount’, ‘MOH’)

Perform regression analysis for direct labor hours and direct labor

r_squared_dlh = perform_regression(‘Direct Labor Hours’, ‘Direct Labor’, ‘DLH’, ‘DL’)

Perform regression analysis for machine hours and manufacturing overhead

r_squared_machine_hours = perform_regression(‘Machine Hours’, ‘Manufacturing Overhead’, ‘Machine hours’, ‘MOH’)

Bonus: Estimate overhead expense for each production run

intercept, slope, _, _, _ = stats.linregress(data[‘Headcount’], data[‘Manufacturing Overhead’])

data[‘estimated_overhead’] = intercept + slope * data[‘Headcount’]

Impress the production manager by displaying the estimated overhead for each production run

print(“Estimated Overhead for Each Production Run:”)
print(data[[‘Production Run’, ‘estimated_overhead’]])

Topic		Replies	Views
61\|EXCEL – Determining Cost Drivers with Regression Challenges excel , beginner , managerial , data-analysis , regression	61	1417	March 13, 2025
203\|PYTHON – COGS Costing Challenges intermediate , python , pandas , matplotlib	0	28	May 4, 2024
77\|PYTHON – Error Correction Challenges beginner , python , pandas , debugging	9	282	January 28, 2025
130\|PYTHON – Quick Calculation Challenges beginner , audit , python , pandas	3	126	June 27, 2024
230\|PYTHON – Data Visualization Seaborn Challenges intermediate , python , seaborn	0	13	May 22, 2024