*BYU Student Author*: @Parker_Sherwood

*Reviewers*: @Jonathan_Weston, @Andrew

*Estimated Time to Solve:* Time 20 minutes

We provide the solution to this challenge using:

- Python (The challenge requires the pandas and scipy libraries.)
- Go here to perform this challenge in Excel

Need a program? Click here.

**Overview**

You work at Park’s Production, a simple manufacturing facility that makes one product. The facility has data on its first 500 production runs. The company doesn’t have much experience with managerial accounting, so they took a chance and hired you since you’ve had a class or two about the subject. At your most recent meeting with the production manager, you learned that the company is planning to allocate overhead based on headcount (at a rate of $1,000/head). You know that good cost drivers directly contribute to the cost being allocated. For example, headcount may be a good driver for a supervisor’s salary since supervisors are directly responsible for the people they oversee. Even though you’re new, you have your doubts about headcount being a good driver for all overhead costs such as depreciation, rent, supervisor salaries, etc.

You have some experience with Python and regression analysis and are ready to put it to use. You know that something is statistically significant if the p-value is less than 0.05. You also know that “R Square” in the regression output denotes a high correlation for values between +/- 0.5 and +/- 1. Values between +/- 0.29 and 0 denote a weak or nonexistent relationship. The best drivers, as a result, will show a “R Square” above 0.5 (since we expect a positive relationship between a driver and its cost).

It’s time to find out just how good headcount is as a cost driver—and if there’s a better option.

**Instructions**

- The overhead rate was calculated by summing the Manufacturing Overhead and dividing it by the sum of Headcount. Do you think this is the best way to determine the rate using headcount as a driver? Try another calculation to see if there’s an argument that $1,000 is even the right number if using headcount.
## Suggestions and Hints

Calculate the rate of manufacturing overhead/headcount for each production run, then average those. Is it the same as the initial estimate?

- Now run a regression analysis on headcount (X variable) and manufacturing overhead (Y variable). What do you learn from the p-value? What do you learn from the R Square?
## Suggestions and Hints

You can import stats from the scipy library and use “stats.linregress” to run the regression. The syntax calls for the x variable then the y variable. The regression output contains the following statistics: slope, intercept, r value, p value, and standard error. You can establish five variables to take on these values and set those 5 variables, separated by commas, equal to the linregress code line. In this case, you will need to square the r value to have the R Square value to analyze the data.

- Now run a regression on direct labor hours (X variable) and direct labor (Y variable). Are the results what you would expect?
## Suggestions and Hints

You should notice that direct labor hours and direct labor costs are perfectly correlated. This makes sense for a place that pays hourly wages.

- Try at least one more regression to see if there’s a variable that is better than headcount at predicting manufacturing overhead costs. Put your R Square in the comments!
## Suggestions and Hints

Try machine hours and manufacturing overhead.

- Bonus: Use the regression outputs to estimate the overhead expense for each production run. You’ll need the intercept, the slope, and the driver for each production run. What else could you do to impress the production manager?
## Suggestions and Hints

You may need to manually calculate the predicted values. Do this by multiplying the column containing your selected driver by the slope and then adding the intercept. Creating new columns in the dataframe may be useful as well.

## Check Figures

- Step 1: Calculating the overhead rate as described in the hint should yield $1,138.549.
- Step 2: The R Square value in the first regression should be around 0.01263. This means that this relationship is not very predictive.
- Step 3: The R Square value in the second regression should be 0.9999. This means that this relationship is very predictive.

**Data Files**

**Solution**

## Solution Code

```
#import packages
import pandas as pd
from scipy import stats
#import data
df = pd.read_csv(#Insert Filepath here)
#Step 1
total_overhead = df['Manufacturing Overhead'].sum()
total_headcount = df['Headcount'].sum()
print(f"The driver estimate is {total_overhead/total_headcount}")
print()
df['overhead_per_head'] = df['Manufacturing Overhead']/df['Headcount']
print(f"The new estimate of the driver is {df['overhead_per_head'].mean()}")
print()
#Step 2
#Run the regression and name the output
hc_slope, hc_intercept, hc_r_value, hc_p_value, hc_std_err = stats.linregress(df['Headcount'], df['Manufacturing Overhead'])
#Print the outputs. Notice that the R Square value is small (less than 0.5).
# Despite being statistically significant, this model is not very predictive.
print("HEADCOUNT AND MANUFACTURING OVERHEAD")
print("Slope: ", hc_slope)
print("Intercept: ", hc_intercept)
print("R-squared: ", hc_r_value**2)
print("P-value: ", hc_p_value)
print("Standard error: ", hc_std_err)
print()
#Step 3
dl_slope, dl_intercept, dl_r_value, dl_p_value, dl_std_err = stats.linregress(df['Direct Labor Hours'], df['Direct Labor'])
#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print("DIRECT LABOR HOURS AND DIRECT LABOR")
print("Slope: ", dl_slope)
print("Intercept: ", dl_intercept)
print("R-squared: ", dl_r_value**2)
print("P-value: ", dl_p_value)
print("Standard error: ", dl_std_err)
print()
#Step 4
mh_slope, mh_intercept, mh_r_value, mh_p_value, mh_std_err = stats.linregress(df['Machine Hours'], df['Manufacturing Overhead'])
#Print the outputs. Notice that the R Square value is larger than 0.5, denoting a strong correlation.
#This model is both predictive and statistically significant.
print("MACHINE HOURS AND MANUFACTURING OVERHEAD")
print("Slope: ", mh_slope)
print("Intercept: ", mh_intercept)
print("R-squared: ", mh_r_value**2)
print("P-value: ", mh_p_value)
print("Standard error: ", mh_std_err)
print()
#Step 5
df['predicted_overhead'] = mh_slope*df['Machine Hours'] + mh_intercept
df['differences'] = df['predicted_overhead']-df['Manufacturing Overhead']
print(f"The average difference between the predicted value and actual overhead is {df.differences.mean()}")
```

Challenge61_Solution.txt

*Solution Video:* Challenge 61|PYTHON – Determining Cost Drivers with Regression