BYU Student Author: @Kyle_Nilsen
Reviewers: @Millie_K_B, @Trent_Barlow
Estimated Time to Solve: 45 Minutes
This is an intro challenge that is part of the Python Learning Path.
We provide the solution to this challenge using:
- Python
Need a program? Click here.
Overview
You are an accounting intern for Slice and Dice MegaCorp, Inc., a newly formed yet large-scale tech corporation. The accounting department has been in shambles as a result of processing the work for various periods across the previously different companies, each with their own bookkeeping methods. At long last, they have consolidated all of the employees, products, and sales into a single excel file, “Challenge161_Data.xlsx”, though each set of info is on its own sheet. You are given the task of consolidating all sheets into one comprehensive dataset to find meaningful measures of performance for the young company and decide the most efficient way of doing this is through Python programming with the Pandas package.
Instructions
All instructions necessary for completing this challenge are contained within the provided downloadable Jupyter notebook. However, here is a summarized list of the steps you will perform:
Part I
- Combine all sales data quarterly data into a single dataframe called data using pd.read_excel() and pd.concat(). Then load the other three Excel sheets into their own dataframes
Part II
- Merge data and customer and find the number of customers who have not made any purchases.
- Merge data and product and find the missing ProductID and fill in null values
- Create and apply a function to salesperson to convert a string variable to a float, then merge it with data to find a commission calculation.
- Perform data analysis with data manipulation and aggregate functions.
Data Files
Suggestions and Hints
The hint numbers provided here will match up to their respective problems in the Jupyter notebook.
Part I
1a. When using pd.read_excel for a file with multiple sheets, you will need to include the parameter sheet_name=’x’ after inputting the name of the file.
1b. The pd.concat function requires a list input to run. Either store your four sales data dataframes into a list, or use brackets around the four when calling the function. Also, use ignore_index=True as a second parameter so that the matching indices from the four quarters are reassigned and create no conflicts.
Part II
1a. Regarding any time where the “indicator” parameter is mentioned for joins, this can be included by saying “indicator=True”, which creates a new column called, by default, “_merge”. This column indicates how the two tables are related to each other based on the column you used for the “on” parameter.
1b. IMPORTANT: Modifying the same dataframe multiple times can result in errors arising if you do not keep track of what your most recent update changed. If this happens, you can work around the errors by importing again the original dataframe. For a specific example, if you ever get the error “Cannot use name of an existing column for indicator column,” this is because the “_merge” column was not deleted and is now preventing another merge from being created. Again, to fix this, rerun the import of the original data and any other cells preceding your current step. You can then include “del data[‘_merge’]” at the end of your current cell to prevent any of those same errors.
3a. To get the commission rate percentage, split the string and then pull from the newly generated list the correct number and multiply it by .01. Hidden here is the code for my function convertstring:
def convertstring(string):
string = string.split()
nowfloat = float(string[0])*.01
return nowfloat
4a. Use reset_index() to move the current index into a column. This will give new index values starting at 0 and move the previous index into a column so that it can then be called by slicing.
4b. You can include multiple groupby statements by separating the columns you wish to group by with a comma within the method call.
Solution