BYU Student Author: @Donovon
Reviewers: @MitchFrei @Brett_Lowe
Estimated Time to Solve: 30 Minutes
We provide the solution to this challenge using:
Need a program? Click here
You are a new audit associate with Hardy and Hardy, a CPA firm in based in Bayport. You’ve been assigned to work on the audit for Bayport’s premiere toy store, Vinny’s Vintage Vehicles. In the planning meeting for this audit, your senior asks if you’d feel comfortable looking at the sales data for this past year and looking for any discrepancies. The actual analysis of the data will be completed by a more experienced associate and all you need to do is find rows of data that contain errors. Your senior says that there are usually about 20 erroneous lines of data each year and gives you the following instructions to clarify details. She apologizes for the seemingly random rules to find erroneous rows but assures you that the data always follows the same format and these rules help to narrow down problems.
Definition of erroneous data:
- Erroneous data is that which does not conform to the description in the data dictionary
- Missing postal codes are not erroneous
- Duplicate order numbers are not erroneous
- Customer and contact names are never erroneous
- Duplicate rows are erroneous
- A row is erroneous if a value that should be in one field is in another
- A value is erroneous if it does not match the format of the other data. E.g. planes and cars in place of cars and planes
- Download the data and open it in Excel.
- Get an understanding of each of the fields in the table. The data dictionary can be useful in doing so.
- Use various data analytics techniques to find rows of data that don’t follow the rules outlined above. Be creative and think of ways that you could find one off errors in each field.
- Highlight each of the rows with issues and add a note so the next associate knows what you found.
- Feel free to look at the hints to get a few ideas of techniques you might use in your search or keep them hidden to let your creativity shine!
- Don’t forget to have fun!
Suggestions and Hints
There are exactly 20 erroneous rows of data.
Look for breaks in patterns for each field. It may be helpful to create a table. You can do this from the insert tab of excel. Use the table’s built in filtering to look through the values of each field for anything that seems wrong.
Consider adding new calculated fields to the table that you can use to show which groups of fields don’t go quite match the way they should.
I thought this was a great challenge that tested my abilities to find errors in data. I only found 15/20 in about 40 minutes of work, so I still have some room to improve! But here’s a screenshot of what I was able to find and some error notes.
I was unable to find all 20 errors without looking at the solution. After looking at all 20 errors, I seemed to struggle finding number errors.
I found most of my errors by creating a table and looking through the filter to see what fields didn’t fit. Looks like that turned out to be one of the hints as well! Definitely a good simple way to find errors, but a little time consuming I must say. I wonder if there is a more efficient and effective way to find those problems?? I also was able to find duplicate rows. I found 16/20 errors before peaking at the solution. It took me about an hour to solve. Here is my solution!
@Parker_Sherwood I got 15 as well, and I think they are the same 15 you have. I compared my 15 to the 20 you found, @Mike_Paulsin, and I don’t know why those 5 aren’t showing for me. For example, your note for the line that has sales of $3,566.94 says its a duplicate row, but I can’t find its duplicate in my solution.
I used table filters as @Donovon suggests in his hints. I also used an array formula to calculate whether a line is a duplicate. The formula (for the top two rows of data) is:
Of course, it only works when you sort the data set so that any duplicates would be positioned one on top of the other. The solution file marks both duplicates as erroneous, and I only marked one of each duplicate as erroneous, so that accounts for at least two of what my solution misses. I also identified the transaction with a quantity of 333, but I failed to mark it as erroneous because it reconciled with the amount of the sale. I should’ve looked at the deal size column and seen that it was supposed to be a “Medium” deal.
Well, this is a tough challenge, I might have to keep working on it, but I found 14 erroneous rows. I wasn’t sure when it came to the duplicated rows if I was supposed to include both rows or just one as erroneous because I thought one transaction for had been simply double marked and if I took out both rows, I would end up with less correct transactions if that makes any sense
I noticed I missed the state error and the country errors, I wasn’t sure of any way to check for proper entries besides knowing all the countries in the list or all the states and going through by hand in the filter. Anyone have any ideas for that?
I loved practicing filtering out the data and checking different entries to see check for corrections.
I underestimated this challenge big time! I was also missing a few of the errors at the end. My initial strategy was to use the “Advanced Filter” tool and get a list of all the unique values in a column. This helped me identify when something was in a column, but should not be. This idea only works for some of the errors though and cannot catch them all. Some of them would be hard to catch. Maybe Vinny’s Vintage Vehicles needs some better internal controls to ensure data integrity!
I’m going to be honest, I’m glad I did this challenge because I had no idea where to start looking for errors. Your hint on checking the filters helped a ton. To find the duplicates I first used the remove duplicates funciton found in the Table Design tab. Unfortunately, this doesn’t identify the rows just removes them. So then I decided to use =CONCAT() in conjuction with an If statement to find duplicates. It worked really nicely actually. That’s what I posted below. I was also able to find 16 of the 20 before checking the solution.
Awesome challenge. I’m sure I missed a couple but I was able to mark the vast majority of the erroneous data that needs to be cleaned so the table is more readable and useful. I found making the file into an excel table was extremely helpful and many of the errors could be found quite easily using the filters available. You can see what I did below.
Challenge56_Data.xlsx (367.8 KB)
This was a fun one! This really tested my attention to detail and interpreting a Data Dictionary. Unlike other challenges, I spent most of my time manually looking through the dataset. I didn’t automate very much. Below includes how I found the errors:
- Filtered individual columns to look for strange values
- Filtered individual columns by ascending/descending to reveal incorrect values
- Concatenated entire row into a single cell and conditional formatted for duplicates
- Created simple IF formula to verify sales amount (only for rows with a sales price less than 100). This is where I struggled the most. I didn’t figure out how to verify the total sales value for products that sold for more than 100.
Challenge56_Data.xlsx (860.7 KB)
Did you account for the fact that the “PRICEEACH” column capped out at 100? Based on the instructions, I interpreted this to mean that the “PRICEEACH” value is not reliable for items that cost more than 100. It looks like you calculated the sales price by multiplying “QUANTITYORDERED” by “PRICEEACH.”
From what I understand, that calculation only works for products that are sold of less than 100 each. This is where I struggled the most with the challenge–how do I verify the sales price for items that were sold for more than 100 each? Rylan, did you figure out a different way to calculate this?
This is a very open ended challenge, which I think leads to great practice and also it is nice because I can save it and work on it to keep practicing. My first thoughts were to look to the filters and to see what might have be entered erroneously… like if letter inputs were put into number specific columns (I uploaded the picture of what I found almost immediately.
My updated file was too large to upload here, but I was able to find 9 issues with the data. The most helpful change I made to said data was to insert it into a table. This allowed easier filtering of data, leading to find outliers in points of data (such as years in the invoice number column). I concatenated the first few columns of data and applied a conditional rule highlighting repeats of the concatenated values. This revealed a couple duplicate rows.
I found 14 errors but wasn’t sure how to isolate the other 6. I just put the list in a table a used the sorting ability to single out the errors.
Challenge56_Data.xlsx (429.4 KB)
I wasn’t able to find all the errors, but I found what I could with my limited knowledge of Excel.
Challenge56_Data excel.xlsx (346.1 KB)
I didn’t quite get all the errors either, but I went through and filtered every column and found several mistakes in the data!