BYU Student Author: @Jae
Reviewers: @IWillyerd
Estimated Time to Solve: 90 Minutes
We provide the solution to this challenge using:
- Python
Need a program? Click here.
Overview
Data Drifters is a local vintage toy vehicle store, selling vintage toy cars, motorcycles, ships, and planes. Recently Data Drifters has begun transitioning from their old third-party sales and inventory management system to a new in-house system that hopes to give them the ability to analyze their sales and inventory records better.
While exporting the current data, Rhys, the store manager noticed that the old system had exported its 13-column database into a combined single-column dataset with various delimiters. Being busy with managing the storefront, Rhys has hired you to come clean this dataset so that he can finish implementing the new system.
Instructions
Using the exported data file, break the single-column dataset into the original 13-columns with the appropriate values. Rhys has provided a data dictionary to help you understand what the data columns should include. Make sure to export your cleaned data into a file for Rhys named Clean_Data.csv.
Notes on cleaning the data:
- You do not need to include any leading/trailing symbols such as $ in the cleaned output
- Order Date should still be m/d/yyyy formatted
- Product Code should also begin with an βSβ
- Rhys expects to receive a few more of these exports before the new system is fully functional, so your code should be able to run without an issue on a different dataset
Suggestions and Hints
Breakdown of each column:
- ORDERNUMBER - first 5 numbers
- QUANTITYORDERED - qty precedes the $ sign
- PRICEEACH - number following $ (do not keep $)
- ORDERLINENUMBER - number following |
- SALES - number following $
- ORDERDATE - date
- STATUS - text after date
- QTR_ID - quarter of year 1-4
- MONTH_ID - same month as order date
- YEAR_ID - same year as order date
- PRODUCTLINE - text
- MSRP - number preceding S in product code
- PRODUCTCODE - Code beginning with S, follows S##_#### format (number of digits vary)
The following code will make sure your pandas dataframe is not truncated:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
The following line of code is a good way to practice your string manipulation on a single line of the data, replace the x with whatever row number you want:
line = df.iloc[:,0][x]
Code to check your Clean_Data.csv output file
Make sure that your Clean_Data.csv file does not include the index number of each row.
#Make sure both .csv files are in the same folder as your code file. Else, you will need to replace the string with your file path
import csv
# Read in your solution file (Ensure that the file name matches)
with open('Clean_Data.csv', 'r') as file1:
reader1 = csv.reader(file1)
data1 = list(reader1)
# Read in the Challenge Solution Check file
with open('Challenge66_Solution_Check.csv', 'r') as file2:
reader2 = csv.reader(file2)
data2 = list(reader2)
# Get the column names from the first row of the file
column_names = data1[0]
# Compare the number of rows and columns in each file
if len(data1) != len(data2) or len(data1[0]) != len(data2[0]):
print("Files do not match")
else:
# Compare the values in each row
for i in range(1, len(data1)):
for j in range(len(data1[0])):
if data1[i][j] != data2[i][j]:
print(f"Files do not match. Issue in column '{column_names[j]}'")
break
else:
continue
break
else:
print("Files match perfectly, great job!")
Bonus Questions
- Which products are selling under the MSRP?
- Which orders are still not shipped?
Data Files
Solution
Solution Code
The solution code assumes that the data file is in the same folder as the saved code file that is running the solution code.
import os
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
file_path = os.getcwd()
df = pd.read_csv(f'{file_path}//ChallengeM16_Data_Drifters.csv')
# Create a clean dataframe with the proper column headers
headers = df.columns[0].split(' ')
clean_df = pd.DataFrame(columns=headers)
# Define lists and data dictionary to be used to hold the cleaned data
data = {}
order_number = []
qty_ordered = []
price = []
order_line = []
sales = []
order_date = []
status = []
quarter = []
month_list = []
year_list = []
product_line = []
msrp_list = []
product_code = []
# Clean the data row by row
for row in df.iloc[:,0]:
# ORDERNUMBER
order_number.append(row[:5])
data['ORDERNUMBER'] = order_number
# QUANTITYORDERED
qty_ordered.append(row[5:7])
data['QUANTITYORDERED'] = qty_ordered
#PRICEEACH
price.append(row.split("$")[1].split("|")[0])
data['PRICEEACH'] = price
# ORDERLINENUMBER
order_line.append(row.split("|")[1].split("$")[0])
data['ORDERLINENUMBER'] = order_line
# SALES
sales.append(row.split("$")[2].split(" ")[0])
data['SALES'] = sales
# ORDERDATE
date_pos = row.find(" ")
year_pos = row.find("/",row.find("/")+1)
order_date.append(row[date_pos+1:year_pos+5])
data['ORDERDATE'] = order_date
# STATUS
if row.split("/")[0][-3] == " ":
month = int(row.split("/")[0][-2:])
else:
month = int(row.split("/")[0][-1])
month_pos = row.split("/")[2][4:].find(str(month))
if month == 1:
status.append(row.split("/")[2][4:][:month_pos])
else:
status.append(row.split("/")[2][4:][:month_pos-1])
data['STATUS'] = status
# QTR_ID
if month in [1,2,3]: Q = 1
elif month in [4,5,6]: Q = 2
elif month in [7,8,9]: Q = 3
else: Q = 4
quarter.append(Q)
data['QTR_ID'] = quarter
#MONTH_ID
month_list.append(month)
data['MONTH_ID'] = month_list
#YEAR_ID
year = row[year_pos+1:year_pos+5]
year_list.append(year)
data['YEAR_ID'] = year_list
#PRODUCTLINE
year_pos2 = row.find(year,row.find(year)+1)
_pos = row.find('_')
word_list = []
for char in row[year_pos2:_pos]:
if char.isalpha():
word_list.append(char)
word = ''.join(word_list)
product_line.append(word[:-1])
data['PRODUCTLINE'] = product_line
#MSRP
msrp = row.split("_")[0].split('S')[-2]
msrp_pos = msrp.find(word)
if msrp[msrp_pos-2] == 's':
msrp_list.append(msrp[msrp_pos-1:])
else:
msrp_list.append(msrp[msrp_pos-2:])
data['MSRP'] = msrp_list
#PRODUCTCODE
half1 = row.split("_")[0].split('S')[-1]
half2 = row.split("_")[1]
product_code.append(f'S{half1}_{half2}')
data['PRODUCTCODE'] = product_code
# Add the new data to the cleaned dataframe
clean_df = clean_df.assign(**data)
# Export the cleaned dataframe to a csv
clean_df.to_csv(f'{file_path}//Clean_Data.csv', index=False)
Solution Video: Challenge 66|PYTHON β Data Drifters