66|PYTHON – Data Drifters

BYU Student Author: @Jae
Reviewers: @IWillyerd
Estimated Time to Solve: 90 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Overview
Data Drifters is a local vintage toy vehicle store, selling vintage toy cars, motorcycles, ships, and planes. Recently Data Drifters has begun transitioning from their old third-party sales and inventory management system to a new in-house system that hopes to give them the ability to analyze their sales and inventory records better.

While exporting the current data, Rhys, the store manager noticed that the old system had exported its 13-column database into a combined single-column dataset with various delimiters. Being busy with managing the storefront, Rhys has hired you to come clean this dataset so that he can finish implementing the new system.

Instructions
Using the exported data file, break the single-column dataset into the original 13-columns with the appropriate values. Rhys has provided a data dictionary to help you understand what the data columns should include. Make sure to export your cleaned data into a file for Rhys named Clean_Data.csv.

Notes on cleaning the data:

  • You do not need to include any leading/trailing symbols such as $ in the cleaned output
    • Order Date should still be m/d/yyyy formatted
  • Product Code should also begin with an β€˜S’
  • Rhys expects to receive a few more of these exports before the new system is fully functional, so your code should be able to run without an issue on a different dataset
Suggestions and Hints
Breakdown of each column:
  • ORDERNUMBER - first 5 numbers
  • QUANTITYORDERED - qty precedes the $ sign
  • PRICEEACH - number following $ (do not keep $)
  • ORDERLINENUMBER - number following |
  • SALES - number following $
  • ORDERDATE - date
  • STATUS - text after date
  • QTR_ID - quarter of year 1-4
  • MONTH_ID - same month as order date
  • YEAR_ID - same year as order date
  • PRODUCTLINE - text
  • MSRP - number preceding S in product code
  • PRODUCTCODE - Code beginning with S, follows S##_#### format (number of digits vary)

The following code will make sure your pandas dataframe is not truncated:

pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_columns', None) 

The following line of code is a good way to practice your string manipulation on a single line of the data, replace the x with whatever row number you want:

line = df.iloc[:,0][x] 
Code to check your Clean_Data.csv output file

Make sure that your Clean_Data.csv file does not include the index number of each row.

#Make sure both .csv files are in the same folder as your code file. Else, you will need to replace the string with your file path 
import csv  

# Read in your solution file (Ensure that the file name matches) 
with open('Clean_Data.csv', 'r') as file1: 
    reader1 = csv.reader(file1) 
    data1 = list(reader1) 

# Read in the Challenge Solution Check file 
with open('Challenge66_Solution_Check.csv', 'r') as file2: 
    reader2 = csv.reader(file2) 
    data2 = list(reader2) 

# Get the column names from the first row of the file 
column_names = data1[0] 
 
# Compare the number of rows and columns in each file 
if len(data1) != len(data2) or len(data1[0]) != len(data2[0]): 
    print("Files do not match") 
else: 
    # Compare the values in each row 
    for i in range(1, len(data1)): 
        for j in range(len(data1[0])): 
            if data1[i][j] != data2[i][j]: 
                print(f"Files do not match. Issue in column '{column_names[j]}'") 
                break 
        else: 
            continue 
        break 
    else: 
        print("Files match perfectly, great job!") 

Bonus Questions

  • Which products are selling under the MSRP?
  • Which orders are still not shipped?

Data Files

Solution

Solution Code

The solution code assumes that the data file is in the same folder as the saved code file that is running the solution code.

import os 
import pandas as pd 
pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_columns', None) 

file_path = os.getcwd() 

df = pd.read_csv(f'{file_path}//ChallengeM16_Data_Drifters.csv') 

# Create a clean dataframe with the proper column headers 
headers = df.columns[0].split(' ') 
clean_df = pd.DataFrame(columns=headers) 

# Define lists and data dictionary to be used to hold the cleaned data 
data = {} 
order_number = [] 
qty_ordered = [] 
price = [] 
order_line = [] 
sales = [] 
order_date = [] 
status = [] 
quarter = [] 
month_list = [] 
year_list = [] 
product_line = [] 
msrp_list = [] 
product_code = [] 

# Clean the data row by row 
for row in df.iloc[:,0]: 
    # ORDERNUMBER 
    order_number.append(row[:5]) 
    data['ORDERNUMBER'] = order_number  

    # QUANTITYORDERED 
    qty_ordered.append(row[5:7]) 
    data['QUANTITYORDERED'] = qty_ordered  

    #PRICEEACH 
    price.append(row.split("$")[1].split("|")[0]) 
    data['PRICEEACH'] = price  

    # ORDERLINENUMBER 
    order_line.append(row.split("|")[1].split("$")[0]) 
    data['ORDERLINENUMBER'] = order_line  

    # SALES 
    sales.append(row.split("$")[2].split(" ")[0]) 
    data['SALES'] = sales  

    # ORDERDATE 
    date_pos = row.find(" ") 
    year_pos = row.find("/",row.find("/")+1) 
    order_date.append(row[date_pos+1:year_pos+5]) 
    data['ORDERDATE'] = order_date  

    # STATUS 
    if row.split("/")[0][-3] == " ": 
        month = int(row.split("/")[0][-2:]) 
    else: 
        month = int(row.split("/")[0][-1]) 
    month_pos = row.split("/")[2][4:].find(str(month)) 
    if month == 1: 
        status.append(row.split("/")[2][4:][:month_pos]) 
    else: 
        status.append(row.split("/")[2][4:][:month_pos-1]) 
    data['STATUS'] = status  

    # QTR_ID 
    if month in [1,2,3]: Q = 1 
    elif month in [4,5,6]: Q = 2 
    elif month in [7,8,9]: Q = 3 
    else: Q = 4 
    quarter.append(Q) 
    data['QTR_ID'] = quarter 

    #MONTH_ID 
    month_list.append(month) 
    data['MONTH_ID'] = month_list 

    #YEAR_ID 
    year = row[year_pos+1:year_pos+5] 
    year_list.append(year) 
    data['YEAR_ID'] = year_list 

    #PRODUCTLINE 
    year_pos2 = row.find(year,row.find(year)+1) 
    _pos = row.find('_') 
    word_list = [] 
    for char in row[year_pos2:_pos]: 
        if char.isalpha(): 
            word_list.append(char) 
    word = ''.join(word_list) 
    product_line.append(word[:-1]) 
    data['PRODUCTLINE'] = product_line 

    #MSRP 
    msrp = row.split("_")[0].split('S')[-2] 
    msrp_pos = msrp.find(word) 
    if msrp[msrp_pos-2] == 's': 
        msrp_list.append(msrp[msrp_pos-1:]) 
    else: 
        msrp_list.append(msrp[msrp_pos-2:]) 
    data['MSRP'] = msrp_list  

    #PRODUCTCODE 
    half1 = row.split("_")[0].split('S')[-1] 
    half2 = row.split("_")[1] 
    product_code.append(f'S{half1}_{half2}') 
    data['PRODUCTCODE'] = product_code      

# Add the new data to the cleaned dataframe 
clean_df = clean_df.assign(**data)  

# Export the cleaned dataframe to a csv 
clean_df.to_csv(f'{file_path}//Clean_Data.csv', index=False) 

Solution Video: Challenge 66|PYTHON – Data Drifters

Very challenging! I used regex to parse each row. Here’s my code:

import pandas as pd
import re

#Read data
data = pd.read_csv('C:\\Users\\Mitch.Frei\\OneDrive - Connor Group, Inc\\Documents\\Python\\Challenge66_Data.csv')

#Parse the column names on ' ' and create clean data with new headers
headers = data.columns.tolist()
headers = headers[0].split(' ')
clean_data = pd.DataFrame(columns=headers)
col_count = len(headers)

#Set regex pattern with groupings
pattern = r'(\d{5})(\d+)\$(\d+\.?\d+?)\|(\d+)\$(\d+\.?\d+?)\s(\d+\/\d+\/\d+)(\D+)(\d{1})(\d{1,2})(\d{4})(\D+)(\d+)(S\d+\_\d+)'

#For each row in the dataset, parse the data according to the regex pattern and add row to clean data
for index, row in data.iterrows():
    row_data = data.iloc[index].values[0]
    new_row = re.split(pattern, row_data)[1:col_count+1]
    clean_data.loc[len(clean_data)] = new_row

#Convert column data types
clean_data[['ORDERNUMBER', 'STATUS', 'PRODUCTLINE', 'PRODUCTCODE']] = clean_data[['ORDERNUMBER', 'STATUS', 'PRODUCTLINE', 'PRODUCTCODE']].astype(str)
clean_data[['PRICEEACH', 'SALES']] = clean_data[['PRICEEACH', 'SALES']].astype(float)
clean_data[['ORDERNUMBER', 'QTR_ID', 'MONTH_ID', 'YEAR_ID', 'MSRP']] = clean_data[['ORDERNUMBER', 'QTR_ID', 'MONTH_ID', 'YEAR_ID', 'MSRP']].astype(int)
clean_data['ORDERDATE'] = pd.to_datetime(clean_data['ORDERDATE'], format='%m/%d/%Y')

#Export clean data to csv
clean_data.to_csv('C:\\Users\\Mitch.Frei\\OneDrive - Connor Group, Inc\\Documents\\Python\\Clean_Data.csv', index=False)

#Find which products are selling below the MSRP
Under_MSRP = clean_data[clean_data['PRICEEACH'] < clean_data['MSRP']]
Under_MSRP = Under_MSRP['PRODUCTLINE'].drop_duplicates().reset_index(drop=True)
print(Under_MSRP)

#Find which orders have not been shipped
Not_Shipped = clean_data[clean_data['STATUS'] != 'Shipped']['ORDERNUMBER'].reset_index(drop=True)
print(Not_Shipped)
2 Likes

Here’s how I did it. It was a bit of work getting started, then I realized regex would be easiest. I did this all in a google colab file. Now I did put in the code linked below into chatGPT and it did find a way to streamline what I have into an even more concise way, which was cool. None the less it was a good refresher on how to use regex and work with data.

Summary
import pandas as pd

path = '/content/Challenge66_Data.csv'
data = pd.read_csv(path)

data.rename(columns={'ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE': 'mixed'}, inplace=True)

data['mixed'] = data['mixed'].astype(str)

data[['ordernumber', 'mixed']] = data['mixed'].str.split('$', 1, expand=True)
data['ORDERNUMBER'] = data['ordernumber'].str.slice(0, 5)
data['QUANTITYORDERED'] = data['ordernumber'].str.slice(5)
data = data.drop(columns=['ordernumber'])

data[['PRICEEACH', 'mixed']] = data['mixed'].str.split('|', 1, expand=True)

data[['ORDERLINENUMBER', 'mixed']] = data['mixed'].str.split('$', 1, expand=True)

data[['SALES', 'mixed']] = data['mixed'].str.split(' ', 1, expand=True)

data['ORDERDATE'] = data['mixed'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})')
data['mixed'] = data['mixed'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})(.*)')[1]

data['STATUS'] = data['mixed'].str.extract(r'([A-Za-z ]+)')
data['mixed'] = data['mixed'].str.extract(r'([A-Za-z ]+)(.*)')[1]

data['QTR_ID'] = data['mixed'].str.extract(r'(\d{1})')
data['mixed'] = data['mixed'].str.extract(r'(\d{1})(.*)')[1]

data['MONTH_ID'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})')[0]
data['YEAR_ID'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})')[1]
data['mixed'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})(.*)')[2]

data['PRODUCTLINE'] = data['mixed'].str.extract(r'([A-Za-z]+)')
data['mixed'] = data['mixed'].str.extract(r'([A-Za-z]+)(.*)')[1]

data['MSRP'] = data['mixed'].str.extract(r'(\d+)([A-Za-z].*)')[0]
data['PRODUCTCODE'] = data['mixed'].str.extract(r'(\d+)([A-Za-z].*)')[1]

data = data.drop(columns=['mixed'])

data