66|PYTHON – Data Drifters

Jae · June 14, 2023, 1:15am

BYU Student Author: @Jae
Reviewers: @IWillyerd
Estimated Time to Solve: 90 Minutes

We provide the solution to this challenge using:

Python

Need a program? Click here.

Overview
Data Drifters is a local vintage toy vehicle store, selling vintage toy cars, motorcycles, ships, and planes. Recently Data Drifters has begun transitioning from their old third-party sales and inventory management system to a new in-house system that hopes to give them the ability to analyze their sales and inventory records better.

While exporting the current data, Rhys, the store manager noticed that the old system had exported its 13-column database into a combined single-column dataset with various delimiters. Being busy with managing the storefront, Rhys has hired you to come clean this dataset so that he can finish implementing the new system.

Instructions
Using the exported data file, break the single-column dataset into the original 13-columns with the appropriate values. Rhys has provided a data dictionary to help you understand what the data columns should include. Make sure to export your cleaned data into a file for Rhys named Clean_Data.csv.

Notes on cleaning the data:

You do not need to include any leading/trailing symbols such as $ in the cleaned output
- Order Date should still be m/d/yyyy formatted
Product Code should also begin with an ‘S’
Rhys expects to receive a few more of these exports before the new system is fully functional, so your code should be able to run without an issue on a different dataset

Suggestions and Hints

Breakdown of each column:

ORDERNUMBER - first 5 numbers
QUANTITYORDERED - qty precedes the $ sign
PRICEEACH - number following $ (do not keep $)
ORDERLINENUMBER - number following |
SALES - number following $
ORDERDATE - date
STATUS - text after date
QTR_ID - quarter of year 1-4
MONTH_ID - same month as order date
YEAR_ID - same year as order date
PRODUCTLINE - text
MSRP - number preceding S in product code
PRODUCTCODE - Code beginning with S, follows S##_#### format (number of digits vary)

The following code will make sure your pandas dataframe is not truncated:

pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_columns', None)

The following line of code is a good way to practice your string manipulation on a single line of the data, replace the x with whatever row number you want:

line = df.iloc[:,0][x]

Code to check your Clean_Data.csv output file

Make sure that your Clean_Data.csv file does not include the index number of each row.

#Make sure both .csv files are in the same folder as your code file. Else, you will need to replace the string with your file path 
import csv  

# Read in your solution file (Ensure that the file name matches) 
with open('Clean_Data.csv', 'r') as file1: 
    reader1 = csv.reader(file1) 
    data1 = list(reader1) 

# Read in the Challenge Solution Check file 
with open('Challenge66_Solution_Check.csv', 'r') as file2: 
    reader2 = csv.reader(file2) 
    data2 = list(reader2) 

# Get the column names from the first row of the file 
column_names = data1[0] 
 
# Compare the number of rows and columns in each file 
if len(data1) != len(data2) or len(data1[0]) != len(data2[0]): 
    print("Files do not match") 
else: 
    # Compare the values in each row 
    for i in range(1, len(data1)): 
        for j in range(len(data1[0])): 
            if data1[i][j] != data2[i][j]: 
                print(f"Files do not match. Issue in column '{column_names[j]}'") 
                break 
        else: 
            continue 
        break 
    else: 
        print("Files match perfectly, great job!")

Bonus Questions

Which products are selling under the MSRP?
Which orders are still not shipped?

Data Files

Solution

Solution Code

The solution code assumes that the data file is in the same folder as the saved code file that is running the solution code.

import os 
import pandas as pd 
pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_columns', None) 

file_path = os.getcwd() 

df = pd.read_csv(f'{file_path}//ChallengeM16_Data_Drifters.csv') 

# Create a clean dataframe with the proper column headers 
headers = df.columns[0].split(' ') 
clean_df = pd.DataFrame(columns=headers) 

# Define lists and data dictionary to be used to hold the cleaned data 
data = {} 
order_number = [] 
qty_ordered = [] 
price = [] 
order_line = [] 
sales = [] 
order_date = [] 
status = [] 
quarter = [] 
month_list = [] 
year_list = [] 
product_line = [] 
msrp_list = [] 
product_code = [] 

# Clean the data row by row 
for row in df.iloc[:,0]: 
    # ORDERNUMBER 
    order_number.append(row[:5]) 
    data['ORDERNUMBER'] = order_number  

    # QUANTITYORDERED 
    qty_ordered.append(row[5:7]) 
    data['QUANTITYORDERED'] = qty_ordered  

    #PRICEEACH 
    price.append(row.split("$")[1].split("|")[0]) 
    data['PRICEEACH'] = price  

    # ORDERLINENUMBER 
    order_line.append(row.split("|")[1].split("$")[0]) 
    data['ORDERLINENUMBER'] = order_line  

    # SALES 
    sales.append(row.split("$")[2].split(" ")[0]) 
    data['SALES'] = sales  

    # ORDERDATE 
    date_pos = row.find(" ") 
    year_pos = row.find("/",row.find("/")+1) 
    order_date.append(row[date_pos+1:year_pos+5]) 
    data['ORDERDATE'] = order_date  

    # STATUS 
    if row.split("/")[0][-3] == " ": 
        month = int(row.split("/")[0][-2:]) 
    else: 
        month = int(row.split("/")[0][-1]) 
    month_pos = row.split("/")[2][4:].find(str(month)) 
    if month == 1: 
        status.append(row.split("/")[2][4:][:month_pos]) 
    else: 
        status.append(row.split("/")[2][4:][:month_pos-1]) 
    data['STATUS'] = status  

    # QTR_ID 
    if month in [1,2,3]: Q = 1 
    elif month in [4,5,6]: Q = 2 
    elif month in [7,8,9]: Q = 3 
    else: Q = 4 
    quarter.append(Q) 
    data['QTR_ID'] = quarter 

    #MONTH_ID 
    month_list.append(month) 
    data['MONTH_ID'] = month_list 

    #YEAR_ID 
    year = row[year_pos+1:year_pos+5] 
    year_list.append(year) 
    data['YEAR_ID'] = year_list 

    #PRODUCTLINE 
    year_pos2 = row.find(year,row.find(year)+1) 
    _pos = row.find('_') 
    word_list = [] 
    for char in row[year_pos2:_pos]: 
        if char.isalpha(): 
            word_list.append(char) 
    word = ''.join(word_list) 
    product_line.append(word[:-1]) 
    data['PRODUCTLINE'] = product_line 

    #MSRP 
    msrp = row.split("_")[0].split('S')[-2] 
    msrp_pos = msrp.find(word) 
    if msrp[msrp_pos-2] == 's': 
        msrp_list.append(msrp[msrp_pos-1:]) 
    else: 
        msrp_list.append(msrp[msrp_pos-2:]) 
    data['MSRP'] = msrp_list  

    #PRODUCTCODE 
    half1 = row.split("_")[0].split('S')[-1] 
    half2 = row.split("_")[1] 
    product_code.append(f'S{half1}_{half2}') 
    data['PRODUCTCODE'] = product_code      

# Add the new data to the cleaned dataframe 
clean_df = clean_df.assign(**data)  

# Export the cleaned dataframe to a csv 
clean_df.to_csv(f'{file_path}//Clean_Data.csv', index=False)

Solution Video: Challenge 66|PYTHON – Data Drifters

MitchFrei · October 5, 2023, 5:57pm

Very challenging! I used regex to parse each row. Here’s my code:

import pandas as pd
import re

#Read data
data = pd.read_csv('C:\\Users\\Mitch.Frei\\OneDrive - Connor Group, Inc\\Documents\\Python\\Challenge66_Data.csv')

#Parse the column names on ' ' and create clean data with new headers
headers = data.columns.tolist()
headers = headers[0].split(' ')
clean_data = pd.DataFrame(columns=headers)
col_count = len(headers)

#Set regex pattern with groupings
pattern = r'(\d{5})(\d+)\$(\d+\.?\d+?)\|(\d+)\$(\d+\.?\d+?)\s(\d+\/\d+\/\d+)(\D+)(\d{1})(\d{1,2})(\d{4})(\D+)(\d+)(S\d+\_\d+)'

#For each row in the dataset, parse the data according to the regex pattern and add row to clean data
for index, row in data.iterrows():
    row_data = data.iloc[index].values[0]
    new_row = re.split(pattern, row_data)[1:col_count+1]
    clean_data.loc[len(clean_data)] = new_row

#Convert column data types
clean_data[['ORDERNUMBER', 'STATUS', 'PRODUCTLINE', 'PRODUCTCODE']] = clean_data[['ORDERNUMBER', 'STATUS', 'PRODUCTLINE', 'PRODUCTCODE']].astype(str)
clean_data[['PRICEEACH', 'SALES']] = clean_data[['PRICEEACH', 'SALES']].astype(float)
clean_data[['ORDERNUMBER', 'QTR_ID', 'MONTH_ID', 'YEAR_ID', 'MSRP']] = clean_data[['ORDERNUMBER', 'QTR_ID', 'MONTH_ID', 'YEAR_ID', 'MSRP']].astype(int)
clean_data['ORDERDATE'] = pd.to_datetime(clean_data['ORDERDATE'], format='%m/%d/%Y')

#Export clean data to csv
clean_data.to_csv('C:\\Users\\Mitch.Frei\\OneDrive - Connor Group, Inc\\Documents\\Python\\Clean_Data.csv', index=False)

#Find which products are selling below the MSRP
Under_MSRP = clean_data[clean_data['PRICEEACH'] < clean_data['MSRP']]
Under_MSRP = Under_MSRP['PRODUCTLINE'].drop_duplicates().reset_index(drop=True)
print(Under_MSRP)

#Find which orders have not been shipped
Not_Shipped = clean_data[clean_data['STATUS'] != 'Shipped']['ORDERNUMBER'].reset_index(drop=True)
print(Not_Shipped)

koleton_murray · December 6, 2023, 4:55am

Here’s how I did it. It was a bit of work getting started, then I realized regex would be easiest. I did this all in a google colab file. Now I did put in the code linked below into chatGPT and it did find a way to streamline what I have into an even more concise way, which was cool. None the less it was a good refresher on how to use regex and work with data.

Summary

import pandas as pd

path = '/content/Challenge66_Data.csv'
data = pd.read_csv(path)

data.rename(columns={'ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE': 'mixed'}, inplace=True)

data['mixed'] = data['mixed'].astype(str)

data[['ordernumber', 'mixed']] = data['mixed'].str.split('$', 1, expand=True)
data['ORDERNUMBER'] = data['ordernumber'].str.slice(0, 5)
data['QUANTITYORDERED'] = data['ordernumber'].str.slice(5)
data = data.drop(columns=['ordernumber'])

data[['PRICEEACH', 'mixed']] = data['mixed'].str.split('|', 1, expand=True)

data[['ORDERLINENUMBER', 'mixed']] = data['mixed'].str.split('$', 1, expand=True)

data[['SALES', 'mixed']] = data['mixed'].str.split(' ', 1, expand=True)

data['ORDERDATE'] = data['mixed'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})')
data['mixed'] = data['mixed'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})(.*)')[1]

data['STATUS'] = data['mixed'].str.extract(r'([A-Za-z ]+)')
data['mixed'] = data['mixed'].str.extract(r'([A-Za-z ]+)(.*)')[1]

data['QTR_ID'] = data['mixed'].str.extract(r'(\d{1})')
data['mixed'] = data['mixed'].str.extract(r'(\d{1})(.*)')[1]

data['MONTH_ID'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})')[0]
data['YEAR_ID'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})')[1]
data['mixed'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})(.*)')[2]

data['PRODUCTLINE'] = data['mixed'].str.extract(r'([A-Za-z]+)')
data['mixed'] = data['mixed'].str.extract(r'([A-Za-z]+)(.*)')[1]

data['MSRP'] = data['mixed'].str.extract(r'(\d+)([A-Za-z].*)')[0]
data['PRODUCTCODE'] = data['mixed'].str.extract(r'(\d+)([A-Za-z].*)')[1]

data = data.drop(columns=['mixed'])

data

Jack_Carpenter · December 11, 2023, 11:16pm

This was definitely challenging, but I was able to find some of the solution. Not for beginners haha, but this is what I was able to come up with!

import pandas as pd

from io import StringIO

csv_file = StringIO(“Challenge66_Data.csv”)

df = pd.read_csv(csv_file, sep=’ ', skipinitialspace=True)

df[[‘ORDERNUMBER’, ‘QUANTITYORDERED’]] = df[‘ORDERNUMBER’].str.split(‘$’, n=1, expand=True)

df[‘PRICEEACH’] = df[‘PRICEEACH’].apply(lambda x: x.split(‘$’)[1] if isinstance(x, str) and ‘$’ in x else x)

df[[‘ORDERLINENUMBER’, ‘PRICEEACH’]] = df[‘PRICEEACH’].str.split(’ ', n=1, expand=True)

df[[‘SALES’, ‘ORDERDATE’]] = df[‘SALES’].str.split(‘Shipped’, n=1, expand=True)

df[‘QUANTITYORDERED’] = pd.to_numeric(df[‘QUANTITYORDERED’])

df[‘PRICEEACH’] = pd.to_numeric(df[‘PRICEEACH’])

df[‘ORDERLINENUMBER’] = pd.to_numeric(df[‘ORDERLINENUMBER’])

df[‘ORDERDATE’] = pd.to_datetime(df[‘ORDERDATE’])

df[‘QTR_ID’] = pd.to_numeric(df[‘QTR_ID’])

df[‘MONTH_ID’] = pd.to_numeric(df[‘MONTH_ID’])

df[‘YEAR_ID’] = pd.to_numeric(df[‘YEAR_ID’])

df[‘MSRP’] = pd.to_numeric(df[‘MSRP’])

df.head()

Adam_Kelley · December 15, 2023, 6:48am

Clean_Data.csv (152.0 KB)

Benjamin_Lau · January 8, 2024, 11:17pm

This is a very challenging one! But I like the practice I got with regex and pandas! Thanks Jae.
The following is the code I used and the output csv file.

import pandas as pd
import os
pd.options.display.float_format = '{:,.2f}'.format # Display option to include 2 decimal places on all floats
pd.set_option('display.max_columns', None) # Display option to show all columns
pd.set_option('display.max_rows', None) # Display option to show all rows

df = pd.read_csv("Challenge66_Data.csv", sep=" ")

df["PRICEEACH"] = df["ORDERNUMBER"].str.extract("(\$[0-9]+\.?[0-9]*)", expand=True)
df["PRICEEACH"] = df["PRICEEACH"].str.replace("$","")
df["PRICEEACH"] = df["PRICEEACH"].astype(float)

df["ORDERLINENUMBER"] = df["ORDERNUMBER"].str.extract("(\|[0-9]+)", expand=True)
df["ORDERLINENUMBER"] = df["ORDERLINENUMBER"].str.replace("|","")
df["ORDERLINENUMBER"] = df["ORDERLINENUMBER"].astype(float)

df["SALES"] = df["ORDERNUMBER"].str.extract("(\$[0-9|\.]+$)", expand=True)
df["SALES"] = df["SALES"].str.replace("$","")
df["SALES"] = df["SALES"].astype(float)

df["ORDERDATE"] = df["QUANTITYORDERED"].str.extract("([0-9]+/[0-9]+/[0-9]+)", expand=True)
df["ORDERDATE"] = pd.to_datetime(df["ORDERDATE"]).dt.strftime('%m/%d/%Y')

df["STATUS"] = df["QUANTITYORDERED"].str.extract("([A-Z|a-z]+)", expand=True)

df["QTR_ID"] = df["QUANTITYORDERED"].str.extract("([A-Z|a-z]\d)", expand=True)
df["QTR_ID"] = df["QTR_ID"].str.extract("(\d)", expand=True)

df["MONTH_ID"] = df["QUANTITYORDERED"].str.extract("(\d+2\d+[A-Z])", expand=True)
df["MONTH_ID"] = df["MONTH_ID"].str[1:].str.extract("(\d+2)", expand=True)
df["MONTH_ID"] = df["MONTH_ID"].str[:-1]

df["YEAR_ID"] = df["QUANTITYORDERED"].str.extract("(\d+2\d+[A-Z])", expand=True)
df["YEAR_ID"] = df["YEAR_ID"].str[-5:-1]

df["PRODUCTLINE"] = df["QUANTITYORDERED"].str.extract("(\d+2\d+[A-Z|a-z]+)", expand=True)
df["PRODUCTLINE"] = df["PRODUCTLINE"].str.extract("([A-Z|a-z]+)", expand=True)

df["MSRP"] = df["QUANTITYORDERED"].str.extract("([A-Z|a-z]\d+S)", expand=True)
df["MSRP"] = df["MSRP"].str.extract("(\d+)", expand=True)

df["PRODUCTCODE"] = df["QUANTITYORDERED"].str.extract("(S\d+_\d+)")

df["QUANTITYORDERED"] = df["ORDERNUMBER"].str.extract("(\d+\$)")
df["QUANTITYORDERED"] = df["QUANTITYORDERED"].str.replace("$", "")
df["QUANTITYORDERED"] = df["QUANTITYORDERED"].str[5:]
df["QUANTITYORDERED"] = df["QUANTITYORDERED"].astype(int)

df["ORDERNUMBER"] = df["ORDERNUMBER"].str[:5]
display(df.info())
display(df.head())

df.to_csv(f'{os.getcwd()}//Clean_Data.csv', index=False)

Clean_Data.csv (192.2 KB)

Topic		Replies	Views
196\|PYTHON&GENAI – GenAI JEs Challenges python , pandas , xlswriter , genai	0	41	May 3, 2024
70\|PYTHON – Dart Gun Rates Challenges intermediate , python , finance , pandas , html , dataframe , datetime , exchange-rate	1	187	November 5, 2023
130\|PYTHON – Quick Calculation Challenges beginner , audit , python , pandas	3	126	June 27, 2024
203\|PYTHON – COGS Costing Challenges intermediate , python , pandas , matplotlib	0	28	May 4, 2024
65\|PYTHON – Investment Investigator Challenges intermediate , python , finance , pandas , stocks	0	216	May 27, 2023

66|PYTHON – Data Drifters

Related topics