66|PYTHON – Data Drifters

BYU Student Author: @Jae
Reviewers: @IWillyerd
Estimated Time to Solve: 90 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Data Drifters is a local vintage toy vehicle store, selling vintage toy cars, motorcycles, ships, and planes. Recently Data Drifters has begun transitioning from their old third-party sales and inventory management system to a new in-house system that hopes to give them the ability to analyze their sales and inventory records better.

While exporting the current data, Rhys, the store manager noticed that the old system had exported its 13-column database into a combined single-column dataset with various delimiters. Being busy with managing the storefront, Rhys has hired you to come clean this dataset so that he can finish implementing the new system.

Using the exported data file, break the single-column dataset into the original 13-columns with the appropriate values. Rhys has provided a data dictionary to help you understand what the data columns should include. Make sure to export your cleaned data into a file for Rhys named Clean_Data.csv.

Notes on cleaning the data:

  • You do not need to include any leading/trailing symbols such as $ in the cleaned output
    • Order Date should still be m/d/yyyy formatted
  • Product Code should also begin with an ‘S’
  • Rhys expects to receive a few more of these exports before the new system is fully functional, so your code should be able to run without an issue on a different dataset
Suggestions and Hints
Breakdown of each column:
  • ORDERNUMBER - first 5 numbers
  • QUANTITYORDERED - qty precedes the $ sign
  • PRICEEACH - number following $ (do not keep $)
  • ORDERLINENUMBER - number following |
  • SALES - number following $
  • ORDERDATE - date
  • STATUS - text after date
  • QTR_ID - quarter of year 1-4
  • MONTH_ID - same month as order date
  • YEAR_ID - same year as order date
  • PRODUCTLINE - text
  • MSRP - number preceding S in product code
  • PRODUCTCODE - Code beginning with S, follows S##_#### format (number of digits vary)

The following code will make sure your pandas dataframe is not truncated:

pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_columns', None) 

The following line of code is a good way to practice your string manipulation on a single line of the data, replace the x with whatever row number you want:

line = df.iloc[:,0][x] 
Code to check your Clean_Data.csv output file

Make sure that your Clean_Data.csv file does not include the index number of each row.

#Make sure both .csv files are in the same folder as your code file. Else, you will need to replace the string with your file path 
import csv  

# Read in your solution file (Ensure that the file name matches) 
with open('Clean_Data.csv', 'r') as file1: 
    reader1 = csv.reader(file1) 
    data1 = list(reader1) 

# Read in the Challenge Solution Check file 
with open('Challenge66_Solution_Check.csv', 'r') as file2: 
    reader2 = csv.reader(file2) 
    data2 = list(reader2) 

# Get the column names from the first row of the file 
column_names = data1[0] 
# Compare the number of rows and columns in each file 
if len(data1) != len(data2) or len(data1[0]) != len(data2[0]): 
    print("Files do not match") 
    # Compare the values in each row 
    for i in range(1, len(data1)): 
        for j in range(len(data1[0])): 
            if data1[i][j] != data2[i][j]: 
                print(f"Files do not match. Issue in column '{column_names[j]}'") 
        print("Files match perfectly, great job!") 

Bonus Questions

  • Which products are selling under the MSRP?
  • Which orders are still not shipped?

Data Files


Solution Code

The solution code assumes that the data file is in the same folder as the saved code file that is running the solution code.

import os 
import pandas as pd 
pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_columns', None) 

file_path = os.getcwd() 

df = pd.read_csv(f'{file_path}//ChallengeM16_Data_Drifters.csv') 

# Create a clean dataframe with the proper column headers 
headers = df.columns[0].split(' ') 
clean_df = pd.DataFrame(columns=headers) 

# Define lists and data dictionary to be used to hold the cleaned data 
data = {} 
order_number = [] 
qty_ordered = [] 
price = [] 
order_line = [] 
sales = [] 
order_date = [] 
status = [] 
quarter = [] 
month_list = [] 
year_list = [] 
product_line = [] 
msrp_list = [] 
product_code = [] 

# Clean the data row by row 
for row in df.iloc[:,0]: 
    data['ORDERNUMBER'] = order_number  

    data['QUANTITYORDERED'] = qty_ordered  

    data['PRICEEACH'] = price  

    data['ORDERLINENUMBER'] = order_line  

    # SALES 
    sales.append(row.split("$")[2].split(" ")[0]) 
    data['SALES'] = sales  

    date_pos = row.find(" ") 
    year_pos = row.find("/",row.find("/")+1) 
    data['ORDERDATE'] = order_date  

    # STATUS 
    if row.split("/")[0][-3] == " ": 
        month = int(row.split("/")[0][-2:]) 
        month = int(row.split("/")[0][-1]) 
    month_pos = row.split("/")[2][4:].find(str(month)) 
    if month == 1: 
    data['STATUS'] = status  

    # QTR_ID 
    if month in [1,2,3]: Q = 1 
    elif month in [4,5,6]: Q = 2 
    elif month in [7,8,9]: Q = 3 
    else: Q = 4 
    data['QTR_ID'] = quarter 

    data['MONTH_ID'] = month_list 

    year = row[year_pos+1:year_pos+5] 
    data['YEAR_ID'] = year_list 

    year_pos2 = row.find(year,row.find(year)+1) 
    _pos = row.find('_') 
    word_list = [] 
    for char in row[year_pos2:_pos]: 
        if char.isalpha(): 
    word = ''.join(word_list) 
    data['PRODUCTLINE'] = product_line 

    msrp = row.split("_")[0].split('S')[-2] 
    msrp_pos = msrp.find(word) 
    if msrp[msrp_pos-2] == 's': 
    data['MSRP'] = msrp_list  

    half1 = row.split("_")[0].split('S')[-1] 
    half2 = row.split("_")[1] 
    data['PRODUCTCODE'] = product_code      

# Add the new data to the cleaned dataframe 
clean_df = clean_df.assign(**data)  

# Export the cleaned dataframe to a csv 
clean_df.to_csv(f'{file_path}//Clean_Data.csv', index=False) 

Solution Video: Challenge 66|PYTHON – Data Drifters

Very challenging! I used regex to parse each row. Here’s my code:

import pandas as pd
import re

#Read data
data = pd.read_csv('C:\\Users\\Mitch.Frei\\OneDrive - Connor Group, Inc\\Documents\\Python\\Challenge66_Data.csv')

#Parse the column names on ' ' and create clean data with new headers
headers = data.columns.tolist()
headers = headers[0].split(' ')
clean_data = pd.DataFrame(columns=headers)
col_count = len(headers)

#Set regex pattern with groupings
pattern = r'(\d{5})(\d+)\$(\d+\.?\d+?)\|(\d+)\$(\d+\.?\d+?)\s(\d+\/\d+\/\d+)(\D+)(\d{1})(\d{1,2})(\d{4})(\D+)(\d+)(S\d+\_\d+)'

#For each row in the dataset, parse the data according to the regex pattern and add row to clean data
for index, row in data.iterrows():
    row_data = data.iloc[index].values[0]
    new_row = re.split(pattern, row_data)[1:col_count+1]
    clean_data.loc[len(clean_data)] = new_row

#Convert column data types
clean_data[['PRICEEACH', 'SALES']] = clean_data[['PRICEEACH', 'SALES']].astype(float)
clean_data[['ORDERNUMBER', 'QTR_ID', 'MONTH_ID', 'YEAR_ID', 'MSRP']] = clean_data[['ORDERNUMBER', 'QTR_ID', 'MONTH_ID', 'YEAR_ID', 'MSRP']].astype(int)
clean_data['ORDERDATE'] = pd.to_datetime(clean_data['ORDERDATE'], format='%m/%d/%Y')

#Export clean data to csv
clean_data.to_csv('C:\\Users\\Mitch.Frei\\OneDrive - Connor Group, Inc\\Documents\\Python\\Clean_Data.csv', index=False)

#Find which products are selling below the MSRP
Under_MSRP = clean_data[clean_data['PRICEEACH'] < clean_data['MSRP']]
Under_MSRP = Under_MSRP['PRODUCTLINE'].drop_duplicates().reset_index(drop=True)

#Find which orders have not been shipped
Not_Shipped = clean_data[clean_data['STATUS'] != 'Shipped']['ORDERNUMBER'].reset_index(drop=True)

Here’s how I did it. It was a bit of work getting started, then I realized regex would be easiest. I did this all in a google colab file. Now I did put in the code linked below into chatGPT and it did find a way to streamline what I have into an even more concise way, which was cool. None the less it was a good refresher on how to use regex and work with data.

import pandas as pd

path = '/content/Challenge66_Data.csv'
data = pd.read_csv(path)


data['mixed'] = data['mixed'].astype(str)

data[['ordernumber', 'mixed']] = data['mixed'].str.split('$', 1, expand=True)
data['ORDERNUMBER'] = data['ordernumber'].str.slice(0, 5)
data['QUANTITYORDERED'] = data['ordernumber'].str.slice(5)
data = data.drop(columns=['ordernumber'])

data[['PRICEEACH', 'mixed']] = data['mixed'].str.split('|', 1, expand=True)

data[['ORDERLINENUMBER', 'mixed']] = data['mixed'].str.split('$', 1, expand=True)

data[['SALES', 'mixed']] = data['mixed'].str.split(' ', 1, expand=True)

data['ORDERDATE'] = data['mixed'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})')
data['mixed'] = data['mixed'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})(.*)')[1]

data['STATUS'] = data['mixed'].str.extract(r'([A-Za-z ]+)')
data['mixed'] = data['mixed'].str.extract(r'([A-Za-z ]+)(.*)')[1]

data['QTR_ID'] = data['mixed'].str.extract(r'(\d{1})')
data['mixed'] = data['mixed'].str.extract(r'(\d{1})(.*)')[1]

data['MONTH_ID'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})')[0]
data['YEAR_ID'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})')[1]
data['mixed'] = data['mixed'].str.extract(r'(\d{1,2})(\d{4})(.*)')[2]

data['PRODUCTLINE'] = data['mixed'].str.extract(r'([A-Za-z]+)')
data['mixed'] = data['mixed'].str.extract(r'([A-Za-z]+)(.*)')[1]

data['MSRP'] = data['mixed'].str.extract(r'(\d+)([A-Za-z].*)')[0]
data['PRODUCTCODE'] = data['mixed'].str.extract(r'(\d+)([A-Za-z].*)')[1]

data = data.drop(columns=['mixed'])


This was definitely challenging, but I was able to find some of the solution. Not for beginners haha, but this is what I was able to come up with!

import pandas as pd

from io import StringIO

csv_file = StringIO(“Challenge66_Data.csv”)

df = pd.read_csv(csv_file, sep=’ ', skipinitialspace=True)

df[[‘ORDERNUMBER’, ‘QUANTITYORDERED’]] = df[‘ORDERNUMBER’].str.split(‘$’, n=1, expand=True)

df[‘PRICEEACH’] = df[‘PRICEEACH’].apply(lambda x: x.split(‘$’)[1] if isinstance(x, str) and ‘$’ in x else x)

df[[‘ORDERLINENUMBER’, ‘PRICEEACH’]] = df[‘PRICEEACH’].str.split(’ ', n=1, expand=True)

df[[‘SALES’, ‘ORDERDATE’]] = df[‘SALES’].str.split(‘Shipped’, n=1, expand=True)

df[‘QUANTITYORDERED’] = pd.to_numeric(df[‘QUANTITYORDERED’])

df[‘PRICEEACH’] = pd.to_numeric(df[‘PRICEEACH’])

df[‘ORDERLINENUMBER’] = pd.to_numeric(df[‘ORDERLINENUMBER’])

df[‘ORDERDATE’] = pd.to_datetime(df[‘ORDERDATE’])

df[‘QTR_ID’] = pd.to_numeric(df[‘QTR_ID’])

df[‘MONTH_ID’] = pd.to_numeric(df[‘MONTH_ID’])

df[‘YEAR_ID’] = pd.to_numeric(df[‘YEAR_ID’])

df[‘MSRP’] = pd.to_numeric(df[‘MSRP’])


Clean_Data.csv (152.0 KB)

This is a very challenging one! But I like the practice I got with regex and pandas! Thanks Jae.
The following is the code I used and the output csv file.

import pandas as pd
import os
pd.options.display.float_format = '{:,.2f}'.format # Display option to include 2 decimal places on all floats
pd.set_option('display.max_columns', None) # Display option to show all columns
pd.set_option('display.max_rows', None) # Display option to show all rows

df = pd.read_csv("Challenge66_Data.csv", sep=" ")

df["PRICEEACH"] = df["ORDERNUMBER"].str.extract("(\$[0-9]+\.?[0-9]*)", expand=True)
df["PRICEEACH"] = df["PRICEEACH"].str.replace("$","")
df["PRICEEACH"] = df["PRICEEACH"].astype(float)

df["ORDERLINENUMBER"] = df["ORDERNUMBER"].str.extract("(\|[0-9]+)", expand=True)
df["ORDERLINENUMBER"] = df["ORDERLINENUMBER"].str.replace("|","")

df["SALES"] = df["ORDERNUMBER"].str.extract("(\$[0-9|\.]+$)", expand=True)
df["SALES"] = df["SALES"].str.replace("$","")
df["SALES"] = df["SALES"].astype(float)

df["ORDERDATE"] = df["QUANTITYORDERED"].str.extract("([0-9]+/[0-9]+/[0-9]+)", expand=True)
df["ORDERDATE"] = pd.to_datetime(df["ORDERDATE"]).dt.strftime('%m/%d/%Y')

df["STATUS"] = df["QUANTITYORDERED"].str.extract("([A-Z|a-z]+)", expand=True)

df["QTR_ID"] = df["QUANTITYORDERED"].str.extract("([A-Z|a-z]\d)", expand=True)
df["QTR_ID"] = df["QTR_ID"].str.extract("(\d)", expand=True)

df["MONTH_ID"] = df["QUANTITYORDERED"].str.extract("(\d+2\d+[A-Z])", expand=True)
df["MONTH_ID"] = df["MONTH_ID"].str[1:].str.extract("(\d+2)", expand=True)
df["MONTH_ID"] = df["MONTH_ID"].str[:-1]

df["YEAR_ID"] = df["QUANTITYORDERED"].str.extract("(\d+2\d+[A-Z])", expand=True)
df["YEAR_ID"] = df["YEAR_ID"].str[-5:-1]

df["PRODUCTLINE"] = df["QUANTITYORDERED"].str.extract("(\d+2\d+[A-Z|a-z]+)", expand=True)
df["PRODUCTLINE"] = df["PRODUCTLINE"].str.extract("([A-Z|a-z]+)", expand=True)

df["MSRP"] = df["QUANTITYORDERED"].str.extract("([A-Z|a-z]\d+S)", expand=True)
df["MSRP"] = df["MSRP"].str.extract("(\d+)", expand=True)

df["PRODUCTCODE"] = df["QUANTITYORDERED"].str.extract("(S\d+_\d+)")

df["QUANTITYORDERED"] = df["ORDERNUMBER"].str.extract("(\d+\$)")
df["QUANTITYORDERED"] = df["QUANTITYORDERED"].str.replace("$", "")

df["ORDERNUMBER"] = df["ORDERNUMBER"].str[:5]

df.to_csv(f'{os.getcwd()}//Clean_Data.csv', index=False) 

Clean_Data.csv (192.2 KB)