154|PYTHON – Fraudster Finder

BYU Student Author: @Trent_Barlow
Reviewers: @Kyle_nilsen, @Millie_K_B
Estimated Time to Solve: 20 Minutes

This is an intro challenge that is part of the Python Learning Path.

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Overview
As a beginner Python programmer working for an audit firm, you’ve been tasked with checking some employee transactions for fraud. Because it is more engaging, and the client is a visual learner, your boss has asked you to do so using matplolib. You have already written some code to extract the data from the csv file where it is stored, create a master dataframe, as well as split it into three separate dataframes by type of transaction. Open up the Jupyter Notebook file, and get to work on the visualizations!

Instructions

  1. Create a matplolib boxplot visualization using the Amount column of the df.
  2. Now that we can see the outliers, let’s find out what transactions they are. First calculate the upper bound of the boxplot. That can be calculated as the 3rd quartile times 150% of the interquartile range. We have some code below to calculate q3 and the iqr for you. After you have the upper bound, use it to create a new dataframe called bad_apples by filtering df to only those transactions which have an amount above the upper bound. That will give us a dataframe with only the outliers we see above.
  3. Create a variable called manager_counts, then use that variable to create a bar chart that shows the number of transactions approved by each manager. Set the color to #AD1B02.
  4. Make a a new variable called “average_amounts” using “df.groupby()”. Then use that to create a bar graph below which shows the average amount approved by each Manager. Set the color to #86BC25
  5. Filter the df to only those instances in which Manager D was the approver.
  6. At the beginning, we split our data into three dfs: df_credit_card, df_reimbursement, and df_invoice. Make a histogram for each of these to see how well the transactions form around the bell curve. Use 20 bins, ensure the edgecolor is set to white, and color is set to #051141. The first one is done for you.
  7. Make another histogram, with 24 bins, range=[0,24]. I got you started down below by extracting the hour component of Time in the df. Ensure edgecolor is set to #333333, and color is set to #ffe600.

Data Files

Solution

Step by Step Solution
  1. #Create a box plot to visualize outliers in the “Amount” column
    plt.boxplot(df[‘Amount’], vert=False)
    plt.title(‘Box Plot of Transaction Amounts’)
    plt.xlabel(‘Transaction Amount’)
    plt.show()

  2. #Calculate the interquartile range (IQR)
    q1 = df[‘Amount’].quantile(0.25)
    q3 = df[‘Amount’].quantile(0.75)
    iqr = q3 - q1

    #Calculate the upper bound of outliers (Upper Fence)
    upper_bound = q3 + 1.5 * iqr
    bad_apples = df[df[‘Amount’]>upper_bound]
    bad_apples

  3. #Count the number of transactions approved by each manager
    manager_counts = df[‘Approver’].value_counts()

    #Create a bar chart to visualize the results
    plt.bar(manager_counts.index,manager_counts.values, color= “#AD1B02”)
    plt.title(‘Number of Transactions Approved by Each Manager’)
    plt.xlabel(‘Manager’)
    plt.ylabel(‘Number of Transactions’)
    plt.show()

  4. average_amounts = df.groupby(‘Approver’)[‘Amount’].mean()

    #Now, let’s plot this information
    plt.bar_label(plt.bar(average_amounts.index, average_amounts.values, color = “#86BC25”)) # Adjust padding as needed

    plt.title(‘Average Transaction Amount Approved by Each Manager’)
    plt.xlabel(‘Manager’)
    plt.ylabel(‘Average Transaction Amount’)
    plt.show()

  5. Manager_D = df[df[‘Approver’]==‘Manager D’]

  6. #Reimbursement
    plt.figure(figsize=(10, 6))
    plt.hist(df_reimbursement[‘Amount’], bins=20, color=‘#051141’, edgecolor=‘white’)

    plt.title(‘Distribution of Reimbursement Amounts’)
    plt.xlabel(‘Amount’)
    plt.ylabel(‘Frequency’)
    plt.show()

    #CC
    plt.figure(figsize=(10, 6))
    plt.hist(df_credit_card[‘Amount’], bins=20, color=‘#051141’, edgecolor=‘white’)

    plt.title(‘Distribution of Credit Card Amounts’)
    plt.xlabel(‘Amount’)
    plt.ylabel(‘Frequency’)
    plt.show()

    #Invoice
    plt.figure(figsize=(10, 6))
    plt.hist(df_credit_card[‘Amount’], bins=20, color=‘#051141’, edgecolor=‘white’)

    plt.title(‘Distribution of Invoice Amounts’)
    plt.xlabel(‘Amount’)
    plt.ylabel(‘Frequency’)
    plt.show()

Challenge154_Solution.ipynb
Solution Video: Challenge 154|PYTHON – Fraudster Finder