234|PYTHON – Echo Empire’s Ambitious Enterprise

BYU Student Author: @Sterling_Lane
Reviewers: @Millie_K_B, @Trent_Barlow
Estimated Time to Solve: 60 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Overview
Welcome to Echo Empire, the newest music streaming app poised to revolutionize the industry. As the lead data scientist for the Echo Empire systems group, you play a pivotal role. The Marketing Director has reached out to your team, expressing a desire to enhance the platform’s content strategy significantly. Echo Empire aims to better serve its consumer base by offering content other music streaming companies cannot. The team has provided you with raw data exported directly from Spotify, which includes information about the songs on the platform, and seeks your advice on how Echo Empire can distinguish its product offerings. What elements of music drive popularity among customers? Your task is to utilize your experience in Python and univariate statistics to answer these questions from the Marketing team, helping Echo Empire establish itself as the preeminent streaming magnate.

Instructions

  1. Load the starting dataset into a Python data frame. Inspect the different columns of data to get familiar with the data and check out the different data types present. Refer to the Data Dictionary for more specific information about what each column of data represents.
  2. The data contains duration in terms of milliseconds. Make this more readable by converting these values to minutes. Round your minutes output to two decimal places and rename the column to “duration_minutes”.
  3. What percentage of these songs are classified as pop? Save this to a variable called pop_percentage in percentage form rounded to two decimals (like this: 99.23%). A song can have multiple genres. If one of the genres is pop, then that song should be classified as a pop song for the purposes of this question.
    a. Is pop the most common genre? Get a list of unique genres in the dataset and get a percentage for each of them. Use the code in the previous question to define a function to help you do this.
  4. Calculate some statistics about some of the numeric variables the Marketing Director is specifically curious about: popularity, danceability, and tempo.
    a. Export a .csv file containing the mean, median, 1st quartile, and 3rd quartile of these variables that you can provide to the Marketing Director. The four statistics should be the column names and the numeric variable names should be the row labels.
  5. Finally, let’s answer the Marketing Director’s most pressing question: Do people really care about explicit lyrics? The Marketing Director is considering having Echo Empire market itself as only hosting non-explicit songs but wants to make sure that a song’s popularity truly goes down if it is explicit before a decision of that magnitude is made.
    a. Create a graph that can help answer the Marketing Director’s question: a dual line chart. Use Matplotlib to create a line chart with Year on the X axis and Average Popularity on the Y axis. Ensure the chart contains two lines depicting the average popularity of explicit songs and average popularity of non-explicit songs. Add a legend so it’s easy to tell which line is which. (You’ll notice that only one song was released in 1998 in the dataset and that song was explicit. Because of this, make sure your X axis starts at 1999 and ends in 2020. You may need to adjust your explicit popularities to make sure that the year 1998 is not included.)
    b. Do you think this data confirms the Marketing Director’s feelings on wanting to only host non-explicit songs? Why or why not?
  6. Statistically determine if the difference you observed in your graph above is statistically significant using an independent samples t-test. This is a test that can be performed between two independent groups of values to determine if the difference between them is significant. Run this test between the list of popularities for explicit and non-explicit songs, ensuring you are using the actual popularity values and not the averages like you used in the graph. Remember, a p value less than .05 indicates that the difference is significant.
    a. Does this change your mind about your findings in the previous step? Why or why not?

Data Files

Suggestions and Hints
  • Creating a function to convert a number from milliseconds to minutes could be useful, though that’s not required.
  • The string method ‘.split()’ may be helpful for solving the multiple genres per song conundrum. Make sure to split on a comma and a space if you take this approach.
  • To specify the row label value in a data frame for question 4, use the index parameter in the pd.DataFrame function.
  • To find the mean, median, or quartiles of a series, you can call specific methods assigned to them. For example, the mean of a series can be found by calling series.mean() and the 1st quartile can be found by calling series.quantile(.25).
  • To get the X values for the line chart, one option is to use the set() function on the original data frame. Keep in mind that set data types cannot be used to plot data points in a chart.
  • One way to prepare the Y values is to split the data frame into two data frames—one containing only explicit songs and one containing non-explicit songs— and grouping each of the average popularities by year with .groupby().
  • To run an independent samples T-test, you need to run the below line of code to import the right module. You may need to pip install scipy if you don’t have it installed already. Use this as a basis to help you research how to run this test in Python.
from scipy.stats import ttest_ind 

Solution