137|PYTHON – Sooper Looper

BYU Student Author: @Nate
Reviewers: @Jonathan_Weston, @Donovon, @MitchFrei
Estimated Time to Solve: 20 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

The investing world moves fast. A matter of seconds often means the difference between beating the market and getting a beating. It’s not uncommon for funds to buy expensive real estate close to stock exchanges to minimize the time it takes for data to travel from the exchange’s server to their company’s computers.

Clearly, this is an industry where efficiency counts, a fact you are keenly aware of as a senior analyst on the quant fund at Sundance Valley Advisors. For months, your team has been working on a complex machine learning algorithm in Python to use quality, value, and momentum factors to take positions in equity securities. Your main job on this project is to maximize the efficiency (i.e., minimize the runtime) of the algorithm.

Recently you assigned Jerry, the team’s intern, to write a very small portion of the code. The segment should perform some simple cleaning of pandas DataFrames containing daily stock prices. Unfortunately, Jerry isn’t a great Pythonista, and his code is slow. Very slow. The finished algorithm needs to be able to process millions of datapoints to make financial decisions in less than a minute, but it takes Jerry’s short script nearly an hour to clean the short, 123,700 record practice dataset you gave him (stockData.csv).

Now, you must fix Jerry’s code (contained in sooperLooper.ipynb) so that it can process millions of datapoints in a matter of seconds.

Jerry’s code performs the following cleaning functions:

  1. Converts the date column to a datetime object.
  2. Calculates the return from open to close of each stock each day.
  3. Converts the volume column to be an integer value.
  4. Designates the price fluctuation for each stock each day as high, medium, or low.
  5. Links each stock ticker to the company’s name (found in a separate DataFrame).

You quickly notice that Jerry relies heavily on looping through the stock price DataFrame to perform these functions. However, you know that looping is extremely inefficient, so you want to avoid this at all costs.

Simply put, your task is to fix Jerry’s code so that it runs as quickly as possible. Using the time counter included in the code, see how quickly you can get the script to clean and output the data from the stockData.csv practice set. Comment below with a screenshot of how quickly your code ran, and remember, seconds count!

Data Files

Getting Started

While Jetty’s code is pretty bad, all of his calculations and assumptions are correct. Feel free to use his formula for calculating daily returns and his cutoffs for ‘High’, ‘Medium’, and ‘Low’ volatility stocks. Also, he correctly assumed that all values in the ‘Volume’ column are given in millions, so you don’t have to account for valeus given in the thousands.

If you are new to working with pandas dataframes, this resource from w3Schools can help you get acquainted.

Suggestions and Hints
Suggestion #1: Pandas Datetime Objects

Dates can be imported directly from files as datetime objects or converted after importing. Use parse_dates=[‘date_col’] when reading in a csv, or pd.to_datetime to convert later in the code.

Suggestion #2: Vector Calculations

Simple calculations can easily be performed on entire DataFrame columns without looping using vector calculations. A simple example would be: df[‘col_3’] = df[‘col_1’] * df[‘col_2’]

Suggestion #3: Applying Functions to DataFrame Columns

You can write functions to do more complex calculations or string manipulations. To apply a function to an entire DataFrame column use the syntax: df[‘col’].apply(function_name)

Suggestion #4: Conditional Expressions

The popular python package numpy contains functionality that allows conditional assignments to be applied to an entire DataFrame column. These conditional expressions take the general format `df[‘col’] = np.where({insert expression}, value_if_true, value_if_false)’

Suggestion #5: Merging DataFrames

Merging two DataFrames is a quick way to add needed columns to an analysis. To do this, use pd.merge(). Make sure you know what arguments you need to pass for this method to work.


Solution Code
from datetime import datetime 
import pandas as pd, numpy as np, time, os 
start_time = time.time() #Record start time  

#Get current directory 
def get_dir(): 
dir = os.getcwd() 
if '/' in dir: 
delimiter = '/' #Mac users 
delimiter = '\\' #PC users 
return dir,delimiter 
dir,delimiter = get_dir() 

#Convert date to datetime object 
stocks_df = pd.read_csv(f'{dir}{delimiter}stockData.csv',parse_dates=['Date'])  
companies_df = pd.read_csv(f'{dir}{delimiter}companyNames.csv') 

#Calculate daily return 
stocks_df['Return'] = round((stocks_df['Close']/stocks_df['Open'])-1,4) 

#Change volume to numbers in millions 
def convert_volume(volume): 
volume = volume.replace('M','') 
volume = float(volume) * 1000000 
return int(volume) 
stocks_df['Volume'] = stocks_df['Volume'].apply(convert_volume) #Call function 

#Note high volatility stocks 
stocks_df['Volatility'] = np.where(((stocks_df['High'] / stocks_df['Low'])-1)>.05, 
'High',np.where(((stocks_df['High'] / stocks_df['Low'])-1)<=.025,'Low','Medium')) 

#Link tickers to names 
stocks_df = pd.merge(stocks_df,companies_df,how='left',on='Ticker',) 

#Export report to Excel 

end_time = time.time() #Record end time 
elapsed_time = end_time - start_time #Calculate elapsed time in seconds 
print("Elapsed time: {:.2f} seconds".format(elapsed_time)) #Print elapsed time 

Solution Video: Challenge 137|PYTHON – Sooper Looper