BYU Student Author: @Nate
Reviewers: @Jonathan_Weston, @Donovon, @MitchFrei
Estimated Time to Solve: 20 Minutes
We provide the solution to this challenge using:
- Python
Need a program? Click here.
Overview
The investing world moves fast. A matter of seconds often means the difference between beating the market and getting a beating. It’s not uncommon for funds to buy expensive real estate close to stock exchanges to minimize the time it takes for data to travel from the exchange’s server to their company’s computers.
Clearly, this is an industry where efficiency counts, a fact you are keenly aware of as a senior analyst on the quant fund at Sundance Valley Advisors. For months, your team has been working on a complex machine learning algorithm in Python to use quality, value, and momentum factors to take positions in equity securities. Your main job on this project is to maximize the efficiency (i.e., minimize the runtime) of the algorithm.
Recently you assigned Jerry, the team’s intern, to write a very small portion of the code. The segment should perform some simple cleaning of pandas DataFrames containing daily stock prices. Unfortunately, Jerry isn’t a great Pythonista, and his code is slow. Very slow. The finished algorithm needs to be able to process millions of datapoints to make financial decisions in less than a minute, but it takes Jerry’s short script nearly an hour to clean the short, 123,700 record practice dataset you gave him (stockData.csv
).
Now, you must fix Jerry’s code (contained in sooperLooper.ipynb
) so that it can process millions of datapoints in a matter of seconds.
Instructions
Jerry’s code performs the following cleaning functions:
- Converts the date column to a datetime object.
- Calculates the return from open to close of each stock each day.
- Converts the volume column to be an integer value.
- Designates the price fluctuation for each stock each day as high, medium, or low.
- Links each stock ticker to the company’s name (found in a separate DataFrame).
You quickly notice that Jerry relies heavily on looping through the stock price DataFrame to perform these functions. However, you know that looping is extremely inefficient, so you want to avoid this at all costs.
Simply put, your task is to fix Jerry’s code so that it runs as quickly as possible. Using the time counter included in the code, see how quickly you can get the script to clean and output the data from the stockData.csv
practice set. Comment below with a screenshot of how quickly your code ran, and remember, seconds count!
Data Files
- Challenge137_Sooper_Looper.ipynb
- Challenge137_Stock_Data.csv
- Challenge137_Company_Names.csv
- Challenge137_Data_Dictionary.docx
Getting Started
While Jetty’s code is pretty bad, all of his calculations and assumptions are correct. Feel free to use his formula for calculating daily returns and his cutoffs for ‘High’, ‘Medium’, and ‘Low’ volatility stocks. Also, he correctly assumed that all values in the ‘Volume’ column are given in millions, so you don’t have to account for valeus given in the thousands.
If you are new to working with pandas dataframes, this resource from w3Schools can help you get acquainted.
Suggestions and Hints
Suggestion #1: Pandas Datetime Objects
Dates can be imported directly from files as datetime objects or converted after importing. Use parse_dates=[‘date_col’]
when reading in a csv, or pd.to_datetime
to convert later in the code.
Suggestion #2: Vector Calculations
Simple calculations can easily be performed on entire DataFrame columns without looping using vector calculations. A simple example would be: df[‘col_3’] = df[‘col_1’] * df[‘col_2’]
Suggestion #3: Applying Functions to DataFrame Columns
You can write functions to do more complex calculations or string manipulations. To apply a function to an entire DataFrame column use the syntax: df[‘col’].apply(function_name)
Suggestion #4: Conditional Expressions
The popular python package numpy
contains functionality that allows conditional assignments to be applied to an entire DataFrame column. These conditional expressions take the general format `df[‘col’] = np.where({insert expression}, value_if_true, value_if_false)’
Suggestion #5: Merging DataFrames
Merging two DataFrames is a quick way to add needed columns to an analysis. To do this, use pd.merge()
. Make sure you know what arguments you need to pass for this method to work.
Solution
Solution Code
from datetime import datetime
import pandas as pd, numpy as np, time, os
start_time = time.time() #Record start time
#Get current directory
def get_dir():
dir = os.getcwd()
if '/' in dir:
delimiter = '/' #Mac users
else:
delimiter = '\\' #PC users
return dir,delimiter
dir,delimiter = get_dir()
#Convert date to datetime object
stocks_df = pd.read_csv(f'{dir}{delimiter}stockData.csv',parse_dates=['Date'])
companies_df = pd.read_csv(f'{dir}{delimiter}companyNames.csv')
#Calculate daily return
stocks_df['Return'] = round((stocks_df['Close']/stocks_df['Open'])-1,4)
#Change volume to numbers in millions
def convert_volume(volume):
volume = volume.replace('M','')
volume = float(volume) * 1000000
return int(volume)
stocks_df['Volume'] = stocks_df['Volume'].apply(convert_volume) #Call function
#Note high volatility stocks
stocks_df['Volatility'] = np.where(((stocks_df['High'] / stocks_df['Low'])-1)>.05,
'High',np.where(((stocks_df['High'] / stocks_df['Low'])-1)<=.025,'Low','Medium'))
#Link tickers to names
stocks_df = pd.merge(stocks_df,companies_df,how='left',on='Ticker',)
#Export report to Excel
stocks_df.to_excel('dailyReturns.xlsx',index=None)
display(stocks_df)
end_time = time.time() #Record end time
elapsed_time = end_time - start_time #Calculate elapsed time in seconds
print("Elapsed time: {:.2f} seconds".format(elapsed_time)) #Print elapsed time
Challenge137_Solution.ipynb
Challenge137_Stock_Returns.xlsx
Solution Video: Challenge 137|PYTHON – Sooper Looper