137|PYTHON – Sooper Looper

BYU Student Author: @Nate
Reviewers: @Jonathan_Weston, @Donovon, @MitchFrei
Estimated Time to Solve: 20 Minutes

We provide the solution to this challenge using:

• Python

Overview
The investing world moves fast. A matter of seconds often means the difference between beating the market and getting a beating. It’s not uncommon for funds to buy expensive real estate close to stock exchanges to minimize the time it takes for data to travel from the exchange’s server to their company’s computers.

Clearly, this is an industry where efficiency counts, a fact you are keenly aware of as a senior analyst on the quant fund at Sundance Valley Advisors. For months, your team has been working on a complex machine learning algorithm in Python to use quality, value, and momentum factors to take positions in equity securities. Your main job on this project is to maximize the efficiency (i.e., minimize the runtime) of the algorithm.

Recently you assigned Jerry, the team’s intern, to write a very small portion of the code. The segment should perform some simple cleaning of pandas DataFrames containing daily stock prices. Unfortunately, Jerry isn’t a great Pythonista, and his code is slow. Very slow. The finished algorithm needs to be able to process millions of datapoints to make financial decisions in less than a minute, but it takes Jerry’s short script nearly an hour to clean the short, 123,700 record practice dataset you gave him (`stockData.csv`).

Now, you must fix Jerry’s code (contained in `sooperLooper.ipynb`) so that it can process millions of datapoints in a matter of seconds.

Instructions
Jerry’s code performs the following cleaning functions:

1. Converts the date column to a datetime object.
2. Calculates the return from open to close of each stock each day.
3. Converts the volume column to be an integer value.
4. Designates the price fluctuation for each stock each day as high, medium, or low.
5. Links each stock ticker to the company’s name (found in a separate DataFrame).

You quickly notice that Jerry relies heavily on looping through the stock price DataFrame to perform these functions. However, you know that looping is extremely inefficient, so you want to avoid this at all costs.

Simply put, your task is to fix Jerry’s code so that it runs as quickly as possible. Using the time counter included in the code, see how quickly you can get the script to clean and output the data from the `stockData.csv` practice set. Comment below with a screenshot of how quickly your code ran, and remember, seconds count!

Data Files

Getting Started

While Jetty’s code is pretty bad, all of his calculations and assumptions are correct. Feel free to use his formula for calculating daily returns and his cutoffs for ‘High’, ‘Medium’, and ‘Low’ volatility stocks. Also, he correctly assumed that all values in the ‘Volume’ column are given in millions, so you don’t have to account for valeus given in the thousands.

If you are new to working with pandas dataframes, this resource from w3Schools can help you get acquainted.

Suggestions and Hints
Suggestion #1: Pandas Datetime Objects

Dates can be imported directly from files as datetime objects or converted after importing. Use `parse_dates=[‘date_col’]` when reading in a csv, or `pd.to_datetime` to convert later in the code.

Suggestion #2: Vector Calculations

Simple calculations can easily be performed on entire DataFrame columns without looping using vector calculations. A simple example would be: `df[‘col_3’] = df[‘col_1’] * df[‘col_2’]`

Suggestion #3: Applying Functions to DataFrame Columns

You can write functions to do more complex calculations or string manipulations. To apply a function to an entire DataFrame column use the syntax: `df[‘col’].apply(function_name)`

Suggestion #4: Conditional Expressions

The popular python package `numpy` contains functionality that allows conditional assignments to be applied to an entire DataFrame column. These conditional expressions take the general format `df[‘col’] = np.where({insert expression}, value_if_true, value_if_false)’

Suggestion #5: Merging DataFrames

Merging two DataFrames is a quick way to add needed columns to an analysis. To do this, use `pd.merge()`. Make sure you know what arguments you need to pass for this method to work.

Solution

Solution Code
``````from datetime import datetime
import pandas as pd, numpy as np, time, os
start_time = time.time() #Record start time

#Get current directory
def get_dir():
dir = os.getcwd()
if '/' in dir:
delimiter = '/' #Mac users
else:
delimiter = '\\' #PC users
return dir,delimiter
dir,delimiter = get_dir()

#Convert date to datetime object

#Calculate daily return
stocks_df['Return'] = round((stocks_df['Close']/stocks_df['Open'])-1,4)

#Change volume to numbers in millions
def convert_volume(volume):
volume = volume.replace('M','')
volume = float(volume) * 1000000
return int(volume)
stocks_df['Volume'] = stocks_df['Volume'].apply(convert_volume) #Call function

#Note high volatility stocks
stocks_df['Volatility'] = np.where(((stocks_df['High'] / stocks_df['Low'])-1)>.05,
'High',np.where(((stocks_df['High'] / stocks_df['Low'])-1)<=.025,'Low','Medium'))