30|PYTHON – SEC and ASC

BYU Student Author: @Nate
Reviewers: @Mike_Paulsin, @Brett_Lowe
Estimated Time to Solve: 20 - 30 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Overview
The Securities and Exchange Commission (SEC) regularly corresponds with companies after reviewing their financial statements and other disclosure documents. These letters may raise questions, concerns, or requests for additional information related to a Company’s implementation of the standards found within the Accounting Standards Codification (ASC).

As accounting professionals, we help clients navigate the complexities of GAAP and ensure compliance with the latest standards, so knowing which ASC topics are under the greatest SEC scrutiny can enable us to better advise our clients on these issues. Fortunately, a background in data analytics can save us time spent searching through comment letters to identify these trends.

Instructions
In this challenge, you’ll write Python script that can parse through multiple SEC inquires to recognize explicit mentions of ASC topics, identifying which are most commonly mentioned.

Included below is a folder containing all 12 comment letters sent by the SEC in the first quarter of 2023 that reference the ASC. All mentions of ASC topics follow the format: ASC ### (e.g., ASC 321). Extract and count these mentions, grouped by topic.

A file containing the numbers and names of all ASC topics is also included. Use this file to add names to the topics you extracted from the SEC comment letters. Export or print a simple report of your mentioned topics count. Are you surprised which standards receive the most scrutiny?

Suggestions and Hints

Many mentions are to specific subsections (e.g., ASC 321-10-35-2). For this challenge you only need to extract the first three digits, which indicate the topic area.

The native Python function open() paired with methods .readlines() or .read() allows you to save .txt files as string variables.

Python’s native Operating System interfacing package can help you reference directories and iterate through files in a folder using the following methods:

  • os.getcwd()
  • os.listdir()
    Import this package with import os

Python’s native Regex package contains the following methods that can help identify patterns in string:

  • re.findall() - creates a list of all substrings matching the specified pattern within a string
  • re.sub() - replaces all instances of the specified pattern within a string with a specified string
  • re.match() - if any instances of the specified pattern are found within a string, returns a match object, otherwise, returns None
    Import this package with import re

Data Files

Solution

Solution Code
import os #import packages 
import re 
import pandas as pd 

#x = '\\' #for Windows users 
x = '/' #for Mac users 

directory = os.getcwd() + f'{x}Uploads' #save folder holding SEC correspondence letters - Windows 
pattern = r'ASC\s+\d{3}' #save regex pattern to search for refernces to ASC topics 
topics_dict = {} #initialize dictionary to save mentions of each topic 

for upload in os.listdir(directory): #iterate through files in 'Uploads' folder 
  text = open(f'{directory}{x}{upload}').read() #save contents of letter to 'text' 
  text = re.sub(r'\n\s*',r' ',text) #replace newline and extra whitespace characters with single spaces 
  asc_topics = re.findall(pattern,text,re.IGNORECASE) #save all matches for 'pattern' to 'asc_topics' list 
  for topic in asc_topics: #iterate through all matches in 'asc_topics' list 
    if topic in topics_dict: #check for previously mentioned topics 
      topics_dict[topic] += 1 #add an additional mention to existing topic in dictionary 
    else: 
      topics_dict[topic] = 1 #add new topic to dictionary with 1 mention 

df = pd.DataFrame(data=topics_dict.items(),columns=['Topic','Mentions']) #create dataframe with mentioned topics 

df.sort_values(by=['Mentions'],ascending=False,inplace=True) #sort dataframe with most-mentioned topics first 

topics_list = open(os.getcwd() + f'{x}ASC Topics.txt').readlines() #create list of all topics in the ASC 

for topic in topics_list: #iterate through all topics in the ASC 
  for i, row in df.iterrows(): #iterate through all rows of the mentioned topics dataframe 
    if re.match(row['Topic'][-3:],topic): #check if topic from list matches mentioned topic 
      topic = re.sub('\n|\d{3}','',topic) #delete numbers from topic  
      df.loc[i,'Topic'] = df.loc[i,'Topic'] + topic #concatenate topic name with topic number and saves to 'df' 

print(df) #display 'df' for viewing 

df.to_excel('SEC Topics of Interest.xlsx',index=False) #export 'df' to Excel report 

Challenge30_Solution.txt
Solution Video: Challenge 30|PYTHON – SEC and ASC

I was looking for an excuse to learn Python’s capabilities with Regex and this was a good challenge to do that for! I relied heavily on regex and pandas to accomplish this challenge. I took a different approach than the solution presented, though there are undoubtedly some similarities upon my review of it.

I first decided to prepare two dictionaries: one using the reference file to connect topic names to the topic IDs and another to connect a reference count (starting at 0) to each topic ID. I converted the topic reference table to a df for later. My biggest difference from the solution was I decided to define a function that would grab a list of topic IDs from any given letter inputted and apply that function to each file dynamically. I also utilized Regex Grouping to easily isolate the Topic ID from the ASC reference. See my defined function below (based on a predefined file directory location):

def get_topics(file):
    letter = open(directory+file).read()
    topic_ids = []
    # Regex Pattern to check for
    pattern = 'ASC\s+(\d{3})'
    # Find all iterations where pattern is present in the letter
    matching_list = re.findall(pattern, letter, re.IGNORECASE)
    for match in matching_list:
        # Utilize Regex grouping to isolate just the topic number and append to list
        topic_id = re.sub(pattern,'\1', match)
        topic_ids.append(topic_id)
    return topic_ids

From there, I looped through each comment letter and applied my function to each one, using the resulting list of topics to update the mention counts in the dictionary. I converted my completed topics count dictionary into a dataframe and then used the pd.merge method to merge the Topic Descriptions to it. I noticed the solution had concatenated the topic ID to the name, so my output does differ a little bit there. My full solution is below.

# Import necessary modules for analysis
import pandas as pd
import os
import re
# Access Text File to Edit_Create Blank Dictionary
topics_list = open('ASC Topics.txt').readlines()
topics_dict = {}
topic_descriptions = {}
# Iterate through each item in the list and add to dictionary
for i, topic in enumerate(topics_list):
    # Ignore lines that don't contain topics
    if i > 1:
        # System flags this as inappropriate, but you can deduce the method I used here
        topic = topic.stripe()
        # Split out topic numbers and topic descriptions
        number = topic.split(' — ')[0]
        description = topic.split(' — ')[1].replace('—','—')
        # Initialize reference count before analysis
        count = 0
        # Prepare Topic Reference Count Dictionary
        topics_dict[number] = count
        # Prepare Topic Description Dictionary to link to IDs
        topic_descriptions[number] = description
# Create Reference Frame for Topic IDs and Topic Descriptions
df_topic = pd.DataFrame(data = topic_descriptions.items(), columns = ['Topic ID', 'Description'])
# Define function to easily pull out all references of ASC Topics
def get_topics(file):
    letter = open(directory+file).read()
    topic_ids = []
    # Regex Pattern to check for
    pattern = 'ASC\s+(\d{3})'
    # Find all iterations where pattern is present in the letter
    matching_list = re.findall(pattern, letter, re.IGNORECASE)
    for match in matching_list:
        # Utilize Regex grouping to isolate just the topic number and append to list
        topic_id = re.sub(pattern,'\1', match)
        topic_ids.append(topic_id)
    return topic_ids
# Direct computer to list of comment letter files (based on how I set up my file configuration)
directory = os.getcwd()+'\\Uploads\\'
files = os.listdir(directory)
# Collect Number of References to each ASC reference
for file in files:
    # Get list of topics mentioned in letter using premade function
    topic_ids = get_topics(file)
    for topic in topic_ids:
        if topic in topics_dict:
            # Increment reference count by 1 if topic exists
            topics_dict[topic] += 1
        else:
            # Create new topic if topic doesn't exist already
            topics_dict[topic] = 1
# Create Topics Count Dataframe
df_counts = pd.DataFrame(data = topics_dict.items(), columns = ['Topic ID', 'Mentions'])
df_counts.sort_values(by = ['Mentions'], ascending = False, inplace = True)
# Merge Counts Dataframe with Topics Dataframe
topic_counts = pd.merge(df_counts, df_topic, on = 'Topic ID')
# Rearrange columns
topic_counts = topic_counts[['Topic ID', 'Description', 'Mentions']]
# Export to Excel
topic_counts.to_excel('ASC_References.xlsx', index=False)    

I ended up working on this and trying to figure it as I am still learning python but I found that this worked and it is really close to the solution, but it just had some things that I left out but it still worked on my end.

import os
import pandas as pd

x=‘\’

directory = os.getcwd() + f’{x}Uploads’
topics_dict =

for upload in os.listdir(directory):
text = open(f’{directory}{x}{upload}‘).read()
text = re.sub(r’\n\s*‘,r’ ',text)
asc_topics = re.findall(text,re.IGNORECASE)
for topic in asc_topics:
if topic in topics_dict:
topics_dict[topic] += 1
else:
topics_dict[topic] = 1

df = pd.DataFrame(data=topics_dict.items(),columns=[‘Topic’,‘Mentions’])
df.sort_values(by=[‘Mentions’],ascending=False)

topics_list = open(os.getcwd() + f’{x}ASC Topics.txt’).readlines()

for topic in topics_list:
for i, row in df.iterrows():
if re.match(row[‘Topic’][-3:],topic):
topic = re.sub(‘\n|\d{3}’,‘’,topic)
df.loc[i,‘Topic’] = df.loc[i,‘Topic’] + topic

print(df)
df.to_excel(‘SEC Topics of Interest.xlsx’,index=False)

Great Challenge! Such a great exercise to navigate a bit of regex and glob. I’m not surprised that ASC 606 is the standard that receives the most scrutiny. Revenue is arguably the most important account in any companies financials! I hope I can apply these skills into my job this summer.

Here is my solution:

image

Good refresher on lists, glob, and regex. I’ve never used regex in Python, but it’s very useful. I learned about a cool functionality I’ve never seen before: *set(list), which outputs a list of all unique values in a list. Here is my code:

It doesn’t really surprise me that Revenue and Business Combinations are a hot topic given the complexity of both issues.

Great challenge! I think this would be super useful when seeing which topics are applied most often. Here is the solution I came up with:

import os
import re
import pandas as pd

data_dir = os.getcwd() + ‘\Uploads’
regex_pattern = r’ASC\s+\d{3}’
topics = {}

for file_name in os.listdir(data_dir):
file_content = open(f’{data_dir}\{file_name}‘).read()
file_content = re.sub(r’\n\s*‘, r’ ', file_content)
matches = re.findall(regex_pattern, file_content, re.IGNORECASE)
for match in matches:
if match in topics:
topics[match] += 1
else:
topics[match] = 1

df = pd.DataFrame(data=topics.items(), columns=[‘Topic’, ‘Mentions’])
df.sort_values(by=[‘Mentions’], ascending=False, inplace=True)

asc_topics = open(os.getcwd() + ‘\ASC Topics.txt’).readlines()
for asc_topic in asc_topics:
asc_topic = re.sub(‘\n|\d{3}’, ‘’, asc_topic)
for i, row in df.iterrows():
if re.match(row[‘Topic’][-3:], asc_topic):
df.loc[i, ‘Topic’] = df.loc[i, ‘Topic’] + asc_topic

print(df)
df.to_excel(‘SEC Topics of Interest.xlsx’)

Great challenge! As a small refactoring os.path.join() is helpful to create paths on any operating system, you would not need to worry about windows vs mac:

import os 
os.path.join(directory, upload)

Another idea is to use pathlib:

from pathlib import Path
text = (Path(directory) / upload).read_text()

I created a bar chart with matplotlib to display my results!

1 Like