BYU Student Author: @Nate
Reviewers: @Erick_Sizilio, @Jae
Estimated Time to Solve: 60 Minutes
We provide the solution to this challenge using:
- Python
Need a program? Click here.
Overview
Congrats! You just found your dream job, and you couldn’t be more excited. For years, you’ve been wanting to get promoted from assistant regional manager to assistant to the regional manager. Well, last week you interviewed with a prestigious paper manufacturer based in Scranton, PA, and the regional manager Michael just called—you got the job!
Michael has already emailed you the employment contract, and you are excited to read through it. However, wanting to save paper and not having time to read all 4 pages, you decide to take a quick hour to write a Python program that can identify the obligations and benefits outlined for you and your prospective employer, Dunder Mifflin, in your employment contract.
You also know that nearly 84% of employees never read their entire employment contracts, so if your program is any good, you might be able to license it out for a healthy profit.
Instructions
In an hour or less, write a script that can identify obligations and benefits outlined for different parties in a contract.
To help you, I have provided your employment contract with Dunder Mifflin as a .pdf, .docx, and .txt file. Pick one or more formats that your script will accept as input from a user.
I have also provided a file containing a list of 50 common words that indicate obligations or benefits in contracts. This file can help you search for and return instances of these words along with the sentences containing them. Output these sentences to a report that you can review to avoid reading the whole contract.
See what obligations and benefits your script finds in the Dunder Mifflin contract and comment your favorites below!
Data Files
- Challenge93_Employment_Agreement.pdf
- Challenge93_Employment_Agreement.docx
- Challenge93_Employment_Agreement.txt
- Challenge93_Contract_Words.txt
Suggestions and Hints
I recommend using Python’s Regex package to parse through contract text and find instances of the words in the contract_words.txt
file. Make sure that you are returning entire sentences so that you don’t miss obligations or benefits.
Solution
Solution Code
from pdfminer.high_level import extract_text
import docx, re, sys
path = input('Input the full file path of your contract.') # Accept file path
# Read list of contract words
with open('contract_words.txt', 'r') as file:
regex_list = file.readlines()
# Extract text from .pdf files
def pdf_extract(path):
pdf_file = open(path, 'rb') # Open the PDF file
text = extract_text(pdf_file) # Extract text from the PDF
pdf_file.close() # Close the PDF file
text = text.replace('\n','')
return text
# Extract text from .docx files
def docx_extract(path):
doc = docx.Document(path) # Open the Word document
# Extract text from the Word document
text = ''
for paragraph in doc.paragraphs:
text += paragraph.text + ' '
return text
# Read text from .txt files
def txt_extract(path):
with open(path, 'r') as file:
text = file.read() # Read the contents of the file
text = re.sub(r'\n|\xa0|\t',' ',text)
return text
# Split text into list of sentences
def split_sentences(text):
text = re.sub(r'\s+',r' ',text)
sentence_splits = re.compile(r"""
# Split sentences on whitespace between them.
(?: # Group for two positive lookbehinds.
(?<=[.!?]) # Either an end of sentence punct,
| (?<=[.!?]['"]) # or end of sentence punct and quote.
) # End group of two positive lookbehinds.
\s+ # Split on whitespace between sentences.
""",
re.IGNORECASE | re.VERBOSE)
text_list = sentence_splits.split(text)
return text_list
# Allow user to reference .pdf, .docx, or .txt files
if path[-3:] == 'pdf':
text = pdf_extract(path)
elif path[-4:] == 'docx':
text = docx_extract(path)
elif path[-3:] == 'txt':
text = txt_extract(path)
else:
# End program if file type unsupported
print('The path you have entered is not valid.\nPlease enter a file with a .pdf, .docx, or .txt extension.')
sys.exit()
# Accept titles of parties in contract
parties = input('List the titles of all parties for whom you want an overview. Separate parties with commas.')
parties_list = parties.split(',')
text_list = split_sentences(text) # Split text into sentences
# Create an output word doc with title
doc = docx.Document()
doc.add_paragraph('CONTRACT OVERVIEW:\n')
# Find obligations/benefits of each party
for i, party in enumerate(parties_list):
# Clean party title
party = party.lstrip().rstrip().lower()
parties_list[i] = party
# Check for party title in contract
while party not in text.lower():
# Prompts user if party title not found
parties_list[i] = input(f'"{party}" is not referenced in this contract, please provide a new title for this party, or type "END" to cancel.')
if parties_list[i].lower() == 'end':
sys.exit()
party = parties_list[i]
output_list = [] # Create list of sentences with obligations/benefits
doc.add_paragraph(f'Obligations and Benefits of {party.upper()}:') # Add subtitle for party
# Iterate through sentences in contract
for t in text_list:
# Iterate through contract words
for r in regex_list:
r = r.replace('\n','')
# Find contract word in sentence and append matches to output list
if re.search(f'{party}\\s+{r}',t,re.IGNORECASE):
output_list.append(t)
# Drop duplicates in output list
output_set = set(output_list)
# Add setences to word doc
for o in output_set:
doc.add_paragraph(f'{o}\n')
# if no obligations found, type 'None'
if len(output_set) == 0:
doc.add_paragraph('None\n')
doc.save('Contract Overview.docx') # Save word doc
Challenge93_Solution.ipynb
Solution Video: Challenge 93|PYTHON – Dream Job?