93|PYTHON – Dream Job?

BYU Student Author: @Nate
Reviewers: @Erick_Sizilio, @Jae
Estimated Time to Solve: 60 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Overview
Congrats! You just found your dream job, and you couldn’t be more excited. For years, you’ve been wanting to get promoted from assistant regional manager to assistant to the regional manager. Well, last week you interviewed with a prestigious paper manufacturer based in Scranton, PA, and the regional manager Michael just called—you got the job!

Michael has already emailed you the employment contract, and you are excited to read through it. However, wanting to save paper and not having time to read all 4 pages, you decide to take a quick hour to write a Python program that can identify the obligations and benefits outlined for you and your prospective employer, Dunder Mifflin, in your employment contract.

You also know that nearly 84% of employees never read their entire employment contracts, so if your program is any good, you might be able to license it out for a healthy profit.

Instructions
In an hour or less, write a script that can identify obligations and benefits outlined for different parties in a contract.

To help you, I have provided your employment contract with Dunder Mifflin as a .pdf, .docx, and .txt file. Pick one or more formats that your script will accept as input from a user.

I have also provided a file containing a list of 50 common words that indicate obligations or benefits in contracts. This file can help you search for and return instances of these words along with the sentences containing them. Output these sentences to a report that you can review to avoid reading the whole contract.

See what obligations and benefits your script finds in the Dunder Mifflin contract and comment your favorites below!

Data Files

Suggestions and Hints

I recommend using Python’s Regex package to parse through contract text and find instances of the words in the contract_words.txt file. Make sure that you are returning entire sentences so that you don’t miss obligations or benefits.

Solution

Solution Code
from pdfminer.high_level import extract_text 
import docx, re, sys 

path = input('Input the full file path of your contract.') # Accept file path 

# Read list of contract words  
with open('contract_words.txt', 'r') as file: 
    regex_list = file.readlines() 

# Extract text from .pdf files 
def pdf_extract(path): 
    pdf_file = open(path, 'rb') # Open the PDF file 
    text = extract_text(pdf_file) # Extract text from the PDF 
    pdf_file.close() # Close the PDF file 
    text = text.replace('\n','') 
    return text 

# Extract text from .docx files 
def docx_extract(path): 
    doc = docx.Document(path) # Open the Word document 
    # Extract text from the Word document 
    text = '' 
    for paragraph in doc.paragraphs: 
        text += paragraph.text + ' ' 
    return text

# Read text from .txt files 
def txt_extract(path): 
    with open(path, 'r') as file: 
        text = file.read() # Read the contents of the file 
    text = re.sub(r'\n|\xa0|\t',' ',text) 
    return text 

# Split text into list of sentences 
def split_sentences(text): 
    text = re.sub(r'\s+',r' ',text) 
    sentence_splits = re.compile(r""" 
        # Split sentences on whitespace between them. 
        (?: # Group for two positive lookbehinds. 
        (?<=[.!?]) # Either an end of sentence punct, 
        | (?<=[.!?]['"]) # or end of sentence punct and quote. 
        ) # End group of two positive lookbehinds. 
        \s+ # Split on whitespace between sentences. 
        """,  
        re.IGNORECASE | re.VERBOSE) 
    text_list = sentence_splits.split(text)

    return text_list 
 
# Allow user to reference .pdf, .docx, or .txt files 
if path[-3:] == 'pdf': 
    text = pdf_extract(path) 
elif path[-4:] == 'docx': 
    text = docx_extract(path) 
elif path[-3:] == 'txt': 
    text = txt_extract(path) 
else: 
    # End program if file type unsupported 
    print('The path you have entered is not valid.\nPlease enter a file with a .pdf, .docx, or .txt extension.') 
    sys.exit() 

# Accept titles of parties in contract 
parties = input('List the titles of all parties for whom you want an overview. Separate parties with commas.') 
parties_list = parties.split(',')  

text_list = split_sentences(text) # Split text into sentences  

# Create an output word doc with title 
doc = docx.Document() 
doc.add_paragraph('CONTRACT OVERVIEW:\n')  

# Find obligations/benefits of each party 
for i, party in enumerate(parties_list): 
    # Clean party title 
    party = party.lstrip().rstrip().lower()  
    parties_list[i] = party 
    # Check for party title in contract 
    while party not in text.lower(): 
        # Prompts user if party title not found 
        parties_list[i] = input(f'"{party}" is not referenced in this contract, please provide a new title for this party, or type "END" to cancel.') 
        if parties_list[i].lower() == 'end': 
            sys.exit() 
        party = parties_list[i]  

    output_list = [] # Create list of sentences with obligations/benefits 
    doc.add_paragraph(f'Obligations and Benefits of {party.upper()}:') # Add subtitle for party 
    # Iterate through sentences in contract 
    for t in text_list: 
        # Iterate through contract words 
        for r in regex_list: 
            r = r.replace('\n','') 
            # Find contract word in sentence and append matches to output list 
            if re.search(f'{party}\\s+{r}',t,re.IGNORECASE): 
                output_list.append(t) 
    # Drop duplicates in output list 
    output_set = set(output_list) 
    # Add setences to word doc 
    for o in output_set: 
        doc.add_paragraph(f'{o}\n') 
    # if no obligations found, type 'None' 
    if len(output_set) == 0: 
        doc.add_paragraph('None\n')  

doc.save('Contract Overview.docx') # Save word doc 

Challenge93_Solution.ipynb
Solution Video: Challenge 93|PYTHON – Dream Job?