31|PYTHON – PDF Merger

BYU Student Author: @Jonathan_Weston
Reviewers: @Hyrum, @Parker_Sherwood
Estimated Time to Solve: 45 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

Overview
You are a tax accountant at a small office who prepares tax forms for individuals. Currently, you have three clients: Gertrude, Jason, and MacDonald. Gertrude is old and can’t see. Her children still live with her, and she has hired an in-house assistant to help her get around. Jason is a single guy who has started his own business. MacDonald works on a farm with his children, and they also rent out a barn for people to stable their horses.

You have already gathered the tax forms that each client will need, but your filing software needs all of the tax forms to be merged into one file for each client. Rather than paying for expensive PDF editing software or using free, but time-consuming websites, you have chosen to write a program in python that will merge them automatically.

DISCLAIMER: This is not instructions on how to prepare or file taxes. This is simply an example of how to use python to merge pdf files.

Instructions
You can write the code however you want, but it must have the following components:

  1. Ask the user with an input statement to enter in the file path for the Challenge PDF Merger folder.
  2. The program should merge the tax forms in each client’s folder and create a “Merged Tax Forms.pdf”. There should be a total of three created files. One for each client. The solution for this challenge uses the pypdf library to merge the tax forms. Look under Suggestions and Hints to see how to install that library and the other libraries used for this challenge.
  3. The program should place the newly created “Merged Tax Forms.pdf” into each client’s folder.
  4. If that file already exists in each folder, the program should not append the new file to the old or leave a duplicate.

More detailed steps and instructions can be found under Suggestions and Hints if you want more guidance on how to create this program.

Data Files

Suggestions and Hints

The pypdf library contains several modules and tools you can use to edit pdfs. In a cell, try running the following to install the library:

  • pip install pypdf
  • If you get an error, try running this instead:
  • pip install pypdf --user

I was able to create this program with the following functions and modules:

  • from pypdf import PdfMerger
  • from glob import glob
  • import shutil
  • import os

Here are more specific instructions for your code if you need them:

  1. Create a directory for the Challenge PDF Merger folder and a directory for wherever your python notebook operates. Include an os.makedirs statement in case you create filepaths that do not exist.
  2. Create a list of the client folders within the Challenge folder
  3. Loop through each client folder and create a directory
  4. Delete the Merged Tax Forms.pdf in case it already existed. This prevents the program from appending to the pre-existing file.
  5. Create a list of the tax forms within the client folder
  6. Append the pdf files using the PdfMerger module:
#Append the pdf files 
merger = PdfMerger() 
for pdf in pdfs: 
    merger. Append(pdf) 

  1. Name the file “Merged Tax Forms.pdf” and then close the file
  2. Move the file from your working directory back into the client’s folder.

Solution

I enjoyed working on this challenge! I had never used some of these packages before, but it was very interesting and useful. I can see myself using something like this to clean up my personal or work files. Thanks for helping me learn something new! Here’s a look at my solution, which is not as dynamic as the author’s but still accomplishes the challenge.

def combine_files():
    #import necessary packages
    import glob
    import shutil
    import os
    from pypdf import PdfMerger
    
    #Retrieve file path information
    directory = os.getcwd()+"\\"
    beginning_filepath = input("Please enter the filepath of the 'Challenge PDF Merger' folder:  ")

    #Ensure that the file path ends with valid slash characters
    if beginning_filepath[-1]!="\\":
        beginning_filepath+="\\"
    
    #Create a list of customers
    cust_list = ['Gertrude', 'Jason', 'MacDonald']

    #Iterate through the list
    for cust in cust_list:
        #Create the directory to the customer file
        cust_dir = beginning_filepath+cust+"\\"
        file_list = glob.glob(cust_dir+"*.pdf")
        merger = PdfMerger()
        #Iterate through all files in a customer's folder, appending each one
        for file_path in file_list:
            merger.append(file_path)
        #Create and save the new merged pdf
        merger.write("Merged Tax Forms.pdf")
        merger.close()
        #Move the new pdf
        shutil.move(directory + "Merged Tax Forms.pdf", cust_dir + "Merged Tax Forms.pdf")
    
#Run the function
combine_files()
1 Like

Amazing. I am so happy whenever I can make the computer do something that I can do but don’t want to. Few things compare to it.

def combine_tax_forms():
    from pypdf import PdfMerger
    from glob import glob
    import os
    import shutil

    # Get all directories
    pdf_directory = input('input the file path for the challenge folder:\n') + '\\'
    cwd = os.getcwd()+'\\'
    os.makedirs(os.path.dirname(cwd), exist_ok=True)
    
    # Create client list
    clients = glob(pdf_directory+'*')
#     print(clients)
    
    # Create pdf files list for each client
    for person in clients:
        client_dir = person+'\\'
        print(client_dir)
        
        pdfs = glob(client_dir+'*.pdf')

        # Merge pdfs
        merger = PdfMerger()
    
        for file in pdfs:
            merger.append(file)

        # Check for Created file and remove if already exists
        if os.path.isfile(client_dir + 'Merged Tax Forms.pdf'):
            os.remove(client_dir + 'Merged Tax Forms.pdf')

        # Name new merged file
        merger.write('Merged Tax Forms.pdf')
        merger.close()

        # Move new file to client folder
        shutil.move(cwd+'Merged Tax Forms.pdf',client_dir+'Merged Tax Forms.pdf')
    
combine_tax_forms()

I’ve never used the glob library (though I probably should), but I wanted to see if there was a way I could do this just by using the os library that Python offers. Here’s my solution code:

import os
from PyPDF2 import PdfMerger

path = input('Please input the path for the folder:') #Get folder path from user

#Create delimiter for Mac vs Windows machine
if '/' in path:
    delimiter = '/'
else:
    delimiter = '\\'

client_list = []

#Fill list with client names
for folder in os.listdir(path):
    client_list.append(folder)

#Create merged PDF for all clients
for client in client_list:
    client_path = path + delimiter + client

    #Get list of all the tax form files for the client
    tax_forms = []
    for file_name in os.listdir(client_path):
        if file_name.endswith(".pdf") and "Merged Tax Forms.pdf" not in file_name:
            tax_forms.append(os.path.join(client_path, file_name))

    #Merge tax form files into one file
    merger = PdfMerger()
    for pdf in tax_forms:
        merger.append(pdf)
    merged_file_name = os.path.join(client_path, "Merged Tax Forms.pdf")
    merger.write(merged_file_name)
    merger.close()

    print(f"{client}'s tax forms have been merged! Good job, you!")

I had never used the pypdf library before either, but with some help from Google and ChatGPT (which is awesome as we all know) I was able to get my code to be as efficient as possible. It definitely isn’t as efficient as using glob though, so that’s definitely a library I will have to start teaching myself! I’ll be on the lookout for more challenges on here that use it!