42|PYTHON – Inventory Check Part 2

BYU Student Author: @Jonathan_Weston
Reviewers: @IWillyerd, @klayton, @Alex_Garrett
Estimated Time to Solve: 45 Minutes

We provide the solution to this challenge using:

  • Python

Need a program? Click here.

This is the second part of the Inventory Check Challenge. Click here to participate in the challenge Inventory Check Part 1.

Overview
You are the lead auditor for Innovatech Publishing. Over the past couple of weeks, you have been leading the inventory count for different departments. In the previous challenge, you used regex to validate information provided by staff after they counted different departments. Today, your goal is to verify that the barcodes on merchandise on store shelves match the information recorded by audit staff. You have an Excel table with the inventory count information and a folder of pictures of barcodes for each sampled item. To accomplish this task, you plan to use python and an image-recognition library to ensure that the barcode in the picture matches the barcode value listed in the Excel file.

Instructions
The libraries you will need to import for this challenge are as follows:

import pandas as pd 
import pytesseract 

Python-tesseract is an optical character recognition tool for python. It will recognize and read the text embedded in images.

Disclaimer: TechHub.Training is only using the pytesseract library for a specific use case and does not endorse or receive any compensation from pytesseract or its developers. Any references to the pytesseract library are solely for the purpose of describing the use case and should not be construed as an endorsement or affiliation with the library or its developers.

Additionally, you will need to install the tesseract executable in order to enable the library. This is a little different from a normal pip install. This challenge was created on windows and will walk through the installation for that software. macOS requires a different installation process using MacPorts or Homebrew, which will not be covered in this challenge.

  1. Download the Tessearct installer for Windows here.
  2. Additional installation instructions for macOS can be found here.
  3. Open tesseract-ocr-setup executable.
  4. When you get to the “Choose Components” screen, the only items that should have check marks are ScrollView, Training tools, Shortcuts creation, and Language data. Installing the additional language data will dramatically increase the size of this application, and it is unnecessary for the use of this challenge.
  5. On the next screen, it will ask you to choose the Install Location. Leave the file path unchanged but copy the path so that you can find it quickly in your file explorer in the next steps.
  6. Install tesseract.
  7. A new folder named Tesseract_OCR should be in your Program Files folder in your PC. You can paste the file path you copied in step 5 into your file explorer to jump to it.
  8. Now that the executable is downloaded, you can import the library into python by typing:
    !pip install pytesseract 
    
    Now that pytesseract is installed, you can import the library and the executable. To do this, you will need to call both the package and set the file path for the executable (illustrated below). Inside the Tesseract-OCR folder is an executable called “tesseract.exe”, which is found at the file path that you copied in step 7. The file path to this executable is what you’ll need to assign to call it into python. Here is an example of how to call in the library and executable into your python program:
    #Import pytesseract and set the location of the executable in your computer 
    
    import pytesseract 
    pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe' 
    
  9. You are now ready to use pytesseract to extract values within python. Make sure you have downloaded and extracted the zip folder and the Challenge42_Inventory_Check_Part2.xlsx file. Within the Excel workbook, you’ll find information regarding the inventory count, and file path to the evidence for each row sample. Your python program will use each of those file paths to find the matching picture of a barcode and make sure that the code in the picture matches the code in the Excel workbook.

Your code should follow these next steps to complete the challenge:

  1. Import the libraries listed at the beginning of the instructions, import pytesseract, and set the location of the tesseract executable. (Completed in step 8)
  2. Create a variable named “folder_directory. Set the variable equal to the file path to the Challenge folder. The file path for the “folder_directory” variable should end with “\Challenge42_Innovatech_Publishing_Part2\”. This way, you can append it directly to the value in the “Support FilePath Ext.” column of the Excel file.
  3. Read in the excel file as a data frame and make the UPC column a string type
  4. Make a new column called “PNG_Path” that combines your folder_directory variable with the Support FilePath Extension from the data frame
  5. Initialize an empty list. This is where you will store the PNG values extracted using tesseract.
  6. pytesseract will extract the string values of the .png file given in the argument. You can assign that value to a variable using the following code:
    value = pytesseract.image_to_string(PNG_Path).strip()
    
  7. Iterate through each value of the PNG_Path column in the data frame and use that as the “PNG_Path” argument for the pytesseract.image_to_string module.
  8. Append the values taken from each .png file to the list you made previously.
  9. Append the list to the data frame as a new column.
  10. Create a new data frame that contains only the rows where the UPC number didn’t match the value extracted from the barcode.
  11. Print the new data frame to an excel file.

Data Files
Provide description if necessary. Change Extension to whatever the file extension should be: csv, xlsx, etc.

Suggestions and Hints

When using backslashes, python treats those as special characters. To treat it as a normal backslash, you will have to type two backslashes. An easy way to enter your file paths into pyton without fixing every backslash is to use an input statement.

You can append a list as a column to a data frame with this code:

df = df.assign(Column_Name=List_Name)

Your new dataframe should have two rows where the UPC didn’t match the PNG value: BBB-001 and CCC-001

Solution

Solution Code
pip install pytesseract 
import pandas as pd 

#Import pytesseract and set the location of the executible in your computer 

import pytesseract 
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe' 

# #Get the filepath to the Challenge folder 

folder_directory = input('Please enter the file path for the Client Audit Folder.')+'\\' 

#Read in the excel file as a dataframe and make the UPC column a string type 

file_path = input('Plase enter the file path for the Excel file.')+'\\' 

df = pd.read_excel(file_path+'ChallengeX_Inventory_Check_Part2.xlsx') 

df['UPC'] = df['UPC'].astype(str) 

#Make a new column that combines your folder_directory with the Support FilePath Ext from the dataframe 

df['PNG_Path'] = folder_directory+df['Support_Filepath_Ext'] 

#Initialize an empty list 

PNG_Values = [] 

#Iterate through each PNG_Path and use that as the argument for the pytesseract.image_to_string module. 

for i,row in df.iterrows(): 

    path = row['PNG_Path'] 

    value = pytesseract.image_to_string(path).strip()

    #Append the values taken from each .png and append them to the list you made previously 

    PNG_Values.append(value) 

#Append your list to the dataframe 

df = df.assign(PNG_Value=PNG_Values) 

#Create a new dataframe that contains only the rows where the UPC number didn't match the value extracted from the barcode 

df2=df[df['UPC']!=df['PNG_Value']] 

#Print the dataframe to an excel file 

df2.to_excel(file_path+'ChallengeX_Solution.xlsx',index=None) 

Challenge42_Solution.ipynb
Challenge42_Solution.xlsx
Solution Video: Challenge 42|PYTHON – Inventory Check Part 2

Hi! I was trying to do this challenge, and the Challenge42_Inventory_Check_Part2 file isn’t actually the dataset needed to complete the problem. It’s the solution file. I spent some time learning how tesseract works anyway, so it wasn’t a total loss!