232|ALTERYX – Normalization Station

BYU Student Author: @Kyle_Nilsen
Reviewers: @Carter_Lee, @Dalling_Gardner
Estimated Time to Solve: 40 Minutes

We provide the solution to this challenge using:

  • Alteryx

Need a program? Click here.

Overview
You are considering applying for a high-achieving accounting PhD program. To ensure your success, you decide to perform some preliminary investigation into what research topics and methodologies have the highest concentration of publications in the professional world. You stumble on a flat file that includes much of this information, however, the data is not normalized. Because of this, it will be inefficient to navigate the data within your Access database. To facilitate your research, you decide to take the time to normalize the data within Alteryx by creating the necessary tables and relationships. Your goal is ultimately to produce a recreation of the flat file with updated IDs for each significant component.

Instructions

  1. Download the Authors_Data.xlsx and Challenge232_NormalizationStation.yxmd files below.
  2. Open the Alteryx workflow and upload the correct file path to the input tool (this is the Authors_Data file you just downloaded).
  3. Once this file is input, run the flow to load the data into the environment. Once here, take a moment to explore and analyze the data. You will notice there are empty, labeled containers in the flow. These are the “tables” you will create for the normalized relationships. You might find it helpful to draw out the tables by hand before trying to design them in Alteryx.
  4. Create third normal form tables for Method, Topic, Journal, and University. After doing this, update the flat file with the new IDs.
  5. Create second normal form tables for Article and Author. After doing this, update the flat file with the new IDs.
  6. Create second normal form tables for ArticleMethod, ArticleTopic, and AuthorArticle. These tables are all concatenated primary keys, so you will not give them unique IDs or use them to update the flat file.
  7. Output the normalized, updated dataset to a csv file called “PublishingsNormalized.csv”

Data Files

Suggestions and Hints
  1. In order to build out the normalization effectively, you need to understand the relationships between each of the columns in the flat file and assign only what is necessary to each table (hence, first, second, third normal form). If you have not done any normalization before, you can read about it here Database Normalization – Normal Forms 1nf 2nf 3nf Table Examples
  2. The first (and most useful) step for each table is to first determine what elements from the flat file are relevant and using a select tool to filter out the irrelevant columns. Then, each table follows a similar pattern of finding unique values and assigning them a unique ID.
  3. The University table is one of the exceptions to the above. Authors have both a “PhdUnivName” and a “WorkUnivName.” Do one select tool for each, then use a union tool, and then you can find the unique values before assigning each a unique ID.
  4. Use a join tool to merge the IDs back into the flat file’s data. You should use multiple in succession before moving on to the next section (i.e., between steps 4 and 5 in the instructions).
  5. This is the relationship diagram I created for the tables:

Solution