Wrangling Inconsistent Data: How to Use Automation to Create Governance

143 views
0

Great minds think differently, not alike. The way you perceive and file data in your brain is going to be largely different than anyone else. Physical data is no different. Data created by and entered by an individual will be individualized.

This is one of the reasons data governance is so critical, as Sarah Ashby recently discussed. This individualized data allows a margin of “human error” to potentially be included in your data. Until data governance or consistency is established, how do you automate and consolidate data that may contain inconsistency?

Combining data

Functions allow us to perform mathematical, statistical, and logical operations in Excel. However, a set of arguments or specific values are required. Functions work well with the concept of data governance and automation with computer output data, but functions are not as user-friendly when factoring in “human error.”

One of the most common internal requests or questions I get asked is “Can you combine this data?” In most instances, the person making the request has exhausted most of the options they can think of short of manually merging multiple sheets, rows, and columns of data. Manually merging this type of data can take hours or even days to complete. I am always glad when asked before they try to manually complete these merges. The sooner you look into alternatives or use your resources, such as data analysts like me, the more time can be saved.

Define an absolute point of reference

Although there are functions in Excel specifically created to assist in merging data, that is not always the more convenient way, especially with larger sets and multiple sheets. I am going to walk you through a relatively user-friendly way to go about merging your data, but first, we need one thing: an absolute point of reference.

I would define an absolute point of reference in data as a unique common denominator among the sets of data, this could be an account number, an individualized ID, an employee ID, a unique location code, etc. Basically, you need an identifier that can be found in both sets of data. An example would be if my ID was 003 I can merge information from multiple sheets using this point of reference because I am the only person with this 003 ID number.

Using Power Query

Once you have found your point of reference, you can use Power Query to merge multiple sets of data. Microsoft defines Power Query as “a data transformation and data preparation engine. Power Query comes with a graphical interface for getting data from sources and a Power Query Editor for applying transformations.” Although this sounds intimidating, I guarantee it is much more time-efficient than manually merging all of your data. Power Query can be utilized for more than just merging data, but in this case, we are going to explore its data merging functionality.

Power Query is also known as “Get and Transfer Data.” This is what it is labeled as in Microsoft Excel under the Data tab. Using Power Query you can import data you have created, connect data from an external source such as a website, or shape and transform your data. The data can be refreshed once it is set up and could potentially save hours of time recreating and/or merging spreadsheets. Basically, once merged, you can add on to or add new data to your worksheet and hit a button to refresh all the data and how it’s merged.

How to merge

When merging multiple sheets of data using Power Query for the first time, I’d recommend adding both of your sheets into one Excel workbook. Although merging through Power Query can be done without this step, it will help to get all of your data in one place for practice. Next, you will need to create a connection between all of the sheets you want to use in the merge. This gives Power Query access to the tables you want to work in and allows you to organize and merge the data later.

These connections are created in the Data tab, in the Get and Transform Data section. When on the sheet to which you would like to add a connection, click the “From Table/Range” option, prompting you to verify the range of your first set of data. Once confirmed, Power Query Editor will automatically pop up. We are only creating a connection and not changing any data here so you can rename your table if you would like, but you will want to close and load to a connection by selecting the drop-down on the top left of the editor under “close and load,” you will select the “Close & Load To…” option.

When selecting how you want to view this data, select “Only Create Connection.” You have created a connection to your first set of data, you can repeat this process for all of the sheets you would like to use in the merge.

Fuzzy matches account for human error

Once you have your connections created, you can right-click your data tables in the Queries and Connections pane and choose merge. Select your tables from the dropdowns, then click on the columns that contain the data’s point of reference. The bottom of the pop-up will show you how many matches were found, but we can get the data to match a little bit closer by using the fuzzy match capabilities. Fuzzy matches allow for “human error” in your data and can find the next best option by looking at the data and using logic to find the next best option. Where functions can find an exact value, fuzzy matches can find exact matches, matched with different cases, misspelled words, missing leading or following words, and much more depending on the similarity threshold you choose.

Some other factors that make Power Query fuzzy matches more user-friendly are the ability to limit the number of matches found, set the similarity intensity (defaulted to 80%), mix cases, and the ability to choose the different types of joining. Do you want to keep sheet 1 data only? Sheet 2? Do you want all data, both matched and not matched? Only point of reference columns or all columns? There are so many variations, it truly gives the best possible customization located in Excel.

Edit your data

After selecting how you want your data merged you can move on to the Power Query Editor. In the Power Query Editor you can name your new merged sheet. You can also expand the column from your chosen sheets. Delete the obsolete columns, filter out data you don’t need, move columns, and set validations and functions to columns of data. Excel does not allow you to drag and drop and reorganize data easily, but Power Query Editor gives you more freedom when re-organizing your data.

After you have established your new document, by closing and loading it to a table, you can then add or replace data on your other connection sheets and refresh. Using the refresh button will change all the data but recreate the same steps you completed when creating your original merge. The ultimate form of automation, there is no need to create it again and again.

Take control of your data

I find using Power Query especially helpful when merging multiple sheets of large sets of data containing individualized entries. Utilizing Power Query to process your data could allow you to merge your data and save you time recreating information and helping correct and establish data governance.

Once you know how to work through Power Query, you will have the know-how to upload data and statistics from the web, from other files, and from various other locations! Power Query is extremely user-friendly and there are many resources and how-to’s available for its many applications. The “fuzzy” match feature and Power Query in general are one of the most innovative automation tools in Excel. You can feel like an Excel expert just by familiarizing yourself with this feature!

Author

Your email address will not be published. Required fields are marked *