Creating a new Excel ‘Witbal’ Function (Enhancing Fuzzy Match)

Analytics and Automation
  • Pulkit Saraf
  • 23-11-2020
Creating a new Excel ‘Witbal’ Function (Enhancing Fuzzy Match)

I improved fuzzy logic for my specific scenario!

At work, we were cleaning 1000s of addresses received from client for an analysis. The first time we did it manually so as to get a sense of the data.

We ran a concatenation of account name, account address, city and state through a Google Maps API (this if i remember correctly) to get an accurate listing. Sometimes the listings were good, sometimes same and other times, worse than the original input.

Four employees would then spend a week trying to manually check each record to decide between original address or the Google API enriched address for better information quality.

We used fuzzy match for easy initial filtering. If there’s a high match, we’d keep the address line with higher character count. That covered around 10% of the records. Rest 90% we checked manually.

In doing so, I noticed something. Fuzzy is good for strings with large character count such as 300 or more maybe. In short, higher the character count, the better the fuzzy matching. But it falls surprisingly short when covering strings with lesser character count such as in an address line.


Let’s take an example.

In the following scenario, fuzzy match will return an extremely low matching rate.

Example 1: Client given address: A1 Building, India

Google API enriched address: A1 Building, B2 Street, C3 Block, D4 State, India

Since fuzzy match will mark this as low match %, this record will fall into manual check bucket for the employee. However, this is not something that should require human intervention. So, I decided to automate this particular situation and built the Which IThe Better Address Line (witbal) function.

It takes two addresses as input and outputs the richer address.

Here’s how it works

I score each address based on unique words in the address line.

This is different than just finding character count because that’s not always the right answer. (Example 2: Client address: “A1 Building , B2 State, India” has higher character count than Google enriched address: “A1, B2, C3” but we would like to pick the latter)

Step 1: Build a repository of frequently repeated non unique address words based on your data — street, road, building, floor, apartment, zone etc. These are generic address words.

Step 2: Calculate all words common to both the address lines after removing the generic words. These are common words.

Step 3: Calculate the score using the following formula:

Witbal Score = (Total word count) – (Generic words) – (Common words)

If the Witbal function is able to see a significant difference in the scores of the two address, it will return the one with the higher score and ultimately richer address line.

Let’s apply this back to the two examples we discussed earlier.

Verification

Example 1: Witbal score for client address is [3–2–1] = 0. Witbal score for google enriched address is [9–5–1] = 3. Hence, even when fuzzy returns a low score, Witbal will confidently (3 > 0) return the second address line.

Example 2: Witbal score for client address is [5–3–2] = 0. Witbal score for google enriched address is [3–0–2] = 1. Witbal again returns the richer address correctly when both fuzzy and character count will have marked the address for manual check.

With the use of the Witbal function, I ended up saving 2 employees’ full time and 3 days from the other 2 employees’ time. In total, savings of 100+ hours.

CODE

This is an old version, so it may require slight modifications. Also, RemoveDupesDict function used in the code was copied from stack overflow.

txt1 = address1
txt2 = address2

arr1 = Array(“;”, “ “, “.”, “-”, “:”, “_”, “,”)
j = 0
For j = LBound(arr1) To UBound(arr1)
 txt1 = Replace(txt1, arr1(j), “$$”)
 txt2 = Replace(txt2, arr1(j), “$$”)
Next j

splitadd1 = Split(txt1, “$$”)
splitadd2 = Split(txt2, “$$”)

sdadd1 = RemoveDupesDict(splitadd1)
sdadd2 = RemoveDupesDict(splitadd2)

k = 0
For loop1 = LBound(sdadd1) To UBound(sdadd1)
For loop2 = LBound(sdadd2) To UBound(sdadd2)
If sdadd1(loop1) = sdadd2(loop2) Then
k = k + 1
End If
Next loop2
Next loop1

arrcom = Array(“NO”, “BUILDING”, “NUMBER”, “COUNTY”, “DISTRICT”, “ROAD”, “Rd”, “STREET”, “PARK”, “ZONE”, “FLOOR”)
com1 = 0
com2 = 0

For loop1 = LBound(sdadd1) To UBound(sdadd1)
For loop2 = LBound(arrcom) To UBound(arrcom)
If sdadd1(loop1) = arrcom(loop2) Then
com1 = com1 + 1
End If
Next loop2
Next loop1

For loop1 = LBound(sdadd2) To UBound(sdadd2)
For loop2 = LBound(arrcom) To UBound(arrcom)
If sdadd2(loop1) = arrcom(loop2) Then
com2 = com2 + 1
End If
Next loop2
Next loop1

score1 = UBound(sdadd1) — LBound(sdadd1) + 1 — k — com1
score1 = UBound(sdadd2) — LBound(sdadd2) + 1 — k — com2

If score2 > score1 Then
witba = address2
ElseIf score1 > score2 Then
witba = address1
ElseIf score1 = score2 Then
witba = “same score”
End If

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
arrcom = Array(“NO”, “BUILDING”, “NUMBER”, “COUNTY”, “DISTRICT”, “ROAD”, “Rd”, “STREET”, “PARK”, “ZONE”, “FLOOR”)
com1 = 0
com2 = 0

For loop1 = LBound(sdadd1) To UBound(sdadd1)
For loop2 = LBound(arrcom) To UBound(arrcom)
If sdadd1(loop1) = arrcom(loop2) Then
com1 = com1 + 1
End If
Next loop2
Next loop1

For loop1 = LBound(sdadd2) To UBound(sdadd2)
For loop2 = LBound(arrcom) To UBound(arrcom)
If sdadd2(loop1) = arrcom(loop2) Then
com2 = com2 + 1
End If
Next loop2
Next loop1

score1 = UBound(sdadd1) — LBound(sdadd1) + 1 — (k + 1) — com1
score2 = UBound(sdadd2) — LBound(sdadd2) + 1 — (k + 1) — com2

If score2 > score1 Then
witba = address2
ElseIf score1 > score2 Then
witba = address1
ElseIf score1 = score2 Then
witba = “same score”
End If