Member-only story

A Simple Algorithm To Compute How Similar Strings Are In Python

Liu Zuo Lin
5 min readNov 19, 2021

--

In Python, we are able to test if 2 strings are an exact match using the boolean equals operator ==:

print("apple" == "apple") # True
print("apple" == "orange") # False

However, things get more complicated if we wish to find a similar string that’s close enough:

print("apple" == "apples") # False
print("orange" == "ornage") # False
print("pear" == "paer") # False

As humans, we can tell that "apple" and "apples" are very similar, "orange" and "ornage" are very similar (likely a typo) and the same case with "pear" and "paer"(another typo). Unfortunately, Python recognises them as completely different strings.

words = ["apple", "orange", "pear", "apples", "snapple", "zzzzz"]

Given a list of words, we want to somehow be able to check which words are closer to one another (we’re expecting "apple", "apples" and "snapple" to be pretty close!) In this article, we’ll explore a simple method to do this.

Cosine Similarity Formula

cos(a, b) = dot(a, b) / magnitude(a) / magnitude(b)

a and b are vectors, and dot refers to the dot product, while magnitude refers to the magnitude of the vector. Don’t worry if these sound foreign to…

--

--

Liu Zuo Lin
Liu Zuo Lin

Written by Liu Zuo Lin

SWE @ Meta | [Ebook] 101 Things I Never Knew About Python: https://payhip.com/b/vywcf

No responses yet