Member-only story
A Simple Algorithm To Compute How Similar Strings Are In Python
In Python, we are able to test if 2 strings are an exact match using the boolean equals operator ==
:
print("apple" == "apple") # True
print("apple" == "orange") # False
However, things get more complicated if we wish to find a similar string that’s close enough:
print("apple" == "apples") # False
print("orange" == "ornage") # False
print("pear" == "paer") # False
As humans, we can tell that "apple"
and "apples"
are very similar, "orange"
and "ornage"
are very similar (likely a typo) and the same case with "pear"
and "paer"
(another typo). Unfortunately, Python recognises them as completely different strings.
words = ["apple", "orange", "pear", "apples", "snapple", "zzzzz"]
Given a list of words, we want to somehow be able to check which words are closer to one another (we’re expecting "apple"
, "apples"
and "snapple"
to be pretty close!) In this article, we’ll explore a simple method to do this.
Cosine Similarity Formula
cos(a, b) = dot(a, b) / magnitude(a) / magnitude(b)
a
and b
are vectors, and dot
refers to the dot product, while magnitude
refers to the magnitude of the vector. Don’t worry if these sound foreign to…