ngram_frequency_similarity¶

ngram_frequency_similarity(string1, string2, n)

Description¶

Calculates n-gram similarity between two strings

The n-gram comparison algorithm is a method used to measure the similarity between two strings by analyzing their subsequence of n consecutive characters, called n-grams. The process of the n-gram comparison algorithm involves the following steps:

N-gram Extraction: Divide each input string into overlapping sequences of n characters.
Counting N-grams: Count the occurrences of each unique n-gram in both strings.
Calculating Similarity: Compare the n-gram counts between the two strings and compute a similarity score. The similarity score is here calculated with cosine similarity.

The above description is taken from Yassine EL KHAL article

Example of n-gram: the sentence Lorem ipsum dolor sit amet gives the following 4-grams ['LORE', 'OREM', 'REM ', 'EM I', 'M IP', ' IPS', 'IPSU', ...]

Returned similarity score is between 0 and 1, 1 meaning that the strings are the most similar.

Usage¶

Call or Deploy ngram_frequency_similarity ?

Call ngram_frequency_similarity directly

The easiest way to use bigfunctions

ngram_frequency_similarity function is deployed in 39 public datasets for all of the 39 BigQuery regions.
It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
(You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region	Dataset
`eu`	`bigfunctions.eu`
`us`	`bigfunctions.us`
`europe-west1`	`bigfunctions.europe_west1`
`asia-east1`	`bigfunctions.asia_east1`
...	...

Deploy ngram_frequency_similarity in your project

Why deploy?

You may prefer to deploy ngram_frequency_similarity in your own project to build and manage your own catalog of functions.
This is particularly useful if you want to create private functions (for example calling your internal APIs).
Get started by reading the framework page

Deployment

ngram_frequency_similarity function can be deployed with:

pip install bigfunctions
bigfun get ngram_frequency_similarity
bigfun deploy ngram_frequency_similarity

Examples¶

1. Calculate n-gram frequency similarity between two simple strings with n=2

EUUSeurope-west1

select bigfunctions.eu.ngram_frequency_similarity("hello world", "world hello", 2)

select bigfunctions.us.ngram_frequency_similarity("hello world", "world hello", 2)

select bigfunctions.europe_west1.ngram_frequency_similarity("hello world", "world hello", 2)

+------------+
| similarity |
+------------+
| 0.8        |
+------------+

2. Calculate n-gram frequency similarity between two phrases with n=3

EUUSeurope-west1

select bigfunctions.eu.ngram_frequency_similarity("The quick brown fox", "The quick brown dog", 3)

select bigfunctions.us.ngram_frequency_similarity("The quick brown fox", "The quick brown dog", 3)

select bigfunctions.europe_west1.ngram_frequency_similarity("The quick brown fox", "The quick brown dog", 3)

+------------+
| similarity |
+------------+
| 0.82       |
+------------+

3. Calculate n-gram frequency similarity between two sentences with n=4

EUUSeurope-west1

select bigfunctions.eu.ngram_frequency_similarity("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Lorem ipsum dolor sit amet, consectetur adipiscing.", 4)

select bigfunctions.us.ngram_frequency_similarity("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Lorem ipsum dolor sit amet, consectetur adipiscing.", 4)

select bigfunctions.europe_west1.ngram_frequency_similarity("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Lorem ipsum dolor sit amet, consectetur adipiscing.", 4)

+------------+
| similarity |
+------------+
| 0.93       |
+------------+

Use cases¶

This ngram_frequency_similarity function is useful for several text analysis and data matching tasks where you want to determine how similar two strings are based on the sequences of characters they contain. Here are a few use cases:

1. Plagiarism Detection: Compare student submissions or documents to identify potential plagiarism by calculating the n-gram similarity. A high similarity score could indicate copied content.

2. Duplicate Detection: Identify duplicate records in a database, even if they have slight variations in wording or spelling. For example, finding near-identical product descriptions or customer addresses.

3. Fuzzy Matching: Match records that are not exactly the same but are similar enough to be considered a potential match. This is useful in situations where data entry errors or variations in naming conventions might exist. Examples include: * Matching customer names from different sources. * Matching product names across different retailers. * Finding similar articles or news stories.

4. Recommendation Systems: Suggest related products or content based on the similarity of their descriptions or titles. If two products have a high n-gram similarity, they might be relevant to the same customer.

5. Spell Checking/Auto-Correction: Suggest possible corrections for misspelled words by finding words with high n-gram similarity to the incorrect input.

6. Information Retrieval: Improve search relevance by identifying documents that are semantically similar to a search query, even if the exact words are not present.

7. Text Classification: Group similar texts together based on their n-gram profiles. This could be used to categorize documents, emails, or social media posts.

Example Scenario (Fuzzy Matching):

Imagine an e-commerce site that wants to prevent duplicate product listings. A seller might try to list a "Samsung Galaxy S23" slightly differently, like "Samsung Galaxy S23 Smartphone" or "New Samsung Galaxy S23". By using ngram_frequency_similarity with an appropriate n value, the system can detect these near-duplicates and flag them for review, even though the strings aren't identical. This prevents redundant listings and ensures data quality.

Need help or Found a bug?

Get help using ngram_frequency_similarity

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about ngram_frequency_similarity

If the function does not work as expected, please

report a bug so that it can be improved.
or open the discussion with the community on Slack.

We also provide professional suppport.

Show your by adding a on