bigfunctions > ngram_frequency_similarity
ngram_frequency_similarity¶
Call or Deploy ngram_frequency_similarity
?
✅ You can call this ngram_frequency_similarity
bigfunction directly from your Google Cloud Project (no install required).
- This
ngram_frequency_similarity
function is deployed inbigfunctions
GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error). - Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions. This is particularly useful if you want to create private functions (for example calling your internal APIs). Discover the framework
Public BigFunctions Datasets:
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Description¶
Signature
ngram_frequency_similarity(string1, string2, n)
Description
Calculates n-gram similarity between two strings
The n-gram comparison algorithm is a method used to measure the similarity between two strings by analyzing their subsequence of n consecutive characters, called n-grams. The process of the n-gram comparison algorithm involves the following steps:
- N-gram Extraction: Divide each input string into overlapping sequences of n characters.
- Counting N-grams: Count the occurrences of each unique n-gram in both strings.
- Calculating Similarity: Compare the n-gram counts between the two strings and compute a similarity score. The similarity score is here calculated with cosine similarity.
The above description is taken from Yassine EL KHAL article
Example of n-gram: the sentence Lorem ipsum dolor sit amet
gives the following 4-grams ['LORE', 'OREM', 'REM ', 'EM I', 'M IP', ' IPS', 'IPSU', ...]
Returned similarity score is between 0 and 1, 1 meaning that the strings are the most similar.
Examples¶
1. Calculate n-gram frequency similarity between two simple strings with n=2
select bigfunctions.eu.ngram_frequency_similarity('hello world', 'world hello', 2)
select bigfunctions.us.ngram_frequency_similarity('hello world', 'world hello', 2)
select bigfunctions.europe_west1.ngram_frequency_similarity('hello world', 'world hello', 2)
+------------+
| similarity |
+------------+
| 0.8 |
+------------+
2. Calculate n-gram frequency similarity between two phrases with n=3
select bigfunctions.eu.ngram_frequency_similarity('The quick brown fox', 'The quick brown dog', 3)
select bigfunctions.us.ngram_frequency_similarity('The quick brown fox', 'The quick brown dog', 3)
select bigfunctions.europe_west1.ngram_frequency_similarity('The quick brown fox', 'The quick brown dog', 3)
+------------+
| similarity |
+------------+
| 0.82 |
+------------+
3. Calculate n-gram frequency similarity between two sentences with n=4
select bigfunctions.eu.ngram_frequency_similarity('Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Lorem ipsum dolor sit amet, consectetur adipiscing.', 4)
select bigfunctions.us.ngram_frequency_similarity('Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Lorem ipsum dolor sit amet, consectetur adipiscing.', 4)
select bigfunctions.europe_west1.ngram_frequency_similarity('Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Lorem ipsum dolor sit amet, consectetur adipiscing.', 4)
+------------+
| similarity |
+------------+
| 0.93 |
+------------+
Need help using ngram_frequency_similarity
?
The community can help! Engage the conversation on Slack
For professional suppport, don't hesitate to chat with us.
Found a bug using ngram_frequency_similarity
?
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
For professional suppport, don't hesitate to chat with us.
Use cases¶
This ngram_frequency_similarity
function is useful for several text analysis and data matching tasks where you want to determine how similar two strings are based on the sequences of characters they contain. Here are a few use cases:
1. Plagiarism Detection: Compare student submissions or documents to identify potential plagiarism by calculating the n-gram similarity. A high similarity score could indicate copied content.
2. Duplicate Detection: Identify duplicate records in a database, even if they have slight variations in wording or spelling. For example, finding near-identical product descriptions or customer addresses.
3. Fuzzy Matching: Match records that are not exactly the same but are similar enough to be considered a potential match. This is useful in situations where data entry errors or variations in naming conventions might exist. Examples include: * Matching customer names from different sources. * Matching product names across different retailers. * Finding similar articles or news stories.
4. Recommendation Systems: Suggest related products or content based on the similarity of their descriptions or titles. If two products have a high n-gram similarity, they might be relevant to the same customer.
5. Spell Checking/Auto-Correction: Suggest possible corrections for misspelled words by finding words with high n-gram similarity to the incorrect input.
6. Information Retrieval: Improve search relevance by identifying documents that are semantically similar to a search query, even if the exact words are not present.
7. Text Classification: Group similar texts together based on their n-gram profiles. This could be used to categorize documents, emails, or social media posts.
Example Scenario (Fuzzy Matching):
Imagine an e-commerce site that wants to prevent duplicate product listings. A seller might try to list a "Samsung Galaxy S23" slightly differently, like "Samsung Galaxy S23 Smartphone" or "New Samsung Galaxy S23". By using ngram_frequency_similarity
with an appropriate n
value, the system can detect these near-duplicates and flag them for review, even though the strings aren't identical. This prevents redundant listings and ensures data quality.
Spread the word¶
BigFunctions is fully open-source. Help make it a success by spreading the word!