rare_values¶
rare_values(values, frequency_threshold)
Description¶
Returns rare_values
among array of values
This function computes the frequency of each value in values
array and returns the values which frequency is stricly below the given frequency_threshold
.
By returning rare values, this function can be used for anomaly detection in a variety of use cases.
Examples¶
Call or Deploy rare_values
?
Call rare_values
directly
The easiest way to use bigfunctions
rare_values
function is deployed in 39 public datasets for all of the 39 BigQuery regions.- It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)
Public BigFunctions Datasets
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Deploy rare_values
in your project
Why deploy?
- You may prefer to deploy
rare_values
in your own project to build and manage your own catalog of functions. - This is particularly useful if you want to create private functions (for example calling your internal APIs).
- Get started by reading the framework page
Deployment
rare_values
function can be deployed with:
pip install bigfunctions
bigfun get rare_values
bigfun deploy rare_values
Detect rare strings in an array of strings with a frequency_threshold
of 0.2.
cherry
appears once for an array of 6 elements so its frequency
is 1 / 6 ~= 0.167 < 0.2.
It is a rare string compared to the frequency_threshold
.
select bigfunctions.eu.rare_values(['apple', 'apple', 'banana', 'banana', 'banana', 'cherry'], 0.2)
select bigfunctions.us.rare_values(['apple', 'apple', 'banana', 'banana', 'banana', 'cherry'], 0.2)
select bigfunctions.europe_west1.rare_values(['apple', 'apple', 'banana', 'banana', 'banana', 'cherry'], 0.2)
+-------------+
| rare_values |
+-------------+
| ["cherry"] |
+-------------+
Need help or Found a bug using rare_values
?
Get help using rare_values
The community can help! Engage the conversation on Slack
We also provide professional suppport.
Report a bug about rare_values
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
We also provide professional suppport.
Use cases¶
Let's say you have a dataset of e-commerce transactions and you want to identify potentially fraudulent orders based on unusual shipping addresses. You could use the rare_values
function to find addresses that appear infrequently.
Scenario:
You have a table orders
with a column shipping_city
. Most orders are shipped to common cities, but fraudulent orders might be shipped to less common locations.
Query:
SELECT
shipping_city
FROM
`your-project.your_dataset.orders`
WHERE
shipping_city IN (
SELECT
*
FROM
UNARRAY(
bigfunctions.us.rare_values(
(
SELECT
ARRAY_AGG(shipping_city)
FROM
`your-project.your_dataset.orders`
),
0.01
)
)
)
Explanation:
SELECT ARRAY_AGG(shipping_city) FROM your-project.your_dataset.orders
: This subquery aggregates all theshipping_city
values into a single array.bigfunctions.us.rare_values(... , 0.01)
: This calls therare_values
function with the array of cities and afrequency_threshold
of 0.01. This means any city that appears in less than 1% of the orders will be considered "rare".SELECT * FROM UNARRAY(...)
: This unnests the array of rare values returned byrare_values
into individual rows.WHERE shipping_city IN (...)
: This filters the originalorders
table to only include rows where theshipping_city
is present in the list of rare cities.
Result:
The query will return a list of shipping_city
values that are considered rare based on the defined threshold. You can then further investigate these orders to determine if they are potentially fraudulent.
Other Use Cases:
- Product Anomaly Detection: Identify rarely purchased products, which could indicate data entry errors, discontinued items, or sudden changes in demand.
- User Behavior Analysis: Find users with uncommon activity patterns, which could be a sign of bots or malicious actors.
- Error Detection in Logs: Identify rare error messages in system logs, which might point to new or infrequent bugs.
- Spam Detection: Find rare words or phrases used in emails or messages, which could indicate spam or phishing attempts.
By adjusting the frequency_threshold
, you can fine-tune the sensitivity of the rare value detection to suit your specific needs.
Spread the word!¶
BigFunctions is fully open-source. Help make it a success by spreading the word!