Skip to content

rare_values

rare_values(values, frequency_threshold)

Description

Returns rare_values among array of values

This function computes the frequency of each value in values array and returns the values which frequency is stricly below the given frequency_threshold.

By returning rare values, this function can be used for anomaly detection in a variety of use cases.

Examples

Call or Deploy rare_values ?
Call rare_values directly

The easiest way to use bigfunctions

  • rare_values function is deployed in 39 public datasets for all of the 39 BigQuery regions.
  • It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...
Deploy rare_values in your project

Why deploy?

  • You may prefer to deploy rare_values in your own project to build and manage your own catalog of functions.
  • This is particularly useful if you want to create private functions (for example calling your internal APIs).
  • Get started by reading the framework page

Deployment

rare_values function can be deployed with:

pip install bigfunctions
bigfun get rare_values
bigfun deploy rare_values

Detect rare strings in an array of strings with a frequency_threshold of 0.2. cherry appears once for an array of 6 elements so its frequency is 1 / 6 ~= 0.167 < 0.2. It is a rare string compared to the frequency_threshold.

select bigfunctions.eu.rare_values(['apple', 'apple', 'banana', 'banana', 'banana', 'cherry'], 0.2)
select bigfunctions.us.rare_values(['apple', 'apple', 'banana', 'banana', 'banana', 'cherry'], 0.2)
select bigfunctions.europe_west1.rare_values(['apple', 'apple', 'banana', 'banana', 'banana', 'cherry'], 0.2)
+-------------+
| rare_values |
+-------------+
| ["cherry"]  |
+-------------+

Need help or Found a bug using rare_values?
Get help using rare_values

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about rare_values

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

We also provide professional suppport.

Use cases

Let's say you have a dataset of e-commerce transactions and you want to identify potentially fraudulent orders based on unusual shipping addresses. You could use the rare_values function to find addresses that appear infrequently.

Scenario:

You have a table orders with a column shipping_city. Most orders are shipped to common cities, but fraudulent orders might be shipped to less common locations.

Query:

SELECT
    shipping_city
  FROM
    `your-project.your_dataset.orders`
  WHERE
    shipping_city IN (
      SELECT
          *
        FROM
          UNARRAY(
            bigfunctions.us.rare_values(
              (
                SELECT
                    ARRAY_AGG(shipping_city)
                  FROM
                    `your-project.your_dataset.orders`
              ),
              0.01
            )
          )
    )

Explanation:

  1. SELECT ARRAY_AGG(shipping_city) FROM your-project.your_dataset.orders: This subquery aggregates all the shipping_city values into a single array.
  2. bigfunctions.us.rare_values(... , 0.01): This calls the rare_values function with the array of cities and a frequency_threshold of 0.01. This means any city that appears in less than 1% of the orders will be considered "rare".
  3. SELECT * FROM UNARRAY(...): This unnests the array of rare values returned by rare_values into individual rows.
  4. WHERE shipping_city IN (...): This filters the original orders table to only include rows where the shipping_city is present in the list of rare cities.

Result:

The query will return a list of shipping_city values that are considered rare based on the defined threshold. You can then further investigate these orders to determine if they are potentially fraudulent.

Other Use Cases:

  • Product Anomaly Detection: Identify rarely purchased products, which could indicate data entry errors, discontinued items, or sudden changes in demand.
  • User Behavior Analysis: Find users with uncommon activity patterns, which could be a sign of bots or malicious actors.
  • Error Detection in Logs: Identify rare error messages in system logs, which might point to new or infrequent bugs.
  • Spam Detection: Find rare words or phrases used in emails or messages, which could indicate spam or phishing attempts.

By adjusting the frequency_threshold, you can fine-tune the sensitivity of the rare value detection to suit your specific needs.

Spread the word!

BigFunctions is fully open-source. Help make it a success by spreading the word!

Share on Add a on