bigfunctions > z_scores
z_scores¶
Call or Deploy z_scores
?
✅ You can call this z_scores
bigfunction directly from your Google Cloud Project (no install required).
- This
z_scores
function is deployed inbigfunctions
GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error). - Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions. This is particularly useful if you want to create private functions (for example calling your internal APIs). Discover the framework
Public BigFunctions Datasets:
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Description¶
Signature
z_scores(arr)
Description
Compute z_scores
for each value of arr
array.
The Z-Score is the number of standard deviations by which the value is above or below the mean value.
Examples¶
select bigfunctions.eu.z_scores([1, 2, 3, 4, 5])
select bigfunctions.us.z_scores([1, 2, 3, 4, 5])
select bigfunctions.europe_west1.z_scores([1, 2, 3, 4, 5])
+-----------------------------------+
| z_scores |
+-----------------------------------+
| [-1.414, -0.707, 0, 0.707, 1.414] |
+-----------------------------------+
Need help using z_scores
?
The community can help! Engage the conversation on Slack
For professional suppport, don't hesitate to chat with us.
Found a bug using z_scores
?
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
For professional suppport, don't hesitate to chat with us.
Use cases¶
A use case for the z_scores
function is to identify outliers in a dataset. Let's imagine you have a table of website session durations in seconds:
CREATE OR REPLACE TABLE `your_project.your_dataset.session_durations` AS
SELECT * FROM UNNEST([
10, 25, 30, 35, 40, 45, 50, 55, 60, 300, 65, 70, 75, 80, 85
]) AS session_duration;
You suspect that the session duration of 300 seconds is an outlier. You can use z_scores
to confirm this:
SELECT
session_duration,
bigfunctions.your_region.z_scores(ARRAY_AGG(session_duration) OVER ()) as z_score
FROM
`your_project.your_dataset.session_durations`;
Replace your_region
with your BigQuery region (e.g., us
, eu
, us_central1
).
This query will calculate the z-score for each session duration. The session with a duration of 300 seconds will likely have a z-score significantly higher than other sessions (above 2 or 3, depending on your data distribution), indicating it's an outlier. You could then filter based on the z-score to identify and potentially remove or further investigate these outlier sessions.
Other use cases include:
- Standardizing data: Transforming data to have a mean of 0 and a standard deviation of 1, useful for comparing variables measured on different scales.
- Anomaly detection: Similar to outlier detection, but in a time-series context, identifying unusual fluctuations in metrics.
- Machine learning preprocessing: Many machine learning algorithms benefit from standardized input data.
- Ranking and scoring: Z-scores can provide a relative ranking of items based on their performance compared to the average. For example, ranking students based on their test scores.
Remember to choose the correct BigQuery region for the bigfunctions
dataset based on where your data resides.
Spread the word¶
BigFunctions is fully open-source. Help make it a success by spreading the word!