A Framework
for BigQuery functions¶

Build a catalog of BigQuery functions
using BigFunctions Framework

Discover the framework¶

A YAML Standard¶

Each function is defined in a yaml file (with its author, description, arguments, examples, code, etc)

Yaml files are used to test & deploy the functions and generate a documentation website (such as this website).

A Command Line Interface¶

bigfun CLI is installable with one pip install and enables you to:
- get the yaml file of a public function
- test the function
- deploy it
- generate a documentation website (such as this website)

A Documentation Website¶

The command line interface generates your catalog of the available functions in your company with use cases and examples.

Foster self-service for your data-people!

Get Started!¶

bigfun CLI (command-line-interface) facilitates BigFunctions development, test, deployment, documentation and monitoring.

1. Install `bigfun` 🛠️¶

pip install bigfunctions

2. Use `bigfun` 🔥¶

$ bigfun --help
Usage: bigfun [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  deploy      Deploy BIGFUNCTION
  docs        Generate, serve and publish documentation
  get         Download BIGFUNCTION yaml file from unytics/bigfunctions...
  test        Test BIGFUNCTION

3. Create you first function 👷¶

Functions are defined as yaml files under bigfunctions folder. To create your first function locally, the easiest is to download an existing yaml file of unytics/bigfunctions Github repo.

For instance to download is_email_valid.yaml into bigfunctions folder, do:

bigfun get is_email_valid

You can then update the file to suit your needs.

4. Deploy you first function 👨‍💻¶

Make sure the gcloud command is installed on your computer

Activate the application-default account with gcloud auth application-default login. A browser window should open, and you should be prompted to log into your Google account. Once you've done that, bigfun will use your oauth'd credentials to connect to BigQuery through BigQuery python client!

Get or create a DATASET where you have permission to edit data and where the function will be deployed.

The DATASET must belong to a PROJECT in which you have permission to run BigQuery queries.

You now can deploy the function is_email_valid defined in bigfunctions/is_email_valid.yaml yaml file by running:

bigfun deploy is_email_valid

The first time you run this command it will ask for PROJECT and DATASET.

Your inputs will be written to config.yaml file in current directory so that you won't be asked again (unless you delete the entries in config.yaml). You can also override this config at deploy time: bigfun deploy is_email_valid --project=PROJECT --dataset=DATASET.

Test it with 👀:

select PROJECT.DATASET.is_email_valid('paul.marcombes@unytics.io')

5. Deploy you first javascript function which depends on npm packages 👽¶

To deploy a javascript function which depends on npm packages there are additional requirements in addition to the ones above.

You will need to install each npm package on your machine and bundle it into one file. For that, you need to install nodejs.

The bundled js file will be uploaded into a cloud storage bucket in which you must have write access. The bucket name must be provided in config.yaml file in a variable named bucket_js_dependencies. Users of your functions must have read access to the bucket.

You now can deploy the function render_template defined in bigfunctions/render_template.yaml yaml file by running:

bigfun deploy render_template

Test it with 👀:

select PROJECT.DATASET.render_template('Hello {{ user }}', json '{"user": "James"}')

6. Deploy you first remote function ⚡️¶

To deploy a remote function (e.g. python function), there are additional requirements in addition to the ones of Deploy you first function section.

A Cloud Run service will be deployed to host the code (as seen here). So you must have permissions to deploy a Cloud Run service in your project PROJECT.

gcloud CLI will be used directly to deploy the service (using gcloud run deploy). Then, make sure you are logged in with gcloud by calling: gcloud auth login. A browser window should also open, and you should be prompted to log into your Google account. WARNING: you read correctly: you have to authenticate twice. Once for bigquery python client (to deploy any function including remote as seen above.) and once now to use gcloud (to deploy a Cloud Run service).

A BigQuery Remote Connection will be created to link BigQuery with the Cloud Run service. You then should have permissions to create a remote connection. BigQuery Connection Admin or BigQuery Admin roles have these permissions.

A service account will be automatically created by Google along with the BigQuery Remote Connection. BigQuery will use this service account of the remote connection to invoke the Cloud Run service. You then must have the permission to authorize this service account to invoke the Cloud Run service. This permission is provided in the role roles/run.admin

You now can deploy the function faker defined in bigfunctions/faker.yaml yaml file by running:

bigfun deploy faker

Test it with 👀:

select PROJECT.DATASET.faker("name", "it_IT")

7. Host your Documentation on GitHub Pages¶

💡 Note: If you want to host your documentation on GitLab, please check this link.

Steps to Host Your Documentation on GitHub Pages¶

Create a new repository on GitHub.
Initialize Git in your local project:
```
git init
```
Add the remote repository
```
git remote add origin <repository-url>
```
Generate the documentation
```
  bigfun docs generate
```
(Optional) Preview the documentation before publishing
```
bigfun docs serve
```
Then open http://localhost:8000 in your browser.

Add, commit, and push your changes

git add .
git commit -m "Add documentation"
git push origin main

Deploy to GitHub Pages
```
mkdocs gh-deploy --force
```

Access your hosted documentation

https://<your-github-username>.github.io/<repository-name>/

YAML Syntax¶

SQLPythonJavascriptSQL Stored ProcedureSQL Aggregate FunctionJavascript Aggregate Function

  type: function_sql  #(1)!
  author: John Doe  #(2)!
  description: |  #(3)!
    Multiplies a number by a factor
    (example function for documentation purposes)
  arguments:  #(4)!
    - name: num
      type: float64
    - name: factor
      type: float64
  output:  #(5)!
    name: product
    type: float64
  examples:  #(6)!
    - description: Basic multiplication
      arguments:
        - 5
        - 3
      output: 15
    - description: Decimal multiplication
      arguments:
        - 2.5
        - 4
      output: 10.0
  code: |  #(7)!
    (
      SELECT num * factor
    )

type (Required) Function category declaration.
author (Optional) Function creator/maintainer identifier.
description (Required) Clear explanation of the function's purpose and behavior.
arguments (Required) List of input arguments with BigQuery-compatible types. Argument structure:
- name Valid identifier (snake_case recommended) Example: user_id, transaction_amount
- type BigQuery-supported data type:
```
BOOL | INT64 | FLOAT64 | STRING | JSON | DATE | TIMESTAMP
```
  BigQuery Data Types Reference
output (Required) Definition of the function's return value structure. Output Structure:
- name Identifier for the return value (snake_case recommended) Example: result, total_amount
- type BigQuery-compatible data type:
```
BOOL | INT64 | FLOAT64 | STRING | JSON | DATE | TIMESTAMP
```
  Example:
```
output:
  name: final_price
  type: FLOAT64
  description: Total amount after applying discounts and taxes
```
  BigQuery Data Types Reference
examples (Required) List of practical usage demonstrations for the function. Key Elements:
- description : Context explanation
- arguments : Input values
- output : Expected result
code (Required) SQL query implementation for the function's logic.

  type: function_py  #(1)!
  author: John Doe  #(2)!
  description: |  #(3)
    Generates a personalized greeting message
    Combines first and last name with a welcome phrase
  arguments:  #(4)!
    - name: first_name
      type: string
    - name: last_name
      type: string
  output:  #(5)!
    name: greeting
    type: string
  examples:  #(6)!
    - description: Basic usage
      arguments:
        - "John"
        - "Doe"
      output: "Hello John Doe"
    - description: Different name
      arguments:
        - "Marie"
        - "Curie"
      output: "Hello Marie Curie"
  init_code: | #(7)!
    # Pre-imported modules (executed once)
    import requests  # Example dependency
  code: | #(9)!
    return f"Hello {first_name} {last_name}"
  requirements: | #(10)!
    # External libraries needed
    numpy==1.24.2
    requests>=2.28.1
  dockerfile: | #(8)!
    image: python:3.9-slim  # Base image
    apt_packages:  # System dependencies
      - libgomp1
    additional_commands: |
      # Additional setup commands
      RUN pip install --upgrade pip
  secrets: | #(14)!
    - name: API_KEY
      description: External service authentication
      documentation_link: https://example.com/api-docs
  max_batching_rows: 1 #(11)!
  quotas: | #(12)!
    max_rows_per_user_per_day: 10000000 # Daily user quota
    max_rows_per_query: 2 # Per-query limit
  cloud_run: #(13)!
    memory: 2Gi
    concurrency: 80 # Max concurrent requests/instance
    cpu: 2 # vCPU count

type (Required) Function category declaration.
author (Optional) Function creator/maintainer identifier.
description (Required) Clear explanation of the function's purpose and behavior.
arguments (Required) List of input arguments with types. Argument structure:
- name Valid Python identifier (snake_case recommended) Example: user_id, transaction_amount
- type Data type from allowed set:
```
BOOL | STRING | JSON | INT64 | FLOAT64
```
  Python Type Hints Documentation
Output structure (name, type).
List of usage examples (description, arguments, output).

init_code (Optional) Initialization code executed once during container startup, before any function invocation. Example:

# Pre-load expensive dependencies
import requests  # HTTP client
import numpy as np  # Numerical computations
from google.cloud import bigquery  # GCP integration
# Initialize shared resources
client = bigquery.Client()
model = load_ml_model("gs://bucket/model.pkl")  # One-time model loading

Key Use Cases:

Pre-importing expensive modules to reduce per-request latency
Initializing database connections/pools
Loading ML models or configuration files
Setting up shared caches or global variables

dockerfile (Optional) Custom Docker container configuration for function packaging. By default uv python image is used. Configurable Elements:
- image Base Docker image (e.g., python:3.9-slim). Recommendation: Use specific version tags (e.g., python:3.9.18-slim)
- apt_packages System packages to install:
```
apt_packages:
  - libgomp1  # OpenMP support
  - libpq-dev  # PostgreSQL bindings
```
- additional_commands Custom build commands (executed in order):
```
RUN pip install --upgrade pip
```
  ⚠️ Important Notes:
- Prefer official images for security
- Don't modify the default EXPOSE 8080
code (Required) Python function implementation containing the core business logic. Key Considerations:
- Arguments defined in the arguments section of the yaml are available here in the code.
- Dependencies must be declared in requirements section

requirements (Optional) Python packages required by the function, following requirements.txt syntax. Format:

package1==1.2.3
package2>=4.5.6
package3  # Comment explaining purpose

Example:

numpy==1.24.2
requests>=2.28.1
google-cloud-storage  # For cloud integration

Python Packaging Documentation

max_batching_rows (Optional) You can specify max_batching_rows as the maximum number of rows in each HTTP request, to avoid Cloud Run functions timeout. If you specify max_batching_rows, BigQuery determines the number of rows in a batch up to the max_batching_rows limit. If not specified, BigQuery determines the number of rows to batch automatically. Documentation.
quotas (Optional) Resource limits to prevent abuse and ensure system stability:
- max_rows_per_query Maximum number of rows in a query using the function.
- max_rows_per_user_per_day Maximum number of rows per day per user in queries using the function.

(Optional) Cloud Run Configuration Configure scaling, compute resources, and deployment settings for your Cloud Run service. All arguments from official Cloud Run documentation are suported (we replaced - by _ in arguments name for convention). Examples of configuration:

# Service Account (default to compute engine service account of your project)
service_account: XXXXXXXXX-compute@developer.gserviceaccount.com

# Allocated memory per instance (valid: 128Mi to 32Gi, in 64Mi increments)
memory: 512Mi

# Number of allocated CPUs per instance (default: 1)
cpu: 1

# Maximum concurrent requests per instance
concurrency: 8  # Set to 1 for strict isolation

# Maximum request duration (e.g., 300s = 5 minutes)
timeout: 300s

# Environment variables (format: KEY1=value1,KEY2=value2)
set_env_vars: DEBUG=true,MAX_RETRIES=3

# Minimum number of running instances (avoids cold starts)
min_instances: 1

# Maximum number of instances allowed
max_instances: 100

(Optional) Secrets

To be documented

❓ FAQ¶

How to correctly highlight sql, python and javascript code in yaml files?

In yaml files multiline string are by default highlighted as strings. That makes reading code field hard to read (with all code in the same string color). To correctly highlight the code regarding its python / javascript / sql syntax, you can install YAML Embedded Languages VSCode extension.

How to define specific parameters for cloud run of python functions?

In yaml files you can add a cloud_run field with cloud run parameters. Any argument of 'cloud run deploy' command can be put under cloud_run field.

You can see an example here.

You can also put the same config in your config.yaml file to define default values (useful for defining a default service account for functions). The arguments defined in config.yaml will be overriden by the arguments (if defined) in the function yaml files.

How to change the cloud run service account for python functions?

By default, your default compute service account is used when deploying cloud run. To change that, see the previous FAQ item which show how to define specific parameters for cloud run.

How to generate key pair for encryption / decryption of secrets contained in arguments

In order not to pass secrets in plain text in function arguments, bigfunctions provides a mechanism to encrypt a secret on the documentation page of a function (for example here). Only the given function will be able to decrypt it for the given users.

For this to work you need to:

Generate a key pair for encryption / decryption by running bigfun config generate-key-pair-for-secrets.
- The public key (used for encryption on the website) will be stored in your config.yaml and used when you generate your website.
- The private key (used for decryption by the function) will be printed on the console
Store the private key in a secret named bigfunctions_private_key in the Google Secret Manager of the project where you deploy the function.
Give to the service account of the function Secret Accessor role to the private key.

The deployed function will automatically download the private key and decrypt any encrypted secret in arguments tagged as secrets (and check secrets were encrypted for this function and for the user who calls it).

A Frameworkfor BigQuery functions¶

Discover the framework¶

A YAML Standard¶

A Command Line Interface¶

A Documentation Website¶

Get Started!¶

1. Install bigfun 🛠️¶

2. Use bigfun 🔥¶

3. Create you first function 👷¶

4. Deploy you first function 👨‍💻¶

5. Deploy you first javascript function which depends on npm packages 👽¶

6. Deploy you first remote function ⚡️¶

7. Host your Documentation on GitHub Pages¶

Steps to Host Your Documentation on GitHub Pages¶

YAML Syntax¶

❓ FAQ¶

A Framework
for BigQuery functions¶

1. Install `bigfun` 🛠️¶

2. Use `bigfun` 🔥¶