A Framework
for BigQuery functions¶
Build a catalog of BigQuery functions
using BigFunctions Framework
Discover the framework¶
-
A YAML Standard¶
Each function is defined in a yaml file (with its author, description, arguments, examples, code, etc)
Yaml files are used to test & deploy the functions and generate a documentation website (such as this website).
-
-
A Command Line Interface¶
bigfun
CLI is installable with onepip install
and enables you to:- get the yaml file of a public function
- test the function
- deploy it
- generate a documentation website (such as this website)
-
-
A Documentation Website¶
The command line interface generates your catalog of the available functions in your company with use cases and examples.
Foster self-service for your data-people!
-
Get Started!¶
bigfun
CLI (command-line-interface) facilitates BigFunctions development, test, deployment, documentation and monitoring.
1. Install bigfun
🛠️¶
pip install bigfunctions
2. Use bigfun
🔥¶
$ bigfun --help
Usage: bigfun [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
deploy Deploy BIGFUNCTION
docs Generate, serve and publish documentation
get Download BIGFUNCTION yaml file from unytics/bigfunctions...
test Test BIGFUNCTION
3. Create you first function 👷¶
Functions are defined as yaml files under bigfunctions
folder. To create your first function locally, the easiest is to download an existing yaml file of unytics/bigfunctions Github repo.
For instance to download is_email_valid.yaml
into bigfunctions folder, do:
bigfun get is_email_valid
You can then update the file to suit your needs.
4. Deploy you first function 👨💻¶
- Make sure the
gcloud
command is installed on your computer- Activate the application-default account with
gcloud auth application-default login
. A browser window should open, and you should be prompted to log into your Google account. Once you've done that,bigfun
will use your oauth'd credentials to connect to BigQuery through BigQuery python client!- Get or create a
DATASET
where you have permission to edit data and where the function will be deployed.- The
DATASET
must belong to aPROJECT
in which you have permission to run BigQuery queries.
You now can deploy the function is_email_valid
defined in bigfunctions/is_email_valid.yaml
yaml file by running:
bigfun deploy is_email_valid
The first time you run this command it will ask for
PROJECT
andDATASET
.Your inputs will be written to
config.yaml
file in current directory so that you won't be asked again (unless you delete the entries inconfig.yaml
). You can also override this config at deploy time:bigfun deploy is_email_valid --project=PROJECT --dataset=DATASET
.
Test it with 👀:
select PROJECT.DATASET.is_email_valid('paul.marcombes@unytics.io')
5. Deploy you first javascript function which depends on npm packages 👽¶
To deploy a javascript function which depends on npm packages there are additional requirements in addition to the ones above.
- You will need to install each npm package on your machine and bundle it into one file. For that, you need to install nodejs.
- The bundled js file will be uploaded into a cloud storage bucket in which you must have write access. The bucket name must be provided in
config.yaml
file in a variable namedbucket_js_dependencies
. Users of your functions must have read access to the bucket.
You now can deploy the function render_template
defined in bigfunctions/render_template.yaml
yaml file by running:
bigfun deploy render_template
Test it with 👀:
select PROJECT.DATASET.render_template('Hello {{ user }}', json '{"user": "James"}')
6. Deploy you first remote function ⚡️¶
To deploy a remote function (e.g. python function), there are additional requirements in addition to the ones of Deploy you first function section.
- A Cloud Run service will be deployed to host the code (as seen here). So you must have permissions to deploy a Cloud Run service in your project
PROJECT
.gcloud
CLI will be used directly to deploy the service (usinggcloud run deploy
). Then, make sure you are logged in withgcloud
by calling:gcloud auth login
. A browser window should also open, and you should be prompted to log into your Google account. WARNING: you read correctly: you have to authenticate twice. Once for bigquery python client (to deploy any function including remote as seen above.) and once now to usegcloud
(to deploy a Cloud Run service).- A BigQuery Remote Connection will be created to link BigQuery with the Cloud Run service. You then should have permissions to create a remote connection. BigQuery Connection Admin or BigQuery Admin roles have these permissions.
- A service account will be automatically created by Google along with the BigQuery Remote Connection. BigQuery will use this service account of the remote connection to invoke the Cloud Run service. You then must have the permission to authorize this service account to invoke the Cloud Run service. This permission is provided in the role roles/run.admin
You now can deploy the function faker
defined in bigfunctions/faker.yaml
yaml file by running:
bigfun deploy faker
Test it with 👀:
select PROJECT.DATASET.faker("name", "it_IT")
7. Host your Documentation on GitHub Pages¶
💡 Note: If you want to host your documentation on GitLab, please check this link.
Steps to Host Your Documentation on GitHub Pages¶
- Create a new repository on GitHub.
- Initialize Git in your local project:
git init
- Add the remote repository
git remote add origin <repository-url>
- Generate the documentation
bigfun docs generate
-
(Optional) Preview the documentation before publishing
Then open http://localhost:8000 in your browser.bigfun docs serve
-
Add, commit, and push your changes
git add . git commit -m "Add documentation" git push origin main
- Deploy to GitHub Pages
mkdocs gh-deploy --force
- Access your hosted documentation
https://<your-github-username>.github.io/<repository-name>/
YAML Syntax¶
type: function_sql #(1)!
author: John Doe #(2)!
description: | #(3)!
Multiplies a number by a factor
(example function for documentation purposes)
arguments: #(4)!
- name: num
type: float64
- name: factor
type: float64
output: #(5)!
name: product
type: float64
examples: #(6)!
- description: Basic multiplication
arguments:
- 5
- 3
output: 15
- description: Decimal multiplication
arguments:
- 2.5
- 4
output: 10.0
code: | #(7)!
(
SELECT num * factor
)
type
(Required) Function category declaration.author
(Optional) Function creator/maintainer identifier.description
(Required) Clear explanation of the function's purpose and behavior.arguments
(Required) List of input arguments with BigQuery-compatible types. Argument structure:name
Valid identifier (snake_case recommended) Example:user_id
,transaction_amount
type
BigQuery-supported data type:BigQuery Data Types ReferenceBOOL | INT64 | FLOAT64 | STRING | JSON | DATE | TIMESTAMP
output
(Required) Definition of the function's return value structure. Output Structure:name
Identifier for the return value (snake_case recommended) Example:result
,total_amount
type
BigQuery-compatible data type:Example:BOOL | INT64 | FLOAT64 | STRING | JSON | DATE | TIMESTAMP
BigQuery Data Types Referenceoutput: name: final_price type: FLOAT64 description: Total amount after applying discounts and taxes
examples
(Required) List of practical usage demonstrations for the function. Key Elements:description
: Context explanationarguments
: Input valuesoutput
: Expected result
code
(Required) SQL query implementation for the function's logic.
type: function_py #(1)!
author: John Doe #(2)!
description: | #(3)
Generates a personalized greeting message
Combines first and last name with a welcome phrase
arguments: #(4)!
- name: first_name
type: string
- name: last_name
type: string
output: #(5)!
name: greeting
type: string
examples: #(6)!
- description: Basic usage
arguments:
- "'John'"
- "'Doe'"
output: "Hello John Doe"
- description: Different name
arguments:
- "'Marie'"
- "'Curie'"
output: "Hello Marie Curie"
init_code: | #(7)!
# Pre-imported modules (executed once)
import requests # Example dependency
code: | #(9)!
return f"Hello {first_name} {last_name}"
requirements: | #(10)!
# External libraries needed
numpy==1.24.2
requests>=2.28.1
dockerfile: | #(8)!
image: python:3.9-slim # Base image
apt_packages: # System dependencies
- libgomp1
additional_commands: |
# Additional setup commands
RUN pip install --upgrade pip
secrets: | #(14)!
- name: API_KEY
description: External service authentication
documentation_link: https://example.com/api-docs
max_batching_rows: 1 #(11)!
quotas: | #(12)!
max_rows_per_user_per_day: 10000000 # Daily user quota
max_rows_per_query: 2 # Per-query limit
cloud_run: #(13)!
memory: 2Gi
concurrency: 80 # Max concurrent requests/instance
cpu: 2 # vCPU count
type
(Required) Function category declaration.author
(Optional) Function creator/maintainer identifier.description
(Required) Clear explanation of the function's purpose and behavior.arguments
(Required) List of input arguments with types. Argument structure:name
Valid Python identifier (snake_case recommended) Example:user_id
,transaction_amount
type
Data type from allowed set:Python Type Hints DocumentationBOOL | STRING | JSON | INT64 | FLOAT64
- Output structure (name, type).
- List of usage examples (description, arguments, output).
init_code
(Optional) Initialization code executed once during container startup, before any function invocation. Example:Key Use Cases:# Pre-load expensive dependencies import requests # HTTP client import numpy as np # Numerical computations from google.cloud import bigquery # GCP integration # Initialize shared resources client = bigquery.Client() model = load_ml_model("gs://bucket/model.pkl") # One-time model loading
- Pre-importing expensive modules to reduce per-request latency
- Initializing database connections/pools
- Loading ML models or configuration files
- Setting up shared caches or global variables
dockerfile
(Optional) Custom Docker container configuration for function packaging. By defaultuv
python image is used. Configurable Elements:image
Base Docker image (e.g.,python:3.9-slim
). Recommendation: Use specific version tags (e.g.,python:3.9.18-slim
)apt_packages
System packages to install:apt_packages: - libgomp1 # OpenMP support - libpq-dev # PostgreSQL bindings
additional_commands
Custom build commands (executed in order):⚠️ Important Notes:RUN pip install --upgrade pip
- Prefer official images for security
- Don't modify the default
EXPOSE 8080
code
(Required) Python function implementation containing the core business logic. Key Considerations:- Arguments defined in the
arguments
section of the yaml are available here in the code. - Dependencies must be declared in
requirements
section
- Arguments defined in the
requirements
(Optional) Python packages required by the function, followingrequirements.txt
syntax. Format:Example:package1==1.2.3 package2>=4.5.6 package3 # Comment explaining purpose
Python Packaging Documentationnumpy==1.24.2 requests>=2.28.1 google-cloud-storage # For cloud integration
max_batching_rows
(Optional) You can specifymax_batching_rows
as the maximum number of rows in each HTTP request, to avoid Cloud Run functions timeout. If you specifymax_batching_rows
, BigQuery determines the number of rows in a batch up to themax_batching_rows
limit. If not specified, BigQuery determines the number of rows to batch automatically. Documentation.quotas
(Optional) Resource limits to prevent abuse and ensure system stability:max_rows_per_query
Maximum number of rows in a query using the function.max_rows_per_user_per_day
Maximum number of rows per day per user in queries using the function.
- (Optional) Cloud Run Configuration
Configure scaling, compute resources, and deployment settings for your Cloud Run service.
All arguments from official Cloud Run documentation are suported (we replaced
-
by_
in arguments name for convention). Examples of configuration:# Service Account (default to compute engine service account of your project) service_account: XXXXXXXXX-compute@developer.gserviceaccount.com # Allocated memory per instance (valid: 128Mi to 32Gi, in 64Mi increments) memory: 512Mi # Number of allocated CPUs per instance (default: 1) cpu: 1 # Maximum concurrent requests per instance concurrency: 8 # Set to 1 for strict isolation # Maximum request duration (e.g., 300s = 5 minutes) timeout: 300s # Environment variables (format: KEY1=value1,KEY2=value2) set_env_vars: DEBUG=true,MAX_RETRIES=3 # Minimum number of running instances (avoids cold starts) min_instances: 1 # Maximum number of instances allowed max_instances: 100
- (Optional) Secrets
To be documented
To be documented
To be documented
To be documented
❓ FAQ¶
How to correctly highlight sql
, python
and javascript
code in yaml files?
In yaml files multiline string are by default highlighted as strings.
That makes reading code
field hard to read (with all code in the same string color).
To correctly highlight the code regarding its python / javascript / sql syntax,
you can install YAML Embedded Languages VSCode extension.
How to define specific parameters for cloud run of python functions?
In yaml files you can add a cloud_run
field with cloud run parameters.
Any argument of 'cloud run deploy' command
can be put under cloud_run
field.
You can see an example here.
You can also put the same config in your config.yaml
file to define default values
(useful for defining a default service account for functions).
The arguments defined in config.yaml
will be overriden by the arguments (if defined)
in the function yaml files.
How to change the cloud run service account for python functions?
By default, your default compute service account is used when deploying cloud run. To change that, see the previous FAQ item which show how to define specific parameters for cloud run.
How to generate key pair for encryption / decryption of secrets contained in arguments
In order not to pass secrets in plain text in function arguments, bigfunctions provides a mechanism to encrypt a secret on the documentation page of a function (for example here). Only the given function will be able to decrypt it for the given users.
For this to work you need to:
- Generate a key pair for encryption / decryption by running
bigfun config generate-key-pair-for-secrets
.- The public key (used for encryption on the website) will be stored in your
config.yaml
and used when you generate your website. - The private key (used for decryption by the function) will be printed on the console
- The public key (used for encryption on the website) will be stored in your
- Store the private key in a secret named
bigfunctions_private_key
in the Google Secret Manager of the project where you deploy the function. - Give to the service account of the function
Secret Accessor
role to the private key.
The deployed function will automatically download the private key and decrypt any encrypted secret in arguments tagged as secrets (and check secrets were encrypted for this function and for the user who calls it).