load_file¶

load_file(url, file_type, destination_table, options)

Description¶

Download web file into destination_table

graph load file

This function extends load_file_into_temp_dataset function.

Each call:

Creates a new temporary dataset only accessible to you in bigfunctions project.
Downloads the file data in a table within this temporary dataset.
Copies the table into destination_table before deleting the temporary dataset.

File Data is downloaded using ibis with DuckDB. Available file_type values are:

csv : doc
json : doc
parquet : doc
delta : doc
geo : doc. (this uses GDAL under the hood and enable you to also read .xls, .xlsx, .shp ...)

Usage¶

Call or Deploy load_file ?

Call load_file directly

The easiest way to use bigfunctions

load_file function is deployed in 39 public datasets for all of the 39 BigQuery regions.
It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
(You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region	Dataset
`eu`	`bigfunctions.eu`
`us`	`bigfunctions.us`
`europe-west1`	`bigfunctions.europe_west1`
`asia-east1`	`bigfunctions.asia_east1`
...	...

Deploy load_file in your project

Why deploy?

You may prefer to deploy load_file in your own project to build and manage your own catalog of functions.
This is particularly useful if you want to create private functions (for example calling your internal APIs).
Get started by reading the framework page

Deployment

load_file function can be deployed with:

pip install bigfunctions
bigfun get load_file
bigfun deploy load_file

Examples¶

1. load random csv

EUUSeurope-west1

call bigfunctions.eu.load_file("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", "your_project.your_dataset.random_sales", null);
select * from bigfunction_result;

call bigfunctions.us.load_file("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", "your_project.your_dataset.random_sales", null);
select * from bigfunction_result;

call bigfunctions.europe_west1.load_file("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", "your_project.your_dataset.random_sales", null);
select * from bigfunction_result;


+--------+
+ status +
+--------+
+   ok   +
+--------+

2. load json - french departements

EUUSeurope-west1

call bigfunctions.eu.load_file("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", "your_project.your_dataset.dim_french_departements", null);
select * from bigfunction_result;

call bigfunctions.us.load_file("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", "your_project.your_dataset.dim_french_departements", null);
select * from bigfunction_result;

call bigfunctions.europe_west1.load_file("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", "your_project.your_dataset.dim_french_departements", null);
select * from bigfunction_result;


+--------+
+ status +
+--------+
+   ok   +
+--------+

3. load json as string without records/struct inference (from DuckDB & BigQuery) - (cf. issue #171)

EUUSeurope-west1

call bigfunctions.eu.load_file("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", "your_project.your_dataset.dim_french_departements", "{\"records\":false, \"columns\":{\"json\":\"string\"}}");
select * from bigfunction_result;

call bigfunctions.us.load_file("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", "your_project.your_dataset.dim_french_departements", "{\"records\":false, \"columns\":{\"json\":\"string\"}}");
select * from bigfunction_result;

call bigfunctions.europe_west1.load_file("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", "your_project.your_dataset.dim_french_departements", "{\"records\":false, \"columns\":{\"json\":\"string\"}}");
select * from bigfunction_result;


+--------+
+ status +
+--------+
+   ok   +
+--------+

4. load parquet on Google Cloud Storage

EUUSeurope-west1

call bigfunctions.eu.load_file("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", "your_project.your_dataset.station_status", null);
select * from bigfunction_result;

call bigfunctions.us.load_file("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", "your_project.your_dataset.station_status", null);
select * from bigfunction_result;

call bigfunctions.europe_west1.load_file("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", "your_project.your_dataset.station_status", null);
select * from bigfunction_result;


+--------+
+ status +
+--------+
+   ok   +
+--------+

5. load xls or xlsx

EUUSeurope-west1

call bigfunctions.eu.load_file("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "your_project.your_dataset.hypermarche_retours", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}");
select * from bigfunction_result;

call bigfunctions.us.load_file("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "your_project.your_dataset.hypermarche_retours", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}");
select * from bigfunction_result;

call bigfunctions.europe_west1.load_file("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "your_project.your_dataset.hypermarche_retours", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}");
select * from bigfunction_result;


+--------+
+ status +
+--------+
+   ok   +
+--------+

6. load french tricky csv

EUUSeurope-west1

call bigfunctions.eu.load_file("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv", "your_project.your_dataset.dim_french_postalcodes", 
      '''{
        "columns": {
            "code_commune_insee": "VARCHAR",
            "nom_commune_insee": "VARCHAR",
            "code_postal": "VARCHAR",
            "lb_acheminement": "VARCHAR",
            "ligne_5": "VARCHAR"
        },
        "delim": ";",
        "skip": 1
      }'''
      );
select * from bigfunction_result;

call bigfunctions.us.load_file("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv", "your_project.your_dataset.dim_french_postalcodes", 
      '''{
        "columns": {
            "code_commune_insee": "VARCHAR",
            "nom_commune_insee": "VARCHAR",
            "code_postal": "VARCHAR",
            "lb_acheminement": "VARCHAR",
            "ligne_5": "VARCHAR"
        },
        "delim": ";",
        "skip": 1
      }'''
      );
select * from bigfunction_result;

call bigfunctions.europe_west1.load_file("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv", "your_project.your_dataset.dim_french_postalcodes", 
      '''{
        "columns": {
            "code_commune_insee": "VARCHAR",
            "nom_commune_insee": "VARCHAR",
            "code_postal": "VARCHAR",
            "lb_acheminement": "VARCHAR",
            "ligne_5": "VARCHAR"
        },
        "delim": ";",
        "skip": 1
      }'''
      );
select * from bigfunction_result;


+--------+
+ status +
+--------+
+   ok   +
+--------+

Use cases¶

The load_file function is useful for quickly loading data from various web-based file formats directly into a BigQuery table. Here's a breakdown of potential use cases categorized by data type and source:

1. CSV Data:

Public Datasets: Loading publicly available datasets in CSV format, like government data or research data. Example: Loading census data or economic indicators from a government website.
Web APIs: Some APIs return data in CSV format. This function can be used to directly ingest that data into BigQuery. Example: A marketing API providing campaign performance data.
GitHub/GitLab: Loading data directly from CSV files stored in repositories like GitHub or GitLab. This is helpful for sharing data within teams or for reproducible research. Example: Loading a training dataset for a machine learning model.

2. JSON Data:

REST APIs: Many REST APIs return data in JSON format. load_file simplifies the process of ingesting this data into BigQuery without intermediate processing. Example: Loading product information from an e-commerce API.
GeoJSON Data: Loading geospatial data in GeoJSON format. Example: Loading geographic boundaries of cities or countries.
Configuration Files: Loading configuration data from JSON files hosted online.

3. Parquet/Delta Lake Data:

Data Lakes: Accessing and loading data directly from data lakes stored on cloud storage platforms like Google Cloud Storage. This is efficient for large datasets as Parquet and Delta Lake are optimized for analytical queries. Example: Loading historical sales data from a data lake.

4. Excel/Shapefiles (via 'geo' file_type):

Legacy Data: Loading data from legacy systems that often store data in Excel or Shapefile formats. Example: Loading customer data from an older CRM system.
GIS Data: Loading geospatial data from shapefiles. Example: Loading data on road networks or land parcels.

5. General Web Files:

Automated Data Ingestion: Regularly loading data from a web source as part of an automated data pipeline. Example: Daily updates of stock prices.
Ad-hoc Data Analysis: Quickly loading data from a web source for exploratory data analysis. Example: Analyzing a competitor's publicly available product catalog.

Key Advantages of using load_file:

Simplicity: Reduces the need for complex ETL pipelines for simple data loading tasks.
Speed: Directly loads data into BigQuery, bypassing intermediate steps.
Flexibility: Supports various file formats and sources.
Accessibility: Makes web-based data easily accessible for analysis within BigQuery.

Example Scenario:

A marketing analyst needs to analyze the performance of their recent social media campaigns. The social media platform provides an API that returns campaign data in CSV format. Instead of manually downloading the CSV file, processing it, and then uploading it to BigQuery, the analyst can use the load_file function to directly load the data from the API endpoint into a BigQuery table, saving time and effort.

Need help or Found a bug?

Get help using load_file

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about load_file

If the function does not work as expected, please

report a bug so that it can be improved.
or open the discussion with the community on Slack.

We also provide professional suppport.

Show your by adding a on