Skip to content

bigfunctions > load_file

load_file

Call or Deploy load_file ?

✅ You can call this load_file bigfunction directly from your Google Cloud Project (no install required).

  • This load_file function is deployed in bigfunctions GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error).
  • Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions --> Read Getting Started. This is particularly useful if you want to create private functions (for example calling your internal APIs).
  • For any question or difficulties, please read Getting Started.
  • Found a bug? Please raise an issue here

Public BigFunctions Datasets are like:

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...

Description

Signature

load_file(url, file_type, destination_table, options)

Description

Download web file into destination_table

graph load file

This function extends load_file_into_temp_dataset function.

Each call:

  1. Creates a new temporary dataset only accessible to you in bigfunctions project.
  2. Downloads the file data in a table within this temporary dataset.
  3. Copies the table into destination_table before deleting it from the temporary dataset.

File Data is downloaded using ibis with DuckDB. Available file_type values are:

  • csv : doc
  • json : doc
  • parquet : doc
  • delta : doc
  • geo : doc. (this uses GDAL under the hood and enable you to also read .xls, .xlsx, .shp ...)

Examples

1. load random csv

call bigfunctions.eu.load_file(
    'https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv', 
    'csv', 'your_project.your_dataset.random_sales', null
  );
select * from bigfunction_result;
call bigfunctions.us.load_file(
    'https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv', 
    'csv', 'your_project.your_dataset.random_sales', null
  );
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
    'https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv', 
    'csv', 'your_project.your_dataset.random_sales', null
  );
select * from bigfunction_result;

+--------+
+ status +
+--------+
+   ok   +
+--------+


2. load json - french departements

call bigfunctions.eu.load_file(
    'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region', 
    'json', 'your_project.your_dataset.dim_french_departements', null
  );
select * from bigfunction_result;
call bigfunctions.us.load_file(
    'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region', 
    'json', 'your_project.your_dataset.dim_french_departements', null
  );
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
    'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region', 
    'json', 'your_project.your_dataset.dim_french_departements', null
  );
select * from bigfunction_result;

+--------+
+ status +
+--------+
+   ok   +
+--------+


**3. load json as string without records/struct inference (from DuckDB & BigQuery) - (cf. issue #171) **

call bigfunctions.eu.load_file(
    'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region', 
    'json', 'your_project.your_dataset.dim_french_departements', {"records":false, "columns":{"json":"string"}}\n);
select * from bigfunction_result;
call bigfunctions.us.load_file(
    'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region', 
    'json', 'your_project.your_dataset.dim_french_departements', {"records":false, "columns":{"json":"string"}}\n);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
    'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region', 
    'json', 'your_project.your_dataset.dim_french_departements', {"records":false, "columns":{"json":"string"}}\n);
select * from bigfunction_result;

+--------+
+ status +
+--------+
+   ok   +
+--------+


4. load parquet on Google Cloud Storage

call bigfunctions.eu.load_file(
    'gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet', 
    'parquet', 'your_project.your_dataset.station_status', null
  );
select * from bigfunction_result;
call bigfunctions.us.load_file(
    'gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet', 
    'parquet', 'your_project.your_dataset.station_status', null
  );
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
    'gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet', 
    'parquet', 'your_project.your_dataset.station_status', null
  );
select * from bigfunction_result;

+--------+
+ status +
+--------+
+   ok   +
+--------+


5. load xls or xlsx

call bigfunctions.eu.load_file(
    'https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx', 
    'geo', 'your_project.your_dataset.hypermarche_retours', '{"layer":"Retours", "open_options": ["HEADERS=FORCE"]}'
  );
select * from bigfunction_result;
call bigfunctions.us.load_file(
    'https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx', 
    'geo', 'your_project.your_dataset.hypermarche_retours', '{"layer":"Retours", "open_options": ["HEADERS=FORCE"]}'
  );
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
    'https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx', 
    'geo', 'your_project.your_dataset.hypermarche_retours', '{"layer":"Retours", "open_options": ["HEADERS=FORCE"]}'
  );
select * from bigfunction_result;

+--------+
+ status +
+--------+
+   ok   +
+--------+


6. load french tricky csv

call bigfunctions.eu.load_file(
    'https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6', 
    'csv', 
    'your_project.your_dataset.dim_french_postalcodes', 
  '''{
    "columns": {
        "code_commune_insee": "VARCHAR",
        "nom_commune_insee": "VARCHAR",
        "code_postal": "VARCHAR",
        "lb_acheminement": "VARCHAR",
        "ligne_5": "VARCHAR"
    },
    "delim": ";",
    "skip": 1
  }'''
  );
select * from bigfunction_result;
call bigfunctions.us.load_file(
    'https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6', 
    'csv', 
    'your_project.your_dataset.dim_french_postalcodes', 
  '''{
    "columns": {
        "code_commune_insee": "VARCHAR",
        "nom_commune_insee": "VARCHAR",
        "code_postal": "VARCHAR",
        "lb_acheminement": "VARCHAR",
        "ligne_5": "VARCHAR"
    },
    "delim": ";",
    "skip": 1
  }'''
  );
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
    'https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6', 
    'csv', 
    'your_project.your_dataset.dim_french_postalcodes', 
  '''{
    "columns": {
        "code_commune_insee": "VARCHAR",
        "nom_commune_insee": "VARCHAR",
        "code_postal": "VARCHAR",
        "lb_acheminement": "VARCHAR",
        "ligne_5": "VARCHAR"
    },
    "delim": ";",
    "skip": 1
  }'''
  );
select * from bigfunction_result;

+--------+
+ status +
+--------+
+   ok   +
+--------+


Use cases

The load_file function is useful for quickly loading data from various web-based file formats directly into a BigQuery table. Here's a breakdown of potential use cases categorized by data type and source:

1. CSV Data:

  • Public Datasets: Loading publicly available datasets in CSV format, like government data or research data. Example: Loading census data or economic indicators from a government website.
  • Web APIs: Some APIs return data in CSV format. This function can be used to directly ingest that data into BigQuery. Example: A marketing API providing campaign performance data.
  • GitHub/GitLab: Loading data directly from CSV files stored in repositories like GitHub or GitLab. This is helpful for sharing data within teams or for reproducible research. Example: Loading a training dataset for a machine learning model.

2. JSON Data:

  • REST APIs: Many REST APIs return data in JSON format. load_file simplifies the process of ingesting this data into BigQuery without intermediate processing. Example: Loading product information from an e-commerce API.
  • GeoJSON Data: Loading geospatial data in GeoJSON format. Example: Loading geographic boundaries of cities or countries.
  • Configuration Files: Loading configuration data from JSON files hosted online.

3. Parquet/Delta Lake Data:

  • Data Lakes: Accessing and loading data directly from data lakes stored on cloud storage platforms like Google Cloud Storage. This is efficient for large datasets as Parquet and Delta Lake are optimized for analytical queries. Example: Loading historical sales data from a data lake.

4. Excel/Shapefiles (via 'geo' file_type):

  • Legacy Data: Loading data from legacy systems that often store data in Excel or Shapefile formats. Example: Loading customer data from an older CRM system.
  • GIS Data: Loading geospatial data from shapefiles. Example: Loading data on road networks or land parcels.

5. General Web Files:

  • Automated Data Ingestion: Regularly loading data from a web source as part of an automated data pipeline. Example: Daily updates of stock prices.
  • Ad-hoc Data Analysis: Quickly loading data from a web source for exploratory data analysis. Example: Analyzing a competitor's publicly available product catalog.

Key Advantages of using load_file:

  • Simplicity: Reduces the need for complex ETL pipelines for simple data loading tasks.
  • Speed: Directly loads data into BigQuery, bypassing intermediate steps.
  • Flexibility: Supports various file formats and sources.
  • Accessibility: Makes web-based data easily accessible for analysis within BigQuery.

Example Scenario:

A marketing analyst needs to analyze the performance of their recent social media campaigns. The social media platform provides an API that returns campaign data in CSV format. Instead of manually downloading the CSV file, processing it, and then uploading it to BigQuery, the analyst can use the load_file function to directly load the data from the API endpoint into a BigQuery table, saving time and effort.