Skip to content

load_file_into_temp_dataset

load_file_into_temp_dataset(url, file_type, options)

Description

Download web file into a temp dataset in bigfunctions project.

graph load file

Each call to this function creates a new temporary dataset which:

  • will contain the destination_table with the file data.
  • is accessible only to you (who calls the function) and the function. You have permission to read data, delete the tables and delete the dataset.
  • has a limited period of life. Default expiration time is set to 1h so that every table created will be automatically deleted after 1h. Empty datasets are periodically removed.
  • has a random name

File Data is downloaded using ibis with DuckDB. Available file_type values are:

  • csv : doc
  • json : doc
  • parquet : doc
  • delta : doc
  • geo : doc. (this uses GDAL under the hood and enable you to also read .xls, .xlsx, .shp ...)

Usage

Call or Deploy load_file_into_temp_dataset ?
Call load_file_into_temp_dataset directly

The easiest way to use bigfunctions

  • load_file_into_temp_dataset function is deployed in 39 public datasets for all of the 39 BigQuery regions.
  • It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...
Deploy load_file_into_temp_dataset in your project

Why deploy?

  • You may prefer to deploy load_file_into_temp_dataset in your own project to build and manage your own catalog of functions.
  • This is particularly useful if you want to create private functions (for example calling your internal APIs).
  • Get started by reading the framework page

Deployment

load_file_into_temp_dataset function can be deployed with:

pip install bigfunctions
bigfun get load_file_into_temp_dataset
bigfun deploy load_file_into_temp_dataset

Examples

1. load random csv

select bigfunctions.eu.load_file_into_temp_dataset("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", null)
select bigfunctions.us.load_file_into_temp_dataset("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", null)
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", null)
+------------------------------------------------------------------+
| destination_table                                                |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+

2. load json - french departements

select bigfunctions.eu.load_file_into_temp_dataset("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", null)
select bigfunctions.us.load_file_into_temp_dataset("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", null)
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", null)
+------------------------------------------------------------------+
| destination_table                                                |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+

3. load parquet on Google Cloud Storage

select bigfunctions.eu.load_file_into_temp_dataset("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", null)
select bigfunctions.us.load_file_into_temp_dataset("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", null)
select bigfunctions.europe_west1.load_file_into_temp_dataset("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", null)
+------------------------------------------------------------------+
| destination_table                                                |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+

4. load xls or xlsx

select bigfunctions.eu.load_file_into_temp_dataset("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}")
select bigfunctions.us.load_file_into_temp_dataset("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}")
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}")
+------------------------------------------------------------------+
| destination_table                                                |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+

5. load french tricky csv

select bigfunctions.eu.load_file_into_temp_dataset("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv", 
      '''{
        "columns": {
            "code_commune_insee": "VARCHAR",
            "nom_commune_insee": "VARCHAR",
            "code_postal": "VARCHAR",
            "lb_acheminement": "VARCHAR",
            "ligne_5": "VARCHAR"
        },
        "delim": ";",
        "skip": 1
      }'''
      )
select bigfunctions.us.load_file_into_temp_dataset("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv", 
      '''{
        "columns": {
            "code_commune_insee": "VARCHAR",
            "nom_commune_insee": "VARCHAR",
            "code_postal": "VARCHAR",
            "lb_acheminement": "VARCHAR",
            "ligne_5": "VARCHAR"
        },
        "delim": ";",
        "skip": 1
      }'''
      )
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv", 
      '''{
        "columns": {
            "code_commune_insee": "VARCHAR",
            "nom_commune_insee": "VARCHAR",
            "code_postal": "VARCHAR",
            "lb_acheminement": "VARCHAR",
            "ligne_5": "VARCHAR"
        },
        "delim": ";",
        "skip": 1
      }'''
      )
+------------------------------------------------------------------+
| destination_table                                                |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+

Use cases

This function is useful for quickly loading data from various online sources directly into BigQuery for analysis without needing to manually download, format, and upload the data. Here are a few specific use cases:

1. Data Exploration and Prototyping:

  • You find a dataset on a public repository (like Github) or a government data portal, and you want to quickly explore it in BigQuery. load_file_into_temp_dataset lets you load the data directly without intermediate steps. This is perfect for initial data analysis and prototyping before deciding to store the data permanently.

2. Ad-hoc Analysis of Public Data:

  • You need to analyze some publicly available data, such as weather data, stock prices, or social media trends, for a one-time report or analysis. You can use this function to load the data on demand without storing it permanently.

3. ETL Pipelines with Dynamic Data Sources:

  • You're building an ETL pipeline that needs to process data from various sources that are updated frequently. load_file_into_temp_dataset can be integrated into your pipeline to dynamically load data from different URLs as needed. This is especially helpful when dealing with data sources that don't have a stable schema or format.

4. Data Enrichment:

  • You have a dataset in BigQuery and need to enrich it with external data, such as geographic information, currency exchange rates, or product catalogs. You can use this function to load the external data into a temporary table and then join it with your existing table.

5. Sharing Data Snippets:

  • You want to share a small dataset with a colleague or client without giving them access to your entire data warehouse. Load the data into a temporary dataset using this function and then grant them temporary access. This offers a secure and convenient way to share data snippets.

Example: Analyzing Tweet Sentiment from a Public API:

Imagine an API that returns tweet data in JSON format. You want to analyze the sentiment of tweets related to a specific hashtag.

  1. Call the API to retrieve the tweets. The API might offer a download link or allow you to stream the data directly.
  2. Use load_file_into_temp_dataset within a BigQuery query to load the JSON data from the API's URL.
  3. Apply BigQuery's text processing functions to analyze the sentiment of the tweets in the temporary table.
  4. Generate your report or visualization directly from the results.

This avoids the need to download the JSON file, create a table schema, and manually load the data, significantly speeding up your analysis. The temporary dataset automatically cleans itself up, simplifying data management.


Need help or Found a bug?
Get help using load_file_into_temp_dataset

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about load_file_into_temp_dataset

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

We also provide professional suppport.


Show your ❤ by adding a ⭐ on