load_file_into_temp_dataset¶
load_file_into_temp_dataset(url, file_type, options)
Description¶
Download web file into a temp dataset in bigfunctions project.
Each call to this function creates a new temporary dataset which:
- will contain the
destination_table
with the file data. - is accessible only to you (who calls the function) and the function. You have permission to read data, delete the tables and delete the dataset.
- has a limited period of life. Default expiration time is set to 1h so that every table created will be automatically deleted after 1h. Empty datasets are periodically removed.
- has a random name
File Data is downloaded using ibis with DuckDB. Available file_type
values are:
- csv : doc
- json : doc
- parquet : doc
- delta : doc
- geo : doc. (this uses GDAL under the hood and enable you to also read .xls, .xlsx, .shp ...)
Usage¶
Call or Deploy load_file_into_temp_dataset
?
Call load_file_into_temp_dataset
directly
The easiest way to use bigfunctions
load_file_into_temp_dataset
function is deployed in 39 public datasets for all of the 39 BigQuery regions.- It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)
Public BigFunctions Datasets
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Deploy load_file_into_temp_dataset
in your project
Why deploy?
- You may prefer to deploy
load_file_into_temp_dataset
in your own project to build and manage your own catalog of functions. - This is particularly useful if you want to create private functions (for example calling your internal APIs).
- Get started by reading the framework page
Deployment
load_file_into_temp_dataset
function can be deployed with:
pip install bigfunctions
bigfun get load_file_into_temp_dataset
bigfun deploy load_file_into_temp_dataset
Examples¶
1. load random csv
select bigfunctions.eu.load_file_into_temp_dataset("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", null)
select bigfunctions.us.load_file_into_temp_dataset("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", null)
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv", "csv", null)
+------------------------------------------------------------------+
| destination_table |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+
2. load json - french departements
select bigfunctions.eu.load_file_into_temp_dataset("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", null)
select bigfunctions.us.load_file_into_temp_dataset("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", null)
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region", "json", null)
+------------------------------------------------------------------+
| destination_table |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+
3. load parquet on Google Cloud Storage
select bigfunctions.eu.load_file_into_temp_dataset("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", null)
select bigfunctions.us.load_file_into_temp_dataset("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", null)
select bigfunctions.europe_west1.load_file_into_temp_dataset("gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet", "parquet", null)
+------------------------------------------------------------------+
| destination_table |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+
4. load xls or xlsx
select bigfunctions.eu.load_file_into_temp_dataset("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}")
select bigfunctions.us.load_file_into_temp_dataset("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}")
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx", "geo", "{\"layer\":\"Retours\", \"open_options\": [\"HEADERS=FORCE\"]}")
+------------------------------------------------------------------+
| destination_table |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+
5. load french tricky csv
select bigfunctions.eu.load_file_into_temp_dataset("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv",
'''{
"columns": {
"code_commune_insee": "VARCHAR",
"nom_commune_insee": "VARCHAR",
"code_postal": "VARCHAR",
"lb_acheminement": "VARCHAR",
"ligne_5": "VARCHAR"
},
"delim": ";",
"skip": 1
}'''
)
select bigfunctions.us.load_file_into_temp_dataset("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv",
'''{
"columns": {
"code_commune_insee": "VARCHAR",
"nom_commune_insee": "VARCHAR",
"code_postal": "VARCHAR",
"lb_acheminement": "VARCHAR",
"ligne_5": "VARCHAR"
},
"delim": ";",
"skip": 1
}'''
)
select bigfunctions.europe_west1.load_file_into_temp_dataset("https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6", "csv",
'''{
"columns": {
"code_commune_insee": "VARCHAR",
"nom_commune_insee": "VARCHAR",
"code_postal": "VARCHAR",
"lb_acheminement": "VARCHAR",
"ligne_5": "VARCHAR"
},
"delim": ";",
"skip": 1
}'''
)
+------------------------------------------------------------------+
| destination_table |
+------------------------------------------------------------------+
| bigfunctions.temp_6bdb75ca_7f72_4f1f_b46a_6ca59f7f66ac.file_data |
+------------------------------------------------------------------+
Use cases¶
This function is useful for quickly loading data from various online sources directly into BigQuery for analysis without needing to manually download, format, and upload the data. Here are a few specific use cases:
1. Data Exploration and Prototyping:
- You find a dataset on a public repository (like Github) or a government data portal, and you want to quickly explore it in BigQuery.
load_file_into_temp_dataset
lets you load the data directly without intermediate steps. This is perfect for initial data analysis and prototyping before deciding to store the data permanently.
2. Ad-hoc Analysis of Public Data:
- You need to analyze some publicly available data, such as weather data, stock prices, or social media trends, for a one-time report or analysis. You can use this function to load the data on demand without storing it permanently.
3. ETL Pipelines with Dynamic Data Sources:
- You're building an ETL pipeline that needs to process data from various sources that are updated frequently.
load_file_into_temp_dataset
can be integrated into your pipeline to dynamically load data from different URLs as needed. This is especially helpful when dealing with data sources that don't have a stable schema or format.
4. Data Enrichment:
- You have a dataset in BigQuery and need to enrich it with external data, such as geographic information, currency exchange rates, or product catalogs. You can use this function to load the external data into a temporary table and then join it with your existing table.
5. Sharing Data Snippets:
- You want to share a small dataset with a colleague or client without giving them access to your entire data warehouse. Load the data into a temporary dataset using this function and then grant them temporary access. This offers a secure and convenient way to share data snippets.
Example: Analyzing Tweet Sentiment from a Public API:
Imagine an API that returns tweet data in JSON format. You want to analyze the sentiment of tweets related to a specific hashtag.
- Call the API to retrieve the tweets. The API might offer a download link or allow you to stream the data directly.
- Use
load_file_into_temp_dataset
within a BigQuery query to load the JSON data from the API's URL. - Apply BigQuery's text processing functions to analyze the sentiment of the tweets in the temporary table.
- Generate your report or visualization directly from the results.
This avoids the need to download the JSON file, create a table schema, and manually load the data, significantly speeding up your analysis. The temporary dataset automatically cleans itself up, simplifying data management.
Need help or Found a bug?
Get help using load_file_into_temp_dataset
The community can help! Engage the conversation on Slack
We also provide professional suppport.
Report a bug about load_file_into_temp_dataset
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
We also provide professional suppport.