load_file¶
load_file(url, file_type, destination_table, options)
Description¶
Download web file into destination_table
This function extends load_file_into_temp_dataset
function.
Each call:
- Creates a new temporary dataset only accessible to you in bigfunctions project.
- Downloads the file data in a table within this temporary dataset.
- Copies the table into
destination_table
before deleting it from the temporary dataset.
File Data is downloaded using ibis with DuckDB. Available file_type
values are:
- csv : doc
- json : doc
- parquet : doc
- delta : doc
- geo : doc. (this uses GDAL under the hood and enable you to also read .xls, .xlsx, .shp ...)
Examples¶
Call or Deploy load_file
?
Call load_file
directly
The easiest way to use bigfunctions
load_file
function is deployed in 39 public datasets for all of the 39 BigQuery regions.- It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)
Public BigFunctions Datasets
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Deploy load_file
in your project
Why deploy?
- You may prefer to deploy
load_file
in your own project to build and manage your own catalog of functions. - This is particularly useful if you want to create private functions (for example calling your internal APIs).
- Get started by reading the framework page
Deployment
load_file
function can be deployed with:
pip install bigfunctions
bigfun get load_file
bigfun deploy load_file
1. load random csv
call bigfunctions.eu.load_file(
'https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv',
'csv', 'your_project.your_dataset.random_sales', null
);
select * from bigfunction_result;
call bigfunctions.us.load_file(
'https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv',
'csv', 'your_project.your_dataset.random_sales', null
);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
'https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv',
'csv', 'your_project.your_dataset.random_sales', null
);
select * from bigfunction_result;
+--------+
+ status +
+--------+
+ ok +
+--------+
2. load json - french departements
call bigfunctions.eu.load_file(
'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region',
'json', 'your_project.your_dataset.dim_french_departements', null
);
select * from bigfunction_result;
call bigfunctions.us.load_file(
'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region',
'json', 'your_project.your_dataset.dim_french_departements', null
);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region',
'json', 'your_project.your_dataset.dim_french_departements', null
);
select * from bigfunction_result;
+--------+
+ status +
+--------+
+ ok +
+--------+
3. load json as string without records/struct inference (from DuckDB & BigQuery) - (cf. issue #171)
call bigfunctions.eu.load_file(
'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region',
'json', 'your_project.your_dataset.dim_french_departements', {"records":false, "columns":{"json":"string"}}\n);
select * from bigfunction_result;
call bigfunctions.us.load_file(
'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region',
'json', 'your_project.your_dataset.dim_french_departements', {"records":false, "columns":{"json":"string"}}\n);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
'https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region',
'json', 'your_project.your_dataset.dim_french_departements', {"records":false, "columns":{"json":"string"}}\n);
select * from bigfunction_result;
+--------+
+ status +
+--------+
+ ok +
+--------+
4. load parquet on Google Cloud Storage
call bigfunctions.eu.load_file(
'gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet',
'parquet', 'your_project.your_dataset.station_status', null
);
select * from bigfunction_result;
call bigfunctions.us.load_file(
'gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet',
'parquet', 'your_project.your_dataset.station_status', null
);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
'gs://bike-sharing-history/toulouse/jcdecaux/2024/Feb.parquet',
'parquet', 'your_project.your_dataset.station_status', null
);
select * from bigfunction_result;
+--------+
+ status +
+--------+
+ ok +
+--------+
5. load xls or xlsx
call bigfunctions.eu.load_file(
'https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx',
'geo', 'your_project.your_dataset.hypermarche_retours', '{"layer":"Retours", "open_options": ["HEADERS=FORCE"]}'
);
select * from bigfunction_result;
call bigfunctions.us.load_file(
'https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx',
'geo', 'your_project.your_dataset.hypermarche_retours', '{"layer":"Retours", "open_options": ["HEADERS=FORCE"]}'
);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
'https://github.com/AntoineGiraud/dbt_hypermarche/raw/refs/heads/main/input/Hypermarche.xlsx',
'geo', 'your_project.your_dataset.hypermarche_retours', '{"layer":"Retours", "open_options": ["HEADERS=FORCE"]}'
);
select * from bigfunction_result;
+--------+
+ status +
+--------+
+ ok +
+--------+
6. load french tricky csv
call bigfunctions.eu.load_file(
'https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6',
'csv',
'your_project.your_dataset.dim_french_postalcodes',
'''{
"columns": {
"code_commune_insee": "VARCHAR",
"nom_commune_insee": "VARCHAR",
"code_postal": "VARCHAR",
"lb_acheminement": "VARCHAR",
"ligne_5": "VARCHAR"
},
"delim": ";",
"skip": 1
}'''
);
select * from bigfunction_result;
call bigfunctions.us.load_file(
'https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6',
'csv',
'your_project.your_dataset.dim_french_postalcodes',
'''{
"columns": {
"code_commune_insee": "VARCHAR",
"nom_commune_insee": "VARCHAR",
"code_postal": "VARCHAR",
"lb_acheminement": "VARCHAR",
"ligne_5": "VARCHAR"
},
"delim": ";",
"skip": 1
}'''
);
select * from bigfunction_result;
call bigfunctions.europe_west1.load_file(
'https://www.data.gouv.fr/fr/datasets/r/323af5b8-7831-445b-9a46-d4da140b61b6',
'csv',
'your_project.your_dataset.dim_french_postalcodes',
'''{
"columns": {
"code_commune_insee": "VARCHAR",
"nom_commune_insee": "VARCHAR",
"code_postal": "VARCHAR",
"lb_acheminement": "VARCHAR",
"ligne_5": "VARCHAR"
},
"delim": ";",
"skip": 1
}'''
);
select * from bigfunction_result;
+--------+
+ status +
+--------+
+ ok +
+--------+
Need help or Found a bug using load_file
?
Get help using load_file
The community can help! Engage the conversation on Slack
We also provide professional suppport.
Report a bug about load_file
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
We also provide professional suppport.
Use cases¶
The load_file
function is useful for quickly loading data from various web-based file formats directly into a BigQuery table. Here's a breakdown of potential use cases categorized by data type and source:
1. CSV Data:
- Public Datasets: Loading publicly available datasets in CSV format, like government data or research data. Example: Loading census data or economic indicators from a government website.
- Web APIs: Some APIs return data in CSV format. This function can be used to directly ingest that data into BigQuery. Example: A marketing API providing campaign performance data.
- GitHub/GitLab: Loading data directly from CSV files stored in repositories like GitHub or GitLab. This is helpful for sharing data within teams or for reproducible research. Example: Loading a training dataset for a machine learning model.
2. JSON Data:
- REST APIs: Many REST APIs return data in JSON format.
load_file
simplifies the process of ingesting this data into BigQuery without intermediate processing. Example: Loading product information from an e-commerce API. - GeoJSON Data: Loading geospatial data in GeoJSON format. Example: Loading geographic boundaries of cities or countries.
- Configuration Files: Loading configuration data from JSON files hosted online.
3. Parquet/Delta Lake Data:
- Data Lakes: Accessing and loading data directly from data lakes stored on cloud storage platforms like Google Cloud Storage. This is efficient for large datasets as Parquet and Delta Lake are optimized for analytical queries. Example: Loading historical sales data from a data lake.
4. Excel/Shapefiles (via 'geo' file_type):
- Legacy Data: Loading data from legacy systems that often store data in Excel or Shapefile formats. Example: Loading customer data from an older CRM system.
- GIS Data: Loading geospatial data from shapefiles. Example: Loading data on road networks or land parcels.
5. General Web Files:
- Automated Data Ingestion: Regularly loading data from a web source as part of an automated data pipeline. Example: Daily updates of stock prices.
- Ad-hoc Data Analysis: Quickly loading data from a web source for exploratory data analysis. Example: Analyzing a competitor's publicly available product catalog.
Key Advantages of using load_file
:
- Simplicity: Reduces the need for complex ETL pipelines for simple data loading tasks.
- Speed: Directly loads data into BigQuery, bypassing intermediate steps.
- Flexibility: Supports various file formats and sources.
- Accessibility: Makes web-based data easily accessible for analysis within BigQuery.
Example Scenario:
A marketing analyst needs to analyze the performance of their recent social media campaigns. The social media platform provides an API that returns campaign data in CSV format. Instead of manually downloading the CSV file, processing it, and then uploading it to BigQuery, the analyst can use the load_file
function to directly load the data from the API endpoint into a BigQuery table, saving time and effort.
Spread the word!¶
BigFunctions is fully open-source. Help make it a success by spreading the word!