get_github_data¶
get_github_data(public_repo, destination_dataset, streams)
Description¶
Get data from public_repo
into destination_dataset
(using GitHub Airbyte Connector with Airbyte-Serverless)
Data is appended in raw format in tables (one table per stream) into
destination_dataset
. When supported by the stream, data is extracted incrementally (next execution will only retrieve new rows).You must create the
destination_dataset
and givedataEditor
access to[email protected]
before calling this function. You can do this by executing:-- Create Destination Dataset create schema `your_project.your_dataset`; -- Grant Access to Destination Dataset grant `roles/bigquery.dataEditor` on schema `your_project.your_dataset` to 'serviceAccount:[email protected]';
While it's running (or after) you can explore logs in table
your_project.your_dataset._airbyte_logs
Examples¶
Call or Deploy get_github_data
?
Call get_github_data
directly
The easiest way to use bigfunctions
get_github_data
function is deployed in 39 public datasets for all of the 39 BigQuery regions.- It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)
Public BigFunctions Datasets
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Deploy get_github_data
in your project
Why deploy?
- You may prefer to deploy
get_github_data
in your own project to build and manage your own catalog of functions. - This is particularly useful if you want to create private functions (for example calling your internal APIs).
- Get started by reading the framework page
Deployment
get_github_data
function can be deployed with:
pip install bigfunctions
bigfun get get_github_data
bigfun deploy get_github_data
Requirements
get_github_data
uses the following secrets. Get them by reading the documentation link and store them in Google Secret Manager in the project where you deploy the function (and give Accessor role to the service account of the function):
name | description | documentation to get the secret |
---|---|---|
github_personal_access_token |
GitHub Personal Access Token for public repositories | doc |
1. Get stargazers from airbytehq/airbyte repository
select bigfunctions.eu.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers")
select bigfunctions.us.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers")
select bigfunctions.europe_west1.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers")
+--------+
| result |
+--------+
| ok |
+--------+
2. Get stargazers AND releases from airbytehq/airbyte repository
select bigfunctions.eu.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers, releases")
select bigfunctions.us.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers, releases")
select bigfunctions.europe_west1.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers, releases")
+--------+
| result |
+--------+
| ok |
+--------+
3. To get a list of available streams, let streams
param to null
select bigfunctions.eu.get_github_data("airbytehq/airbyte", "your_project.your_dataset", null)
select bigfunctions.us.get_github_data("airbytehq/airbyte", "your_project.your_dataset", null)
select bigfunctions.europe_west1.get_github_data("airbytehq/airbyte", "your_project.your_dataset", null)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| issue_timeline_events,assignees,branches,collaborators,comments,commit_comment_reactions,commit_comments,commits,contributor_activity,deployments,events,issue_comment_reactions,issue_events,issue_labels,issue_milestones,issue_reactions,issues,organizations,project_cards,project_columns,projects,pull_request_comment_reactions,pull_request_commits,pull_request_stats,projects_v2,pull_requests,releases,repositories,review_comments,reviews,stargazers,tags,teams,team_members,users,workflows,workflow_runs,workflow_jobs,team_memberships |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Need help or Found a bug using get_github_data
?
Get help using get_github_data
The community can help! Engage the conversation on Slack
We also provide professional suppport.
Report a bug about get_github_data
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
We also provide professional suppport.
Use cases¶
A data analyst wants to analyze the activity on the dbt-labs/dbt-core
GitHub repository. They are particularly interested in tracking the number of stargazers over time, identifying key contributors through commits, and understanding the release history. They can use the get_github_data
BigQuery function to import this data directly into BigQuery for analysis.
Here's how they would use the function:
- Prepare the destination dataset:
-- Create the dataset
CREATE SCHEMA `my_project.github_data`;
-- Grant access to the BigFunctions service account
GRANT `roles/bigquery.dataEditor`
ON SCHEMA `my_project.github_data`
TO 'serviceAccount:[email protected]';
- Import the data using the function:
SELECT bigfunctions.us.get_github_data('dbt-labs/dbt-core', 'my_project.github_data', 'stargazers, commits, releases');
This call will import data for the specified streams (stargazers
, commits
, and releases
) into tables within the my_project.github_data
dataset. For instance, the stargazer data will likely be in a table named my_project.github_data.stargazers
.
- Analyze the data in BigQuery:
Now the analyst can use standard SQL queries to analyze the imported data. For example:
-- Track stargazer growth over time
SELECT DATE(starred_at) AS star_date, COUNT(*) AS num_stars
FROM `my_project.github_data.stargazers`
GROUP BY star_date
ORDER BY star_date;
-- Identify top contributors
SELECT author.login, COUNT(*) AS num_commits
FROM `my_project.github_data.commits`
GROUP BY author.login
ORDER BY num_commits DESC;
-- Explore release history
SELECT name, tag_name, published_at
FROM `my_project.github_data.releases`
ORDER BY published_at DESC;
This use case demonstrates how the get_github_data
function simplifies the process of importing and analyzing GitHub repository data within BigQuery, enabling data-driven insights into project activity and community engagement.
Spread the word!¶
BigFunctions is fully open-source. Help make it a success by spreading the word!