Skip to content

get_github_data

get_github_data(public_repo, destination_dataset, streams)

Description

Get data from public_repo into destination_dataset (using GitHub Airbyte Connector with Airbyte-Serverless)

Data is appended in raw format in tables (one table per stream) into destination_dataset. When supported by the stream, data is extracted incrementally (next execution will only retrieve new rows).

You must create the destination_dataset and give dataEditor access to [email protected] before calling this function. You can do this by executing:

-- Create Destination Dataset
create schema `your_project.your_dataset`;

-- Grant Access to Destination Dataset
grant `roles/bigquery.dataEditor`
on schema `your_project.your_dataset`
to 'serviceAccount:[email protected]';

While it's running (or after) you can explore logs in table your_project.your_dataset._airbyte_logs

Examples

Call or Deploy get_github_data ?
Call get_github_data directly

The easiest way to use bigfunctions

  • get_github_data function is deployed in 39 public datasets for all of the 39 BigQuery regions.
  • It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...
Deploy get_github_data in your project

Why deploy?

  • You may prefer to deploy get_github_data in your own project to build and manage your own catalog of functions.
  • This is particularly useful if you want to create private functions (for example calling your internal APIs).
  • Get started by reading the framework page

Deployment

get_github_data function can be deployed with:

pip install bigfunctions
bigfun get get_github_data
bigfun deploy get_github_data

Requirements

get_github_data uses the following secrets. Get them by reading the documentation link and store them in Google Secret Manager in the project where you deploy the function (and give Accessor role to the service account of the function):

name description documentation to get the secret
github_personal_access_token GitHub Personal Access Token for public repositories doc

1. Get stargazers from airbytehq/airbyte repository

select bigfunctions.eu.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers")
select bigfunctions.us.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers")
select bigfunctions.europe_west1.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers")
+--------+
| result |
+--------+
| ok     |
+--------+

2. Get stargazers AND releases from airbytehq/airbyte repository

select bigfunctions.eu.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers, releases")
select bigfunctions.us.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers, releases")
select bigfunctions.europe_west1.get_github_data("airbytehq/airbyte", "your_project.your_dataset", "stargazers, releases")
+--------+
| result |
+--------+
| ok     |
+--------+

3. To get a list of available streams, let streams param to null

select bigfunctions.eu.get_github_data("airbytehq/airbyte", "your_project.your_dataset", null)
select bigfunctions.us.get_github_data("airbytehq/airbyte", "your_project.your_dataset", null)
select bigfunctions.europe_west1.get_github_data("airbytehq/airbyte", "your_project.your_dataset", null)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| issue_timeline_events,assignees,branches,collaborators,comments,commit_comment_reactions,commit_comments,commits,contributor_activity,deployments,events,issue_comment_reactions,issue_events,issue_labels,issue_milestones,issue_reactions,issues,organizations,project_cards,project_columns,projects,pull_request_comment_reactions,pull_request_commits,pull_request_stats,projects_v2,pull_requests,releases,repositories,review_comments,reviews,stargazers,tags,teams,team_members,users,workflows,workflow_runs,workflow_jobs,team_memberships |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Need help or Found a bug using get_github_data?
Get help using get_github_data

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about get_github_data

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

We also provide professional suppport.

Use cases

A data analyst wants to analyze the activity on the dbt-labs/dbt-core GitHub repository. They are particularly interested in tracking the number of stargazers over time, identifying key contributors through commits, and understanding the release history. They can use the get_github_data BigQuery function to import this data directly into BigQuery for analysis.

Here's how they would use the function:

  1. Prepare the destination dataset:
-- Create the dataset
CREATE SCHEMA `my_project.github_data`;

-- Grant access to the BigFunctions service account
GRANT `roles/bigquery.dataEditor`
ON SCHEMA `my_project.github_data`
TO 'serviceAccount:[email protected]';
  1. Import the data using the function:
SELECT bigfunctions.us.get_github_data('dbt-labs/dbt-core', 'my_project.github_data', 'stargazers, commits, releases');

This call will import data for the specified streams (stargazers, commits, and releases) into tables within the my_project.github_data dataset. For instance, the stargazer data will likely be in a table named my_project.github_data.stargazers.

  1. Analyze the data in BigQuery:

Now the analyst can use standard SQL queries to analyze the imported data. For example:

-- Track stargazer growth over time
SELECT DATE(starred_at) AS star_date, COUNT(*) AS num_stars
FROM `my_project.github_data.stargazers`
GROUP BY star_date
ORDER BY star_date;

-- Identify top contributors
SELECT author.login, COUNT(*) AS num_commits
FROM `my_project.github_data.commits`
GROUP BY author.login
ORDER BY num_commits DESC;

-- Explore release history
SELECT name, tag_name, published_at
FROM `my_project.github_data.releases`
ORDER BY published_at DESC;

This use case demonstrates how the get_github_data function simplifies the process of importing and analyzing GitHub repository data within BigQuery, enabling data-driven insights into project activity and community engagement.

Spread the word!

BigFunctions is fully open-source. Help make it a success by spreading the word!

Share on Add a on