Reference

stscraper is a Python interface for GitHub API

Key features:

  • utilize multiple API keys to speed up scraping

  • transparently handle pagination and minor network errors

Installation

pip install --user --upgrade strudel.scraper

Usage

The main way to use this module is through GitHubAPI objects.

import stscraper as scraper
import pandas as pd

gh_api = scraper.GitHubAPI("token1,token2,...")

# repo_issues is a generator that can be used
# to instantiate a pandas dataframe
issues = pd.DataFrame(gh_api.repo_issues('cmustrudel/strudel.scraper'))

Tokens can be provided either at class instantiation or through an environment variable:

# somewhere in ~/.bashrc
export GITHUB_API_TOKENS='comma-separated list of tokens'
# later, in some Python file:
gh_api = scraper.GitHubAPI()  # tokens from the environment var will be used

If no keys were passed at class instantiation and GITLAB_API_TOKENS environment variable is not defined, stscraper will also check GITHUB_TOKEN environment variable. This variable is created by GitHub actions runner and also used by hub utility.

REST (v3) API

class stscraper.GitHubAPI(*args, **kwargs)[source]

This is a convenience class to pool GitHub v3 API keys and update their limits after every request. Actual work is done by outside classes, such as _IssueIterator and _CommitIterator

all_repos()[source]

Get all GitHub repositories

all_users()[source]

Get all GitHub users

issue_comments(repo, issue_id)[source]

Get comments on an issue or a pull request. Note that for pull requests this method will return only general comments to the pull request, but not review comments related to some code. Use review_comments() to get those instead.

issue_events(repo, issue_no)[source]

Get issue events. This includes state changes, references, labels etc.

org_members(org)[source]

Get public organization members. Note that if you are a member of the organization you’ll get everybody.

org_repos(org)[source]

Get organization repositories

static project_exists(repo_slug)[source]

Check if the project exists. This is a slightly cheaper alternative to getting repository info. It does not using API keys.

pull_request_commits(repo, pr_id)[source]

Get commits in a pull request. pr_id is the visible pull request number, not internal GitHub id.

repo_commit(repo_slug, commit_hash)[source]

Get details for a single commit.

repo_commits(repo_slug)[source]

Get all repository commits. Note that GitHub API might ignore some merge commits

repo_contributors(repo_slug)[source]

Get a timeline of up to 100 top project contributors

Suggested use:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     GitHubAPI().repo_contributors(repo_slug)).set_index('user')
>>> df.columns = pd.to_datetime(df.columns, unit='s')
>>> df
          2018-08-19  2018-08-26    ...    2020-07-12  2020-07-19
user                                ...
user2589           3           0    ...             0           0
...
repo_info(repo_slug)[source]

Get repository info

repo_issue_comments(repo_slug)[source]

Get all comments in all issues and pull requests, both open and closed.

repo_issue_events(repo_slug)[source]

Get all events in all issues and pull requests, both open and closed.

repo_issues(repo_slug)[source]

Get repository issues (not including pull requests)

repo_labels(repo_slug)[source]

Get a tuple of repository labels. Labels are issue tags used by maintainers

>>> GitHubAPI().repo_labels('pandas-dev/pandas')[:5]
('2/3 Compat', '32bit', 'API - Consistency', 'API Design', 'Admin')
repo_pulls(repo_slug)[source]

Get all repository pull requests. Unlike the issues API, this method will return information specific for pull requests, like head SHAs and branch names.

repo_topics(repo_slug)[source]

Get a tuple of repository topics. Topics are “keywords” assigned by repository owner.

>>> GitHubAPI().repo_topics('pandas-dev/pandas')
('data-analysis', 'pandas', 'flexible', 'alignment', 'python')
review_comments(repo, pr_id)[source]

Get pull request comments related to some code. This will not return general comments, see issue_comments()

user_info(username)[source]

Get user info - name, location, blog etc.

user_orgs(username)[source]

Get user organization membership. Usually includes only public memberships, but for yourself you get non-public as well.

user_repos(username)[source]

Get list of user repositories

GraphQL (v4) API

class stscraper.GitHubAPIv4(*args, **kwargs)[source]

An interface to GitHub v4 GraphQL API.

Due to the nature of graphql API, this class does not provide a specific set of methods. Instead, you’re expected to write your own queries and this class will help you with pagination and network timeouts.

Basic usage:

>>> api = GitHubAPIv4('github_api_tokens')
>>> api('''query ($user: String!) {
...       user(login:$user) {
...         login, name
...       }}''', user='user2589')
{'login': 'user2589', 'name': None}
>>> list(api('''query ($user: String!, $cursor: String) {
...       user(login: $user) {
...         followers(first:100, after:$cursor) {
...           nodes { login }
...           pageInfo{endCursor, hasNextPage}
...     }}}''', user='user2589'))
[{'login': 'ArefMq'},
 # ... more users
 {'login': 'neoascetic'}]

In the first case, it will return a dictionary of user attributes. In the second case, it will return a generator of objects, handling pagination transparently.

It looks a little bit like magic, but it is not. Here is how it works:

  • first, it will parse the query and try to figure out the first object that

    has multiple fields; in the first query, it is user. In the second, it is user.followers. The query parser is pretty naive, and is expected to fail on non-trivial queries. In this case, you will need to explicitly tell what object you want to retrieve. In the example below, we explicitly tell scraper the path to the return object in the second positional argument:

    >>> api('...some query..',
    ...     ('repository', 'defaultBranchRef', 'target', 'history'),
    ...     owner='CMUSTRUDEL', repo='strudel.scraper')
    
  • then, it will check if there is a pageInfo object in this object. If it

    is not, it will simply return the content of this object; this is what happened with the first query. If there IS a pagination object, it will indicate we need pagination, and the content of nodes or edges will be returned instead.