Reference

This module provides interfaces to “unofficial GitHub API”, i.e. data available in the user interface but not in the official API.

This includes:

  • user contributions timeline (all repositories contributed to,
    organizations joined publicly, created repos, reported issues, etc.). There is no official API for this, and public datasets like GHTorrent do not report some of these events.
  • user contribution stats (just number of contributions per year).
    You can get the same information from GHTorrent, but this method is only taking one HTTP request and thus it’s much faster.
  • get weekly contributors stats for a projects
    (number of Lines Of Code contributed per week by top 100 contributors since the beginning of the project). LOC information is not available via API, and similar stats for commits take multiple requests via official API.
class stgithub.Scraper[source]

A class to access “unofficial GitHub API”

Note

This “unofficial API” is rate limited, just as the official one. The rate limit is 40 requests in 80 seconds, and some calls take multiple requests. So, for example, parsing a user activity timeline typically takes couple minutes. Use this “API” with caution as it might be extremely slow.

full_user_activity_timeline(user, start=None, to=None)[source]

Get a list of public user contributions, by month by repository.

Parameters:
  • user (str) – GitHub login of the user to get activity for.
  • start (str) – date to start with, e.g. ‘2017-01’ or ‘2017-01-01’. datetime objects should also work.
  • to (str) – upper bound of date ranges to parse, same as start. Note: the day is 1 by default, i.e. ‘2017-01’ will be interpreted as 1st of January 2017.
Yields:

Dict[str, int] – A generator of activity dictionaries. Each dict has fields month, a %Y-%m formatted month, and repo, a repository slug. Other fields indicate number of contributions of a given type:

  • commits: number of commits.
  • issues: number of reported issues.
  • reviews: number of reviewed pull requests.
    GitHub counts any commented pull request as reviewed, also ignoring any code comments.
  • pull_requests: number of pull requsts submitted.
  • created_repository: can be only 1.
  • joined_org: can be only 1.
    The repository slug in this case is the GitHub org name.
  • private: all contributions in private repositories combined,
    if user enabled anonymous reporting of private activities. The repository slug in this case is an empty string.

The output of this method is suitable for a pd.DataFrame constructor:

>>> pd.DataFrame(
...     Scraper().full_user_activity_timeline('user2589'))
     commits   ...     reviews
...
111      NaN   ...         NaN
112      NaN   ...         NaN
113      1.0   ...         NaN
<BLANKLINE>
[114 rows x 9 columns]

It is even better to index on month+repo and replace NaNs:

>>> pd.DataFrame(
...     Scraper().full_user_activity_timeline('user2589')
... ).set_index(['month', 'repo']).fillna(0).astype(int)
                                         commits   ...     reviews
month   repo                                       ...
...
2012-05 user2589/minicms                      11   ...           0
2011-09 alsoicode/django-admin-sortable        0   ...           0
2011-08 user2589/django-rosetta                0   ...           0
        mbi/django-rosetta                     0   ...           0
2005-03 user2589/schooligan                    1   ...           0
<BLANKLINE>
[114 rows x 7 columns]

Get user events as a 2-tuple generator: (date, link).

Events include: commits, issues and refs creation (tags/branches). Internally, this method is using Atom feed. The result includes up to couple month of activity; sometimes it also misses up to one month of recent events.

Note

This method is know to return incomplete data. Proceed with caution.

Parameters:

user (str) – The GitHub login of the user.

Yields:

Tuple[str, str]

(<%Y-%m-%d date>, link to the activity)

It seems like this feed only includes tags and commits

>>> list(Scraper().links_to_recent_user_activity('user2589'))  # doctest: +SKIP
[('2018-12-01', '/user2589/Q/tree/master'),
 ('2018-12-01',
  '/user2589/Q/commit/9184f20f939a70e3930ef762cc83906220433fc8'),
 ('2018-11-20', '/user2589/TAC_Github/tree/master'),
 ...]
project_contributor_stats(repo_slug)[source]

Get top 100 contributors weekly commit stats over the project history

Parameters:repo_slug (str) – <owner_login>/<repo_name>
Returns:
A list of top 100 contributors in the repo, with their logins,
total number of commits and weekly contribution counts as number of lines added, changed or deleted. Note that weeks are started on Sunday and represented by a Unix timestamp.
Return type:list
>>> Scraper().project_contributor_stats('pandas-dev/pandas') # doctest: +SKIP
[{u'author': {u'avatar': u'https://avatars0.githubusercontent.com/...',
   u'hovercard_url': u'/hovercards?user_id=1435085',
   u'id': 1435085,
   u'login': u'blbradley',
   u'path': u'/blbradley'},
  u'total': 8,
  u'weeks': [{u'a': 0, u'c': 0, u'd': 0, u'w': 1249171200},
   {u'a': 0, u'c': 0, u'd': 0, u'w': 1249776000},
   {u'a': 0, u'c': 0, u'd': 0, u'w': 1250380800},
...
}]
user_daily_contrib_num(user, year)[source]

Get number of daily contributions of a GitHub user in a given year. This method represents the white and green grid in the profile page.

Parameters:
  • user (str) – The GitHub login of the user to get stats for.
  • year (int) – Year of contributions to get
Returns:

A dictionary with keys being %Y-%m-%d formatted dates, and

values being the number of contributions. This method does not differentiate types of contributions, i.e. it is a sum of commits, issues, submitted and reviewed pull requests, etc.

Return type:

dict

>>> Scraper().user_daily_contrib_num('user2589', 2018)
{'2018-01-01': 0,
 '2018-01-02': 15,
 ...
 '2018-12-31': 0}