воскресенье, 19 января 2014 г.

Big Data is easy

Big Data is very easy. :)
1. Read http://adambard.com/blog/top-github-languages-for-2013-so-far/
2. Check http://www.githubarchive.org/
3. Check https://developers.google.com/bigquery/
4. Read https://github.com/igrigorik/githubarchive.org/tree/master/bigquery
5. Change Adam's query to
SELECT repository_language, count(repository_language) AS repos_by_lang
FROM [githubarchive:github.timeline]
WHERE repository_fork == "false"
AND type == "CreateEvent"
AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('2013-01-01 00:00:00')
AND PARSE_UTC_USEC(repository_created_at) < PARSE_UTC_USEC('2014-01-01 00:00:00')
GROUP BY repository_language
ORDER BY repos_by_lang DESC
LIMIT 100
6. Run it on BigQuery - Query complete (2.3s elapsed, 6.80 GB processed)
7. PROFIT:
Besides of Gihub errors on language detection (like most of projects including jQuery is detecting like Javascript) it will be interesting to find top language by number/size of commits, but I think its not very easy to do - Github doesn't report commits, only pushes... Ok, let's do it for pushes:

SELECT repository_language, count(repository_language) as pushes
FROM [githubarchive:github.timeline]
WHERE type="PushEvent"
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2013-01-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2014-01-01 00:00:00')
GROUP BY repository_language
ORDER BY pushes DESC
LIMIT 100


Results are almost the same -
Anyway, now you know that Big Data is not scary at all :)