Tl;dr scraping a huge site full of Python blog-posts, Parsing it and getting some interesting data about Python projects across the web. Check out the final result: Python-Station.
Edit: feel Free to talk about it on HN or on reddit
About Planet Python
I found out about Planet Python not too long ago, and got instant hooked!
For those who aren’t familiar with Planet Python:
“This is an aggregate of Python news from a growing number of developers.” Hitchhiker guide to python
The Cool thing about PP is that it collects RSS feeds from more than 700 python-blogs.
Finding Planet Python’s (Lost) history:
I went on a hunt across the web – trying to find the (lost) history of PP.
The first stop was asking on Reddit – which (sadly) didn’t really help.
Then I found this quite neat website – which contains data for only a couple of months.
And finally, I hit the Jackpot – A full history of planet python!
The first post was more than 10 years ago – which means the site has more than 10 years of Python blog posts!
What to do with the Data?
Ok.
So now that we found PP history – what should we do with it?
Wouldn’t it be cool to find out which GitHub repos were featured at PP over the years and also get some additional insights on each repo:
Using Planet Python history to find GitHub projects
Now that we have a goal, all we need to do is to start executing
The main workflow is (quite) simple:
- Download a web page (full of different blog posts)
- Parse the relevant posts from the page.
- Find all GitHub references in each post.
- Use external resources to enrich the data.
- Wrap up!
During the process I used file system as a local cache, it’s not a must – but it does make life a lot easier.
Step 1: Download a web page (full of different blog posts)
The idea is to iterate over all pages in the site and make each page into a JSON-line.
At first, I started playing with Scrapy*, but I decided it’s an overkill and ended up using requests.
with open("planet_python_history.jl", "w") as f: # open the output file for i in itertools.count(1): # start iterate over the pages url = "https://aggape.de/feeds/view/53/?page={}".format(i) logging.info("fetching {}".format(url)) page_data = requests.get(url, verify=False).text # get the page if is_last_page(page_data): # validate it's not the last page print("finished!!") break f.write(json.dumps(page_data) + "\n") # write page as jsonline
#This is only the gist of my code, and in the full version, I also added a retry mechanism**.
* Scrapy:
Although I didn’t end up using scrapy if you are into scraping I really suggest you give it a try.
It handles a lot of things for you (multiple spiders, async, pipelining etc) and it’s really battle-tested.
Why would some use scrapy instead of just crawling with requests or urllib2?
** Fault tolarnce:
In this project I have some basic assumptions:
- Every few minutes always monitor the script running ( So no need to add alerts)
- In case of failure to download the page ( status code not 200) – retry.
- In case the site is down I would see it when I monitor the script. (Again no need for alert)
- What happens if the site change (new HTML structure)?
After downloading the history once I can just start scraping Planet Python directly (no need to use the history site) and combine it with my local data.
Step 2: Parse the relevant posts from the page.
Like I said, I didn’t go with scrapy (even though it has parsing abilities), I’ve ended up using beautifulsoup as a parser instead.
The cool thing about beautifulsoup is that it offers a full toolset for XML/HTML parsing.
def transform_pages_into_posts(input_file, output_file): posts = [] with open(input_file) as f: pages = [json.loads(line) for line in f] # load all pages for page in tqdm(pages): # Iterate over all of the the pages in the site soup = BeautifulSoup(page, 'html.parser') page_posts = soup.find_all("div", class_="item") for raw_post in page_posts: # Iterate over all posts in a single page clean_post = raw_post_to_clean_post(raw_post) # Clean the post if clean_post: posts.append(clean_post) with open(output_file, "w+") as f: # save output json.dump(posts, f, sort_keys=True, indent=4)
Step 3: Finding all GitHub references in each post
After parsing the post and finding all URLs the job became quite easy.
def extract_github_projects_from_posts(input_file, output_file): with open(input_file) as f: data = json.load(f) extracted_github_projects = {} for post in data: for project_url, project_data in post.get("out_urls").items(): # get all links on page # check to see if we already saw this repo project_aggregated_data = extracted_github_projects.get(project_url, project_data) # find the latest mention on the repo project_aggregated_data["last_mention"] = max(project_data.get("last_mention", ""), post["created"]) project_aggregated_data["blogs_ref"] = project_aggregated_data.get("blogs_ref", []) \ + [post["url"]] # update ref extracted_github_projects[project_url] = project_aggregated_data with open(output_file, "w+") as f: json.dump(extracted_github_projects, f, sort_keys=True, indent=4)
#This is only the gist of my code and in the full version,
Step 4: Use external resources to enrich the data
I used multiple API’s to enrich the data on each repo, And when working on a small scale project (like this one) I think it’s always easier using the REST API (as it’s official and always up to date) and not some SDK (which most of the time forces you to read the official API to understand).
This time was no different and I used requests(REST API) in most cases.
The one exception was Reddit -> I didn’t like the official doc + the Python SDK(Praw) was great.
- Github API – to find if the repo is Python/starts/fork/first etc.
- Praw – Reddit Python API, to find if the repo was featured on Reddit.
- Hackernews API – Find if the repo was featured on hackernews.
- Github python trending – Find if the repo is currently trending on GitHub. (Not really API -> i just scraped it)
def enrich_github_projects(input_file, output_file): with open(input_file) as f: raw_github_projects = json.load(f) github_projects = enrich_and_filter_using_github(raw_github_projects) add_clean_fields(github_projects) add_is_trending_to_projects(github_projects) add_was_on_hn_to_projects(github_projects) add_was_on_reddit_to_projects(github_projects) with open(output_file, "w+") as f: json.dump(list(github_projects.values()), f, indent=4)
#This is only the gist of my code, and in the full version, I have a lot of parsing + downloading tricks.
Step 5: Wrap Up
In the end, all that was left is to combine everything into a fully functional pipeline:
def main(pages_to_download, files_prefix, pipeline_steps_dir): download_posts(output_file=file_for_raw_pages) transform_pages_into_posts(input_file=file_for_raw_pages, output_file=file_for_posts) extract_github_projects_from_posts(input_file=file_for_posts, output_file=file_for_raw_github_projects) enrich_github_projects(input_file=file_for_raw_github_projects, output_file=file_for_pipeline_output)
#This is only the gist of my code, and in the full version, I have click to get more control over the pipeline.
Step 6: Final result
Summary
It was my first experience at real world scrapping and I had a lot of fun!
For the source code behind python-station check out my Github repos:
In the future, I wish to improve the run time by using asyncio and maybe add some other enhancement.
Putting Python aside, learning Vue.js and Github pages (in order to make the site live) was also fun and quite different from what I normally do.
Stay tuned for my next blog post about hosting a free self-updated site using Vue.js + Github pages and Circle CI!
And i hope you enjoy, feel free to comment 🙂
Leave a Reply