Using asyncio to download hackernews (at 1300 requests/sec)!

tl;dr super-fast and easy hn client, based on python 3.6+

As this is my first blog post, I’ll focus on downloading a set of Hackernews posts and user data to use as the data set for the machine learning exercises we’ll look at in future posts. 

We’ll start by exploring the different options to create a local copy of  HN’s data (or of any other site), and will continue to improve and optimize our solution.
All of the benchmarking was done on my laptop –
Macbook Pro 16 RAM , i7, 100mb/sec download speed 

Setup: 

 

Official API

We’ll start by looking at the official hacker-news API.
While it appears quite neat and simple (and supports Firebase out of the box),
there isn’t a publicly available Python SDK.

Semi-official (Most popular)

After a quick Google search, I learned about hoxer, which appeared to be a simple and effective tool.
After playing with it a bit I wrote the following script:

using the time command to measure the performance:


As hn.get_max_item() was around 13,000,000 when I checked, it would have taken me (130,00,000/10)*10.6 =13780000 seconds (or 3800 hours) to download all the posts (not including users data).

Using asyncio – first try (Naive)

At this stage, I decided to try using Python 3’s asyncio and aiohttp in lieu of attempting to improve the haxor code (using requests + sessions + threads).
I based my initial work on this great post: Making 1 million requests with python-aiohttp .
I won’t delve into how asyncio works here, so I definitely recommend that post as a starting point.


sample response:

And the performance:

Right off the bat, we got a speed boost. But can we improve it further?
The answer is Sure!

asyncio – second try (Consumer/Producer)

But first, let’s try to understand the limitations with the naive solution.

  1. The number of open coroutines sclaes linearly with the number of requests (if we are fetching 100,000 posts we’ll generate 100,000 futures) – not scalable.
  2. The number of open TCP onnections is 20 by default – too low.
  3. When we reach high rates (1000 reqs/sec) we sometimes fail to process the response due to  server slowdowns – we need some retry mechanism.

In order to solve all of these problems we’ll use a different approach: producer -> queue -> consumers.

The design is simple, as we have n consumers which read from the queue, and 1 producer which fills the queue. This way we have only n open coroutines no matter how many requests we have.

We will alsoe dead-letter-queue, which will store a failed requests.

Let’s start with the consumer code:

The consumer reads a URL from the main queue -> performs an HTTP request -> inserts the response into the responses array.
In case a consumer encounters a problem it writes the URL to the DLQ, sleeps, and reads the next URL from the queue.

The producer goes over the iterable and put his items in the queue. in case the queue is full it is blocked until the consumer read messages from the queue.

 

  • In lines 2-14 we initialize the queues and the consumer
  • 15-16 we fetch the last item from HN
  • 17-18 we initialize the producer – the producer will finish only after all the URLs are inserted  into the queue.
  • 20 we wait for the task_done call for all the URLs (this means we had at least one attempt per  each URL)
  • 22 we wait for the task_done call for all URLs that entered the DLQ(a message can be processed more than once)
  • 24-25 remove all the consumers (no need of them as we already got all posts)

 Full code:

 

Using this code we can  reach 1,000-2,000 requests per second (single thread), and a total run time of 13 minutes, 11 seconds! Not bad for a single-threaded client : )

Summary

I pushed the full wrapper for the Hackernews API we saw here to Github: asyncio-hn.
Feel free to ask any questions (and of course, any improvements will be most appreciated)
And lastly –  stay tuned for using (basic) sanic  and the hackernews wrapper for some neat python3 combination 🙂

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on Reddit
  • Ocl@y

    Please take a look to uvloop 2-3x faster than asyncio if you want speed

    • Hi,
      i tested uvloop before writing this post and must say it didn’t seem to make much of a different:
      I add the following line to the example:

      And for a total of 100K requests the diff was 2 seconds: