Saturday, May 23, 2015

Multi-threading and Web Requests


I'm writing a tool that needs to make over 200 API calls using the requests library in python.  The function is doing little more than asking for information and then appending that information into a list.  The current run time results look like this:
[Finished in 209.3s]
I think I can do better.   The main delay is caused because each request must be made after the previous request has finished (or in 'consecutive' order).  This happens because a single python script runs as a single thread, processing things consecutively.

Running multiple threads would resolve this problem.A quick search for python multithreading points me to the Pool class in the multithreading library. Using a pool of "workers" (threads) I can run each web request concurrently.

My first attempt at implementing a pool results in the following error:
Can't pickle
Stack Overflow had an answer pointing out that in order to pass the work around to multiple threads, the job for each thread needed to be converted into a standard format (serialized or 'pickled').  the particular serializer used by this library dosen't understand how to serialize an instanced method, so we'll help it out:
Placing the above script in my class means that it will add understanding for instanced method pickling.  

The next hurtle was the need to pass multiple values into a function called by a Pool instance.  While there seem to be many suggested ways of doing this, I found building a helper function to be easiest.  


This helper function takes the a tuple of arguments handed to it, and then calls the "real" function passing the tuple values in.  Putting everything together looks something like this:


In the end pool.map is handed the helper function to call, and a payload.  The payload consists of a list of tuples.  Each tuple contains the arguments needed for 1 call of the function doing the work (sometimes called the 'Critical Section').  The end result:
[Finished in 16.3s]
Running with 32 processes, the task took only 16.2 seconds.  I'd say that's a solid improvement.

No comments:

Post a Comment