Thursday, May 14, 2015

Beautiful Webscraping

I'll be working out the Stack Exchange London office for the next 2 weeks.  Getting there will be a 6 hour flight, and it is likely not to have an internet connection.  On long trips, lacking internet, I like to work on problems from Project Euler.  I'd rather not manually download every problem to my git repo, but with a little bit of Python I can automate it.

A problem on project Euler is presented on a simple page, sampled below:

Looking at the page, we can see that the URL is easily manipulated.  Replacing the number at the end of the URL should get the corresponding problem.  Using the requests library and a for loop should quickly hit every problem page.
After hitting each page, I need to grab the relevant content.  After looking at the HTML in the page, it's clear that the problem is stored in a div tag with the class "problem_content"

BeautifulSoup can parse HTML returned by the requests and spit out just the problem text.  I've included my script below.  The relevant variables should be easy enough to change if anyone would like to include it in their own repo.