I recently did a project for which I had to collect massive amounts of text data. I chose to use Project Gutenberg, a library of tens of thousands of free ebooks. It's a great resource for training ML algorithms and doing NLP.

To avoid wasting time downloading each book manually, I created a scaper that can be run from the command line. I'll link the github repo at the bottom, but I wanted to discuss some of the main technologies I used to build the scraper and how to run the program.

How it was built

I built my scraper in python using BeautifulSoup and requests.  The model I used can be repurposed to build scrapers for other sites.

When building a scraper, you'll need to know the base URL from which you want to pull data. For me, it was https://www.gutenberg.org. You'll also need to know the taxonomy of your website. For Project Gutenberg, each book is assigned a unique ID, which is part of the URL for that book.

link for ebook with id 61979

Once you know the URL structure for your site, it's easy to get multiple assets. In my case, navigating to a given book was a matter of appending "/ebooks/{book_id}" to the base URL.

Python's requests library will do most of the rest of the work. Using requests.Session(), we can grab the data from a page. Note that you can also use requests.get(), but using sessions persists cookies and parameters across your requests, so if you're doing multiple requests, it's helpful to use a session.

def download_book(self,booktitle,data_link):

        file = open(filename, "w")
    except Exception as e:
        print("<--- ERROR DOWNLOADING %s --->" % (booktitle))
method to download book from Project Gutenberg

How to run

Running the scraper is just a matter of downloading the code and specifying the necessary parameters.

arguments for gutenberg_scraper

Sample run:

downloading book with ID 11

Link to GitHub repository: https://github.com/kpully/gutenberg_scraper