This article gives an overview on scraping various websites with Python and some JavaScript at the end.
Python Libraries
- Beautiful Soup
Small & quick library. Build on top of lxml and html5lib. - Scrapy – Scraping Framework
An open source and collaborative framework for extracting the data you need from websites.
Scrapy is a big, object-oriented library with several features. Its documentation is very detailed and describes various scraping cases you might come across.
It’s easy to add support for proxies, or select how you want to export the scraped data, mainly:
- JSON, XML, CSV,
- FTP,
- Cloud-storage (e.g. Amazon S3).
A core feature is the extendability of Scrapy with available middlewares, or your own custom features.
But what if you want to scrape dynamically generated content off web apps with no public API?
Sometimes websites also make it hard to parse by obfuscating their structure, moving their DOM content around, etc.
Enter Reverse Engineering
Reproducing request to scrape data
Read:
- Scrapy Documentation: Reproducing requests
- Article on Reverse Engineering an API
- Postman Documentation
Useful Tools:
Scraping with DOM extraction in JavaScript
Read:
- DOM Scraping into Data Layer & Custom JS Variables
- Efficient selection of DOM Elements for Data Extraction.
Libraries
- JSdom
A JavaScript implementation of various web standards, for use with Node.js
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent); // "Hello world"
- Surgeon JS
Powerful, declarative DOM extraction expression evaluator.
articles:
- select article {0,}
- body:
- select .body
- read property innerHTML
imageUrl:
- select img
- read attribute src
summary:
- select ".body p:first-child"
- read property innerHTML
- format text
title:
- select .title
- read property textContent
pageName:
- select .body
- read property innerHTML
Sometimes it’s really hard to reproduce certain requests.
What then?
Enter headless browser scraping
Note: Scraping with headless browsers is more slow. It makes more sense when you want to gather and index whole web apps, not specific pieces of content.
With headless rendering, you are able to execute just like in a real browser with interactions (login, move mouse, click element X, type in Y, etc..).
Using Splash with Python
Splash is a fast and lightweight headless browser build in Python 3, Twisted, and QT5. It is usable from its HTTP-API.
Splash is already integrated with Scrapy and there is a Github repo one can use.
Alternative headless browsers
- Firefox headless mode
Usable with Selenium + Python, or the language of your choice. - Puppeteer (Chrome + Node.js)
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. - Scrapy + Selenium
Scrapy middleware to handle javascript pages using selenium.