Categories
JavaScript Python Scraping

Data Extraction for Data Science – Scraping HTML & Javascript WebApps with Python and Scrapy

This article gives an overview on scraping various websites with Python and some JavaScript at the end.

Python Libraries

  • Beautiful Soup
    Small & quick library. Build on top of lxml and html5lib.
  • Scrapy – Scraping Framework
    An open source and collaborative framework for extracting the data you need from websites.

    Scrapy is a big, object-oriented library with several features. Its documentation is very detailed and describes various scraping cases you might come across.

    It’s easy to add support for proxies, or select how you want to export the scraped data, mainly:
  • JSON, XML, CSV,
  • FTP,
  • Cloud-storage (e.g. Amazon S3).

    A core feature is the extendability of Scrapy with available middlewares, or your own custom features.

But what if you want to scrape dynamically generated content off web apps with no public API?

Sometimes websites also make it hard to parse by obfuscating their structure, moving their DOM content around, etc.

Enter Reverse Engineering

Reproducing request to scrape data

Read:

Useful Tools:

Scraping with DOM extraction in JavaScript

Read:

Libraries

  • JSdom
    A JavaScript implementation of various web standards, for use with Node.js

const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);

console.log(dom.window.document.querySelector("p").textContent); // "Hello world"
  • Surgeon JS
    Powerful, declarative DOM extraction expression evaluator.
articles:
- select article {0,}
- body:
  - select .body
  - read property innerHTML
  imageUrl:
  - select img
  - read attribute src
  summary:
  - select ".body p:first-child"
  - read property innerHTML
  - format text
  title:
  - select .title
  - read property textContent
pageName:
- select .body
- read property innerHTML

Sometimes it’s really hard to reproduce certain requests.
What then?

Enter headless browser scraping

Note: Scraping with headless browsers is more slow. It makes more sense when you want to gather and index whole web apps, not specific pieces of content.

With headless rendering, you are able to execute just like in a real browser with interactions (login, move mouse, click element X, type in Y, etc..).

Using Splash with Python

Splash is a fast and lightweight headless browser build in Python 3, Twisted, and QT5. It is usable from its HTTP-API.

Splash is already integrated with Scrapy and there is a Github repo one can use.

Alternative headless browsers

By Dustin Simmons

Dustin Simmons is a Data Engineer who runs las-inc.com.

Leave a Reply

Your email address will not be published. Required fields are marked *