Scraping Dynamic Websites with R and Chromote

video
r
Author

James Balamuta

Published

October 29, 2024

While teaching web scraping this week, a student challenged me to scrape weather data from Windy.com. This led to creating a video demonstrating how to handle both static and JavaScript-heavy websites using R and Chromote.

Video

In the video, I walkthrough:

  1. Scraping the static https://www.r-project.org website;
  2. Retrieving the weather data from the dynamic https://windy.com website; and,
  3. Cleaning up and summarizing the extracted data.

Code

You can find the code used in the video in the following GitHub repository:

https://github.com/coatless-videos/chromote-web-scraping

Why Chromote?

Traditional web scraping methods fail spectacularly when the content on the webpage is dynamically generated by JavaScript. The trick to getting around this is to wait for the JavaScript to finish rendering the content before extracting it. Chromote solves this by letting you control a Chrome or Chromimum-based web browser directly from R. This allows you to:

  • Wait for JavaScript content to load on the page;
  • Interact with dynamic elements; and, then,
  • Capture the fully rendered page

While {RSelenium} is another popular option for scraping dynamic websites, I encountered notable issues with using it in trying to scrape the website due to the lack of support for Selenium Docker images on ARM/M-series Macs. So, {chromote} provides a more reliable alternative in these environments.

Quick Example

Here’s a minimal example of launching Chrome, navigating to a page, and extracting the content of a specific CSS selector:

library(chromote)
library(rvest)

# Specify a website
url <- "https://www.r-project.org"

# Specify a CSS selector to retrieve
selector <- ".sidebar"

# Start a new Chrome session
chrome_session <- ChromoteSession$new()

# Open a browser remotely controlled from R
chrome_session$view()

# Navigate to a website
chrome_session$Page$navigate(url)
chrome_session$Page$loadEventFired()
  
# Find the root document node
root <- chrome_session$DOM$getDocument()
  
# Find the element using DOM methods
node <- chrome_session$DOM$querySelector(
  nodeId = root$root$nodeId,
  selector = selector
)
  
# Get the HTML content of the node
html <- chrome_session$DOM$getOuterHTML(
  nodeId = node$nodeId
)
  
# Close the browser session
chrome_session$close()

From there, we can use the HTML stored in html variable with {rvest} or any other HTML parsing library to further extract or navigate the desired content.

Fin

Hopefully, this video and code helps you get started scraping dynamic websites directly from R. Plus, it goes to show that R can be used for more than just data analysis!

If you have any questions or suggestions, feel free to leave a comment on the video or reach out to me on socials!