While teaching web scraping this week, a student challenged me to scrape weather data from Windy.com. This led to creating a video demonstrating how to handle both static and JavaScript-heavy websites using R and Chromote.
Video
In the video, I walkthrough:
- Scraping the static https://www.r-project.org website;
- Retrieving the weather data from the dynamic https://windy.com website; and,
- Cleaning up and summarizing the extracted data.
Code
You can find the code used in the video in the following GitHub repository:
Why Chromote?
Traditional web scraping methods fail spectacularly when the content on the webpage is dynamically generated by JavaScript. The trick to getting around this is to wait for the JavaScript to finish rendering the content before extracting it. Chromote solves this by letting you control a Chrome or Chromimum-based web browser directly from R. This allows you to:
- Wait for JavaScript content to load on the page;
- Interact with dynamic elements; and, then,
- Capture the fully rendered page
While {RSelenium}
is another popular option for scraping dynamic websites, I encountered notable issues with using it in trying to scrape the website due to the lack of support for Selenium Docker images on ARM/M-series Macs. So, {chromote}
provides a more reliable alternative in these environments.
Quick Example
Here’s a minimal example of launching Chrome, navigating to a page, and extracting the content of a specific CSS selector:
library(chromote)
library(rvest)
# Specify a website
<- "https://www.r-project.org"
url
# Specify a CSS selector to retrieve
<- ".sidebar"
selector
# Start a new Chrome session
<- ChromoteSession$new()
chrome_session
# Open a browser remotely controlled from R
$view()
chrome_session
# Navigate to a website
$Page$navigate(url)
chrome_session$Page$loadEventFired()
chrome_session
# Find the root document node
<- chrome_session$DOM$getDocument()
root
# Find the element using DOM methods
<- chrome_session$DOM$querySelector(
node nodeId = root$root$nodeId,
selector = selector
)
# Get the HTML content of the node
<- chrome_session$DOM$getOuterHTML(
html nodeId = node$nodeId
)
# Close the browser session
$close() chrome_session
From there, we can use the HTML stored in html
variable with {rvest}
or any other HTML parsing library to further extract or navigate the desired content.
Fin
Hopefully, this video and code helps you get started scraping dynamic websites directly from R. Plus, it goes to show that R can be used for more than just data analysis!
If you have any questions or suggestions, feel free to leave a comment on the video or reach out to me on socials!