Headless Web Scraping with Playwright and JavaScript

Stasbytes
5 min readMar 21, 2024

--

Introduction

Web Scraping is the process of extracting data from websites. Data has many applications including research, content aggregation and forecasting. Every industry uses data, which is why web scraping is a valuable skill set. Data collection can be done manually by copying and pasting but because that can be tedious and inefficient, the process is often automated using scripts and tools.

There are numerous programming languages used by developers all over the world, and each one has its method for simplifying the data collection process.

Scraping should be done carefully because it causes significant strain on the website servers. Some sites have APIs to supply data to the users, and while some others do not, they provide instructions for web scrapers on how to perform scraping tasks on their site. Make sure to follow these instructions or you could get barred from accessing the website.

For this tutorial, we will use Playwright, one of JavaScript’s most popular web scraping libraries. We will be scraping one of the free to scrape websites available on the internet and saving the data we scrape to a CSV file for easy analysis at a later date.

Introduction to Playwright

Playwright is a great tool for web scraping because it has amazing features for this process. Playwright allows headless execution. This simply means that the process can run in the background without requiring a visible browser window. It’s great because it means you can continue your other activities while your web scraping is ongoing.

Other advantages of using Playwright include automation capabilities which allows users to interact with the web page in the same way that a real user would. Playwright allows you to click buttons, fill out forms, navigate pages and use links. It offers an API that simplifies the entire process, and if a user runs into errors, there is an active support community where help is easily rendered. All these combines to make playwright great for web scraping tasks.

Web Scraping

Setting up the Environment

Our first task is to install all the required libraries. Playwright is a JavaScript library, so this article assumes JavaScript is available on your system. You should run the following command in your terminal.

This line installs the Playwright, Cheerio and Csv-writer packages. Cheerio is a library that allows you to parse HTML and XML documents, while csv-writer simplifies the process of writing CSV files.

Importing the Libraries

Next, we create a file called scrape.js where we begin to write our scraping code. Once that’s done, we begin the process.

The above line of code imports the necessary modules for the web scraping process and also for writing the data to a CSV file.

The Scraping Function

Getting into the nitty-gritty of the scraping, we create a function to perform the task.

In the block of code above, we define a function named test. In it, we launch a Chromium browser and set the headless option to false for the browser window to be visible. We then create a new page and instruct it to navigate to the URL we want to scrape. Next, we retrieve the page’s HTML content and load it into cheerio, assigning it to the $ variable so can manipulate the content.

We select all div.thumbnail elements on the page and store them in the item variable. This is done to allow you to extract specific data from each individual element.

Extracting Specific Data

We now have the data saved which will allow us to extract what we need.

First we initialize an empty data array to contain the extracted information. A loop is started to iterate over each item in the items collection. In that loop, an object called obj is created to store the extracted data for each item.

It first extracts the title and price text for the main page and then retrieves the href element which is saved as a link. With some manipulation, the script adds that link to the base URL of the website which contains the product description we need.

Once that’s done, page.goto is used to navigate to that URL and pass the response to productPageResponse which retrieves the HTML content and saves it to productPageHtml. That HTML is then loaded using Cheerio and the text is extracted.

At this point, we have the extracted title, price and description which are assigned to their respective properties of the obj object. It is then pushed into the data array that was created and the loop is repeated until all of the items have been processed.

Closing out

Now that we have scraped the data, we conclude the process by closing the browser and writing the extracted data into CSV format.

The final block of code closes the browser and then uses the csvWriter to create a CSV file called data.csv with headers Title, Price and Description.

csvWriter.writeRecords(data) is then used to save the extracted data from the data array to the CSV file. We log a message to the console to indicate when this process is completed.

Finally, we call the test() function to execute the process and the scraped data should look like the image below.

Conclusion

The code used in this post is available on Pastebin. In this tutorial, we looked at how to scrape a website using Playwright and Cheerio and use its automated capabilities to select elements, extract the necessary information and navigate to extra pages to retrieve more necessary information. Finally, the data was stored in a CSV file for integration with other applications or analysis purposes.

Playwright performs more than just web scraping. Using it can unlock more possibilities whether you need it to just extract data, automate tasks or simulate user interactions, Playwright is ready and available for you. Good luck and Happy Scraping!

--

--

No responses yet