Web Scraping with NodeJS using Puppeteer.

Stasbytes
5 min readJan 31, 2024

--

Introduction

Introduction to Web Scraping

Web scraping is a method of extracting information from websites. It is done automatically by utilizing software tools or scripts to retrieve the relevant information for immediate use or storage for later use. The explosion and importance of data are widely known, yet accessing it for use can be problematic because it is not usually in an easily extractable format. Not all websites offer APIs that provide data too. All of these factors contribute to the importance of web scraping.

A common issue with web scraping is the existence of dynamic content. The presence of dynamic content is a common difficulty while scraping the web. Dynamic content is defined as content that varies based on user interaction with the website, such as clicking, scrolling the page, or typing into a text box.

Introduction to NodeJS and Puppeteer

To make this process easier, we’ll talk about web scraping with Node.js and Puppeteer. What are those, you could ask? Node.js is a JavaScript environment that works on a variety of operating systems. It enables users to execute Javascript code outside of a web browser, enabling developers to create command line tools, and may also be used for server-side programming.

Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full (“headful”) Chrome/Chromium. Headless mode is a term used to describe running Chrome without Chrome. It is often used for projects that require browser automation

Puppeteer is a powerful tool because it allows you to perform tasks such as:

- Capture Screenshots and PDFs

- Create an automated testing environment using the latest JavaScript and browser features.

- Diagnose your website’s performance issues and much more.

This article introduces the basics of web scraping with Puppeteer, therefore we’ll be scraping a test site on a web scraping platform which will give you an excellent background to go about using Puppeteer for future projects.

Getting Started with Web Scraping

Setting up your environment

Running this command in the terminal will install Puppeteer on your device.

Starting with Puppeteer

This line imports the Puppeteer library.

Creating the Web Scraping function

First, create a Javascript file in which the code will be written.

This block of code goes through the initial steps of scraping the data. It starts by declaring the asynchronous function named scrapeTableData which will be used to scrape the required data. An asynchronous function is a programming method that allows your program to start a long-running operation while being sensitive and responsive to additional inputs, rather than waiting until the first task is completed. Once the function is complete, the program will return the result.

Next, the code launches the headless browser instance using the Puppeteer launch functionality. After launching the browser, the code creates a new page within it, represented by a Page object.

Once this page is created, the code navigates to the specified URL using the page.goto method. Subsequently, the code instructs the page to wait until the table to be scraped becomes visible. This ensures that the table loads completely before proceeding with scraping the data.

Getting Data from the Table

The above chunk of code extracts data from the table. First, the program evaluates the page to identify all its elements. document/querySelector(‘table’) is then used to choose the page’s first table element. The querySelectorAll() method is then used to select all tr (table row) items and return a collection of row elements.

The next step is to extract all of the headers from the table element. The line extracts the table headers by taken the th (table header) elements from the first row [(rows[0]). After extraction, Array.from() is used to transform the header items into an array. map() is then combined with .textContent.trim() to extract only the text content of each header cell.

For the table rows, the line extracts the table data by iterating over each row element except the first (.slice(1))and converts it to an array. map() is then combined with .textContent.trim() to extract only the text content of each cell.

The return command retrieves an object containing the retrieved table headers and data from the page.evaluate() function.

Presenting the Extracted Data

The above lines display the extracted table headers and data on the console and then close the headless browser instance that was created.

Using the Scrape Function

This line calls the function that was created and initiates the scraping process.

Running the Javascript Script

Overall, this script uses Puppeteer to launch a headless browser, navigate to a web page, wait for a specific element to be visible, and then extract the table headers and data from the page using JavaScript executed in the browser context.

Conclusion

Puppeteer is an extremely in-depth tool. All the functionality cannot be covered in a simple article, but a proper introduction and background to the Javascript library has been given in this article. You can explore the documentation to see other examples and more use cases for it. Make sure to respect the rules of web scraping and check websites for robots.txt which governs the rules of scraping the site. Best of luck and happy scraping!

--

--

No responses yet