Introduction
Introduction to Web Scraping
Web scraping is a method of extracting information from websites. It is done automatically by utilizing software tools or scripts to retrieve the relevant information for immediate use or storage for later use. The explosion and importance of data are widely known, yet accessing it for use can be problematic because it is not usually in an easily extractable format. Not all websites offer APIs that provide data too. All of these factors contribute to the importance of web scraping.
A common issue with web scraping is the existence of dynamic content. The presence of dynamic content is a common difficulty while scraping the web. Dynamic content is defined as content that varies based on user interaction with the website, such as clicking, scrolling the page, or typing into a text box.
Introduction to NodeJS and Puppeteer
To make this process easier, we’ll talk about web scraping with Node.js and Puppeteer. What are those, you could ask? Node.js is a JavaScript environment that works on a variety of operating systems. It enables users to execute Javascript code outside of a web browser, enabling developers to create command line tools, and may also be used for server-side programming.
Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full (“headful”) Chrome/Chromium. Headless mode is a term used to describe running Chrome without Chrome. It is often used for projects that require browser automation
Puppeteer is a powerful tool because it allows you to perform tasks such as:
- Capture Screenshots and PDFs
- Create an automated testing environment using the latest JavaScript and browser features.
- Diagnose your website’s performance issues and much more.
This article introduces the basics of web scraping with Puppeteer, therefore we’ll be scraping a test site on a web scraping platform which will give you an excellent background to go about using Puppeteer for future projects.
Getting Started with Web Scraping
Setting up your environment
Starting with Puppeteer
Creating the Web Scraping function
First, create a Javascript file in which the code will be written.
This block of code goes through the initial steps of scraping the data. It starts by declaring the asynchronous function named scrapeTableData which will be used to scrape the required data. An asynchronous function is a programming method that allows your program to start a long-running operation while being sensitive and responsive to additional inputs, rather than waiting until the first task is completed. Once the function is complete, the program will return the result.
Next, the code launches the headless browser instance using the Puppeteer launch functionality. After launching the browser, the code creates a new page within it, represented by a Page object.
Once this page is created, the code navigates to the specified URL using the page.goto method. Subsequently, the code instructs the page to wait until the table to be scraped becomes visible. This ensures that the table loads completely before proceeding with scraping the data.
Getting Data from the Table
The above chunk of code extracts data from the table. First, the program evaluates the page to identify all its elements. document/querySelector(‘table’) is then used to choose the page’s first table element. The querySelectorAll() method is then used to select all tr (table row) items and return a collection of row elements.
The next step is to extract all of the headers from the table element. The line extracts the table headers by taken the th (table header) elements from the first row [(rows[0]). After extraction, Array.from() is used to transform the header items into an array. map() is then combined with .textContent.trim() to extract only the text content of each header cell.
For the table rows, the line extracts the table data by iterating over each row element except the first (.slice(1))and converts it to an array. map() is then combined with .textContent.trim() to extract only the text content of each cell.
The return command retrieves an object containing the retrieved table headers and data from the page.evaluate() function.
Presenting the Extracted Data
Using the Scrape Function
Running the Javascript Script
Overall, this script uses Puppeteer to launch a headless browser, navigate to a web page, wait for a specific element to be visible, and then extract the table headers and data from the page using JavaScript executed in the browser context.
Conclusion
Puppeteer is an extremely in-depth tool. All the functionality cannot be covered in a simple article, but a proper introduction and background to the Javascript library has been given in this article. You can explore the documentation to see other examples and more use cases for it. Make sure to respect the rules of web scraping and check websites for robots.txt which governs the rules of scraping the site. Best of luck and happy scraping!