Web Scraping with Java and Selenium

Stasbytes
5 min readFeb 28, 2024

--

Introduction

Web Scraping is a technique for extracting data from websites. Every industry relies heavily on data for research, analysis and decision making. Scraping is an automated process that involves the collection of data from webpages typically in HTML format.

Scraping consumes a significant amount of a website’s server resources. That is why, before commencing on a project, it is important you carefully review the guidelines of each site concerning scraping. Some websites provide APIs that ensure you don’t have to scrape and they can assist you for what you need but you must often extract the data yourself to fit your specific use case.

This article aims to help you understand Web Scraping with Java and Selenium. We will scrape a job listing website and extract relevant information from it.

Java is a widely popular programming language because of the strong community support and extensive range of libraries. It is well suited to web scraping because of those traits and its compatibility with Selenium to aid browser automation.

Setting Up The Environment

Selenium is a fantastic tool that is mainly used for web scraping. It has other great features that make it an essential part of your toolkit such as:

- Browser Interaction: Selenium may be programmed to interact with a webpage, simulating real-world user behaviors such as screenshots and clicking links.

- Multiple Programming Languages Support: Selenium supports many programming languages, including Python, Java, C#, and Ruby. This enables developers to use it in their preferred language.

- Range: Selenium has a broad scope for controlling browsers. Your IDE allows you to access DOM elements, handle notifications, and manage cookies. It is a very useful tool.

Selenium works with a variety of browsers such as Edge, Chrome, Firefox, Safari and Internet Explorer. You simply need to install the right driver to give Selenium the permission to control it. In this article, we will be using Chrome, therefore you will need to install Chromedriver which is a tool that acts as a bridge between Selenium and the browser.

Now that we have installed the Chromedriver, we set up the Maven Environment. To get the dependencies used for the program, you can check out this Pastebin.

Web Scraping

Importing the Required Libraries

These are all libraries that will be used to handle the Web Scraping program.

Navigating to the Website

This block of code sets up the Chromedriver, launches the Chrome browser and navigates to the specified URL which should be scraped and indicates that the website is open.

Getting the Job Listings

After opening the website, we inspect the page to find the elements where all the job listings are located. They are located in the CSS selector .card-content. Next, we create a Stringbuilder called csvData. A stringbuilder is used to efficiently build a string by combining multiple smaller strings. We then add the headers to the CSV file to serve as the column names.

Extracting Relevant Information

This code iterates through the list of jobs on the page. It retrieves the job title, company name, and location from each job by using CSS selectors to locate their specific elements. Next, it finds and clicks on the Apply link in each posting and then retrieves the link and performs the click operation. This means that we can scrape the extra information on the webpage after the click has been done.

Getting the Data from the Apply Link

This block of code handles the switching of the WebDriver to a new window that appears after clicking the Apply link. It iterates through all the windows that open. When the window opens, it waits 10 seconds before extracting the text content which contains the job description. Once the job description element is present, the text content is extracted.

Next, it adds the extracted job details to the csvData StringBuilder in a CSV format, closes the Apply window and returns the driver to the main window. This process repeats for each job description, adding the data to the CSV string before closing the window and moving to the next job listing.

Finalizing the Process

Once the last process is completed, the driver returns to the main window. The code then saves the scraped job data to a file into a CSV file named job_data.csv. If the file is successfully saved, it prints a message to the console. Finally the code quits the driver and closes the browser window.

The Scraped Data

The CSV file’s final data will display:

Conclusion

This article has gone through the process of automating scraping job data from a webpage, extracting relevant job details, and saving them to a CSV file. You can access the code used here. Thank you and Happy Scraping!

--

--

No responses yet