Web Scraping is the process of extracting data from websites. Accessing this data is usually done using tools or scripts. These tools are required since website information is often in HTML format, which requires parsing before being extracted for immediate use or storage for later use.
Parsing is analyzing HTML code to identify certain elements and then extracting important information from them.
Java is a great programming language to use for web scraping for various reasons such as:
- It has a great ecosystem of libraries and frameworks that make web scraping a straightforward process. Some examples are Jsoup, HtmlUnit, Selenium WebDriver, HttpClient, and Jaunt. These libraries all have different approaches for web scraping and can be used based on the exact requirements of your project.
- Integration: Java integrates well with a host of other technologies such as databases, servers and cloud platforms. This makes it easier for developers to perform web scraping tasks and easily connect the solution to other services.
For this project, we will be scraping the Books to Scrape Website using Jsoup as the Java library and Replit as the online IDE where we will be running and executing the code. Finally, we will save the results to a CSV file for later usage.
This project assumes that the reader is familiar with the basics of Java. If you need a refresher, you can check out the official Oracle Java documentation
Setting Up Your Environment
Jsoup is a Java library that makes it easy to parse HTML and XML documents. It is very easy to add Jsoup to your Replit environment.
Replit is a platform that provides a lot of tools and features necessary for developing software. It allows developers to write, run and collaborate on code on your web browser. Once you create an account, you’re taken to the page below. You fill it in and start Building.
Installing Jsoup
Web Scraping
An interesting quirk found when using Replit is after the creation of a Java project, only the main.java file runs when you attempt to use the run button. The easiest way to get around this is to simply clear the code in the file and start yours in there.
To get our data, we first import the libraries we’ll be making use of the functionality
Importing the Libraries
The first set of code imports the necessary classes from the Jsoup external library. The next block of code imports classes from Java’s standard library.
The Web Scraping Method
This block of code created a class which serves as the entry point for the Java program we have created. It then declares the method created. try is included for exception handling. Next, it uses the Jsoup library to connect to the target URL that we will scrape to obtain the page’s HTML content.
The select method from Jsoup to choose elements from the HTML document. The CSS selector .product_pod is then used to target the elements with that class. These elements are then stored in a variable named books.
extractBookInformation extracts relevant information from the selected HTML elements and then returns a list of Book Objects. This list is then stored in a bookList variable. saveToCsv method then saves the information into the CSV file. An update message and a catch block to handle exceptions are added.
In summary, this code block connects to the specified URL, retrieves the HTML content, selects the important elements, extracts the information from those elements, stores the information in a CSV file and then prints a status update message.
Extracting the Book Information
This block of code declares the method extractBookInformation. The next line uses the ArrayList class to create an empty list called BookList. A loop is then started to iterate over each Element. For each iteration, the element is assigned to book from which the title, price, and availability are extracted and then stored in a bookList group. Finally, return bookList; returns all the information extracted and stored.
Saving to CSV
The saveToCsv method takes two arguments bookList which is a list of all the information gathered and the filePath which is the path to the CSV file. A try block is then used to handle errors that could come up. Next, we initialize the import FileWriter named writer with the specified filePath. This path will be used to write character data to a file. Headers are added to the file. StringBuilder is a class that allows the string to be effectively constructed by appending other strings or new characters
Finally, a loop is created to store the information in the bookList
Looping the Content
First, it will append the title, price, and availability and ensure each value is separated by a comma to make sure the format adheres to CSV rules. The getTitle(), getPrice(), and getAvailability() methods are called on the book object to retrieve the corresponding information. The loop is then ended.
The toString() method is used to convert the csvContent object to a string and the write method is used to write it to the file. A catch method is included here to notice where exactly any error occurs because it is a long process.
The Book Class
This Book class represents a book object and includes variables for storing book-related information. It then provides a public constructor that takes arguments to initialize the variables and get methods to retrieve their values. It then puts all that information together and provides a convenient way to access the information
Running the Script
Now that we’ve written the code, running the script is easy. Simply click the results will appear as seen in the figure below. It outputs “Book information saved to books.csv”. The books.csv is visible in the Files area of the left sidebar.
The content of the CSV file will be as follows:
Conclusion
This article introduces the Jsoup library for web scraping in Java and Replit, an incredible resource in software development. Steps for web scraping such as defining methods, extracting the information from the elements of the webpage and saving it to a CSV file have all been explained in depth. This is a simple program and you can use Jsoup for more advanced projects. Good luck and Happy Scraping!