Introduction
Web Scraping is a powerful tool. In this article, we’ll look at Web Scraping with BeautifulSoup and Python to collect some data from the Internet. Web scraping is a method of extracting information from web pages. It is done automatically by using software tools or scripts to retrieve relevant information for immediate use or storage for later usage.
This tutorial will introduce you to scraping websites with Python. It provides a full explanation of how to obtain information from a website. For this tutorial, we will use the Sky Sports website to generate a data frame of the headlines on the home page as well as the categories under which these titles fall. This information can be utilized for a variety of purposes, including monitoring trends in popular sports, common terms, and so on.
Tutorial Requirements and Setup
For this project, we’ll be making use of the following packages
- Request
- Pandas
Importing Relevant Packages
BeautifulSoup is a Python package designed for web scraping. The internet is a structured collection of data that includes components. Every page on the internet contains HTML and XML material. BeautifulSoup allows you to access the content of a page via HTML or XML, extract specific information based on attributes or tags, and then edit the data as you see fit. BeautifulSoup makes this process easier, which is why it is so popular.
The Requests Python package is useful for web scraping since it allows you to submit HTTP requests to a web server and then obtain the HTML content of that page. It is another tool that works hard to make data retrieval easier for web scraping. Pandas is a Python package for properly managing, clearing, and visualizing data obtained through web scraping.
Web scraping is an extremely resource-intensive activity for the server. Jupyter notebooks can be used to reduce the workload as much as possible. In this tutorial, we will use Google Colab, an online notebook platform. Also, the majority of packages are pre-installed. If you’re using a local code editor, you can copy and paste the following code to install all the required packages for this project:
pip install requests beautifulsoup4 pandas
After installing those packages, run the code below to begin scraping.
Scraping the Website
Getting into the specifics of scraping the site now, the first line sends a request to the URL that we want to scrape, which in this case is skysports.com. This uses the requests library to retrieve all of the web page’s HTML content.
The second line creates an object that we’ve named soup. This object takes two arguments: the HTML content (page.content) and the parser to be used (LXML).
A parser is a library that interprets the HTML text obtained from a webpage. It examines the document’s structure and allows the user to extract the needed information. There are many different sorts of parsers, such as HTML, XML, and CSS parsers, with examples like LXML (python), Jsoup (java), and Nokogiri (Ruby).
Getting the Specific Information
To extract the exact information, you need from the webpage, right-click on the specific item; an inspect button will appear in the side menu, directing you to the browser terminal. HTML is an entirely separate computer language that cannot be thoroughly covered in this article, but each object on a page has a class, which you should check for in the Inspect tab that appears. Take note of the class and enter it into the code that will extract the exact information and other associated information from the webpage.
The code above discovers all instances of that class and saves them to a BeautifulSoup element named match1. Using a loop plus syntax to match the first 20 matches and text.strip() to get only the text. Then you get the results with the grey background. This is done for both tags and categories. It can be expanded to include as many items as you want to scrape.
Sorting the Data
After scraping the data, we want to store it in a format that would be much easier to use, thus we choose a dataframe. First, we extract our lists from the loops and store them in a dictionary, which is then turned into a pandas dataframe for convenient use. Headlines and sports categories for the first 20 headlines on Sky Sports’ homepage.
Summary
Getting started with web scraping is straightforward but as you start to scrape more, the HTML complexity of pages will increase which in turn makes scraping harder. This article goes over the basics that will help you in any project you undertake.
Make sure you are aware of the website’s terms and conditions before scraping data. It is also possible to make use of timeouts to get around a website that is blocking your IP address and your script failing. Web scraping allows you to explore the possibilities accessible with data.
Conclusion
This article describes how to easily scrape a website using Jupyter Notebook and BeautifulSoup. The notebook containing the code used in this article can be accessed here. Thank you for reading and happy scraping.