Web Scraping with Scrapy

Stasbytes
6 min readFeb 6, 2024

--

Introduction

As technology advances, web scraping remains one of the most important parts of the tech toolkit. Web scraping plays an irreplaceable role in extracting data from the internet. Having access to accurate information sets one apart in today’s world. In this article, we’ll be making use of a Python framework known as Scrapy to perform web scraping tasks.

Web scraping plays a crucial role in extracting valuable data from websites, and when it comes to job hunting, obtaining accurate and up-to-date job information is of utmost importance.

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to site monitoring and automated testing. There are different ways of using Scrapy to handle web scraping, in this tutorial, we will be using the spider. A spider is a method that defines how to crawl and extract data from a webpage.

For this article, we’ll learn the fundamentals of scraping a website. The fake jobs dashboard contains a list of roles, their locations and the times they were advertised. At the end of this tutorial, you should have a fully functional web scraping script that you understand how to use and that will be useful when developing future Scrapy projects.

Setting up the Scrapy Project

Getting started with Scrapy is simple, the first step is to install the package.

The next step will be to ensure that the project environment remains clean and manageable. The next step will be to ensure that the project environment remains clean and manageable.

This creates a folder called scrape_jobs that will be used for the remainder of this project. throughout this project. A common mistake that leads to frustration is forgetting to change directories into the one just created. Make sure you do that.

This line of code creates a spider that handles the project, names it jobs in this example and then finally the URL that is going to be scraped.

If you’ve been following, this is exactly what your folder should look like after creating the spider. A new file called jobs.py which is the main file we will be working with on this tutorial has been created with the last command.

The jobs.py file contains this as soon as you start it up.

Starting to Scrape

Now that we have laid the groundwork, it is time to start scraping the web page.

This code creates an interactive terminal which interacts directly with the page.

When web scraping, HTML is extremely important. It is also helpful to have basic CSS knowledge. To access the information you need, you simply right-click on it and select Inspect, this shows the HTML element of the page and where the needed information is stored.

For this project, we simply want to get the job role, location and timestamp. HTML is a structured language, therefore each job posting including all the necessary information is assigned to a specific class.

Information for one job posting

Getting the information for one job posting is important as it helps set up what we need to get the rest.

This is still done in the interactive terminal. As mentioned earlier, all the information about each job is contained within the structure of the page’s HTML. On closer inspection, we see the div with card-content has information for each job.

We already have the response from loading the webpage earlier, now we use CSS selectors to target the exact information required. Getting the data from div.card-content and then using the get()functionality returns the following output.

Now, that we have all the information for one posting, we drill down to the specifics needed.

Job Title

From the output above, you will notice that the text we need ‘Senior Python Developer’ is contained within a h2 tag with the class title is-5. In CSS, blank spaces in classes are represented by ‘.’

Job location

To access the job location, it is in the p tag with location as the class. To get the exact information we need, we apply ::text to get just the text and strip() to remove any remaining tags.

Job Timestamp

For the timestamp, it is simply time, we apply ::text to get the exact data needed.

Getting All Information

Now that we know how to get the exact information we need, we scrape the entire web page to get all the necessary data at once. We do this by including a loop in the parse function of the Python file generated with the spider.

This loop takes the whole page and gets roles, locations and timestamps for each job posting

This line returns a Python dictionary containing the extracted data. Yield is used to return the dictionary. Each dictionary represents a scraped job post and contains the role, location, and time posted as key-value pairs.

Next, we save the file and exit from the Interactive terminal that we have been working with using exit()

Running the Script

Finally, we run the script in the terminal.

This code crawls using the jobs file that we’ve been working with and extracts the information. -o is used to specify the output name and format which is jobs.json. The project is complete.

The final JSON file will look like this once you’re done.

Conclusion

In this article, we looked into Scrapy for web scraping. The installation process was covered, as well as the use of HTML and CSS selectors to extract specific information before using loops to access all the information needed before exporting it to a JSON file.

This is a solid foundation for any complex web scraping projects you might want to undertake. There is a lot more functionality possible with Scrapy, if you need any extra help, check out the extensive documentation. Scrapy allows you to save time and effort while extracting data for your web scraping projects. Good luck and Happy Scraping!

--

--

No responses yet