What is soundboard in Discord?

In the context of Discord, a soundboard is a feature or application that allows users to play pre-recorded audio clips or sound effects during voice chats or conversations. It enhances the…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Web Scraping using BeautifulSoup for Your Niche

How to scrape the website of your choice using Python

As someone who is regularly fascinated by the potential of data mining, I find myself desiring very specific datasets more often than I’d like to admit and I’m not one to shy away from making this known. Unsurprisingly, I was recently contacted by a former colleague, hoping they could get a dataset on blue-collar jobs to determine trends and inform a study they were a part of. By nature, this was one I knew I’d have to scrape for. And scrape I did!

Ideally, contacting the owners of target websites will yield the most accurate data, in addition to data from the past. Easier said than done however for more reasons than one, scraping has become the de facto way of obtaining data for studies focusing on niches.

The titular Python module is what we’ll be using to scrape and parse the target website. Python is an extremely powerful language, thanks in part to the plethora of modules available to import and use. We begin by installing and using the Python module beautifulsoup4. While this module is powerful in its own right, when used in conjunction with various other modules, any website can be scraped irrespective of whether the website loads data dynamically, or if it requires us to login. For our purposes here, we will restrict ourselves to a dataset from a static website on the aforementioned topic: blue-collar jobs.

Terminal:

If you are using Google Colab like I am:

Please refer to the second post in the references (final section) for more information.

We start by importing all the required modules. The tail end of the following code block is for running Selenium on Colab. Please note that Selenium will be required to deal with websites that load in data dynamically (eg. websites that load in more data as the user keeps scrolling; dealt with in step vi).

The target website uses CloudFlare, which by default blocks any request we send, probably to ward off and protect from unwanted requests. A simple workaround is to modify the header and change the user-agent to a known browser (instead of the default header: python-requests/2.23.0).

We then create a BeautifulSoup object using the webpage’s content. Upon careful inspection of the source HTML, I realised that the website’s developer did an excellent job of organising all the divs and the class we’re after is ‘JobItem’. I would recommend using your favourite browser’s inspect mode to perform any such exploration as they let us highlight elements.

With the main class identified, getting to the specifics becomes as easy as diving deeper into the divs hierarchy. We determine and obtain the specific classes we’re after and store them in appropriate variables.

At this point, it is safe to assume that if one piece of information (company) is available, so will the rest.

With all pieces of the puzzle at the ready, we create a function that takes in the target city as an argument, uses it to create the city-specific URL, and generates a list of URLs corresponding to the 50 pages of entries.

Looping through each URL, we repeat the process described in the testing steps to, instead of printing, append the obtained information into appropriately named lists.

With the lists now ready, the dataset is generated by using the zip method to combine each entry in one list with the corresponding entry in the other lists to create what would be a row in the final dataset.

With the dataset handy (obtained using the scraper function), we use Pandas to create a DataFrame.

Since the dataset has the gender requirement and educational qualifications clubbed, we split them to create two different columns. Additionally, the substring ‘Posted on: ’ is removed from the ‘Posted On’ column.

The modified DataFrame is then converted into a .csv file, ready for data analysis.

While our target website has all the data divided between pages, some websites load in more data as the user scrolls. In order to obtain all the available data, all we’ll have to do is scroll down until the maximum depth is reached, at which point we scrape the entire website in one go.

The scraper here obtains the current height of the website, scrolls to the end (using Selenium), and compares the current height with the previous. If these happen to be the same, it means we have scrolled down to the end (thanks to the third post in the references).

I’ve said this before, and I’ll say it again, data excites me. I see the potential to ascertain relationships and discover insights that can help show the way forward. When a niche dataset is then explored in this way, the potential is boundless.

What’s more, we have an infinite source: the internet. The internet is a beautiful thing. It is a labyrinth that has all the information we need. Any tool that helps us make sense of this vast labyrinth is power unimaginable.

If you’ve come this far, you have my sincerest gratitude. I hope this helps you obtain the dataset you need for your niche.

Add a comment

Related posts:

3 Reasons Why Companies Need to Engage in Conversation With their Customers

eCommerce is an essential part of our lives. A truly useful tool. Yet, it has also become a space where brands mainly inundate online shoppers with ads or promotional content. Brands tend to stick to…

Ride with the moon in the dead of night shirt

Dana Romaine if politicians won’t enforce the Ride with the moon in the dead of night shirt like the 1952 Immigration Nationality Act the people will rise up. Since you can’t, how many little boys…