Nowadays, there are APIs for nearly everything. If you wanted to build an app that told people the current weather in their area, you could find a weather API and use the data from the API to give users the latest forecast.
But what do you do when the website you want to use doesn't have an API? That's where Web Scraping comes in. Web pages are built using HTML to create structured documents, and these documents can be parsed using programming languages to gather the data you want.
If you're going to spend time crawling the web, one task you might encounter is stripping out visible text content from HTML. If you're working in Python, we can accomplish this using BeautifulSoup. Setting up the extraction. To start, we'll need to get some HTML. I'll use Troy Hunt's recent blog post about the 'Collection #1' Data Breach. Apr 17, 2021 I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an 'alternative' app, which is simply a localhost that can search for a user, and get its posts). Loading Web Pages with 'request' The requests module allows you to send HTTP requests using. Chrome Proxy Extension Access localized content by using a single Chrome extension. Firefox Proxy Add-on Pick a country and get a residential IP assigned automatically. Proxy Address Generator Use proxy generator to export proxy list for any software. The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job.
Web Scraping with Python and Beautiful Soup
There are two basic steps to web scraping for getting the data you want:
- Load the web page (i.e. the HTML) into a string
- Parse the HTML string to find the bits you care about
Python provides two very powerful tools for doing both of these tasks. We can use the Requests library to retrieve the web page containing our data, and we can use the awesome Beautiful Soup package for parsing and extracting the data. If you'd like to know a bit more about the Requests library and how it works, check out this post for a bit more depth.
Using Beautiful Soup we can easily select any links, tables, lists or whatever else we require from a page with the libraries powerful built-in methods. So let's get started!
HTML basics
Before we get into the web scraping, it's important to understand how HTML is structured so we can appreciate how to extract data from it. The following is a simple example of a HTML page:
HTML will always start with a type declaration of <!DOCTYPE html>
and will be contained between <html>
/ </html>
tags.
The <body>
tags wrap around the visible part of a website, which is made up by various combinations of header tags (<h1>
to <h6>
), paragraphs (<p>
), links (<a>
) and several others not shown in this example, such as <input>
and <table>
tags.
HTML tags can also be given attributes, like the id
and class
attributes in the example above. These attributes can help with styling by uniquely identifying elements.
If these tags are new to you, it might be worth taking some time quickly getting up to speed with HTML. Codecademy and W3Schools both offer excellent introductions into HTML (and CSS) that will be more than enough for this tutorial.
Analyzing the HTML
Have you ever followed one of those links on your social media to a 'Top 10 films of 2017', only to find it's one of those sites where each listing is on a different page? Part of you wants to find out what they thought was number one, the other part wants to give up waiting for all the ads to load? Well, web scrapping can help you with that.
We are going to use this article from CinemaBlend to find out the 10 Greatest Movies of All-Time.
Take a look at the link. It should bring you to a page where you can see that Taxi Driver was ranked 10th in the list. We want to grab this, so the first thing we need to do is look at the page structure. Right click on the page in the link above, and select the Page Source
option.
This will bring up the HTML document for the entire page, side-menus and all. Don't be alarmed, I don't expect you to read all that. Instead press Ctrl + F
and search for 10. Taxi Driver.
You should find something like this:
This part of the HTML represents the rank and title found underneath the movie image as shown below:
The easiest way to be sure is that this search should return only 1 result, which means we must be looking at the same part of the page.
So the 10th entry in our list is Taxi Driver, but how do we get the other 9 without having to click through every page?
Open the page source again, but this time search for Continued On Next Page. You should find something like this:
This section is rendered as the link we need to click on to see the next entry:
Again, we can tell this is the same element because it is the only result in the whole page source that should match.
Believe it or not, with just those two HTML segments we can create a Python script that will get us all the results from the article.
Scraping the HTML
Before we can write our scraping script, we need to install the necessary packages. Type the following into the console:
pip install requestspip install beautifulsoup4
Now we can write our web scraper. Create a script called scraper.py
and open it in your development environment. We'll start by importing Requests
and BeautifulSoup
:
Let's use the Requests
library to grab the page holding the 10. Taxi Driver entry and store it in a variable called page
. We'll also create a variable called results
, which will store the film rankings in a list for us:
Do you remember when we looked at the HTML for the web article using page source? Essentially, we now have that page's HTML stored in our variable, and we're going to use BeautifulSoup
to parse through the response to find the data we care about.
The next step is to feed page
into BeautifulSoup
:
Now we can use the BeautifulSoup
built-in methods to extract the film and it's ranking from the snippet we examined earlier:
To do this, we can use CSS selector syntax. In CSS, selectors are used to select elements for styling. Notice how the div
element has a class of liststyle
? We can use this to select the div
tag, since a div
tag with this exact class only appears once on the page.
Note: Usually, class
attributes aren't unique and are used to style multiple elements in a similar way. If you want to guarantee uniqueness, try to use an id
attribute.
Here, we have used the BeautifulSoup select
method to grab the div
element we want. The select
method returns a list containing any matching elements. In our case, element
returns: [<div>10. Taxi Driver</div>]
.
Since our list only contains one item, we get the element with index 0. We then use the BeautifulSoup get_text
method to return just the text inside the div
element, which will give us '10. Taxi Driver'.
Finally, let's append the result to our results list:
Crawling the HTML
Another key part of web scraping is crawling. In fact, the terms web scraper and web crawler are used almost interchangeably; however, they are subtly different. A web crawler gets web pages, whereas a web scraper extracts data from web pages. The two are often used together, since usually when you crawl some web pages you also want to get some data from them, hence the confusion.
In order for us to determine the other 9 rankings in the article, we will need to crawl through the web pages to find them. To do that, we are going to use the snippet we discovered before:
An <a>
tag represents a link, and the destination for that link when clicking on it is held by the href
attribute. We want to pass the value held by the href
attribute to the Requests
library, just like we did for the first page. We can do that with the following:
Here we have selected for any a
tag that contains the class next-story
and is within a parent div
element that itself has a class of nextpage
. This will return just a single result, since a link matching this criteria occurs just once on the page for our Continued On Next Page link.
We can then get the value of the href
attribute by calling the get
method on the a
tag and storing it in a variable called url
.
The next step would be to pass the href
variable into the Requests
library get
method like we did at the beginning, but in order to do that we are going to need to refactor our code slightly to avoid repeating ourselves.
Refactoring the Scraper
Right now, our scraper successfully grabs our chosen page and extracts the movie title and ranking, but to do the same for the remaining pages we need to repeat the process without just duplicating our code. To do this we are going to use recursion.
Recursion involves coding a function that calls itself one or more times, something that Python is able to take advantage of very easily. Here is our scraper refactored as a recursive function:
Let's go through each section of the code and see what is happening.
The scraper
function takes two arguments. The first, url
, is the URL of the page you want to extract information from, which gets passed into requests
.
The second argument results
is optional but is key to the operation of our recursive function. When the function is first called, it should be called as follows:
scraper('https://www.cinemablend.com/new/10-Greatest-Movies-All-Time-According-Actors-73867.html')
The results
parameter is not provided, and thus is set to an empty list. The function then grabs the page and extracts the information from it, appending it to the results
list.
The next vital part of our recursive function lies here:
If we find a link on the page matching the CSS selector div.nextpage a.next-story
, then we will call the scraper
function again, this time with the href
of the link to the next page AND the results
list we have generated so far. This means when scraper
runs for any subsequent calls, the results
parameter will not be empty and instead we will continue to append new results to it.
When the scraper reaches the last page of the article (i.e. the movie ranked number one), then there will be no link matching the CSS selector and our recursive function wil return the final results
list.
Note: Take care when using recursion. If you don't create a condition that will eventually end the function calls, a recursive function will run continously until it causes a runtime error. This is to prevent an issue known as stack overflow.
A complete working script could look something like this:
Scraper limitations
So now you've seen how easily you can extract information from a web page, why wouldn't you use it all the time? Well, sadly, there are downsides.
For starters, web scraping can also be slower obtaining the information than through an equivalent API, and some sites don't like you scraping information from their pages, so you need to check their policies to see it's okay.
But perhaps the most significant drawback is changes to the the HTML page structure. One of the advantages of APIs is that they are designed with developers in mind, and are therefore less likely to changes how they work. Web pages on the other hand can change quite dramatically. If the web page author decides to change the class names of their elements, such as the nextpage
and next-story
CSS selectors we used, our scraper will break. This can be frustrating if a website updates regularly.
That being said, web sites have improved their structures a lot over the years with the popularity of many easy-to-use frameworks, which means pages are unlikely to change too much over time.
Summary
Hopefully you've seen enough that you can now use web scraping confidently in your own projects. The advantage of web scraping is that what you see is what you get and If you know the information you are after, you don't need to dig around trying to figure out an API to get it. Just code a simple scraper and it's yours!
Like
Please enable JavaScript to view the comments powered by Disqus.blog comments powered by DisqusOnce you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.
Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.
I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.
While it’s written primarily for people who are new to programming, I also hope that it’ll be helpful to those who already have a background in software or python, but who are looking to learn some web scraping fundamentals and concepts.
Table of Contents:
- Extracting Content from HTML
- Storing Your Data
- More Advanced Topics
Useful Libraries
For the most part, a scraping program deals with making HTTP requests and parsing HTML responses.
I always make sure I have requests
and BeautifulSoup
installed before I begin a new scraping project. From the command line:
Then, at the top of your .py
file, make sure you’ve imported these libraries correctly.
Making Simple Requests
Make a simple GET request (just fetching a page)
Make a POST requests (usually used when sending information to the server like submitting a form)
Pass query arguments aka URL parameters (usually used when making a search query or paging through results)
Inspecting the Response
See what response code the server sent back (useful for detecting 4XX or 5XX errors)
Access the full response as text (get the HTML of the page in a big string)
Look for a specific substring of text within the response
Check the response’s Content Type (see if you got back HTML, JSON, XML, etc)
Extracting Content from HTML
Now that you’ve made your HTTP request and gotten some HTML content, it’s time to parse it so that you can extract the values you’re looking for.
Using Regular Expressions
Using Regular Expressions to look for HTML patterns is famously NOT recommended at all.
However, regular expressions are still useful for finding specific string patterns like prices, email addresses or phone numbers.
Run a regular expression on the response text to look for specific string patterns:
Using BeautifulSoup
BeautifulSoup is widely used due to its simple API and its powerful extraction capabilities. It has many different parser options that allow it to understand even the most poorly written HTML pages – and the default one works great.
Compared to libraries that offer similar functionality, it’s a pleasure to use. To get started, you’ll have to turn the HTML text that you got in the response into a nested, DOM-like structure that you can traverse and search
Look for all anchor tags on the page (useful if you’re building a crawler and need to find the next pages to visit)
Look for all tags with a specific class attribute (eg <li>...</li>
)
Look for the tag with a specific ID attribute (eg: <div>...</div>
)
Look for nested patterns of tags (useful for finding generic elements, but only within a specific section of the page)
Look for all tags matching CSS selectors (similar query to the last one, but might be easier to write for someone who knows CSS)
Using Beautifulsoup To Scrape Website
Get a list of strings representing the inner contents of a tag (this includes both the text nodes as well as the text representation of any other nested HTML tags within)
Return only the text contents within this tag, but ignore the text representation of other HTML tags (useful for stripping our pesky <span>
, <strong>
, <i>
, or other inline tags that might show up sometimes)
Convert the text that are extracting from unicode to ascii if you’re having issues printing it to the console or writing it to files
Get the attribute of a tag (useful for grabbing the src
attribute of an <img>
tag or the href
attribute of an <a>
tag)
Putting several of these concepts together, here’s a common idiom: iterating over a bunch of container tags and pull out content from each of them
Using XPath Selectors
BeautifulSoup doesn’t currently support XPath selectors, and I’ve found them to be really terse and more of a pain than they’re worth. I haven’t found a pattern I couldn’t parse using the above methods.
If you’re really dedicated to using them for some reason, you can use the lxml library instead of BeautifulSoup, as described here.
Storing Your Data
Now that you’ve extracted your data from the page, it’s time to save it somewhere.
Note: The implication in these examples is that the scraper went out and collected all of the items, and then waited until the very end to iterate over all of them and write them to a spreadsheet or database.
I did this to simplify the code examples. In practice, you’d want to store the values you extract from each page as you go, so that you don’t lose all of your progress if you hit an exception towards the end of your scrape and have to go back and re-scrape every page.
How To Web Scrape With Beautifulsoup
Writing to a CSV
Probably the most basic thing you can do is write your extracted items to a CSV file. By default, each row that is passed to the csv.writer
object to be written has to be a python list
.
In order for the spreadsheet to make sense and have consistent columns, you need to make sure all of the items that you’ve extracted have their properties in the same order. This isn’t usually a problem if the lists are created consistently.
If you’re extracting lots of properties about each item, sometimes it’s more useful to store the item as a python dict
instead of having to remember the order of columns within a row. The csv
module has a handy DictWriter
that keeps track of which column is for writing which dict key.
Writing to a SQLite Database
You can also use a simple SQL insert if you’d prefer to store your data in a database for later querying and retrieval.
More Advanced Topics
These aren’t really things you’ll need if you’re building a simple, small scale scraper for 90% of websites. But they’re useful tricks to keep up your sleeve.
Javascript Heavy Websites
Contrary to popular belief, you do not need any special tools to scrape websites that load their content via Javascript. In order for the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere.
It usually means that you won’t be making an HTTP request to the page’s URL that you see at the top of your browser window, but instead you’ll need to find the URL of the AJAX request that’s going on in the background to fetch the data from the server and load it into the page.
There’s not really an easy code snippet I can show here, but if you open the Chrome or Firefox Developer Tools, you can load the page, go to the “Network” tab and then look through the all of the requests that are being sent in the background to find the one that’s returning the data you’re looking for. Start by filtering the requests to only XHR
or JS
to make this easier.
Once you find the AJAX request that returns the data you’re hoping to scrape, then you can make your scraper send requests to this URL, instead of to the parent page’s URL. If you’re lucky, the response will be encoded with JSON
which is even easier to parse than HTML.
Content Inside Iframes
This is another topic that causes a lot of hand wringing for no reason. Sometimes the page you’re trying to scrape doesn’t actually contain the data in its HTML, but instead it loads the data inside an iframe.
Again, it’s just a matter of making the request to the right URL to get the data back that you want. Make a request to the outer page, find the iframe, and then make another HTTP request to the iframe’s src
attribute.
Sessions and Cookies
While HTTP is stateless, sometimes you want to use cookies to identify yourself consistently across requests to the site you’re scraping.
The most common example of this is needing to login to a site in order to access protected pages. Without the correct cookies sent, a request to the URL will likely be redirected to a login form or presented with an error response.
However, once you successfully login, a session cookie is set that identifies who you are to the website. As long as future requests send this cookie along, the site knows who you are and what you have access to.
Delays and Backing Off
If you want to be polite and not overwhelm the target site you’re scraping, you can introduce an intentional delay or lag in your scraper to slow it down
Some also recommend adding a backoff that’s proportional to how long the site took to respond to your request. That way if the site gets overwhelmed and starts to slow down, your code will automatically back off.
Spoofing the User Agent
By default, the requests
library sets the User-Agent
header on each request to something like “python-requests/2.12.4”. You might want to change it to identify your web scraper, perhaps providing a contact email address so that an admin from the target website can reach out if they see you in their logs.
More commonly, this is used to make it appear that the request is coming from a normal web browser, and not a web scraping program.
Using Proxy Servers
Even if you spoof your User Agent, the site you are scraping can still see your IP address, since they have to know where to send the response.
If you’d like to obfuscate where the request is coming from, you can use a proxy server in between you and the target site. The scraped site will see the request coming from that server instead of your actual scraping machine.
If you’d like to make your requests appear to be spread out across many IP addresses, then you’ll need access to many different proxy servers. You can keep track of them in a list
and then have your scraping program simply go down the list, picking off the next one for each new request, so that the proxy servers get even rotation.
Setting Timeouts
Web Scraping Using Python Beautifulsoup
If you’re experiencing slow connections and would prefer that your scraper moved on to something else, you can specify a timeout on your requests.
Handling Network Errors
Just as you should never trust user input in web applications, you shouldn’t trust the network to behave well on large web scraping projects. Eventually you’ll hit closed connections, SSL errors or other intermittent failures.
Learn More
If you’d like to learn more about web scraping, I currently have an ebook and online course that I offer, as well as a free sandbox website that’s designed to be easy for beginners to scrape.
Beautifulsoup Tutorial Python 3
You can also subscribe to my blog to get emailed when I release new articles.