I’ve taken a shine to posting on the /r/learnpython subreddit recently and I have a noticed a trend over the last few days – quite a lot of people seem to be using the Holiday break to learn how to webscrape and are coming upon the same problem: they’ve checked the html of a website in their browser to find the info they want, constructed a perfectly acceptable selector but it isn’t returning anything! I decided to write up a quick tutorial that I can link instead of having to answer the question every time.
How Does The Internet Even Work?
So, first thing we’ll chat about what actually happens when you visit a web page. When you type an address in your task bar your browser sends a HTTP GET request to that address. This usually returns a file, typically a html file to be specific. This is a plain text document that has the content of the web page wrapped in html tags. This is the document that your webscraping software is acting on – your selectors look through these tags to grab the content inside them. However, when using a browser this first file isn’t all that is requested.
At minimum, you’ll also get a CSS file that tells your browser how to make the page look pretty. If there are any images, embedded video, etc. your browser will download those for you. On more complex pages there’ll be some more complex requests or instructions, which 9 times out of 10 is the cause for the missing info. Basically the html you are inspecting in your browser is NOT the initial html your scraper requests – it has been altered by the additional requests. You can check for this and get the data you actually want by following the steps below.
Scenario 1: Restructuring the Page
The first scenario is that your data is in the initial request, just in another spot. The idea is some JavaScript gets run to restructure the page to put the data in the right spots. This is usually either some embedded JavaScript or JSON. The best way to check for this is to add a print(response.text)
to your code and then use a ctrl+f to look for the content. If you are lucky it will be easily selected, like an attribute in a tag. If you are less lucky it will be some JSON, probably in a script tag. Usually in that case you can select it and use the JSON package in python to parse it into a dict. If you are really unlucky it will be within a bunch of JavaScript, but with a mix of the right selector and regex you might be able to pull it out.
Scenario 2: The Data is Requested Later
The next scenario is that the data is literally missing from the initial request. The initial html will contain some code to request the data from somewhere else and when it comes it will insert it into the html for your browser. The best way to check this is to use the developer tools on your browser, go to the network tab and refresh the page. Do this using a true refresh, which is usually done by holding down shift and pressing the refresh icon on your browser. When it completes do a search for the content of the data you are looking for scrape and find the request.
The image above shows a likely request, the next step is to find the details of the request. For this you should go to the Headers tab. In the top General section you want to grab the url and check whether it is a POST or GET request. Below in the Request Headers section you will see what headers were sent with the request – some, all or maybe even none of these will also need to be included in your request. There might also be a Form Data section at the bottom that should be included, mainly in the case of POST requests.
You’ll need to do some trial and error to get it right, but once you have a working request you will use this instead of the usual URL for the website. In most cases this will be JSON data, so you can throw the response.text into a JSON parser and turn it into a dict. In some rare cases this will be additional html which you will select as you would any normal html content.
Scenario 3: You’ve been banned
The final scenario is that you’ve been banned by some kind of anti-bot system. You can usually tell this quite easily – check the response.status
and if it isn’t 200 you’ve probably been banned. Alternatively the page will be replaced by an error message, which you would have seen when checking scenario 1. In this case there may not be a lot you can do – circumventing anti-bots can range from simple to complex to impossible depending on the system being used. My best advice here is to check YouTube or google for more in-depth guides.
Conclusion
In summary, the cause here is usually that the initial response is different from the final content rendered in your browser. If you are lucky it is within the initial response but in another place. If you are less lucky it came from an additional request, probably from an API. If you are super unlucky the page has anti-bot systems active and you’ll need to try other means to work around it.