Using Scrapy-GUI In Your Scrapy Shell

I spent my Christmas break working on my first python package – Scrapy-GUI. This package is an addon for the scrapy shell that gives you a graphical user interface for writing and testing scrapy queries and processors. This blog is going to focus on how the tools tab works and how you can utilise it to help build your scrapy web spiders.

Installation and Activation

First things first, how to install the package and integrate it into your shell.

The installation part is simple, as it is on the Python Package Index, all you need to do is install it via pip. pip install scrapy_gui will download and install it plus its dependencies. Please note that it requires python 3.6+.

Once installed, to use it in your shell you just have to import the load_selector function from the module. To do this use the following code in your scrapy shell from scrapy_gui import load_selector. Then if, for example, you want to load your response into the UI you call this function with the response selector as the argument using load_selector(response) and the window will open.

Tools Tab

The tools tab allows you to test three elements of a parser – query, regex filter and processor.

The Query input is compulsory and can be found in the top left. While scrapy can handle both xpath and css queries, the GUI currently only lets you test CSS queries. Anything you type here returns results equivalent to running the code selector.css(your_query).getall() in your shell.

The Regex input allows you to test regular expression filtering for your query. When you have the box checked it will update your query so it runs the code selector.css(your_query).re(your_regex).

Finally, the Function input lets you write python code to process the output of your query. This code can be quite complex, for example in the image below I imported the json package and used it to parse the content and extract specific elements.

Example tools UI window. It shows a selector along with a complex json parsing function.

For the functions tab there are two requirements:

  1. It must define a function named user_fun that takes results and selector as its arguments.
  2. The user_fun function must return a list.

The code passes your function through python’s exec function to define a function named user_fun, then passes the results from the query and your initial selector through it.

Source Tab

This tab contains the html source of your selector. I want to focus on this to hammer home something I mentioned in my previous entry – sometimes the content you see in your browser is not the same as what you are trying to scrape from. In addition, sometimes the data you are looking for could be more easily found through hidden elements like in script tags, meta tags or as attributes on other tabs.

Using this tab you can perform simple text searches on the source of your selector. This can help you quickly check that the content you are looking for is present and help you make sure you craft the best query to extract your data.

Example showing the html source of quotes.toscrape.com. The name "albert" is highlighted.

Conclusion

Thus ends a brief introduction to Scrapy-GUI. I hope this is enough to get you started using it when building your own scrapy spiders. If you come across any bugs or have any feedback feel free to open an issue on its github, or even create a fork and submit a Pull Request!

This entry was posted in Python, scrapy, tutorial, webscraping and tagged , , , . Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.