Scrapy is a Python-based all-in-one tool that can download, parse and store data you’ve scraped. You can also use it to build APIs and web services.
It works like a spider, crawling pages and extracting data from them. It does this using asynchronous processing. It makes a request to the engine, gets a response, then processes that response. Then, it sends a new request to the next page, and so on.
A web-crawler is a program that searches for documents on the web https://scrapy.ca/ automatically and then indexes and catalogs the information for retrieval later. This is a key feature that allows it to be used on large websites.
The main components of a web-crawler are the engine (or scraper) and the spider. The engine consists of a scheduler which sends requests to the spiders.
When a request is sent to the engine, it uses a Spider Middleware (represented by the dark blue bars in the figure below) to process it. The spiders are essentially Python programs that run on your computer and are responsible for sending the requests to the engine. They process the response by extracting the items they need, generate further requests, if necessary, and send them to the engine.
Depending on the size of the website, each spider can send n number of requests to the engine. The engine dispatches those requests to the scheduler, which adds them to a queue and dispatches them when requested by the engine.
Another important function of the scheduler is the ability to retry requests if there are problems with them. This can make it a lot faster and less expensive than relying on the spiders themselves to process requests and send them to the engine.
In addition to retrying, the scheduler has other functions such as parallel processing, and data cleaning. These functions allow scrapers to be built to work with large amounts of data, while maintaining a good performance level.
These features make the Scrapy scheduler ideal for crawling large websites, and are one of the reasons why so many developers use it for their applications.
The Scrapy library also includes a number of middlewares and tools that can help you automate your crawling process. For example, Scrapy has a tool called Selector Gadget that helps you find CSS or XPath of an element on the web page.
You can also use XPath to create filters and recursive queries. These allow you to easily filter data by specific attributes, and save you from having to write the code manually for each search.
When using XPath, you must be careful about what you are selecting. For instance, if you are trying to scrape the names of faculty members at UCSB, the name and email fields may be different on each detail page, so you have to make sure that your XPath is correct.
For the best results, you should test your XPath on several pages of a website before you try to use it in production. This will ensure that you are getting consistent results, and that you are not over-reaching.