Does Your Skin Need Body Butter In Summer?

Summer is the season of sunshine, beach days, and outdoor adventures. But it’s also a season that can take a toll on your skin. Heat, humidity, and sun exposure can all affect your skin, leaving it…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Responsible Crawling with Scrapy

How to collect business data with a minimum of requests

No worries, this mission is not impossible ;-)

The portal we are about to visit is not a testing site but real business. Customers do find information about the services provided by the salons and may book an appointment with the hairdresser of their choice. Please remember that most salons were closed during lockdown. Some did not survive.

Suppose you were only interested in the hair salons in Germany, you could collect all relevant URLs via an XPath expression like //div[@class="salondirectory-citylist"][1]/a/@href (yes, the index starts at one), iterate over the list and fire a request for each URL. But you can do so much better!

Image by Klaus-Dieter Warzecha

The first is controlled by a LinkExtractor, the latter is decided via a callback. This is the standard approach in Scrapy to process the answer to an HTTP request, also known as a response.

If you were to fetch Austrian and Swiss salons as well, simply expand the tuple for allowed regular expressions that define the URLs. Alternatively, remember how to use the pipe symbol in regular expressions.

If you really need the attributes outlined above, you will have to

The good news is that just need another rule for your CrawlSpider and a suitable callback method:

The bad news is that this comes with a price. You will be firing a lot of requests against a portal. Don’t be surprised if you’re met with reluctance and run into countermeasures.

Step back and rethink the idea of deep-diving into the individual web pages. Do you really need all the data at this very moment?

If you’re scraping on hold, don’t! Your bandwidth is not unlimited and the transport cost is not zero.

What are your options then to use your potential more responsible?

Overview pages, like the ones with all the hairdressers from one city, are an excellent and cheap source of information.

The individual entries contain company name, postal code, city, street, latitude, longitude, the URL of the shop, the number of ratings, and the mean value of the rating. All this information is easily available from the anchor tags within one div element.

Add a comment

Related posts:

Can I Have 4 Minutes Of Your Time To Make You A Better Developer?

So how can you grow as a developer? Grow is a very broad term of course. For some people growing means that you have to practice. You have to fall into certain traps, so you know how to fix a…

Managing Migraine in College

The car is packed and you have told your dog that he is a good boy at least 100 times. Today is move-in day for your freshman year of college. Some people that you know are nervous about meeting new…

A Board Certified Dermatologist Calls B.S. on the 10 Step Korean Skincare Process

I am a board certified dermatologist with almost 30 years of experience. I field questions from patients almost daily about what their skincare routine should look like. Sometimes I’m asked about…