bhfutils

Utilities that are used by any spider of Behoof project

Project description

Overview

The bhfutils package is a collection of utilities that are used by any spider of Behoof project.

Requirements

Unix based machine (Linux or OS X)
Python 2.7 or 3.6

Installation

Inside a virtualenv, run pip install -U bhfutils. This will install the latest version of the Behoof Scrapy Cluster Spider utilities. After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation

Full documentation for the bhfutils package does not exist

custom_cookies.py

The custom_cookies module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py

The distributed_scheduler module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py

The redis_domain_max_page_filter module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded

redis_dupefilter.py

The redis_dupefilter module is redis-based request duplication filter

redis_global_page_per_domain_filter.py

The redis_global_page_per_domain_filter module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.

Algorithm	Hash digest
SHA256	`b83d0344a47cde14654d54e6da6503b2c20f162090cc3c241cae662238abe342`
MD5	`1e43a48151747aa30823b9f093d3e03c`
BLAKE2b-256	`f055f1428dc6ab4943789f46f36dd642215d1d14cd7ebecdbd36e3035f8aa2f9`

Algorithm	Hash digest
SHA256	`af5e25779cb979cab3d0a29e1fb9deed0931ceb998740ac7f75cc0c98eac618a`
MD5	`d95380fab3e67e7a63b17df219e57f7d`
BLAKE2b-256	`1fb416d9ca51c7a9df59a25ea9a16b07083787e8c591e6112e957d0c38fd4d9d`

bhfutils 0.1.18

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Overview

Requirements

Installation

Documentation

custom_cookies.py

distributed_scheduler.py

redis_domain_max_page_filter.py

redis_dupefilter.py

redis_global_page_per_domain_filter.py

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes