Skip to main content

This module implements a crawler to find all links and resources

Project description

Cr0wl3r logo

Cr0wl3r

Description

This package implements a web discreet crawler to find all visible URLs on a website, this crawler can store pages (and reuse them for next crawl), scan web content for dynamic content (useful for pentest, red teaming and hacking), create a full JSON report and database to reuse the analysis, try to identify web pages, static content and assets to request only what is useful.

The name Cr0wl3r is a pun with Crawler and Growler because this tool in not offensive but it's the first step to attack a web server.

Requirements

This package require:

  • python3
  • python3 Standard Library

Optional:

  • Selenium

Installation

pip install Cr0wl3r 

Usages

Command lines

# Python executable
python3 Cr0wl3r.pyz -h
# or
chmod u+x Cr0wl3r.pyz
./Cr0wl3r.pyz --help

# Python module
python3 -m Cr0wl3r https://github.com/mauricelambert

# Entry point (console)
Cr0wl3r -F report.json -L DEBUG -l logs.log -R -S -d -c "mycookie=foobar" -H "User-Agent:Chrome" -m 3 -t "p" -r https://github.com/mauricelambert
Cr0wl3r -R -S -C -d -u -i -F report.json -L DEBUG -l logs.log -c "mycookie=foobar" "session=abc" -c "counter=5" -H "User-Agent:Chrome" "Api-Key:myapikey" -H "Authorization:Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==" -m 5 -t "p" "img" -t "link" -I 3.5 -f "raw-url-only" -D4 "text/html" -r https://github.com/mauricelambert

Python3

from Cr0wl3r import CrawlerRawPrinter

CrawlerRawPrinter(
    "https://github.com/mauricelambert",
    recursive=False,
).crawl()
from ssl import _create_unverified_context
from Cr0wl3r import _Crawler, reports
from logging import basicConfig
from typing import Union

basicConfig(level=1)

class CustomCr0wl3r(_Crawler):
    def handle_web_page(
        self, from_url: str, url: str, tag: str, attribute: str
    ) -> Union[bool, None]:

        print("[+] New web page:", url, "from", from_url, f"{tag}<{attribute}>")
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

    def handle_static(
        self, from_url: str, url: str, tag: str, attribute: str
    ) -> Union[bool, None]:

        print("[+] New static:", url, "from", from_url, f"{tag}<{attribute}>")
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

    def handle_resource(
        self, from_url: str, url: str, tag: str, attribute: str
    ) -> Union[bool, None]:

        print("[+] New assets:", url, "from", from_url, f"{tag}<{attribute}>")
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

cr0wl3r = CustomCr0wl3r(
    "https://github.com/mauricelambert",
    recursive=True,
    update=True,
    max_request=10,
    only_domain=False,
    headers={"User-Agent": "Chrome", "Cookie": "mycookie=abc"},
    robots=False,
    sitemap=False,
    crossdomain=False,
    context=_create_unverified_context(),
    interval=3.5,
    download_policy="do not download",
)
cr0wl3r.crawl()

with open("urls.txt", 'w') as report:
    [report.write(url + '\n') for url in reports]

Links

Help

~# python3 Cr0wl3r.py -h
usage: Cr0wl3r.py [-h] [--recursive] [--update] [--insecure] [--do-not-request-robots] [--do-not-request-sitemap] [--do-not-request-crossdomain]
                  [--not-only-domain] [--max-request MAX_REQUEST] [--cookie COOKIE] [--headers HEADERS [HEADERS ...]]
                  [--tags-counter TAGS_COUNTER [TAGS_COUNTER ...]] [--report-filename REPORT_FILENAME] [--loglevel {WARNING,CRITICAL,DEBUG,INFO,ERROR}]
                  [--logfile LOGFILE] [--interval-request INTERVAL_REQUEST] [--output-format {raw,colored,raw-only-url}]
                  [--download-all | --download-html | --download-static | --download-resources | --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE | --download-requested | --do-not-download]
                  url

This script crawls web site and prints URLs.

positional arguments:
  url                   First URL to crawl.

options:
  -h, --help            show this help message and exit
  --recursive, -r       Crawl URLs recursively.
  --update, -u          Re-downloads and overwrites responses from requests made during previous crawls.
  --insecure, -i        Use insecure SSL (support selenium and urllib)
  --do-not-request-robots, --no-robots, -R
                        Don't search, request and parse robots.txt
  --do-not-request-sitemap, --no-sitemap, -S
                        Don't search, request and parse sitemap.xml
  --do-not-request-crossdomain, --no-crossdomain, -C
                        Don't search, request and parse crossdomain.xml
  --not-only-domain, -d
                        Do not request only the base URL domain (request all domains).
  --max-request MAX_REQUEST, -m MAX_REQUEST
                        Maximum request to perform.
  --cookie COOKIE, -c COOKIE
                        Add a cookie.
  --headers HEADERS [HEADERS ...], -H HEADERS [HEADERS ...]
                        Add headers.
  --tags-counter TAGS_COUNTER [TAGS_COUNTER ...], --tags TAGS_COUNTER [TAGS_COUNTER ...], -t TAGS_COUNTER [TAGS_COUNTER ...]
                        Add a tag counter for scoring.
  --report-filename REPORT_FILENAME, --report REPORT_FILENAME, -F REPORT_FILENAME
                        The JSON report filename.
  --loglevel {WARNING,CRITICAL,DEBUG,INFO,ERROR}, -L {WARNING,CRITICAL,DEBUG,INFO,ERROR}
                        WebSiteCloner logs level.
  --logfile LOGFILE, -l LOGFILE
                        WebCrawler logs file.
  --interval-request INTERVAL_REQUEST, --interval INTERVAL_REQUEST, -I INTERVAL_REQUEST
                        Interval between each requests by domain.
  --output-format {raw,colored,raw-only-url}, --format {raw,colored,raw-only-url}, -f {raw,colored,raw-only-url}
                        Output format.
  --download-all, --download, -D, -D0
                        Download (store) all responses
  --download-html, --dh, -D1
                        Download (store) only HTML responses
  --download-static, --ds, -D2
                        Download (store) only static files (HTML, CSS, JavaScript)
  --download-resources, --dr, -D3
                        Download (store) only resources files (images, documents, icon...)
  --download-by-content-type DOWNLOAD_BY_CONTENT_TYPE, --dct DOWNLOAD_BY_CONTENT_TYPE, -D4 DOWNLOAD_BY_CONTENT_TYPE
                        Download (store) only responses with Content-Type that contains this value
  --download-requested, --dR, -D5
                        Download all requests responses and try to requests only Web page
  --do-not-download, --dN, -D6
                        Try to requests only Web page and do not download
~# 

Licence

Licensed under the GPL, version 3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Cr0wl3r-1.0.1.tar.gz (35.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page