547 words
3 minutes
Creepy Crawlies

Creepy Crawlies: Web Crawling and Reconnaissance#

Web crawling is a powerful technique used to automatically gather data from websites, making it faster and more efficient. With the help of various crawling tools, you can focus on analyzing the extracted data rather than manually collecting it. These tools vary in capabilities, and each serves different use cases, particularly when it comes to security and reconnaissance tasks.

Here are some of the most widely used web crawlers for reconnaissance and web scraping:

1. Burp Suite Spider#

  • Purpose: A part of the Burp Suite web application testing platform, Spider is designed to crawl web applications. It maps out the website, finds hidden content, and identifies potential security vulnerabilities.
  • Use Case: Web application security testing and vulnerability discovery.

2. OWASP ZAP (Zed Attack Proxy)#

  • Purpose: An open-source web application security scanner that includes a spidering component. It can crawl and test web applications for vulnerabilities in both automated and manual modes.
  • Use Case: Comprehensive web security testing.

3. Scrapy (Python Framework)#

  • Purpose: A versatile Python framework for building custom web crawlers. Scrapy allows you to handle complex crawling scenarios and extract structured data from websites. It’s highly flexible and ideal for tailored reconnaissance tasks.
  • Use Case: Custom web scraping and reconnaissance with high scalability.

4. Apache Nutch (Scalable Crawler)#

  • Purpose: A highly extensible open-source web crawler written in Java. Nutch is designed to scale for large web crawls and can be customized for specific domain crawling.
  • Use Case: Large-scale web crawling and reconnaissance projects.

Ethical Crawling Practices#

Regardless of the tool used, it’s essential to follow ethical and responsible crawling practices:

  • Obtain permission: Before crawling a website, ensure you have the necessary permissions, especially for large or intrusive scans.
  • Avoid overloading servers: Be mindful of the number of requests made to the website to prevent overwhelming its server.

Using Scrapy for Web Crawling#

Scrapy is a powerful and customizable Python framework, ideal for reconnaissance tasks. Below, we will walk through setting up Scrapy and using it to crawl a target website like inlanefreight.com.

Installing Scrapy#

To install Scrapy, use the following command:

pip3 install scrapy

This command installs Scrapy and all necessary dependencies.

Setting Up ReconSpider#

Next, download a custom Scrapy spider designed for reconnaissance tasks (in this case, ReconSpider) and extract the files:

wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/1
unzip ReconSpider.zip

This will extract the necessary files to your working directory.

Running the Spider#

Once you have extracted the files, run the spider using the following command:

python3 ReconSpider.py http://inlanefreight.com

Replace inlanefreight.com with any domain you wish to crawl. The spider will crawl the target website and collect valuable information.

Analyzing the Results#

The results of the crawl will be saved in a JSON file called results.json. This file contains various types of data extracted from the target website, such as:

{
  "emails": [
    "lily.floid@inlanefreight.com",
    "cvs@inlanefreight.com"
  ],
  "links": [
    "https://www.themeansar.com",
    "https://www.inlanefreight.com/index.php/offices/"
  ],
  "external_files": [
    "https://www.inlanefreight.com/wp-content/uploads/2020/09/goals.pdf"
  ],
  "js_files": [
    "https://www.inlanefreight.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=3.3"
  ],
  "form_fields": [],
  "images": [
    "https://www.inlanefreight.com/wp-content/uploads/2021/03/AboutUs_01-1024x810.png"
  ],
  "videos": [],
  "audio": [],
  "comments": [
    "<!-- #masthead -->"
  ]
}

JSON Structure Breakdown#

JSON KeyDescription
emailsLists email addresses found on the domain.
linksLists URLs of links found within the domain.
external_filesLists URLs of external files such as PDFs.
js_filesLists URLs of JavaScript files used by the website.
form_fieldsLists form fields found on the domain (empty in this example).
imagesLists URLs of images found on the domain.
videosLists URLs of videos found on the domain (empty in this example).
audioLists URLs of audio files found on the domain (empty in this example).
commentsLists HTML comments found in the source code.
Creepy Crawlies
https://fuwari.vercel.app/posts/creepy-crawlies/
Author
Ranjung Yeshi Norbu
Published at
2025-04-21