Skip to main content

bartunek.me

Mirroring a website with wget

Sometimes you want to quickly mirror a web page, maybe to keep an off-line version, download part of the site - for example, documents such as PDF. Another use case is creating a mirror as a base for phishing sites used during red teams or some awareness training campaigns. In such a case, we can quickly download a page with wget - including all stylesheets, images, other media files, etc. wget is a powerful tool - I highly recommend going through its man page. Below I will describe only a tiny portion of the features and options that are available.

To create a mirror of a web page, you can start with the following command:

wget --mirror --convert-links --adjust-extension --page-requisites
--no-parent http://example.org

Options used:

  • --mirror or -m- download page recursively
  • --convert-links or -k - convert all links to relative paths, suitable for viewing locally
  • --adjust-extension or -E - append .html extension for HTML file types with other extensions
  • --page-requisites or -p - download all files required to view a webpage - this includes images, stylesheets, sounds, etc.
  • --no-parent or -np - do not ascend to the parent directory, useful to restrict download to only part of the website.

The short version:

wget -mkEpnp http://example.org

Note: There are two p in the flags - first p is --page-requisites second p is part of -np (--no-parent).

Other useful options

If wget complains about site certificate - for example, you are mirroring internal corporate page with a cert from your company Certificate Authority - force wget to skip certificate validation with --no-check-certificate flag.

By default, wget respects the robots.txt file, so it will not download files specified as disallowed in that file. To force wget to download all paths use: -e robots=off.

Sometimes you may want to skip cookies (tracking cookies, for example) - in such case use: --no-cookies

Some websites serve different content depending on user-agent or block tools such as wget. In this case, set different User-Agent header: for example: -U Mozilla, or you can set full version, for example, Firefox 68 on Linux: -U "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"

When you need to download assets (JavaScript, CSS, Images, etc.) from other hosts, use --span-hosts or -H.

Only downloading specified file extensions

To download a specific type of files, for example: .pdf from a directory, use:

wget -e robots=off -r -l1 -A ".pdf" http://example.org

Options:

  • -e robots=off - disable robots.txt
  • -r - recursive download
  • -l1 - recursion maximum depth level
  • -A - list of file name suffixes or patterns