Mirroring a website with wget
Sometimes you want to quickly mirror a web page, maybe to keep an off-line
version, download part of the site - for example, documents such as PDF.
Another use case is creating a mirror as a base for phishing sites used during red teams or some awareness training campaigns.
In such a case, we can quickly download a page with wget
- including all stylesheets, images, other media files, etc. wget
is a powerful tool - I highly recommend going through its man page. Below I will describe only a tiny portion of the features and options that are available.
To create a mirror of a web page, you can start with the following command:
wget --mirror --convert-links --adjust-extension --page-requisites
--no-parent http://example.org
Options used:
--mirror
or-m
- download page recursively--convert-links
or-k
- convert all links to relative paths, suitable for viewing locally--adjust-extension
or-E
- append.html
extension for HTML file types with other extensions--page-requisites
or-p
- download all files required to view a webpage - this includes images, stylesheets, sounds, etc.--no-parent
or-np
- do not ascend to the parent directory, useful to restrict download to only part of the website.
The short version:
wget -mkEpnp http://example.org
Note: There are two p
in the flags - first p
is --page-requisites
second p
is part of -np
(--no-parent
).
Other useful options
If wget
complains about site certificate - for example, you are mirroring
internal corporate page with a cert from your company Certificate Authority - force wget
to skip certificate validation with --no-check-certificate
flag.
By default, wget
respects the robots.txt
file, so it will not download files specified as disallowed in that file. To force wget
to download all paths use: -e robots=off
.
Sometimes you may want to skip cookies (tracking cookies, for example) - in such case use: --no-cookies
Some websites serve different content depending on user-agent or block
tools such as wget
. In this case, set different User-Agent header:
for example: -U Mozilla
, or you can set full version, for example, Firefox 68 on Linux: -U "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"
When you need to download assets (JavaScript, CSS, Images, etc.) from other hosts, use --span-hosts
or -H
.
Only downloading specified file extensions
To download a specific type of files, for example: .pdf
from a directory, use:
wget -e robots=off -r -l1 -A ".pdf" http://example.org
Options:
-e robots=off
- disablerobots.txt
-r
- recursive download-l1
- recursion maximum depth level-A
- list of file name suffixes or patterns