Sometimes you want to quickly mirror a web page, maybe to keep an off-line
version, download part of the site - for example, documents such as PDF.
Another use case is creating a mirror as a base for phishing sites used during red teams or some awareness training campaigns.
In such a case, we can quickly download a page with
wget - including all stylesheets, images, other media files, etc.
wget is a powerful tool - I highly recommend going through its man page. Below I will describe only a tiny portion of the features and options that are available.
To create a mirror of a web page, you can start with the following command:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
-m- download page recursively
-k- convert all links to relative paths, suitable for viewing locally
.htmlextension for HTML file types with other extensions
-p- download all files required to view a webpage - this includes images, stylesheets, sounds, etc.
-np- do not ascend to the parent directory, useful to restrict download to only part of the website.
The short version:
wget -mkEpnp http://example.org
Note: There are two
p in the flags - first
p is part of
Other useful options
wget complains about site certificate - for example, you are mirroring
internal corporate page with a cert from your company Certificate Authority - force
wget to skip certificate validation with
wget respects the
robots.txt file, so it will not download files specified as disallowed in that file. To force
wget to download all paths use:
Sometimes you may want to skip cookies (tracking cookies, for example) - in such case use:
Some websites serve different content depending on user-agent or block
tools such as
wget. In this case, set different User-Agent header:
-U Mozilla, or you can set full version, for example, Firefox 68 on Linux:
-U "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"
Only downloading specified file extensions
To download a specific type of files, for example:
wget -e robots=off -r -l1 -A ".pdf" http://example.org
-e robots=off- disable
-r- recursive download
-l1- recursion maximum depth level
-A- list of file name suffixes or patterns