It's hard to understate the usefulness and robust feature set that most of the GNU tools have in their arsenal. Today, I'll make mention of one such tool, wget, and a novell use of the command. As I go through my work, I find that sites we agree to take over have little structure. They generally were slapped together a long time ago, with little thought to organization, made with Dreamweaver or, Stallman forbid, FrontPage. I'm not judging; as long as something looks okay in the browser, a company can proclaim, "We're on the intarwebs!" However, tracking down all of their pages to be converted into a CMS, for instance, can be time consuming. Not wanting to waste a client's money by searching through the source for links and images, then manually reconstructing the layout of the files, I fell on my trusty GNU tool wget. (I also did not have FTP access, but I knew there were dead pages that I didn't want to resurrect. Using wget in this case helped me retrieve only the pages that were still linked to from the main page). Here's a variation of the incantation of wget I used:
wget -r -A '*.htm*, *.jpg, *.png, *.gif' -l 3 http://www.example-site.com
What's it all mean? -r: wget should retrieve recursively -A: takes a comma-separated list of patterns to match files to accept (use -R to reject). In this case, we want all htm, html, and most picture format files. -l: denotes how far down the rabbit hole to venture. I started with 1, so only links from the first page were parsed and followed. I then tried 2, following links that were a level below the parent and compared the resulting structure. Trying 3, I found no difference between 3's results and 2's results, meaning all links had been followed and accounted for. The result: A directory called www.example-site.com that contains the files in their layout on the server. Now I knew which pages needed converting and which images to add to the new site. A side note: A handy way to see the layout of your newly downloaded directory is to use the tree command.
will display something like this:
www.example-site.com/ |– about.html |– calendar.html |– committees.html |– contact.html |– otherdir | `– index.html |– images | |– header.gif | |– logo.gif | `– spacer.gif |– index.html