More Wget

It's hard to understate the usefulness and robust feature set that most of the GNU tools have in their arsenal. Today, I'll make mention of one such tool, wget, and a novell use of the command. As I go through my work, I find that sites we agree to take over have little structure. They generally were slapped together a long time ago, with little thought to organization, made with Dreamweaver or, Stallman forbid, FrontPage. I'm not judging; as long as something looks okay in the browser, a company can proclaim, "We're on the intarwebs!" However, tracking down all of their pages to be converted into a CMS, for instance, can be time consuming. Not wanting to waste a client's money by searching through the source for links and images, then manually reconstructing the layout of the files, I fell on my trusty GNU tool wget. (I also did not have FTP access, but I knew there were dead pages that I didn't want to resurrect. Using wget in this case helped me retrieve only the pages that were still linked to from the main page). Here's a variation of the incantation of wget I used:

    wget -r -A '*.htm*, *.jpg, *.png, *.gif' -l 3

What's it all mean? -r: wget should retrieve recursively -A: takes a comma-separated list of patterns to match files to accept (use -R to reject). In this case, we want all htm, html, and most picture format files. -l: denotes how far down the rabbit hole to venture. I started with 1, so only links from the first page were parsed and followed. I then tried 2, following links that were a level below the parent and compared the resulting structure. Trying 3, I found no difference between 3's results and 2's results, meaning all links had been followed and accounted for. The result: A directory called that contains the files in their layout on the server. Now I knew which pages needed converting and which images to add to the new site. A side note: A handy way to see the layout of your newly downloaded directory is to use the tree command.


will display something like this: |– about.html |– calendar.html |– committees.html |– contact.html |– otherdir | `– index.html |– images | |– header.gif | |– logo.gif | `– spacer.gif |– index.html