Archive for the ‘Geekdom’ Category

Connect to remote erlang shell while inside emacs

Thursday, January 7th, 2010

While developing my top secret project, I have been getting into the fun stuff in Erlang and Emacs. Connecting to a running instance of my app from a remote shell wasn’t straightforward to me at first, so below is my documented way of connecting, as well as dropping into the Erlang JCL from within an Emacs erlang shell.

  1. Start yaws: yaws –daemon -sname appname –conf /path/to/yaws.conf
  2. Start emacs, and from within emacs start an Erlang shell with C-c C-z (assuming you have distel configured).
  3. From the Emacs erlang shell, get into Erlang’s JCL by typing C-q C-g and pressing enter. A ^G will be printed at the prompt, but won’t be evaluated until you press enter. You should see the familiar JCL prompt “User switch command –>”.
  4. Type ‘j’ to see current jobs you have running locally, which is probably just the current shell (1 {shell,start,[init]}).
  5. Type ‘r appname@compy’ to connect to the remote node identified by appname ( from the -sname parameter ) on the computer compy (usually whatever hostname returns)
  6. Type ‘j’ to see current jobs, which should list your current shell as “1 {shell,start,[init]}”, and a second shell “2* {appname@compy,shell,start,[]}”.
  7. Type ‘c 2′ to connect to the remote shell. You can now run commands in the node’s shell. You may have to press enter again to bring up a shell prompt.
james@compy 14:33:34 ~/dev/erlang/app
> yaws --daemon -sname app --conf config/yaws.conf
 
james@compy 14:34:00 ~/dev/erlang/app
> emacs
Eshell V5.7.4  (abort with ^G)
1> ^G
 
User switch command
 --> j
   1* {shell,start,[init]}
 --> r app@compy
 --> j
   1  {shell,start,[init]}
   2* {app@compy,shell,start,[]}
 --> c 2
 
1>

PHP’s json_last_error

Monday, December 28th, 2009

A quick note that I hope Google picks up concerning php’s json_last_error function. I was trying to debug a json string I was decoding with json_decode, but was getting NULL. When I tried to use the json_last_error(), a fatal undefined function error was returned. The reason: json_last_error doesn’t exist in php versions < 5.3. Ah, version numbers! So, check your php version if the function is undefined.

Simple, yet a detail easily overlooked.

Erlang, Primes, and Sieves Again (Lazy edition)

Thursday, December 17th, 2009

I was working on the Project Euler problems, specifically problem 7, and decided to implement a prime number generator based of my previous prime number implementation based off the Melissa O’Neill paper.

The problem says that, given that the 6th prime number is 13, find the 10,001st prime. Since we don’t have a known upper bound to use the primes:queue/1 function, I created a lazy sieve that would return the next prime, its position, and an iterator. The code:

43
44
45
46
47
48
49
50
51
52
53
54
-export([lazy_sieve/0]).
 
lazy_sieve() ->
    Table = insert_prime(2, skew_kv:empty()),
    [ { 2, 1 } | fun() -> lazy_sieve({ 3, 2 }, Table) end ].
 
lazy_sieve({ X, Pos }, Table) ->
    {NextComposite, _Value} = skew_kv:min(Table),
    case  NextComposite =< X of
	true -> lazy_sieve({X+1, Pos}, adjust(Table, X));
	_Else -> [ {X, Pos} | fun() -> lazy_sieve({X+1, Pos+1}, insert_prime(X,Table)) end]
    end.

This can be used thusly:

1>[{Prime1, Pos1} | Next1] = primes:lazy_sieve().
% Prime1 = 2, Pos1 = 1, Next1 = fun()
2>[{Prime2, Pos2} | Next2] = Next1().
% Prime2 = 3, Pos2 = 2, Next2 = fun()
3>[{Prime3, Pos3} | Next3] = Next2().
% Prime3 = 5, Pos3 = 3, Next3 = fun()

This makes defining a function to find the nth prime trivial:

58
59
60
61
62
63
64
65
-export([nth/1]).
 
nth(N) ->
    nth(lazy_sieve(), N).
 
nth([ {Prime, N} | _Next], N) -> Prime;
nth([ {_Prime, _Pos} | Next], N) ->
    nth(Next(), N).

Lazy evaluation twice, in the main functionality and in the value of a key in the queue. And the performance of nth is pretty good too (time is the left number, in mirco-seconds, and the 10,000st prime is the right number):

9> timer:tc(primes, nth, [10001]).
{1635839,104743} % 1.6 seconds
10> timer:tc(primes, queue, [104743]).
{1788997, % 1.7 seconds to generate the full list of primes
 [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
  79,83,89,97,101,103|...]}

So it takes 1.6 seconds to find the 10,001st prime and 1.7 seconds to generate the list of primes up to the 10,001st prime.

Red-black Trees in Erlang

Tuesday, December 1st, 2009

Working through Chris Okasaki’s “Purely Functional Data Structures“, I found that I couldn’t find an Erlang version of the red-black tree implementation Chris shows on page 28.

The code, also available as a pastie and download (with more comments of mine) .erl and .hrl:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
%% redblack.hrl
-record(node, {color, data, left, right}).
 
%% redblack.erl
%% Adapted from Okasaki "Purely Functional Data Structures" p. 28
member(_X, undefined) ->
    false;
member(X, #node{left=A, data=Y}) when X < Y ->
    member(X, A);
member(X, #node{data=Y, right=B}) when X > Y ->
    member(X, B);
member(_X, _S) ->
    true.
 
insert(X, undefined) ->
    #node{color=black, data=X};
insert(X, S) ->
    % to do recursive anonymous functions, pass the created fun in as 2nd parameter
    Ins = fun(undefined, _F) ->
                  #node{color=red, data=Data}; % insert the new data as a red node
             (#node{color=Color, left=A, data=Y, right=B}, F) when X < Y->
                  balance(Color, F(A, F), Y, B);
             (#node{color=Color, left=A, data=Y, right=B}, F) when X > Y->
                  balance(Color, A, Y, F(B, F));
             (Node, _F) ->
                  Node
          end,
    #node{left=A, data=Y, right=B} = Ins(S, Ins),
    #node{color=black, left=A, data=Y, right=B}.
 
%% detect Black->Red->Red Patterns and balance the situation
balance(black, #node{color=red, left=#node{color=red, left=A, data=X, right=B}, data=Y, right=C}, Z, D) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, #node{color=red, left=A, data=X, right=#node{color=red, left=B, data=Y, right=C}}, Z, D) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=#node{color=red, left=B, data=Y, right=C}, data=Z, right=D}) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=B, data=Y, right=#node{color=red, left=C, data=Z, right=D}}) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(Color, Left, Data, Right) ->
    #node{color=Color, left=Left, data=Data, right=Right}.

So, I’m going to go through this, mainly to prove to myself that I kinda get it, and because Ben will complain I didn’t dumb it down enough for him.

Line 2 defines the node structure in an Erlang header file. Comparable to a struct in C.

Member

Lines 6-13 define the member function. Each member() -> … is a function clause, essentially a case statement on steroids. The first clause says if the tree is undefined, then whatever X is, it is not a member of undefined, so return false.
The second clause unpacks the current node ( #node{left=A, data=Y} ) and assigns the values of left and data to A and Y, respectively (I have tried to maintain Chris’s variable names as much as possible, though I think Erlang’s record syntax complements the Standard ML syntax found in the book). The clause then compares Y, the node’s data, to the query X, and if X < Y, checks the left branch (the variable A) of the node.
The third clause is nearly identical, except it handles when X > Y, and checks the right branch (B) for X’s existence.
The fourth and final clause handles a match (because when X is not greater than or less than Y, it must be equal), returning true. Simple enough to follow and grok.

Insertion

When I read several different articles online concerning red-black trees, the insertion routine sounded quite complex to me. I was quite pleased when I read the algorithm in the book, as it crystallized my thoughts and allowed me to grok the action.

The first clause of the insert function handles the empty (or undefined) tree, creating and returning a black node (since the root of the tree must be black).
The second clause takes the new data X and a tree S and creates a new function, stored in the var Ins. Now, calling an anonymous function recursively is tricky, and I will explain why this part varies slightly from the text. Since there’s no way to refer to the function (as it isn’t named), we have to pass the anonymous function as a second parameter to the anonymous function, which is what happens on line 28. So when you see F(A, F), mentally think Ins(A).
The first parameter to Ins() is our current node in the tree. If it is undefined, we insert our new data as a red node. Why? That is discussed at length elsewhere, but it boils down to being easier to fix a red violation than a black violation (look up the constraints a red-black tree has on it, and the violations will make more sense).
The second clause unpacks the node and makes our less than/greater than comparison like a normal BST. What’s different is that we call the appropriate recursive insert call (remember F(A, F)? This inserts X somewhere into the left side of Node, and F(B, F) inserts X somewhere into the right side), and pass the result to the balance function.
The final clause handles a match between X and Y and returns the Node and its branches unmodified.

Balance

Why do we need a balance function? In case our data being inserted into the tree is more ordered than we may have expected. A vanilla BST with inputs [3,2,1] will be a glorified list. We force the situation by inserting a red node in the tree and causing a situation where the parent of the newly inserted red node is also a red node. This violates the constraint that a red node’s children must be black nodes. There are four possible combinations of this violation:

  1. (B)-Left->(Red)-Left->(Red)
  2. (B)-Left->(Red)-Right->(Red)
  3. (B)-Right->(Red)-Left->(Red)
  4. (B)-Right->(Red)-Right->(Red)

Each of those possibilities maps to the first four clauses of the balance function. The fifth clause passes a node back unchanged. So let’s look at the first clause. This clause is activated when a black node’s left child is red, and that red node’s left child is also red.

      B3
     /   \
   R2     *
  /   \
R1     *

So we unpack the three nodes and return the three nodes rotated to remove the red violation.

Variable Meaning Value
A R1′s Left node undefined
X R1′s Data 1
B R1′s Right node undefined
Y R2′s Data R2
C R2′s Right node undefined
Z B3′s Data 3
D B3′s Right node undefined

Unpacking from the pattern match, we then pack up a new tree structure according to the clause’s definition:

33
#node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};

This essentially replaces the black node having a red child and red grandchild with a red node that has two black nodes for children. The next illustration shows the result of the balance call:

      R2
     /   \
   B1    B3

Now, the bubbling up of the red node may cause a red violation with the parent and grandparent of R2, which will be handled as the recursion of Ins() unwinds.

When the initial Ins() ends (line 28), the resulting node is the root of the new tree, which we then unpack into the data, left, and right fields. We then force the root node to be black by creating the root node with the unpacked parts, forcing color=black.

So there you have it. Simplified, with a lot of things glossed over/ignored, but this is my port of the red-black code listing on page 28. Looking forward to progressing through the book more and increasing my data structure- and Erlang-fu.

Further fun: Read Chris’s reflections 10 years after publishing his book.

Symfony/Propel Memory Issues

Tuesday, October 6th, 2009

I have a bulk import process that needs running nightly. Currently there are around 4,200 “rows” to process, which actually can encompass many tables so row is not entirely appropriate. The problem is that the script poops out after ~200 “rows” with memory limit errors. While increasing the memory limit is do-able, I am not interested in that as a solution currently.

First, I recorded some numbers to benchmark where the script was in memory consumption. The first number is the “row”, the peak is how much memory the script has allocated, and the mem is the delta in memory allocation per row.

0 peak: 41,103,408 mem: 29,884,416
1 peak: 42,440,264 mem: 1,310,720
2 peak: 43,613,848 mem: 1,310,720
3 peak: 43,893,960 mem: 262,144
4 peak: 44,223,040 mem: 262,144
5 peak: 44,896,296 mem: 786,432
6 peak: 45,671,560 mem: 786,432
7 peak: 45,865,952 mem: 0
8 peak: 46,418,272 mem: 786,432
9 peak: 47,917,888 mem: 1,310,720
10 peak: 48,566,312 mem: 786,432

After the initialization phase, memory allocation increases fairly steadily and it is not long until the 128M memory limit is reached. This is unacceptable as I know some “rows” should be much closer to 0 as nothing is imported on the majority of rows.

My first solution was to disable logging:

sfConfig::set('sf_logging_enabled', FALSE);

The initial memory allocation was decreased, but the running deltas remained higher than expected.

Second, I inserted a ton of unset() calls in the various functions. This dropped my deltas a little:

0 peak: 30,611,472 mem: 20,971,520
1 peak: 31,969,760 mem: 1,310,720
2 peak: 32,180,112 mem: 262,144
3 peak: 32,333,216 mem: 262,144
4 peak: 32,525,144 mem: 262,144
5 peak: 32,656,616 mem: 0
6 peak: 32,863,736 mem: 262,144
7 peak: 33,103,264 mem: 262,144
8 peak: 33,455,544 mem: 262,144
9 peak: 33,754,288 mem: 262,144
10 peak: 33,984,976 mem: 262,144

But allocation still killed the import before it could complete. Browsing through various sites, I discovered Propel had a hard time cleaning up circular references, which meant PHP couldn’t garbage-collect that memory. However, to combat this, Propel 1.3 offers a static method disableInstancePooling that allowed me to override Propel’s desire to keep instances around.

Adding:

Propel::disableInstancePooling();

to the beginning of the import gave me these results:

0 peak: 26,569,632 mem: 17,301,504
1 peak: 28,582,152 mem: 2,097,152
2 peak: 30,455,352 mem: 1,835,008
3 peak: 30,536,176 mem: 0
4 peak: 30,536,176 mem: 0
5 peak: 31,517,088 mem: 1,048,576
6 peak: 31,534,152 mem: 0
7 peak: 31,552,120 mem: 0
8 peak: 31,589,632 mem: 0
9 peak: 31,695,504 mem: 0
10 peak: 31,695,504 mem: 0

Now new memory was allocating only when the import was actually doing something of significance. In fact, watching the import proceed with the deltas displayed, I could observe the memory decreasing at times, prolonging the life of the script by orders of magnitude. Whereas before the script was processing ~200 “rows”, it currently processes the whole batch (4,237 “rows” currently) in one go.

As the importable “rows” increase, I know I won’t be butting up against memory limits for some time.

iwlagn: Microcode SW error detected. Restarting 0×82000000.

Saturday, August 22nd, 2009

If you have gotten the above error in your syslog, you probably are not able to connect to wireless APs that require WPA passphrases, but can connect to ones that require WEP or no security. While searching for solutions, there were some that offered solutions for controlling the power of the card (for temperature-related issues), but those did not address my issue. Fortunately, the solution was fairly simple for my problem.

I have Ubuntu Jaunty, with the 2.6.28-15 kernel. The resolution came when I looked in synaptic at the installed linux-restricted-modules and found I had two instances from older kernels still installed. Purging those completely from the system, and retrying to connect to my WPA2-enabled AP succeeded.

Give it a shot.

Converting a site to use Cachefly for static content

Tuesday, February 24th, 2009

I recently needed to move static content from a live site to a cachefly account. Rather than go through the directories, looking for the resources (js/css/images) I needed to ftp, I thought, “Man, this sure sounds like it could be automated”.

The first step was to collect a list of items that needed ftping to cachefly. I know what you’re saying, “Use find!” In case Ben is reading this, find “searchs for files in a directory hierarchy” (that’s from find’s man page Ben). I wanted to separate the resources out so I ran three different invocations.

For javascripts and css, the invocation was nearly identical:

find . -name '*.js' > js.libs
find . -name '*.css' > css.libs

Images were a little trickier. Most of the images are static content, but some are user-generated, likely to change or be removed. These do not go up to the CDN (at least for now). The user-generated content is located under one directory (call it /images/usergen), so we simply need to exclude it from find’s search.

find -path '*images/usergen*' -prune -o -path . -iname '*.gif' -o -iname '*.jpg' -o -iname '*.png' > image.files

The important parts:

  • -path '*images/usergen*' -prune

    Remove any found items that contain images/usergen in the path name.

  • -o -path .

    Search within the current directory (the root of the project).

  • -iname '*.gif' -o -iname '*.jpg' -o -iname '*.png'

    Match, case-insensitive (-iname instead of -name), any files ending in gif, jpg, or png.

We are then left with three files, each line of which contains the path, relative to the project root, of each resource I want to upload. I created a simple php script to upload the images, maintaining the pathing, to cachefly. So an image with relative path /images/header/header_left.png would now be accessible at instance.cachefly.com/images/header/header_left.png.

So the images are now up on the CDN. Now we need our code to point there as well. Fortunately, most of the resources were prepended with a domain (stored in the global $live_site). So the src attribute of an image, for instance, would be src=”< ?= $live_site ?>/images/header/header_left.png”. Creating a $cachefly_site global, we now only need to find lines in our code that have a basic layout of “stuff……$live_site…stuff…..png” where stuff is (.*) in regex land. So we utilize two commands, find and egrep. Find locates files we want and egrep searches the found files for a regex that would locate the resources in the code.

So first, we build the regex. We know a couple elements that need to be present, and one that should not be present. Needed are live_site and a resource extension (js/css/jpg/png/gif), and not needed is the “images/usergen” path, as this points to user generated content. So the regex becomes:

'live_site([^images/usergen])+.+(png|gif|jpg|css|js)'

This is the arg for egrep (the -l switch means print the file names that have a match, rather than the lines of a file that match):

egrep -lr 'live_site([^images/usergen])+.+(png|gif|jpg|css|js)'

Now we need to tell egrep what files to search using find:

find . -name "*.php" -exec egrep -lr 'live_site([^images/usergen])+.+(png|gif|jpg|css|js)' {} \;

We then store this list of files into a shell variable:

export FILES=`find . -name "*.php" -exec egrep -lr 'live_site([^images/usergen])+.+(png|gif|jpg|css|js)' {} \;`

Now that we have the files we need, we can search and replace $live_site with $cachefly_site for resources. The goto command for search and replace is sed. The sed command will look generically like this:

sed -i 's/search/replace/g' FILE

We actually have two issues though. Due to the nature of the code, we have to account for the $live_site variable being passed in via the global keyword. So not only are we searching for resource files, but we also have to add $cachefly_site to the global lines to make sure $cachefly_site is defined within the function where output is generated.

Searching and replacing resource files is pretty easy:

sed -i '/live_site.+\|js\|css\|gif\|png\|jpg/s/live_site/cachefly_site/g' $FILES

$FILES, of course, came from our find/egrep call earlier. There is one catch to the regex used here. It is actually of a different generic form than mentioned above:

sed -i '/contains/s/search/replace/g' FILE

With this format, we put a condition on whether to replace text, meaning the regex in the “contains” portion must be matched before the search and replace is performed on that line.
So our sed above says if the line contains live_site, followed by anything, ending in one of the listed resources (\| means OR), then replace live_site with cachefly_stite. I left of the $ since its common to both variables.

Running the sed command replaces everything nicely, but when we reload the page, we see notices about $live_site being undefined and resources being pulled from the host and not cachefly. So we need to handle the global importing.

This one is a little tricker because we are not really replacing live_site with cachefly_site, but appending it to the list of imported globals. So a line like

global $foo, $bar, $live_site, $baz;

becomes

global $foo, $bar, $live_site, $cachefly_site, $baz;

The other trick is that the global line should not already contain $cachefly_site. We don’t need that redundancy. So, without further ado, the sed:

sed -i '/global.*live_site.*\(cachefly_site\)\{0\}/s/live_site/live_site,\$cachefly_site/g' $FILES

The “contains” portion matches the keyword global, followed by stuff, followed by live_site followed by stuff, with cachefly_site appearing exactly 0 times (denoted by \{0\}). This ensures we only replace live_site when cachefly_site is not in the line already.
The “search” portion is easy; search for live_site. The replace portion replaces live_site with live_site,$cachefly_site. This takes into account when live_site is followed by a comma or semi-colon so we don’t get syntax errors.

And that is basically how I converted a site to use cachefly for static content.

Writing Excel Spreadsheets Using PHP

Thursday, July 24th, 2008

When using the Spreadsheet_Excel_Writer library from the PEAR repository, I came across an issue I didn’t see handled in the docs (as of this writing, I am using Spreadsheet_Excel_Writer 0.9.1 beta)

My application creates spreadsheets that contain order information. Part of each row is a list of up to 20 ISBNs and the quantities desired of each. The issue came in how to handle ISBNs that had a leading zero. When I first looked through the PEAR docs for the library, a Worksheet method, writeString, looked to be the solution. However, the end result was that while the leading zero was maintained, the cell’s format was still numeric. This resulted in the application receiving the generated xls to then drop the zero, resulting in an invalid ISBN.

Looking over the internals of the Worksheet::writeString method didn’t reveal an undocumented feature that would ensure a cell was read as text, regardless of its contents. I next looked at the Format::setNumFormat method as I knew it contained ways to format the number as currency, timestamp, fractions, etc. You could then pass this Format object as the optional fourth parameter to the Worksheet::write method.

Contained in the Format::setNumFormat docs was a link to the OpenOffice.org documentation of the Excel File Format (found here, pdf). Interested in how exactly the file was structured, I read on. What I learned that was directly applicable is that each cell contains a pointer to a format definition, or XF record, and it was this XF record where formatting was stored. From the doc, section 4.6:

All cell formatting attributes are stored in XF records…The cell records themselves contain an index into the XF record list. This way of string cell formatting saves memory and decreases the file size.

So if two cells use the same formatting, like the ISBN columns would, each cell would contain a pointer to the XF record that would tell Excel the cell was text. Seciton 4.6.1 lists the 6 groups of formatting attributes, the first of which is number format, which is then an index to a FORMAT record. Okay, we’re on to something here. Further in the pdf, in section 5.49, we see the definition of the FORMAT record. Lo and behold, the table of formats from the setNumFormat page is listed in the pdf, but we see that the PEAR listing is incomplete. Scanning the complete table in the pdf, we see index 49, type Text, format string ‘@’. Bingo.

Our code for formatting numeric data as text in a string goes a little something like this (modified from the PEAR example code):

$workbook = new Spreadsheet_Excel_Writer();
$worksheet =& $workbook->addWorksheet();
 
// We'll show dates with a three letter month and four digit year format
$text_format =& $workbook->addFormat();
$text_format->setNumFormat('@');
 
$worksheet->write(0, 0, "Without formatting");
$worksheet->write(0, 1, '0123'); // cell contains 123
 
$worksheet->write(1, 0, "With formatting");
$worksheet->write(1, 1, '0123', $text_format); // cell contains 0123

To verify, generate the xls and open it. Right click the cells to modify the format of the cell, and see that the first cell is formatted as a general number, and the second cell is formatted as text.

The meta-moral is to read the docs and follow references to get at the source material. Had I not opened the pdf, it may have been a few more time units finding the information on Google. Plus, I learned a lot more about an important file format. I can sleep easy knowing I’m that much more knowledgeable.

Announcements

Monday, April 28th, 2008

Thought I’d go ahead and announce, mainly to myself, that I will be working through SICP. The rub…doing it in Javascript. Seems as though most other languages are covered (I know Erlang is taken) and since I am doing an increasingly large amount of Javascript, coupled with the eventual prevalence of server-side Javascript, I figured it best to start getting intimate. What I like about this task is that since SICP has been so widely covered on the web, I have many resources to aid in better understanding the material (and it is some thick material). Anyway, I’ve begun chapter one and will post the chapters, as well as excerpts I find interesting, in no pre-defined timeframe.

Oh yeah, and I’m engaged.

More Wget

Thursday, April 10th, 2008

It’s hard to understate the usefulness and robust feature set that most of the GNU tools have in their arsenal. Today, I’ll make mention of one such tool, wget, and a novell use of the command.

As I go through my work, I find that sites we agree to take over have little structure. They generally were slapped together a long time ago, with little thought to organization, made with Dreamweaver or, Stallman forbid, FrontPage. I’m not judging; as long as something looks okay in the browser, a company can proclaim, “We’re on the intarwebs!” However, tracking down all of their pages to be converted into a CMS, for instance, can be time consuming. Not wanting to waste a client’s money by searching through the source for links and images, then manually reconstructing the layout of the files, I fell on my trusty GNU tool wget. (I also did not have FTP access, but I knew there were dead pages that I didn’t want to resurrect. Using wget in this case helped me retrieve only the pages that were still linked to from the main page).

Here’s a variation of the incantation of wget I used:

wget -r -A '*.htm*, *.jpg, *.png, *.gif' -l 3 http://www.example-site.com

What’s it all mean?
-r: wget should retrieve recursively
-A: takes a comma-separated list of patterns to match files to accept (use -R to reject). In this case, we want all htm, html, and most picture format files.
-l: denotes how far down the rabbit hole to venture. I started with 1, so only links from the first page were parsed and followed. I then tried 2, following links that were a level below the parent and compared the resulting structure. Trying 3, I found no difference between 3′s results and 2′s results, meaning all links had been followed and accounted for.

The result:
A directory called www.example-site.com that contains the files in their layout on the server. Now I knew which pages needed converting and which images to add to the new site.

A side note: A handy way to see the layout of your newly downloaded directory is to use the tree command.

tree www.example-site.com/

will display something like this:

www.example-site.com/
|– about.html
|– calendar.html
|– committees.html
|– contact.html
|– otherdir
| `– index.html
|– images
| |– header.gif
| |– logo.gif
| `– spacer.gif
|– index.html
|– join.html
|– news.html
|– partnerships.html
`– scoopholiday.html