Re: Please Help W/Wget.
From: Baard Ove Kopperud (bokoppeNOSpamHere_at_frisurf.no)
Date: Sat, 22 Nov 2003 19:40:18 +0100
"Bozo Schmozo" <firstname.lastname@example.org> wrote in message
> What I wish to be able to do is instruct wget the directories from
> which to download. I am only able to download one file at a time by
> specifying the names of the files by:
> But when I try this:
> wget -r -np
> (with, or without the end "/"), I get a 403 error. I also tried
> without "-np", but the result is the same.
I think you should specify the recursive level -- the -l option.
The argument to -l of 0 or inf, specify an infinite recurive level,
but as long as you also use -np , you don't risk wget starting
to download the whole site.
You may also concider quoting the URL.
wget -r -np -l inf http://www.someplace.org/dir_1/dir_2/
> I also tried it by giving it an "accept" flag with, "-A pdf" but that
> didn't work either.
wget takes everything, so if all that is there is pdf-files, there
is no reason to specify.
> AFAIK, "directory_I_wish_to_download_from" only
> has PDF files, no HTML or any other files.
> Anyone know why this may be happening or what I may be doing wrong?
> Am I missing some flags? Any input or suggestions would be greatly
Here you reley on that the server will list the contense of the
directory in question, but many servers forbid this. Try using
a normal browser to view the directory in question, my guess
is that it produce an "ERROR: Forbidden" and *not* a
If you know what pattern the files have (e.g. doc_001.pdf, doc_002.pdf,
doc_003.pdf, ...) you can generate a init-file to wget (specified with
the "-i filename" option). It could look like this:
If there is no pattern, then find the page(s) from where the files in
the directory are linked from, save them, and clean them so
only the links to the files you're interested remains.
You may have to edit the HTML-code so only the path
to the file remains or fully-quallify the entries (e.g. "file1.pdf"
but 'sed' should help you alot. In the end, you should have
something like the example above, which you can feed do
wget with the -i option.
If there is only one page that links to this directory, you can
perhaps use that page directly. You should probably
specify a reursion-lever of 1 ( wget -r -l 1). Here you may have
use for the -A option to specify that you only want pdf-files, but
be aware that wget deletes the file that aren't of the correct type
first *after* they've been downloaded.
You could also save the page with the links, remove all anchors
you're not interested in, and use it as argument for wget (*not*
as an argument for -i):
wget -r -l 1 -A pdf saved_page.html
This will work if the page uses *fully-qualified* links (i.e.
with domain-name), if it doesn't, you must either edit
the file accordingly (adding domain-name and possibly
directory-path) or use the options to wget that lets you
specify the domain (I know this one exists) and higher
directories (I *think* this option exists too).
Hope this helped,