Re: How to wget download all PDF files larger than 100 Kbytes
From: Alan Connor (zzzzzz_at_xxx.yyy)
Date: Thu, 06 May 2004 20:17:43 GMT
On 6 May 2004 12:34:47 -0700, Orak Listalavostok <email@example.com> wrote:
> How do I get GNU web get (wget) to download all the PDFs
> (potentially thousands) on a stated web page but ignore
> any PDF smaller than than a given size?
> I read the fine manual (wget -help), soon arriving with:
> % wget -prA.pdf http://foo.bar.com
> Which means (roughly): Copy all the PDF files (A.pdf) from the
> specified web page (p), recursively (r) to the default 5 levels.
> But how do I eliminate the copying of files smaller than
> a certain size; that is, how do I tell wget to ignore PDF
> files of (say) 100 Kbytes or smaller?
Doesn't seem to be anything wget can do unaided. If you were to download the
webpage with the pdf links, extract them into a file, you can do this:
$ wget --spider http://home.earthlink.net/~alanconnor/elrav1/er1.tar.gz
Resolving home.earthlink.net... done.
Connecting to home.earthlink.net[220.127.116.11]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21,736 [application/x-tar]
Notice the "Length: ..." header? Feed the list to wget with the -i file option,
parse out the URLS with the size you want and feed THAT list to wget.
comp.unix.shell for help writing the script you'll need.
Not really hard with sed and/or awk.
Perhaps there is another web-tool that will do the job, but I'm not aware of it.
-- Pass-List -----> Block-List ----> Challenge-Response The key to taking control of your mailbox. Design Parameters: http://tinyurl.com/2t5kp || http://tinyurl.com/3c3ag Challenge-Response links -- http://tinyurl.com/yrfjb