Wget usage : request for comments
- From: babaji <banerjee.anirban@xxxxxxxxx>
- Date: Mon, 4 Feb 2008 16:04:59 -0800 (PST)
I am going to start a small project to analyze
how 8 websites ar connected to each other and would be grateful for
any comments which people might have regarding the methodology I plan
The problem: 8 websites with hyperlinks, images, js, etc.. embedded in
them. I need to find how they are referencing each other. In general I
have to know where do the links in the HTML code of these sites point
to in the Internet.
What I plan to do: I will use wget as a crawler (I like command
line!!) and extract the needed information. I will set the command
line params so that wget will visit 2 level deep sites and extract
What exact info do I need: (a) The hyperlinks on each site (b) the
image file(s) (c) any binary stored on the sites (d) the actual HTML
code of the main page.
A few questions:
I plan to use
nirvana$>wget -i input-urls -o logfile -x ouput-dir/ --random-wait 1.5
-U "put-in-mozilla-user-string" -r -l 2 -p "URL of a site"
I have not mentioned using noclobber/timeout/retries options but will
probably use them.
Using -x in the params tell wget to store all data from sitea.com to
output-dir/sitea.com/ , right?
Is it a good idea to modify the standard mozilla user string to
include my name and email, so that the admin of the site can contact
me in case he does not like what I am doing.
The major problem which I forsee is: As I am using recursive downloads
some of the sites which will be downloaded may have very large files
in their directories. I went through the wget manpage but could not
find an option to set the reject list based on size. Type yes but not
size. There was another post on groups.google which answered this
question, bu the solution was to download the data and then "not
analyze" it if it is greater than say, 10 MB.. I need to stop wget
from downloading it in the first place.
Also, if anyone knows about a tool/service that would give me the
links from/to a page if I give a URL as input, it would be super. I
looked through Google's Social API, but it only parses publicly
declared "social" references. There is a service called Kartoo, but it
seems I can't download the results in text format.
Any help/comments/pointers would be greatly appreciated.