Wget usage : request for comments



Hello everyone,
I am going to start a small project to analyze
how 8 websites ar connected to each other and would be grateful for
any comments which people might have regarding the methodology I plan
to follow.

The problem: 8 websites with hyperlinks, images, js, etc.. embedded in
them. I need to find how they are referencing each other. In general I
have to know where do the links in the HTML code of these sites point
to in the Internet.

What I plan to do: I will use wget as a crawler (I like command
line!!) and extract the needed information. I will set the command
line params so that wget will visit 2 level deep sites and extract
them too.

What exact info do I need: (a) The hyperlinks on each site (b) the
image file(s) (c) any binary stored on the sites (d) the actual HTML
code of the main page.

A few questions:
I plan to use

nirvana$>wget -i input-urls -o logfile -x ouput-dir/ --random-wait 1.5
-U "put-in-mozilla-user-string" -r -l 2 -p "URL of a site"

I have not mentioned using noclobber/timeout/retries options but will
probably use them.

Using -x in the params tell wget to store all data from sitea.com to
output-dir/sitea.com/ , right?

Is it a good idea to modify the standard mozilla user string to
include my name and email, so that the admin of the site can contact
me in case he does not like what I am doing.

The major problem which I forsee is: As I am using recursive downloads
some of the sites which will be downloaded may have very large files
in their directories. I went through the wget manpage but could not
find an option to set the reject list based on size. Type yes but not
size. There was another post on groups.google which answered this
question, bu the solution was to download the data and then "not
analyze" it if it is greater than say, 10 MB.. I need to stop wget
from downloading it in the first place.

Also, if anyone knows about a tool/service that would give me the
links from/to a page if I give a URL as input, it would be super. I
looked through Google's Social API, but it only parses publicly
declared "social" references. There is a service called Kartoo, but it
seems I can't download the results in text format.

Any help/comments/pointers would be greatly appreciated.
Best,
-A





.



Relevant Pages

  • Re: Backdoor.irc.bot
    ... All-NetTools and DNSStuff websites both help you resolve addresses. ... They're all free - and most pretty small, so they download quickly enough. ... Create a separate folder for HijackThis, such as C:\HijackThis - copy the ... Spyware Warrior: ...
    (microsoft.public.security.virus)
  • Re: Backdoor.irc.bot
    ... >All-NetTools and DNSStuff websites both help you resolve ... download quickly enough. ... First update it ("Check for updates ... >bit safer when posting to open forums. ...
    (microsoft.public.security.virus)
  • Re: Free Metalworking Plans
    ... NRA LOH & Endowment Member, Golden Eagle, Patriot"s Medal. ... | I Dl'd the lathe plans, ... Use something like Wget ... or some other worthy download manager to retrieve it. ...
    (rec.crafts.metalworking)
  • I/O or CPU bandwidth issue or wget issue or perhaps isp???
    ... negligible when I start wget. ... a while chokes as above. ... The download speeds fluctuate as low as a few hundred ...
    (Fedora)
  • Re: I/O or CPU bandwidth issue or wget issue or perhaps isp???
    ... negligible when I start wget. ... a while chokes as above. ... The download speeds fluctuate as low as a few ...
    (Fedora)