Re: Right tool and method to strip off html files (python or sed?)
- From: Florian Diesch <diesch@xxxxxxxxxxxxx>
- Date: Sun, 15 Jul 2007 02:29:45 +0200
sebzzz@xxxxxxxxx wrote:
I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy to do a part of this
work. (clean up, change to xhtml and remove font and center tags)
Now, Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).
I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.
Use a HTML parser.
I kind of know generally what I need to do:
1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)
Parsing HTML using regexes isn't fun if you want to do not only very
simple thing.
But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.
They will.
Florian
--
<http://www.florian-diesch.de/>
-----------------------------------------------------------------------
** Hi! I'm a signature virus! Copy me into your signature, please! **
-----------------------------------------------------------------------
.
- References:
- Prev by Date: Re: Windows optimization tricks
- Next by Date: Re: RedHat Linux 7.2 needed.
- Previous by thread: Re: Right tool and method to strip off html files (python or sed?)
- Next by thread: Open source for Kids Education
- Index(es):
Relevant Pages
|