Re: Right tool and method to strip off html files (python or sed?)



sebzzz@xxxxxxxxx wrote:

I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy to do a part of this
work. (clean up, change to xhtml and remove font and center tags)

Now, Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).

I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.

Use a HTML parser.


I kind of know generally what I need to do:

1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)

Parsing HTML using regexes isn't fun if you want to do not only very
simple thing.


But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.

They will.





Florian
--
<http://www.florian-diesch.de/>
-----------------------------------------------------------------------
** Hi! I'm a signature virus! Copy me into your signature, please! **
-----------------------------------------------------------------------
.



Relevant Pages

  • Re: parsing in python
    ... > a text from an oracle database that contains different tags that have to ... > texts in Python? ... > to increase the readability of the generated HTML source. ... parse text strings in Python. ...
    (comp.lang.python)
  • Re: word webpages
    ... The ther are som tags with no closing tags DreamWeaver would remove what ever was causing these problems. ... Just create a simple document and save as HTML Make sure it has some type of formatting. ... XML all versions ...
    (microsoft.public.mac.office.word)
  • Re: macro and cl-who help
    ... Lisp, but... ... you back into the "walking forms as HTML data" mode, ... This would have been extensible with user-defined tags, ... HTML tags are macros can be functions: ...
    (comp.lang.lisp)
  • Re: html scraping
    ... Not for parsing HTML! ... DOM and SimpleXML are the right tools here. ... parser that can deal with missing end tags. ... -- If a close tag is seen, push it on the stack. ...
    (comp.lang.php)
  • Re: Volunteer work:)- new Kona Coffee Farmers site
    ... SEO is search engine optimization, which concerns itself with how well your page is indexed by Google or the other search engines. ... Good SEO involves many aspects of the page design, including well-structured HTML documents, appropriate HTML tags and tags, semantic HTML, keyword-optimized URLs, a good domain name, and copious, keyword-dense content. ...
    (alt.coffee)