Re: bayesian filter training question

From: Kjetil Kjernsmo (kjetil_at_kjernsmo.net)
Date: 09/30/05

  • Next message: Stephan Seitz: "Asterisk 1.2beta as debian package?"
    To: debian-user@lists.debian.org
    Date: Fri, 30 Sep 2005 09:14:53 +0200
    
    

    On torsdag 29 september 2005, 21:51, Roberto C. Sanchez wrote:
    > So, I finally decided to get with the 20th century and install
    > spamassassin (acutally spampd hooked through postfix) to do site-wide
    > spam filtering for my server.

    Yiiihaaa!

    > My question is this.  As I am training
    > it with sa-learn, is it (good|bad|indifferent) to train it on spam
    > that has already been flagged as spam.  That is, will this reinforce
    > spamassassin's notion of spam or ruin it?

    No, that's fine. In fact, SA has this autowhitelist concept that does
    exactly that (it's not really a whitelist, though, more an "evening out
    weird things that may happen", I'm not using it).

    You should have a good look at bayes_ignore_header, so that it won't
    train on things that are obviously in spam. SA is pretty good it this
    itself, but if you see spam that has been filtered elsewhere a lot, be
    sure to use it.

    I'm guessing that you, like me, are doing this for your family. In that
    case, I have found that it is quite sufficient to train a single
    database with the spam and ham of the entire family. If you have more
    diverse users, you would probably need to have a per-user
    configuration. For example, a friend of mine has an uncle who is a
    psychiatrist working with people with gambling obsessions, and SA was
    pretty catastrophic for him until he got a per-user config.

    Finally, I found that SA, in it's default 3.0-form was much too
    conservative about the assigned scores, so I have a bunch of rules that
    I have adjusted the score of. You'll get some experience about that in
    time, I guess. Also note that SA 3.1 has been released upstream.

    Cheers,

    Kjetil

    -- 
    Kjetil Kjernsmo
    Programmer / Astrophysicist / Ski-orienteer / Orienteer / Mountaineer
    kjetil@kjernsmo.net   
    Homepage: http://www.kjetil.kjernsmo.net/     OpenPGP KeyID: 6A6A0BBC
    

  • Next message: Stephan Seitz: "Asterisk 1.2beta as debian package?"

    Relevant Pages

    • Re: SA going downhill
      ... Yes, training spamassassin requires thousands of messages to work, both ham ... messages and spam messages. ... after I train it with the newest messages, ... but I'd rather not have to deal with false positives. ...
      (Debian-User)
    • Re: Questions on Spam Sieve. Trying it out...
      ... So I'm trying out Spam ... Spam Sieve's two main scripts- Train Good, ... but those keyboard commands aren't triggering the scripts. ... SpamSieve - Train Good\cmG ...
      (microsoft.public.mac.office.entourage)
    • Questions on Spam Sieve. Trying it out...
      ... So I'm trying out Spam ... Spam Sieve's two main scripts- Train Good, ... but those keyboard commands aren't triggering the scripts. ... SpamSieve - Train Good\cmG ...
      (microsoft.public.mac.office.entourage)
    • Re: Dirty spam
      ... Install spamassasin and train it. ... Go to the web archives, ... offending messageand click the corresponding "Report this as Spam" ... spamassasin on lists.d.o with those messages which are reported as spam. ...
      (Debian-User)
    • Re: Installing/Configuring SpamAssassin and ClamAV
      ... On Monday 03 November 2003 19:31, Kjetil Kjernsmo wrote: ... I have this in same file as the clamav config: ... # put headers in all messages (no matter if spam or not) ... warn message = X-Spam-Flag: YES ...
      (Debian-User)