Re: How do I teach Spam Assassin? [LONG]

From: jdow (jdow_at_earthlink.net)
Date: 03/13/04

  • Next message: Coume - Lubox.com: "Is someone using the Soltek EQ3702A?"
    To: <fedora-list@redhat.com>
    Date: Fri, 12 Mar 2004 18:07:21 -0800
    
    

    From: "Jeff Vian" <jvian10@charter.net>

    > jdow wrote:
    >
    > >http://wiki.spamassassin.org/ is an astonishingly good place to learn
    > >about the ins and outs of SpamAssassin. it also mentions the home pages
    > >of various custom rule sets like 99_TripWire, BigEvil, and many others.
    > >
    > >It is well worth the visit.
    > >
    > >(I have "progressed" to the point that SA filters my mail. I read my
    > >mail via a secure pop2 connection. I maintain spam, oldspam, ham, and
    > >oldham folders on a special account via IMAP to mbox files. I have a
    > >futility that filters off the message 1 the imap tool insists must be
    > >there so that a cron job every night runs "salearn" for me. This is
    > >all rather handy when I am running this email tool. When I get time I
    > >plan to revisit tossing the special folders into my main email account
    > >"safely". I understand more now than when I started. {^_-})
    > >
    > >
    > >
    > I am interested in how you do this.
    > I use fetchmail to get me mail by pop3 from the isp and put it in my
    > account on the linux box. I then need to get it into the maildirs
    > (instead of the default mailbox) so I can teach SA, but am unsure of the
    > mechanics of making that happen. Any pointers on getting the maildirs
    > working will be greatly appreciated.

    The data path is fetchmail to sendmail to procmail to spac to mbox
    output. (SendMail queues the mail in /var/spool/mqueue as fetchmail
    feeds it in. Then it farms the mail back through procmail for delivery
    to the mbox file.) This part is easy and should be a slam dunk for most
    folks here.

    You must get spamd running, through. I use the spamassassin.org rpm
    for 2.63. So it may be different for Fedora. But the key lines are:

    # SPAMDOPTIONS="-d -c -a -m5 -H"
            SPAMDOPTIONS="-d -c -m10 -H"

    The first is commented out. The second is only partly silly. (I see I
    should change the -m option back for sanity. I have a slow machine plus
    massive mail chunks coming in (chiefly from LKML) that lead to imap
    beocming a little disoriented. I get batches of mail delivered twice.
    every once in awhile.) I removed the -a option for AWL. It's something
    I do not trust. And several people's reports on the spamassassin list
    reinforce that view.

    I fired up the imap and later the secure pop3 tool straight out of RH9.
    It picked up the mbox mail box and presents it to me for reading on a
    different machine in Outlook Express.

    Then I decided it was too awkward to use a :0c: rule in procmail to save
    a copy in mbox format of all unprocessed incoming email so that I could
    yank out spam and train with it. (mail and its "s100 spam" ability. Yeah,
    REAL primitive. It HAD to be improved.)

    A throwaway comment by one of the other users on the spamassassin list
    led me to the dual account setup. But I think he used the second account
    for global training. I wanted per user training. (I also put in the
    "allow_user_rules 1" line in /etc/mail/spamassassin/local.cf.

    <digression> Make no changes to /usr/share/spamassassin. I made that
    mistake. I expect pain when I upgrade. Use the /etc/mail/spamassassin
    folder for all your add on rule sets. They appear to run alphabedically.
    </digression>

    I created the second acount and fired up imap for that account. That is
    where I created the spam and ham folders i used, briefly, for training.
    SpamAssassin is friendly enough it can train on its own marked up spam
    if you want. It can train from mbox format. It gets slower and slower
    as your spam database grows.

    On a slow machine getting slower and slower is a decided disadvantage.
    So I created "oldspam" and "oldham" to save already processed messages
    and diffidently copied all the already processed spam and ham to the
    appropriate retraining folders. (If you ever blow away the bayes data-
    bases, all three files, this will allow relatively pain free retraining.)

    Salearn is (apparently) unfriendly enough to train on the message you
    will find using "mail" as the permanent first message in the folder.
    "This," decides me, "is not right!" So I figured to further automate
    the whole process. Now, I speak C far better than perl. Besides for
    the very limited parsing I had to perform C is far faster than perl.
    The resultant futility, imapstrip, opens the folder you want, the
    "old<name>" folder that goes with it (append mode). and a "<name>_temp"
    file (write mode to erase the old one). It checks that the first item
    in the file is indeed the imap header message. If not it proceeds to
    step three directly. in step two it parses to the first real "^From "
    saving the material inbetween the start of the file and the From for
    later on. It completes step 2 by writing out the rest of the buffer
    to both the appended file and the new temp file. It then proceeds to
    read then write to both files in 64k chunks until the end of the file.
    Then in step 4 if the imap header was present it closes the input file
    and rewrites it with the header it captured in step 2.

    Voila, I have the spam archive updated, the spam_temp for learning, and
    the spam folder emptied out all without my intervention.

    Since these folders exist in "<namne>_train" I had to cross connect
    <name> and <name>_train accounts usefully. That means the ~/mail for
    "<name>_train" is linked to "~/<name>_train" for the <name> account.
    I also had to make <name> and <name_train> members of each other's
    adhoc RedHat groups. A little "satrain" script and .procmailrc
    editing later and I'm in business up to today.

    I note that I am now convinced that the "<name>_train" accounts are
    not really needed. But for the time being "it works so I am not messing
    with it."

    For the nonce the C coding is left as an exercise for the student. It
    could be optimized, I suppose. "Done is good!" So I am leaving it alone
    for awhile. The rest of it is simple minded.

    --8<-- minimum .procmailrc for the <name> account
    DROPPRIVS=yes
    PROCMAILMATCH="X-Procmail: Matched on"
    PROCMAILHEADER="X-Procmail: "

    :0 fw: spamassassin.lock
    * < 250000
    * !^List-Id:
    .*(spamassassin-talk\.lists\.sourceforge\.net|spamassassin\.apache.\org)
    | /usr/bin/spamc
    --8<-- added lines for rewriting spamassassin headers for easy replyto
    # Tag SA-talk list mail
    :0 Efw
    * ^List-Id:
    .*(spamassassin-talk\.lists\.sourceforge\.net|spamassassin\.apache.\org)
    | formail -A "$PROCMAILHEADER SA-talk list mail not processed."

    # NEW sa-talk list
    :0 fw
    * ^List-Id: .*spamassassin\.apache\.org
    | formail -A "$PROCMAILMATCH SpamAssassin Talk list" -i "Reply-To:
    spamassassin-users@incubator.apache.org"
    --8<-- satrain script for the user's directory
    #!/bin/bash
    date
    USER=$LOGNAME
    USERTRAIN="$USER"_train
    echo $USERTRAIN
    echo "$USERTRAIN"
    /usr/bin/fetchmail -q
    ls ~/bin/imapstrip
    if [ -f ~/bin/imapstrip ]; then
            echo "imapstrip training ham"
            ~/bin/imapstrip ~/"$USERTRAIN"/ham &&
    sa-learn --ham --showdots --mbox ~/"$USERTRAIN"/ham_temp
            echo "imapstrip training spam"
            ~/bin/imapstrip ~/"$USERTRAIN"/spam &&
    sa-learn --spam --showdots --mbox ~/"$USERTRAIN"/spam_temp
    else
            sa-learn --ham --showdots --mbox ~/"$USERTRAIN"/ham
            sa-learn --spam --showdots --mbox ~/"$USERTRAIN"/spam
    fi
    /usr/bin/fetchmail -d 120 --fetchmailrc ~/.fetchmailrc
    date
    echo "====================================================================="
    --8<--

    Don't forget to setup fetchmail. But that's no big deal if you have
    already done it. It looks remarkably like:
    --8<--
    # Configuration created sometime in 2003 by jdow
    set syslog
    set postmaster "jdow"
    set no bouncemail
    set no spambounce
    set properties ""
    #set daemon 60
    #set logfile /var/log/fetchmail
    #set syslog
    # repeat these lines below for each email account fetched to the user's
    # mailbox.
    poll smtp.earthlink.net with proto APOP
           user 'jdow' there with password 'ZZYYZZYY' is 'jdow@mymachine' here
    options pass8bits
     smtpaddress ' '
    --8<--

    Then I setup a user crontab entry to run the script at a unique per user
    time in the late late night hours.

    I hope that is enough to get you rolling. The C file is a little large
    once I put in some basic bounds checking for me to include it here.

    {^_^}

    -- 
    fedora-list mailing list
    fedora-list@redhat.com
    To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
    

  • Next message: Coume - Lubox.com: "Is someone using the Soltek EQ3702A?"

    Relevant Pages

    • Re: [SLE] Copy spool mail files?
      ... On Sunday May 29 2005 3:00 pm, Anders Johansson wrote: ... >> How do you handle spam? ... > Spamassassin is one step earlier in the chain, it is run by postfix before ... does Cyrus imap store sent messages in imap or in local folders? ...
      (SuSE)
    • Spamassassin and Spambayes
      ... After the big discussion a month or so ago, I decided to give spamassassin ... I use kmail - supposedly, ... been filtered into various folders, training on several from each folder. ... my spam detection crept up to about 80%. ...
      (Fedora)
    • Re: [opensuse] Which IMAP server?
      ... per-user spam learn folders and basiean filters. ... I have users drop unidentified spam into this "system global" imap folder, then I have a fetchmail script pull it from there and feed to to spamassassin for training the Bayessian filter. ...
      (SuSE)
    • speed of spambayes?
      ... Spamassassin right now but it takes around 1.5 seconds to process a ... I want to crunch through several gigabytes of spam ... classifier with a low false negative rate (it's ok if the false ... folders are already spam). ...
      (comp.lang.python)
    • Re: Error Message
      ... I'm not really using it to check for viruses...just to eliminate the spam. ... Is Trend Micro your anti-virus program, ... Every anti-virus program has an e-mail scanning option which should never be ... Do not archive mail in default OE folders. ...
      (microsoft.public.windows.inetexplorer.ie6_outlookexpress)