68098 – Bug machinery mails get a 3.2 spamassassin score

Issue 68098 - Bug machinery mails get a 3.2 spamassassin score

Summary: Bug machinery mails get a 3.2 spamassassin score

Status:	CLOSED FIXED

Alias:	None

Product:	Infrastructure
Classification:	Infrastructure
Component:	Bugzilla (show other issues)
Version:	current
Hardware:	All All

Importance:	P3 Trivial (vote)
Target Milestone:	Patch
Assignee:	Unknown
QA Contact:	issues@www

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-08-03 09:21 UTC by sthibaul
Modified:	2017-05-20 10:27 UTC (History)
CC List:	3 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description sthibaul 2006-08-03 09:21:44 UTC

Hi,

The mails that the Bug machinery sends have a spamassassin score of 3.2:

score=3.2 required=10.0 tests=BAYES_50,FORGED_RCVD_HELO,NO_REAL_NAME,SUBJECT_ENCODED_TWICE,SUBJECT_EXCESS_BASE64

Except for the Bayesian (which I'm trying to teach), all these can be eliminated by fixing the corresponding mail headers.

Comment 1 stx123 2006-08-10 11:01:25 UTC

Please help us to understand what the problems are.

NO_REAL_NAME is obvious.

Could you explain:
SUBJECT_ENCODED_TWICE,SUBJECT_EXCESS_BASE64
FORGED_RCVD_HELO

Comment 2 sthibaul 2006-08-20 00:29:59 UTC

Hi,

FORGED_RCVD_HELO is actually a problem from my part, do not worry about it.

SUBJECT_EXCESS_BASE64 is because mail composers usually encode the subject in a way that minimize the size and maximize readability. For instance, an english subject (that hence can be encoded in ascii) shouldn't be encoded with =?utf-8?blabla?= quirk at all. Latin languages (which have a few non-ascii characters) should have non-ascii parts be encoded with =?utf-8?q?blabla?=, i.e. the quoted-printable form, so that ascii characters of the non-ascii parts can still be easily read.

Currently, SpamAssassin considers that OOo's bug machinery always using the base64 (=?utf-8?B?hex?=) encoding is excessive:

- if the subject is plain ascii, it shouldn't get encoded at all.
- if the subject contains only few non-ascii characters, these parts should be encoded with =?utf-8?q?blabla?=
- else, =?utf-8?b?hex?= is indeed the preferred way (and SpamAssassin shouldn't frown in such case)

SUBJECT_ENCODED_TWICE is actually a consequence of the previous one: The problem is with long subjects, that need to be split in several header lines. Since the bug machinery currently always encode all the subject, it has to split this encoding too, resulting to:

Subject: =?utf-8?b?hexhexhexhexhehxehxehexhehexhehexhehxehehxehhexh?=
	=?utf-8?b?hexhexhexhexhexhehehxehehxehhxehexheh?=

Which is what SpamAssassin calls "encoding the subject twice". By avoiding excessive encoding, this should be avoided in most case. But not all. That's why I've requested SpamAssassin to avoid tagging such subjects (since there is no other way to encode them), but they preferred to just reduce the associated score. The bug machinery should hence just try to avoid it as much as possible.

Comment 3 stx123 2006-08-27 06:44:13 UTC

Thanks "sthibaul" for the explanation.

Support, could you please take care of these problems.

Comment 4 Unknown 2006-09-20 07:21:14 UTC

Started working on this issue .

Comment 5 Unknown 2006-10-11 12:10:23 UTC

I dont think this is something which we could much about have been reading the
RFC documents and feel what we are following is correct and exactly as stated in
RFC 2047 

<snip>
The following are examples of message headers containing 'encoded-
   word's:

   From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
   CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be>
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

      Note: In the first 'encoded-word' of the Subject field above, the
      last "=" at the end of the 'encoded-text' is necessary because each
      'encoded-word' must be self-contained (the "=" character completes a
      group of 4 base64 characters representing 2 octets).  An additional
      octet could have been encoded in the first 'encoded-word' (so that
      the encoded-word would contain an exact multiple of 3 encoded
      octets), except that the second 'encoded-word' uses a different
      'charset' than the first one.
</snip>

Please have a look at this link for more details on this respect 

http://aspn.activestate.com/ASPN/Mail/Message/spamassassin-users/3107435

Here is the details provided for an issue reported in the apache site for this
kind of problem and the workaround or suggestion provided .

http://www.mail-archive.com/dev@spamassassin.apache.org/msg15778.html

<snip>
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5026

[EMAIL PROTECTED] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WORKSFORME

------- Additional Comments From [EMAIL PROTECTED]  2006-08-04 17:34 -------
Hi,

Thanks for the ticket.  What you're reporting is commonly referred to as a
"false positive" (aka: FP).  The rule is actually doing the right thing -- the
subject does have two encodings in it, and so the rule is triggered.

It appears that this is more common now than it was before:

old: 1.047   1.4619   0.0792    0.949   0.58    0.89  SUBJECT_ENCODED_TWICE
new: 0.597   0.6926   0.1444    0.827   0.65    0.89  SUBJECT_ENCODED_TWICE

which basically means that the spam hits have decreased by ~50% while the ham
hits increased by ~50%.  So the next time the scores are generated, I would
expect this rule's score to drop a bit.  In the mean time, you can lower the
score on your installation as you see fit.

Hope this helps. :)

As for the ticket, since the rule is doing the right thing, I'm closing as WFM.

</snip>

I would like to close this issue as wontfix . Hence i am going ahead and closing
as the same . Please reopen if you feel otherwise .

Thanks 
Jobin.

Comment 6 sthibaul 2006-10-11 12:47:26 UTC

Hi,

Could you please at least fix NO_REAL_NAME? (This is really easy).

Samuel

Comment 7 Unknown 2006-10-12 12:21:09 UTC

I understand however there is a workaround for this . Since the mail generated
from the IZ would/might not have a legitmate mail id like jobin@abc since it is
internally generated . Which even normal users of AOL also faces . Given below
is an example and the workaround provided to set up a rule which could be used
to allow mails from the IZ .

<snip>
AOL has no real name. The NO_REAL_NAME test of SA will add points when email is
not in the format Joe Smith <jsmith@anyisp.com>. However, AOL email software
does not append a "real name" in the "from" field - so this recipe counteracts
the effect of the NO_REAL_NAME test to avoid false positives from AOL users.
(Contributed by A. Marshall, 7/26/03)

header MAIL_FROM_AOL From = /aol\.com/i meta AM_AOL_HAS_NO_NAME MAIL_FROM_AOL &&
NO_REAL_NAME describe AM_AOL_HAS_NO_NAME Counteracts NO_REAL_NAME test for AOL
email score AM_AOL_HAS_NO_NAME -1.1

</snip>

Comment 8 sthibaul 2006-10-12 19:05:35 UTC

Yeah, that's easy to do. But the problem is: not every people will know how to do that, so that most of them will have BTS mails to go spam dir. Of course, Spamassassin itself could add a rule in its default config, but that seems pretty ugly to me: should they have to add every bot that doesn't add a real name?!

Comment 9 maison.godard 2006-10-20 16:59:08 UTC

i have the same problems, especially on IZ automatic notifications

0.6 NO_REAL_NAME           Le champ From: ne contient pas le nom complet de 
1.5 SUBJECT_ENCODED_TWICE  Subject: MIME encoded twice
0.0 SUBJECT_EXCESS_BASE64  Subject: base64 encoded encoded unnecessarily

Comment 10 Unknown 2007-02-01 17:45:12 UTC

Updating whiteboard.

Comment 11 Unknown 2007-02-01 20:53:54 UTC

We were able to identify a way/workaround which would allow us to the eliminate
the NO_REAL_NAME from the test . Thereby any IZ mail notification would have
something like Name <emailid> in the From field .

More updates to follow .

Comment 12 Unknown 2007-02-05 12:16:22 UTC

The engineers have added a facility which would resolve the problem of
NO_REAL_NAME for mails generated from the IZ .Hence resolving this issue for the
future .

Comment 13 Unknown 2007-05-10 15:11:59 UTC

Plans are in place for resolving this issue in the next patch release of CEE
4.5.2.Setting the target milestone to reflect the same. Marking this issue as
Resolved Later. Support will continue to track this issue internally and review
the fix once the patch has been applied on the site.

Comment 14 Unknown 2007-07-09 07:17:21 UTC

The option of adding %currentuserrealname% in the Form field by default via the
IZ Email Notification template is present for Add/Modify issue.

Stefan , please make the changes in the IZ template to verify this issue has
been fixed .

Comment 15 Unknown 2007-07-09 07:18:14 UTC

Actually marking this issue as Fixed . Please verify and close this issue .