Bug 2173 - Message subject encoding is assumed ISO 8859-1 instead of using charset in Content-Type header
Summary: Message subject encoding is assumed ISO 8859-1 instead of using charset in Co...
Status: REOPENED
Alias: None
Product: Claws Mail (GTK 2)
Classification: Unclassified
Component: Other (show other bugs)
Version: 3.7.2
Hardware: PC Linux
: P3 enhancement
Assignee: users
URL:
Depends on:
Blocks:
 
Reported: 2010-04-17 12:31 UTC by Michael Orlov
Modified: 2012-09-19 10:34 UTC (History)
0 users

See Also:


Attachments

Description Michael Orlov 2010-04-17 12:31:39 UTC
Sometimes non-English messages are received with the Subject header unencoded - that is, the subject is not in "=?charset?...?=" format. In this case, it appears that Claws Mail assumes ISO 8859-1 encoding of the subject, instead of taking the charset from Content-Type header, even when the latter is available.

I am not sure how this corresponds to RFC-822, but fact of the matter is, it is highly unlikely that a non-ISO 8859-1 message will intentionally use non-ASCII ISO 8859-1 characters in unencoded Subject header. However, it is quite likely to get, e.g., KOI8-R messages with unencoded Subject header in KOI8-R from Russian correspondents.
Comment 1 Holger Berndt 2010-04-17 15:13:34 UTC
I don't see how this is a Claws Mail bug. The Content-Type header refers to the message body, not the headers. This is at most an enhancement request for a workaround to deal with broken messages.
Comment 2 Michael Orlov 2010-04-17 18:40:28 UTC
OK, I wrote that I don't know what RFC-822 says about header encoding. Looking at Content-Type header for that purpose (if it's available, which is not always the case of course - for multi-part messages, for example) would be a highly desirable workaround, however, and I don't think that there are real scenarios where such a workaround could be applied incorrectly.

Actually, looking at RFC-822 now - headers are assumed being ASCII, so non-ASCII headers fall under undefined behavior. It's probably better to use an educated guess in such cases (i.e., Content-Type header), than just ISO 8859-1.
Comment 3 Colin Leroy 2010-04-17 18:54:58 UTC
Over the years, we've found that guessing the charset using the user's locale was more often valid than using the Content-Type header. That's why we do it this way :)
Comment 4 Michael Orlov 2010-04-17 19:08:02 UTC
Interesting... That's probably the best way for most users indeed (granted that broken mailers are probably used for mostly in-country correspondence).

But there is something strange - my LANG is set to en_US.UTF8. Yet, the broken messages I have in KOI8-R have headers shown in summary and message views using ISO 8859-1. I checked by pasting the Subject headers to "iconv -t iso8859-1 | iconv -f koi8-r" in console - this shows the original header.

Is it possible that there is some fallback to ISO 8859-1 or latin1 in the code that attempts to parse the header using user's locale? If this is the case, the fallback could be changed to Content-Type's charset (and if that fails, latin1).
Comment 5 Michael Orlov 2010-04-17 19:27:25 UTC
Regarding fallback, seems that it is indeed what happens more or less, according to conv_localetodisp() in codeconv.c - ".UTF8" is stripped from locale if the initial conversion attempt fails, if I understand the code correctly.
Comment 6 Colin Leroy 2010-04-17 20:41:00 UTC
Yes (see the utf8_instead_of_locale_for_broken_mail=0 hidden preference about that).
Comment 7 Holger Berndt 2010-04-17 21:09:08 UTC
> Actually, looking at RFC-822 now - headers are assumed being ASCII, so
> non-ASCII headers fall under undefined behavior

That they're not in rfc 822 (or rather in 2822) doesn't make them undefined. In fact, none of the MIME stuff is in 2822. Non-ASCII header text is described in rfc 2047.
Comment 8 Andrey Gursky 2012-09-16 21:08:17 UTC
If the received message contains no properly formatted subject field, that misses the used encoding, it is essential to guess one. And the best suggestion for guess is try to look in one in Content-Type. If one was found, it is obviously to assume that the same has been used also for the subject. So this one should be made default_encoding for all subsequent calls to conv_unmime_header(..., const gchar   *default_encoding, ...). The argument is just for this puprose and now is being ignored at all.

PLEASE, make use of it.

For example how this done in sylpheed:
procheader.c:
MsgInfo *procheader_parse_stream(FILE *fp, MsgFlags flags, gboolean full)
...
	gchar *charset = NULL;
...
		case H_SUBJECT:
			if (msginfo->subject) break;
			subject = g_strdup(hp);
			break;
...
		case H_CONTENT_TYPE:
			if (!g_ascii_strncasecmp(hp, "multipart", 9)) {
				MSG_SET_TMP_FLAGS(msginfo->flags, MSG_MIME);
			} else if (!charset) {
				procmime_scan_content_type_str
					(hp, NULL, &charset, NULL, NULL);
			}
			break;
...
	if (subject) {
		msginfo->subject = conv_unmime_header(subject, charset);
		subst_control(msginfo->subject, ' ');
		g_free(subject);
	}
...
----------------

Thanks,
Andrey
Comment 9 Andrey Gursky 2012-09-18 16:27:49 UTC
Is it not possible to reopen this bug?
Comment 10 Michael Orlov 2012-09-19 10:34:42 UTC
Reopened, as requested.

Note You need to log in before you can comment on or make changes to this bug.