Bug 1434 - Cache loses sync when mail delivered directly by procmail
Summary: Cache loses sync when mail delivered directly by procmail
Status: NEW
Alias: None
Product: Claws Mail (GTK 2)
Classification: Unclassified
Component: Plugins/mailMBOX (show other bugs)
Version: 3.2.0
Hardware: PC Linux
: P3 major
Assignee: users
URL:
Depends on:
Blocks:
 
Reported: 2007-12-21 18:19 UTC by Cindy Matthews
Modified: 2010-01-29 20:05 UTC (History)
0 users

See Also:


Attachments

Description Cindy Matthews 2007-12-21 18:19:03 UTC
My received email is delivered from the MTA via procmail into a tree of
folders in my home directory which I NFS mount. Claws mail supports this,
but for one (serious) problem; the folders and cache files get horribly out
of sync, eventually resulting in new messages not being reported, incorrect
message counts, deleted messages reappearing and messages in the message list
window not corresponding in the slightest with those presented in the message
pane. The only solution is to close CM, blow away the caches and restart.
Clearly this is only possible on small folders.

This bug exists in 2.7.2 (with 2.7.2's plugins), 3.1.0 (with its plugins)
and 3.2.0 (with its plugins). This problem affects mbox folders (via the mailmbox plugin) and MH folders (when delivery is via nmh's rcvstore) equally.
I describe below how to set up CM to make it misbehave repeatably.

Configuration settings:

 1. Start CM.

 2. Unload all the plugins apart from mailmbox. This is important as it
    stops corruption accumulating and makes this demo easier.

 3. Turn off automatic mailbox checking so you can see what is going on.

 4. Turn on Execute immediately when moving or deleting messages.

 5. Add an et-Pan! mbox mailbox.

 6. Add a folder inside that mailbox.

 7. Copy in some messages.

 8. Read all the messages with CM.

 9. Click on another folder in the folder list.

10. Set up procmail so it can deliver to that folder. As an alternative you
    could simply cat some (well-formed) messages to the end of the folder
    when you need to.

Now for the problem itself:

11. Select the mbox folder again. You will notice that nothing has changed.
    This is good.

12. In a window, run up mailx and examine the mailbox folder. You will notice
    that the file looks good.

13. Exit mailx with x. This ensures that mailx does not modify the file nor
    update the modtime.

14. If you have set up direct delivery to a mailbox folder with procmail (or
    some such), send yourself a mail, eg. "date | mailx -s test foo". If
    you've skipped the set-up, just cat a (valid and unique) message onto
    the end of the file.

15. Start up mailx again and examine the folder. You will notice that the
    folder still looks good. It contains one new message.

16. Exit mailx with x.

17. Back in CM, press Check for new messages in all folders. 1 new message
    should be noted in the folder list pane. The message list pane should
    be blanked too. I am not sure why this should be so, but if it is a bug,
    perhaps someone should look into this as well.

18. Click on another folder. This is nothing to do with exemplifying the bug
    either, but it seems to be a necessary step to get CM to respond to
    step 19. Perhaps someone should look into this one too.

19. Click on the mbox folder again to bring up the message list pane.

20. You will notice a few read messages and one message marked as new. This
    is good.

21. Click on the message's entry on the message list to show the message.
    It shows the correct message text. This too is good.

22. Press delete to delete the message.

23. Without doing anything further with the GUI, start up mailx again and
    examine the folder. You will notice that the folder has not been updated.
    It still contains one new message.

24. Exit mailx with x.

25. Now repeat step 14 to insert another message into the folder.

26. Using mailx will confirm the mail has gone in correctly. mailx should
    show two new messages in the folder.

27. Exit mailx with x.

28. Repeat steps 17, 18 and 19. This time, you will notice two messages
    marked as new. Clearly this is wrong.

However, the situation can be corrected by doing this:

29. Delete the two 'new' messages in CM.

30. Click on another folder.

31. Examine the folder with mailx and you will see that both messages have been
    correctly deleted.

The folder will then operate as expected.

The critical point is that if a user opens a folder with CM, modifies it,
perhaps by reading messages or deleting them and then stays looking at that
folder whilst an external delivery programme happens to deliver a new message
CM simply throws away its changes to the folder when the user moves off that
folder.

This only becomes apparent when the user next checks for new mail and finds
that the changes from the last time have not been stored. On a large and active
folder this is quite a problem.

However, the situation becomes really ugly if one is silly enough to use the
cachesaver plugin. When this happens the cache does get saved from time to
time, but the mailboxes don't get updated. This results in CM's notion of
where messages are in the folder becoming progressively more distorted over
time. On a large and active folder one quickly finds that clicking on one
message's summary line throws up a completely different message's content.

Possible resolutions are, (at least for mbox):

1. Hold a lock on the folder whilst CM has the folder open, thus preventing
   procmail from delivering to that folder. This stinks because procmail
   sessions will back up on the server. This will be a real problem if
   automatic checking for new messages is switched off too.

2. When leaving a folder and getting ready to apply updates, try to
   re-synchronise the cache with the underlying folder. A lock on the folder
   would need to be in place during this process.

NB. Although I have claimed that this problem affects MH as much as mailmbox
I have not tested this scientifically. In particular I have not tested MH
without cachesaver.
Comment 1 Paul 2007-12-22 10:49:51 UTC
Did you try using the 'Update all local folders' option in the 'After receiving new mail...' frame of the Mail Handling/Receiving page of the Prefs, and the 'Scan for new mail' option on the General page of the Folder Properties?

(Forget the cachesaver plugin, it's more-or-less a relic)
Comment 2 Cindy Matthews 2007-12-23 15:57:14 UTC
Dear Paul,

Thank you for responding so quickly.

I did a little testing on this issue and yesterday I was confident that
enabling 'Update all local folders' and enabling 'Scan for new mail' (which
was on previously anyway) had done the trick. However I left CM running
overnight and it lost sync again. It managed to display a completely
different message when I clicked on the subject line of a new message so
it's back to square one. The cachesaver plugin remains unloaded.

This set me thinking about why this might happen, and after a little
bit of investigation I think I have found something quite interesting. It all
seems to come down to the order in which changes to CM and the mailbox happen.
Let me explain.

If I have a box called Deposits into which procmail drops new messages
and another box called Irrelevant and I perform the following operations
in the order below I can show what is going on. I've used two columns to
show this more clearly.

This sequence in this column works        This sequence in this column doesn't
------------------------------------------------------------------------------

Press Inbox                               Press Inbox

Deposit a message via procmail            Deposit a message via procmail

Press Check for new messages,             Press Check for new messages,
which shows (1)                           which shows (1)

Press Irrelevant                          Press Irrelevant

Press Inbox(1)                            Press Inbox(1)

Deposit a message via procmail            Delete the new message in CM

Delete the new message in CM              Deposit a message via procmail            

Press Check for new messages,             Press Check for new messages,
which shows (1)                           which shows (2)


What matters is the relative timing of deletions and deposits. Unlike
those CM users who collect their mail via IMAP (or some such) I have no
control over when deposits are made, which is perhaps why this bug has
remained unreported until now.

It makes no difference whether 'Update all local folders' is selected or not.

It seems to me that the mailbox merging code works just fine but there is
a deficiency in a check before merging which cannot cope with the notion that
a message might be added to a mailbox after a change has been made in CM's
cache.

I wonder whether this is just true of mailmbox or whether it applies to
MH as well.
Comment 3 Cindy Matthews 2008-02-22 16:34:17 UTC
The bug also affects CM 3.3.0 with CM plugins 3.3.0 and libetpan 0.54.
Comment 4 Cindy Matthews 2010-01-29 20:05:42 UTC
I have finally diagnosed the problem, or so I believe. It goes like this:

CM keeps a 'cache' of the messages in the mailbox file. It keeps two sorts of caches: a set of flags for each message indicating such things as whether a message is unread, and; a duplicate of the content of each message.

When another programme alters the mailbox file CM must update its cache to reflect the changes to the mailbox. This is where the problems set in. The problem is not that CM keeps a cache of the messages per-se, but that it keeps a cache of the /content/ of the messages.

Each message, no matter where it originated from, or how easily accessible it is, is stored in its own separate numbered cache file. For a single locally-accessible mbox file containing 10,000 messages, the filesystem must cope with 10,001 files, 10,000 of which are verbatim copies of the original data. Apart from the obvious inefficiencies of this approach (think I/O time, the time spent syncing directory operations and the time taken to dump the filesystem) having 10,000 cached copies of slices of the mailbox is asking for things to slip out of sync.

The basic problem is that once a message is deleted from a folder, the 'flag cache' (the .claws_* files) is updated (first in memory, and later on disc), and (with my expunge patch) so is the mailbox itself, but the cache file is not deleted. Seen in isolation this is not a problem, but at a later date when a new message is plonked into the mailbox by a proper, old-school programme, CM makes a mess of updating its caches.

If CM concludes that a new message deposited into the mailbox by a proper programme is to be given the same message number as a message which used to be in the mailbox file, the flags cache will be updated correctly, but s_claws_mailmbox_fetch_msg() will conclude that the old, undeleted cache file from the earlier message is a perfectly satisfactory cached copy of the new message. It is this which results in CM displaying the wrong content for a message.

How likely is this scenario? In my experience with over a million messages stored in mbox files and a turn-over of perhaps 1000-2000 messages per day across several folders, it happens several times every day. It happens particularly often when deleting or moving messages which have been delivered only recently. My solution up until now has been to stop CM and blow away the entire cache (apart from the .claws_mark files) every time this happens, but some respite can be gained by blowing away the requisite files containing the cached content whilst CM is running.

So, how to fix it. At first I was tempted to delete the cached content files when the real messages are deleted, but this is complicated somewhat by the message moving code. Someone more familiar with this code should look into it before deleting these files.

The easiest route to take at this stage is simply to delete the cached files in s_claws_mailmbox_fetch_msg() before re-creating them lower down in the function. Although this is not efficient, it is at least reliable. Simply replacing

        if (is_file_exist(file)) {
             ....
        }

with

        /* Bug 1434
         */
        unlink(file);

does the trick.

Then we need to convince CM to actually delete the emails from the mailbox when it is told to do so. We do this by calling claws_mailmbox_expunge() when removing a message, like so:

static gint claws_mailmbox_remove_msg(Folder *folder, FolderItem *item, gintnum)
{
        struct claws_mailmbox_folder * mbox;
        int r;

        g_return_val_if_fail(item != NULL, -1);
        mbox = get_mbox(item, 0);
        if (mbox == NULL)
                return -1;
        r = claws_mailmbox_delete_msg(mbox, num);

        /* Bug 1434
         */
        claws_mailmbox_expunge(mbox);

        if (r != MAILMBOX_NO_ERROR)
                return -1;
        return 0;
}

This keeps the mailbox file in sync with the GUI. It is vital to do this or CM will mis-recognise the messages in the mailbox themselves and then corrupt its flags cache as well. This can cause CM to delete different messages from a mailbox to the ones you intend. This same call must be added to the end of claws_mailmbox_remove_all_msg() as well.

One further bug in the same vein is the fact that CM can't delete a message from an mbox file containing just one message. This in itself causes messages to be resurected. The fix is inside claws_mailmbox_expunge_to_file_no_lock() as follows....

  r = ftruncate(dest_fd, size);
  if (r < 0) {
    res = MAILMBOX_ERROR_FILE;
    goto err;
  }

    /* Bug 1434.
     */
    if( size==0 )
    {
        goto good;
    }

  dest = (char *) mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, dest_fd, 0);
  if (dest == MAP_FAILED) {
    res = MAILMBOX_ERROR_FILE;
    goto err;
  }

....followed by....

  fflush(stdout);

  msync(dest, size, MS_SYNC);
  munmap(dest, size);

    /* Bug 1434.
     */
good:
    fflush(stdout);
  * result_size = size;
  return MAILMBOX_NO_ERROR;

 err:
  return res;
}

It comes about because it is not possible to mmap() a zero-length block of memory without error.

These fixes are mature and are ported to 3.7.4.

Note You need to log in before you can comment on or make changes to this bug.