Re: [RFC v6][PATCH 0/9] Kernel based checkpoint/restart





Cedric Le Goater wrote:
Ingo Molnar wrote:
* Dave Hansen <dave@xxxxxxxxxxxxxxxxxx> wrote:

On Thu, 2008-10-09 at 15:44 +0200, Ingo Molnar wrote:
there might be races as well, especially with proxy state - and
current->flags updates are not serialized.

So maybe it should be a completely separate flag after all? Stick it
into the end of task_struct perhaps.
What do you mean by proxy state? nsproxy?
it's a concept: one task installing some state into another task (which
state must be restored after a checkpoint event), while that other task
is running. Such as a pi-futex state for example.

So a task can acquire state not just by its own doing, but via some
other task too.

thinking aloud,

hmm, that's rather complex, because we have to take into account the
kernel stack, no ? This is what Andrey was trying to solve in his patchset
back in September :

http://lkml.org/lkml/2008/9/3/96

the restart phase simulates a clone and switch_to to (not) restore the kernel
stack. right ?

the self checkpoint and self restore syscalls, like Oren is proposing, are
simpler but they require the process cooperation to be triggered. we could
image doing that in a special signal handler which would allow us to jump
in the right task context.

This description is not accurate:

For checkpoint, both implementations use an "external" task to read the state
from other tasks. (In my implementation that "other" task can be self).

For restart, both implementation expect the restarting process to restore its
own state. They differ in that Andrew's patchset also creates that process
while mine (at the moment) relies on the existing (self) task.

In other words, none of them will require any cooperation on part of the
checkpointed tasks, and both will require cooperation on part of the restarting
tasks (the latter is easy since we create and fully control these tasks).


I don't have any preference but looking at the code of the different patchsets
there are some tricky areas and I'm wondering which path is easier, safer,
and portable.

I am thinking which path is preferred: create the processes in kernel space
(like Andrew's patch does) or in user space (like Zap does). In the mini-summit
we agreed in favor of kernel space, but I can still see arguments why user space
may be better. (note: I refer strictly to the creation of the processes during
restart, not how their state is restored).

any thoughts ?

Oren.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [PATCH v21 00/100] Kernel based checkpoint/restart
    ... version moves portions of checkpoint code closer to where they belong. ... Fix acquiring socket lock before reading RTNETLINK response ... Netdev restore function dispatching from a table ... Restart to handle checkpoint images lacking -ns ...
    (Linux-Kernel)
  • [PATCH v21 00/100] Kernel based checkpoint/restart
    ... version moves portions of checkpoint code closer to where they belong. ... Fix acquiring socket lock before reading RTNETLINK response ... Netdev restore function dispatching from a table ... Restart to handle checkpoint images lacking -ns ...
    (Linux-Kernel)
  • [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20
    ... Selinux prevents msgrcv on restore message queues? ... Fix "scheduling in atomic" while restoring ipc ... Restart to handle checkpoint images lacking -ns ... Refuse to checkpoint if monitoring directories with dnotify ...
    (Linux-Kernel)
  • [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20
    ... Selinux prevents msgrcv on restore message queues? ... Fix "scheduling in atomic" while restoring ipc ... Restart to handle checkpoint images lacking -ns ... Refuse to checkpoint if monitoring directories with dnotify ...
    (Linux-Kernel)
  • Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
    ... Patches 12 and 13 are newer, adding support for c/r of multiple ... Extend checkpoint header with archtiecture dependent header ... Memory restore now maps user pages explicitly to copy data into them, ... all the stakeholders was that doing checkpoint/restart in the kernel ...
    (Linux-Kernel)