Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

From: James Bottomley (James.Bottomley_at_SteelEye.com)
Date: 04/06/05

  • Next message: Kyle Moffett: "Re: Use of C99 int types"
    To: Jens Axboe <axboe@suse.de>
    Date:	Wed, 06 Apr 2005 17:09:59 -0400
    
    

    On Wed, 2005-04-06 at 21:08 +0200, Jens Axboe wrote:
    > > I think the correct model for all of this is that the block driver
    > > shouldn't care (or be tied to) the scsi one. Thus, as long as SCSI can
    > > reject requests from a queue whose device has been released (without
    > > checking the device) then everything is fine as long as we sort out the
    > > lock lifetime problem.
    >
    > But you are tying it completely to the SCSI problem, since we have no
    > locking problems of this sort elsewhere. At least the notified could
    > potentially be used for something else. The lock is just hack number two
    > to work around the problem.

    Then help me understand the problem as you see it.

    My current understanding is that these problems occur because we've
    mixed data in two objects of different lifetimes. So far, this is stack
    independent.

    My proposal is to correct this by moving the data back to the correct
    object, and make any object using it hold a reference, so this would
    make the provider of the block request_fn hold a reference to the queue
    and initialise the queue lock pointer with the lock currently in the
    queue. Drivers that still use a global lock would be unaffected. This
    also means that any provider of a request_fn may expect to process (and
    reject) requests for a period of time after blk_cleanup_queue().
    Really, this refcounting is inherent in blk_init_queue(),
    blk_cleanup_queue() so the only additional requirement is sanely
    processing queue requests after you think it's been cleaned up. This is
    request_fn() provider independent, I think (providers who currently
    don't violate the lifetime rules don't need fixing).

    I claim that this proposal solves the current problem, and any other
    ones we run across that occur because of data mixing.

    Your current patch tries to solve the problem by tying the two objects
    together; unifying the lifetimes. I agree this can be done (it was how
    the sd/sr close race was fixed). But the current way the freeing
    functions thread across the subsystems doesn't look nice and I don't
    currently see a way to get the queue freed: We free the queue on scsi
    device release; if the queue holds a reference to the scsi_device, the
    release function will never be called. Our only other choice is to try
    to free the queue on device_del instead, but that's too early, I think.

    James

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Kyle Moffett: "Re: Use of C99 int types"

    Relevant Pages

    • Re: Seeing lock order reversal
      ... 21 vm page queue free mutex -- ... 18 UMA zone -- ... 18 sleep mtxpool -- ... 15 process lock -- ...
      (freebsd-current)
    • Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler
      ... >> But you are tying it completely to the SCSI problem, ... The lock is just hack number two ... > make the provider of the block request_fn hold a reference to the queue ...
      (Linux-Kernel)
    • Re: location of bioq lock
      ... queue is owned by the driver, and the locking scheme remains the same. ... from a different scheduler than the default, it can be easily plugged in. ... process you have to lock each queue before playing with it. ...
      (freebsd-current)
    • Re: thread communication
      ... I changed this into a simple class which has a SyncRoot which I can Wait and Pulse, and then some code to process the queue. ... One thing I noticed in your code though, in your desire to move the actual processing outside of the lock, you have introduced an inefficiency in the synchronization. ... IFileProcessor processor = Queue.Dequeue; ... Especially with the above alternative, you could easily maintain a local Queue to which all of the public Queue elements are copied, and then just loop while unlocked until the local Queue is empty, adding the "ready" items back into the local Queue as you find them. ...
      (microsoft.public.dotnet.languages.csharp)
    • Re: thread-safety
      ... but that would depend on the implementation of your Queue class. ... lock { ... It's odd for you to be passing a WaitHandle of any sort to the Monitor class; if you've got a WaitHandle, you would normally just wait on _that_, rather than using the WaitHandle simply as a synchronization object for Monitor. ... It would be a problem to call Monitor.Pulseif no other thread is waiting at a call to Monitor.Wait, because then the waiting thread would miss the Pulse. ...
      (microsoft.public.dotnet.languages.csharp)