Re: poll() in 2.6 and beyond

linux_at_horizon.com
Date: 03/03/04

  • Next message: Mike Fedyk: "bad: scheduling while atomic in nfs with 2.6.3"
    Date:	3 Mar 2004 03:06:04 -0000
    To: linux-kernel@vger.kernel.org, root@chaos.analogic.com
    
    

    > I'm talking about the driver! When a open fd called poll() or select(),
    > in user-mode code, the driver's poll() was called, and the driver's poll()
    > would call poll_wait(). Poll_wait() used to NOT return until the driver
    > executed wake_up_interruptible() on that wait-queue. When poll_wait()
    > returned, the driver would return to the caller with the new poll-
    > status.

    poll_wait has ALWAYS, since it was introduces during the select/poll
    changeover in 2.1 development, been a non-sleeping, immediately returning
    function.

    Its predecessor, select_wait(), has been a non-sleeping function since 1.0.

    > So, if the poll_wait isn't a wait-function, but just some add-wakeup
    > to the queue function, then its name probably should have been
    > changed when it changed. At one time it did, truly, wait until
    > it was awakened with wake_up_interruptible.

    There is no "when it changed". It has never changed. Go look at
    the 2.2.20 code on http://lxr.linux.no/.

        ***************************************************************
        * *
        * poll_wait: *
        * - DOES NOT sleep. *
        * - NEVER HAS slept, in any kernel version, EVER. *
        * - WOULD NOT WORK if it did sleep, for reasons that are *
        * so BLATANTLY OBVIOUS that arguing about it after it's *
        * been REPEATEDLY pointed out is a sign that the person *
        * arguing needs to go and visit the rest home with those *
        * nice, young men in their clean, white coats. *
        * *
        ***************************************************************

    It has aways, since select_wait in Linux 1.0, been nothing more than
    an "add-wakeup-to-the-queue" function. The last time the code changed
    significantly was the select/poll changeover in 2.1.x, and even then it
    was very similar.

    When a particular filp's poll method is called, there are two things
    that have to get done:
    1) Check if the wakeup conditions are already satisfied
       (the "no-wait" case), and
    2) Schedule the task for wakeup when the filp's condition changes
       (the "wait" case)

    Now, 2) only has to be done if 1) fails, but we can't do things in that
    order because there's a race condition. if the condition changed between
    the two steps, but doesn't change after that, we'll never wake up.

    So we have to do 2), and THEN check for 1). This is the fundamental
    race condition of sleeping until a condition becomes true, so anyone who
    entertains the remotest channce of ever writing functioning kernel code
    should be excruciatingly familiar with it, and its solution.

    For the terminally dim, everyone hold hands and follow me. Remember
    when we read Kernighan & Ritchie aloud and don't get lost...

    In the 2.2.20 kernel, pollwait is defined (include/linux/poll.h) as:

    14 struct poll_table_entry {
    15 struct file * filp;
    16 struct wait_queue wait;
    17 struct wait_queue ** wait_address;
    18 };
    19
    20 typedef struct poll_table_struct {
    21 struct poll_table_struct * next;
    22 unsigned int nr;
    23 struct poll_table_entry * entry;
    24 } poll_table;
    25
    26 #define __MAX_POLL_TABLE_ENTRIES ((PAGE_SIZE - sizeof (poll_table)) / sizeof (struct poll_table_entry))
    27
    28 extern void __pollwait(struct file * filp, struct wait_queue ** wait_address, poll_table *p);
    29
    30 extern inline void poll_wait(struct file * filp, struct wait_queue ** wait_address, poll_table *p)
    31 {
    32 if (p && wait_address)
    33 __pollwait(filp, wait_address, p);
    34 }

    This ia a trivial wrapper around __pollwait(). Nothing else in the
    function could possibly take more than a few clock cycles.

    __pollwait is defined (fs/select.c) as:

     94 void __pollwait(struct file * filp, struct wait_queue ** wait_address, poll_table *p)
     95 {
     96 for (;;) {
     97 if (p->nr < __MAX_POLL_TABLE_ENTRIES) {
     98 struct poll_table_entry * entry;
     99 entry = p->entry + p->nr;
    100 entry->filp = filp;
    101 filp->f_count++;
    102 entry->wait_address = wait_address;
    103 entry->wait.task = current;
    104 entry->wait.next = NULL;
    105 add_wait_queue(wait_address,&entry->wait);
    106 p->nr++;
    107 return;
    108 }
    109 p = p->next;
    110 }
    111 }

    This does a little bit of bookkeeping and calls add_wait_queue().
    Nothing else in the function could possibly take more than a few dozen
    clock cycles.

    Now look at add_wait_queue(), which is defined (include/linux/sched.h) as
    nothing more than calls to three other functions:

    745 extern inline void __add_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
    746 {
    747 wait->next = *p ? : WAIT_QUEUE_HEAD(p);
    748 *p = wait;
    749 }
    750
    751 extern rwlock_t waitqueue_lock;
    752
    753 extern inline void add_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
    754 {
    755 unsigned long flags;
    756
    757 write_lock_irqsave(&waitqueue_lock, flags);
    758 __add_wait_queue(p, wait);
    759 write_unlock_irqrestore(&waitqueue_lock, flags);
    760 }

    write_lock_irqsave() and write_unlock_irqsave() can get a little
    bit complicated, so pay careful attention.

    We're going to consider just the i386, non-SMP case. SMP locking is
    for big boys and girls who did all their homework and got As on their
    tests.

    They are defined (include/asm-i386/spinlock.h) as:
    115 #define write_lock_irqsave(lock, flags) \
    116 do { save_flags(flags); cli(); } while (0)
    117 #define write_unlock_irqrestore(lock, flags) \
    118 restore_flags(flags)

    These, in turn, go through some wrappers in include/asm-i386/system.h:
    198 #define cli() __cli()
    199 #define sti() __sti()
    200 #define save_flags(x) __save_flags(x)
    201 #define restore_flags(x) __restore_flags(x)

    and end up calling the primitive assembly routines in the same file:

    176 /* interrupt control.. */
    177 #define __sti() __asm__ __volatile__ ("sti": : :"memory")
    178 #define __cli() __asm__ __volatile__ ("cli": : :"memory")
    179 #define __save_flags(x) \
    180 __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */ :"memory")
    181 #define __restore_flags(x) \
    182 __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory")

    These are one or two machine instructions each.

    Finally, _add__wait_queue, above, is just two assignments. Just to
    be absolutely sure we've explored every function-like thing that
    the most deluded paranoid could think might sleep, I'll mention that
    WAIT_QUEUE_HEAD is defined in include/linux/wait.h as:

    22 #define WAIT_QUEUE_HEAD(x) ((struct wait_queue *)((x)-1))

    This, also, cannot possibly sleep.

    To recap, looking back in the wayback machine to 2.2.20, the complete
    call/macro graph of poll_wait is:

    poll_wait
        __pollwait
            add_wait_queue
                write_lock_irqsave
                    save_flags
                        __save_flags
                    cli
                        __cli
                __add_wait_queue
                    WAIT_QUEUE_HEAD
                write_unlock_irqrestore
                    restore_flags
                        __restore_flags

    and NONE OF THESE FUNCTIONS SLEEP. Therefore, poll_wait DID NOT USED TO
    SLEEP.

    2.4 and later allocate the poll_table pages on demand (with GFP_KERNEL)
    rather than preallocating everything, so *that* part can sleep, but it
    doesn't depend on the filp being polled, and the rest CANNOT.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Mike Fedyk: "bad: scheduling while atomic in nfs with 2.6.3"

    Relevant Pages

    • Re: Better polling solution in wince driver
      ... I am implementing a touchpad driver using polling mechanism. ... So the only way to get the data is to poll the device every fixed time interval. ... The answer to my question was to use Sleep() ...
      (microsoft.public.windowsce.platbuilder)
    • Re: Blank blue screen NOT BSOD
      ... changeover last June. ... driver available. ... The video card is an ATI RADEON XPRESS 200M IGP. ... My decoder is Cyberlink Video SP decoder ...
      (microsoft.public.windows.mediacenter)
    • Re: Blank blue screen NOT BSOD
      ... that timeframe would coincide with the digital changeover last June. ... You could try changing the decoder just to see it that helps, the nVidia basic Pure Platinum one is free for 30 days. ... > recording but most of the time it won't. ... I have tried to update the> driver ...
      (microsoft.public.windows.mediacenter)
    • Re: Tracking service status
      ... QueryServiceStatus(). ... I do that with some sleep when stopping the driver ... > continuously poll the status. ...
      (microsoft.public.win32.programmer.kernel)
    • Re: Left and right on vehicles.
      ... cars kept to the left and the driver sat on the left. ... I don't know what it was like in the countryside but I was in Stockholm a week or so after the changeover and traffic was quite smooth in the city. ... I never had the chance to experience buses with the driver on the left driving on the left side of the road. ...
      (alt.usage.english)