[PATCH] unregister_ioctl32_conversion and modules. ioctl32 revisited.

From: Michael S. Tsirkin (mst_at_mellanox.co.il)
Date: 12/17/04

  • Next message: Alan Cox: "Re: How to add/drop SCSI drives from within the driver?"
    Date:	Fri, 17 Dec 2004 03:43:45 +0200
    To: Andrew Morton <akpm@osdl.org>
    
    

    Hello, Andrew!
    Since I didnt get any negative comments on this (and some positive)
    Andi Kleen suggested I submit the following patch to you.
    It boots fine for me, please consider for mainline inclusion.

    Dependencies:
    The patch below is against 2.6.10-rc3. Please note it replaces
    Ingo's ->unlocked_ioctl patch from rc3-mm1, so you have to back that off
    before applying mine to mm:

    http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc3/2.6.10-rc3-mm1/broken-out/unlocked_ioctl.patch

    Please mail me directly with comments since I'm not on the list.

    Changes from the last version I posted ( http://lkml.org/lkml/2004/12/15/62 )
    - Two whitespace cleanups. I hope its all good now.
    - Arnd Bergmann's idea: make it possible to go back from ioctl_compat
      to hash lookup (good for static ioctl tables) by returning -ENOIOCTLCMD
      from ioctl_compat
    - Petr Vandrovec's idea: add HAVE_... macros to make it easier for
      out-of-kernel modules to detect the new file_operations.

    Description:
    The patch introduces two new methods (ioctl_native and ioctl_compat):
    ioctl_native is called on native ioctl syscall, without BKL being taken,
    and is, in that respect, equivalent to Ingo's unlocked_ioctl (which is
    why it conflicts).
    ioctl_compat is called on compat (i.e. 32 bit app on 64 bit OS) ioctl,
    again without BKL being taken.
    If a new call is not defined, default to the old behaviour.
    (It should be possible for me to build a patch that applies on top of Ingo's
    unlocked_ioctl, if its really needed let me know and I'll look at it the next
    week.)

    Motivation: Quoting Andi Kleen:
    > Hallo,
    >
    > There seems to be an unfixable module unload race in the current
    > register_ioctl32_conversion support. The problem is that
    > there is no way to wait for a conversion handler is blocked
    > in a sleeping *_user access before module unloading. The module
    > count is also not increase in this case.
    > ...

    [Snip]

    > A better solution would be to switch the few users of
    > register_ioctl32_conversion() over to a new ->ioctl32 method
    > in file_operations and do the conversion from there. This would
    > avoid the race because the VFS will take care of the module
    > count in open/release.
    >
    > Michael did a patch for this some time ago for a different motivation -
    > he had some benchmarks where the hash table lookup hurt and it was
    > noticeable faster to use a O(1) ->ioctl32 lookup from the file_operations
    > for his application.
    >
    > An useful side effect would be also to the ability to support
    > a per device ioctl name space. While the core kernel doesn't have
    > much (any?) ioctls with duplicated numbers this mistake seems
    > to be quite common in out of tree drivers and it is hard to
    > fix without breaking compatibility.
    >
    > And it would be faster for this case of course too, so even performance
    > critical in kernel ioctls could be slowly converted to ioctl32
    > I wouldn't do it for all, because the current central tables work
    > reasonably well for them and most ioctls are not speed critical
    > anyways.
    >
    > As for in kernel code it won't affect much code because near
    > all conversion handlers in the main tree are not modular (alsa
    > is one exception, there are a few others e.g. in some raid drivers).
    > I expect it will be a bigger problem in the future though as ioctl
    > emulation becomes more widespread and is done more in individual drivers.
    >
    > averell:lsrc/v2.6/linux% gid register_ioctl32_conversion | wc -l
    > 75
    > averell:lsrc/v2.6/linux%
    >
    > In tree users are alsa, aaraid, fusion, some s390 stuff, sisfb, alsa
    >
    > My proposal would be to dust off Michael's patch and convert
    > all users in tree over to ioctl32 and then deprecate and later remove
    > (un)register_ioctl32_conversion

    There was an additional motivation for my patch:
    As noted by Juergen Kreileder, the compat hash does not work
    for ioctls that encode additional information in the command, like this:

    #define EVIOCGBIT(ev,len) _IOC(_IOC_READ, 'E', 0x20 + ev, len)

    Comments?

    Thanks,
    MST

    Signed-off-by: Michael Tsirkin <mst@mellanox.co.il>

    diff -rup linux-2.6.10-rc3/Documentation/filesystems/Locking linux-2.6.10-rc3-mstioctl/Documentation/filesystems/Locking
    --- linux-2.6.10-rc3/Documentation/filesystems/Locking 2004-12-16 15:20:36.000000000 +0200
    +++ linux-2.6.10-rc3-mstioctl/Documentation/filesystems/Locking 2004-12-16 18:25:35.289970464 +0200
    @@ -350,6 +350,10 @@ prototypes:
             unsigned int (*poll) (struct file *, struct poll_table_struct *);
             int (*ioctl) (struct inode *, struct file *, unsigned int,
                             unsigned long);
    + long (*ioctl_native) (struct inode *, struct file *, unsigned int,
    + unsigned long);
    + long (*ioctl_compat) (struct inode *, struct file *, unsigned int,
    + unsigned long);
             int (*mmap) (struct file *, struct vm_area_struct *);
             int (*open) (struct inode *, struct file *);
             int (*flush) (struct file *);
    @@ -383,6 +387,8 @@ aio_write: no
     readdir: no
     poll: no
     ioctl: yes (see below)
    +ioctl_native: no (see below)
    +ioctl_compat: no (see below)
     mmap: no
     open: maybe (see below)
     flush: no
    @@ -428,6 +434,9 @@ move ->readdir() to inode_operations and
     anything that resembles union-mount we won't have a struct file for all
     components. And there are other reasons why the current interface is a mess...
     
    +->ioctl() on regular files is superceded by the ->ioctl_native() and
    +->ioctl_compat() pair. The lock is not taken for these new calls.
    +
     ->read on directories probably must go away - we should just enforce -EISDIR
     in sys_read() and friends.
     
    diff -rup linux-2.6.10-rc3/include/linux/fs.h linux-2.6.10-rc3-mstioctl/include/linux/fs.h
    --- linux-2.6.10-rc3/include/linux/fs.h 2004-12-16 15:20:46.000000000 +0200
    +++ linux-2.6.10-rc3-mstioctl/include/linux/fs.h 2004-12-16 18:25:35.291970160 +0200
    @@ -900,6 +900,12 @@ typedef struct {
     
     typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
     
    +/* These macros are for out of kernel modules to test that
    + * the kernel supports the ioctl_native and ioctl_compat
    + * fields in struct file_operations. */
    +#define HAVE_IOCTL_COMPAT 1
    +#define HAVE_IOCTL_NATIVE 1
    +
     /*
      * NOTE:
      * read, write, poll, fsync, readv, writev can be called
    @@ -915,6 +921,24 @@ struct file_operations {
             int (*readdir) (struct file *, void *, filldir_t);
             unsigned int (*poll) (struct file *, struct poll_table_struct *);
             int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
    + /* The two calls ioctl_native and ioctl_compat described below
    + * can be used as a replacement for the ioctl call above. They
    + * take precedence over ioctl: thus if they are set, ioctl is
    + * not used. Unlike ioctl, BKL is not taken: drivers manage
    + * their own locking. */
    +
    + /* If ioctl_native is set, it is used instead of ioctl for
    + * native ioctl syscalls.
    + * Note that the standard glibc ioctl trims the return code to
    + * type int, so dont try to put a 64 bit value there.
    + */
    + long (*ioctl_native) (struct inode *, struct file *, unsigned int, unsigned long);
    + /* If ioctl_compat is set, it is used for a 32 bit compatible
    + * ioctl (i.e. a 32 bit binary running on a 64 bit OS).
    + * Return -ENOIOCTLCMD if you dont handle it.
    + * Note that only the low 32 bit of the return code are passed
    + * to the user-space application. */
    + long (*ioctl_compat) (struct inode *, struct file *, unsigned int, unsigned long);
             int (*mmap) (struct file *, struct vm_area_struct *);
             int (*open) (struct inode *, struct file *);
             int (*flush) (struct file *);
    diff -rup linux-2.6.10-rc3/include/linux/ioctl.h linux-2.6.10-rc3-mstioctl/include/linux/ioctl.h
    --- linux-2.6.10-rc3/include/linux/ioctl.h 2004-12-16 15:19:34.000000000 +0200
    +++ linux-2.6.10-rc3-mstioctl/include/linux/ioctl.h 2004-12-16 18:25:35.291970160 +0200
    @@ -3,5 +3,13 @@
     
     #include <asm/ioctl.h>
     
    +/* Handles standard ioctl commands, and returns the result in status.
    + Does nothing and returns non-zero if cmd is not one of the standard commands.
    +*/
    +
    +struct file;
    +int std_sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg,
    + struct file *filp, long *status);
    +
     #endif /* _LINUX_IOCTL_H */
     
    diff -rup linux-2.6.10-rc3/fs/ioctl.c linux-2.6.10-rc3-mstioctl/fs/ioctl.c
    --- linux-2.6.10-rc3/fs/ioctl.c 2004-12-16 15:20:45.000000000 +0200
    +++ linux-2.6.10-rc3-mstioctl/fs/ioctl.c 2004-12-16 18:27:21.000899960 +0200
    @@ -36,7 +36,9 @@ static int file_ioctl(struct file *filp,
                             if ((error = get_user(block, p)) != 0)
                                     return error;
     
    + lock_kernel();
                             res = mapping->a_ops->bmap(mapping, block);
    + unlock_kernel();
                             return put_user(res, p);
                     }
                     case FIGETBSZ:
    @@ -46,29 +48,21 @@ static int file_ioctl(struct file *filp,
                     case FIONREAD:
                             return put_user(i_size_read(inode) - filp->f_pos, p);
             }
    - if (filp->f_op && filp->f_op->ioctl)
    - return filp->f_op->ioctl(inode, filp, cmd, arg);
             return -ENOTTY;
     }
     
     
    -asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
    -{
    - struct file * filp;
    +EXPORT_SYMBOL(std_sys_ioctl);
    +int std_sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg,
    + struct file *filp, long *status)
    +{
             unsigned int flag;
    - int on, error = -EBADF;
    -
    - filp = fget(fd);
    - if (!filp)
    - goto out;
    + int on, error, unknown = 0;
     
             error = security_file_ioctl(filp, cmd, arg);
    - if (error) {
    - fput(filp);
    + if (error)
                     goto out;
    - }
     
    - lock_kernel();
             switch (cmd) {
                     case FIOCLEX:
                             set_close_on_exec(fd, 1);
    @@ -100,8 +94,11 @@ asmlinkage long sys_ioctl(unsigned int f
     
                             /* Did FASYNC state change ? */
                             if ((flag ^ filp->f_flags) & FASYNC) {
    - if (filp->f_op && filp->f_op->fasync)
    + if (filp->f_op && filp->f_op->fasync) {
    + lock_kernel();
                                             error = filp->f_op->fasync(fd, filp, on);
    + unlock_kernel();
    + }
                                     else error = -ENOTTY;
                             }
                             if (error != 0)
    @@ -125,15 +122,46 @@ asmlinkage long sys_ioctl(unsigned int f
                             break;
                     default:
                             error = -ENOTTY;
    - if (S_ISREG(filp->f_dentry->d_inode->i_mode))
    + if (S_ISREG(filp->f_dentry->d_inode->i_mode)) {
                                     error = file_ioctl(filp, cmd, arg);
    - else if (filp->f_op && filp->f_op->ioctl)
    - error = filp->f_op->ioctl(filp->f_dentry->d_inode, filp, cmd, arg);
    + }
    + if (error == -ENOTTY) {
    + unknown = 1;
    + goto out;
    + }
    + break;
    + }
    +out:
    + *status = error;
    + return unknown;
    +}
    +
    +asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
    +{
    + struct file *filp;
    + long error = -EBADF;
    + int fput_needed;
    +
    + filp = fget_light(fd, &fput_needed);
    + if (!filp)
    + goto out2;
    +
    + if (!std_sys_ioctl(fd, cmd, arg, filp, &error))
    + goto out;
    +
    + if (filp->f_op && filp->f_op->ioctl_native)
    + error = filp->f_op->ioctl_native(filp->f_dentry->d_inode,
    + filp, cmd, arg);
    + else if (filp->f_op && filp->f_op->ioctl) {
    + lock_kernel();
    + error = filp->f_op->ioctl(filp->f_dentry->d_inode,
    + filp, cmd, arg);
    + unlock_kernel();
             }
    - unlock_kernel();
    - fput(filp);
     
     out:
    + fput_light(filp, fput_needed);
    +out2:
             return error;
     }
     
    diff -rup linux-2.6.10-rc3/fs/compat.c linux-2.6.10-rc3-mstioctl/fs/compat.c
    --- linux-2.6.10-rc3/fs/compat.c 2004-12-16 15:20:45.000000000 +0200
    +++ linux-2.6.10-rc3-mstioctl/fs/compat.c 2004-12-16 18:26:23.013715352 +0200
    @@ -401,16 +401,21 @@ asmlinkage long compat_sys_ioctl(unsigne
                                     unsigned long arg)
     {
             struct file * filp;
    - int error = -EBADF;
    + long error = -EBADF;
             struct ioctl_trans *t;
    + int fput_needed;
     
    - filp = fget(fd);
    - if(!filp)
    + filp = fget_light(fd, &fput_needed);
    + if (!filp)
                     goto out2;
     
    - if (!filp->f_op || !filp->f_op->ioctl) {
    - error = sys_ioctl (fd, cmd, arg);
    + if (!std_sys_ioctl(fd, cmd, arg, filp, &error))
                     goto out;
    + else if (filp->f_op && filp->f_op->ioctl_compat) {
    + error = filp->f_op->ioctl_compat(filp->f_dentry->d_inode,
    + filp, cmd, arg);
    + if (error != -ENOIOCTLCMD)
    + goto out;
             }
     
             down_read(&ioctl32_sem);
    @@ -425,9 +430,12 @@ asmlinkage long compat_sys_ioctl(unsigne
                             error = t->handler(fd, cmd, arg, filp);
                             unlock_kernel();
                             up_read(&ioctl32_sem);
    - } else {
    + } else if (filp->f_op && filp->f_op->ioctl) {
                             up_read(&ioctl32_sem);
    - error = sys_ioctl(fd, cmd, arg);
    + lock_kernel();
    + error = filp->f_op->ioctl(filp->f_dentry->d_inode,
    + filp, cmd, arg);
    + unlock_kernel();
                     }
             } else {
                     up_read(&ioctl32_sem);
    @@ -466,7 +474,7 @@ asmlinkage long compat_sys_ioctl(unsigne
                     }
             }
     out:
    - fput(filp);
    + fput_light(filp, fput_needed);
     out2:
             return error;
     }
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Alan Cox: "Re: How to add/drop SCSI drives from within the driver?"

    Relevant Pages

    • Re: unregister_ioctl32_conversion and modules. ioctl32 revisited.
      ... > the ioctl32 lookup is independent from the file descriptor ... > Michael did a patch for this some time ago for a different motivation - ... > a per device ioctl name space. ... -asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) ...
      (Linux-Kernel)
    • Re: BUG: unable to handle kernel NULL pointer dereference (drm_getunique)
      ... unable to handle kernel NULL pointer dereference at 00000000 ... My patch fixes this, but here is a better one that checks the whole cmd ... This patch checks the incoming ioctl command against the defined ones ... This patch will break some other parts of the DRM interface, ...
      (Linux-Kernel)
    • HEADS UP: KBI breakage for Ethernet modules
      ... It will bring ether_ioctl() into accord with ioctlWRT ... the type of the command argument. ... so this patch will be a little step on ... This change will inevitably break the kernel interface to network ...
      (freebsd-current)
    • Re: HEADS UP: KBI breakage for Ethernet modules
      ... It will bring ether_ioctl() into accord with ioctlWRT ... the type of the command argument. ... so this patch will be a little step on ... This change will inevitably break the kernel interface to network ...
      (freebsd-current)
    • Re: HEADS UP: KBI breakage for Ethernet modules
      ... It will bring ether_ioctl() into accord with ioctlWRT ... the type of the command argument. ... so this patch will be a little step on ... This change will inevitably break the kernel interface to network ...
      (freebsd-current)