Re: [Patch] Support UTF-8 scripts

From: Bodo Eggert (harvested.in.lkml_at_posting.7eggert.dyndns.org)
Date: 09/18/05

  • Next message: Christian Aichinger: "[PATCH] ext[23]: fix missing DQUOT_DROP in error paths"
    To: Bernd Petrovitsch <bernd@firmix.at>, Martin v. Löwis <martin@v.loewis.de>, "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org
    Date:	Sun, 18 Sep 2005 21:23:42 +0200
    
    

    Bernd Petrovitsch <bernd@firmix.at> wrote:
    > On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote:
    > [...]
    >> >>Hmm. What does that have to do with the patch I'm proposing? This
    >> >>patch does *not* interfere with all text files. It is only relevant
    >> >>for executable files starting with the #! magic.
    >> >
    >> > It *does* interfere since scripts are also text files in every aspect.
    >> > So every feature you want for "scripts" you also get for text files (and
    >> > vice versa BTW).
    >>
    >> The specific feature I get is that when I pass a file starting
    >> with <utf8sig>#! to execve, Linux will execute the file following
    >> the #!. In what way do I get this feature for text in general?
    >> And if I do, why is that a problem?
    >
    > After applying this patch it seems that "Linux" is supporting this
    > marker officially in general - especially if the kernel supports it.

    It will be the first POSIX kernel to correctly support utf-8 scripts.
    It's 2005, and according to other(?) posters, this should be standard.

    > I
    > suppose the next kernel patch is to support Win-like CR-LF sequences
    > (which is not the case AFAIK).

    Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it
    should.-)

    > BTW even some standards body thinks that this is the way to go,

    Not surprisingly the Unicode Consortium is one of them.

    > it
    > raises more problems and questions than resolves anything.

    The problem of ow to handle BOM is solved by reading the standard.

    > And though scripts are usually edited/changed/"parsed"/... with an text
    > editor, it is not always the case. Therefore the automatic extension to
    > *all text files* (especially as the marker basically applies to all text
    > files, not only scripts).
    > You want to focus just on your patch and ignore the directly implied
    > potential problems arising ...

    There is no problem arising from the patch, it solves one.
    To solve the rest, use recode.

    [...]
    > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
    > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
    > somewhere in the middle. Does this make sense in anyway?
    > How do I get rid of the marker in the middle transparently?

    The unicode standard defines how to handle them.

    >> > Let alone the confusion why the size of a file with `ls -l` is different
    >> > from the size in the editor or a marker-aware `wc -c`.
    >>
    >> This is true for any UTF-8 file, or any multibyte encoding. For any
    >> multibyte encoding, the number of bytes in the file is different from
    >> the number of characters. That doesn't (and shouldn't) stop people from
    >> using multi-byte encodings.
    >
    > It is different even if a pure ASCII file is marked as UTF-8.

    No pure ASCII file will be marked, since a marked file will be no
    ASCII file.

    > And sure, the problem exists in general with multi-byte encodings.

    ACK, but that's not a kernel problem nor a specific unicode problem.
    Fix it by making China, Greece an Japan convert to ASCII and by making
    all mathematicans stop using strange characters. All other users will
    follow.

    >> What the editor displays as the number of "things" is up to its own.
    >> The output of wc -c will always be the same as the one of ls -l,
    >> as wc -c does *not* give you characters:
    >>
    >> -c, --bytes
    >> print the byte counts
    >>
    >> You might have been thinking of 'wc -m'.
    >
    > It depends on the definition of "character". There are other standards
    > which define "character" as "byte".

    There are architectures defining a byte to be 32 bit.
    They are irrelevant, too.

    [...]
    >> Not sure what this has
    >> to do with the specific patch, though.
    >
    > It is not supported by the kernel. So either you remove it or you make
    > some compatibility hack (like an appropriate sym-link

    -EDOESNOTWORK

    #!/usr/bin/perl -T -s -w

    >, etc.). Since the
    > kernel can start java classes directly, you can probably make a similar
    > thing for the UTF-8 stuff.

    If MSDOS text files are text files are legal scripts, the kernel
    should recognize [\x0D\x0A] as valid line breaks.

    (The real reason would be unicode allowing NEL to be encoded as 0x0D
     or 0x0A.)

    This compile-tested patch adds 32 bytes to binfmt_script:

    --- ./fs/binfmt_script.c.old 2005-09-18 20:28:32.000000000 +0200
    +++ ./fs/binfmt_script.c 2005-09-18 20:29:44.000000000 +0200
    @@ -18,7 +18,7 @@

     static int load_script(struct linux_binprm *bprm,struct pt_regs *regs)
     {
    - char *cp, *i_name, *i_arg;
    + char *cp, *cp2, *i_name, *i_arg;
            struct file *file;
            char interp[BINPRM_BUF_SIZE];
            int retval;
    @@ -47,6 +47,9 @@ static int load_script(struct linux_binp
            bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
            if ((cp = strchr(bprm->buf, '\n')) == NULL)
                    cp = bprm->buf+BINPRM_BUF_SIZE-1;
    + if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL
    + && cp2 < cp)
    + cp = cp2;
            *cp = '\0';
            while (cp > bprm->buf) {
                    cp--;

    -- 
    Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
    verbreiteten Lügen zu sabotieren.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Christian Aichinger: "[PATCH] ext[23]: fix missing DQUOT_DROP in error paths"

    Relevant Pages

    • Re: RT patch acceptance
      ... judge the complexity of a design for that type of system. ... claim that you cannot judge the complexity of a kernel modification. ... Since the patch in question doesn't actually need that information to ... nanokernel's API up to date with additions to Linux's API that RT people ...
      (Linux-Kernel)
    • Re: inline asm semantics: output constraint width smaller than input
      ... Now in this case the patch you suggest might end up hurting the end result ... The below patch is to build the kernel for x86_64, ... # Device Drivers ... # PCI IDE chipsets support ...
      (Linux-Kernel)
    • [RFC] Making percpu module variables have their own memory.
      ... Someone using the -rt patch found that one of the tracing options caused ... 64K for every CPU to cover all the per_cpu variables used in the kernel ... static void wakeup_softirqd_prio ...
      (Linux-Kernel)
    • Re: This is [Re:] How to improve the quality of the kernel[?].
      ... The -mm kernel already implements what your proposed PTS would do. ... If patch have no TS ID, ... Thus i can apply for example lguest patches and implement and test new ... How many open source projects use Bugzilla and how many use the Debian BTS? ...
      (Linux-Kernel)
    • Re: Documentation - how to apply patches for various trees
      ... >> explanation of the various kernel trees and how to apply their patches. ... +a patch to the kernel or, more specifically, what base kernel a patch for ... +and what new version the patch will change the source tree into. ...
      (Linux-Kernel)