Re: detecting file types



On 2006-02-08, Måns Rullgård <mru@xxxxxxxxxxxxx> wrote:
Grant Edwards <grante@xxxxxxxx> writes:

On 2006-02-08, dagecko <dagecko@xxxxxxx> wrote:

I read it quickly, but in fact this was not what I was looking
for. I needed a way to detect, in a program made in C/C++, if
a file is a binary one or a simple ascii one.

1) Do a bitwise or of all the bytes in the file.

2) If bit 7 is set in the result, it's not an ASCII file.

3) If bit 7 is not set in the result, it _might_ be an ASCII
file. Or it might be a binay file that doesn't have any
bytes with bit 7 set.

If you know what language the ASCII is supposed to be, you
could look at the frequency distributions of individual
characters to give you a better idea if a file is really ASCII
or if it's a degenerate binary file.

Looking for non-printable characters < 0x20 is also a good idea.

Those non-printible characters are all prefectly legal ASCII.

However, he's what he's looking for is a typical ASCII _text_
file, then I wouldn't expect to find too many non-printible
characters other than form-feed, line-feed, carriage-return,
horizontal-tab, and maybe backspace.

--
Grant Edwards grante Yow! As President I
at have to go vacuum my coin
visi.com collection!
.



Relevant Pages

  • Re: [PHP] passthru
    ... Use a tag within which newlines are preserved ... Use preg_replace to replace non-printable characters (Can't remember ... to strip everything from ASCII 1 - ASCII 32. ...
    (php.general)
  • Re: detecting file types
    ... Grant Edwards writes: ... I needed a way to detect, in a program made in C/C++, if ... If bit 7 is set in the result, it's not an ASCII file. ... Looking for non-printable characters < 0x20 is also a good idea. ...
    (comp.os.linux.development.apps)
  • Re: passthru
    ... I thought it only replaced entities. ... \n is not a "special char" ... Use preg_replace to replace non-printable characters (Can't remember ... to strip everything from ASCII 1 - ASCII 32. ...
    (php.general)
  • Re: C64 CHROUT routine...
    ... > The reason for the odd screen codes is to allow for a richer set of ... ASCII has lots of non-printable characters (anything ...
    (comp.sys.cbm)
  • Re: detecting file types
    ... Grant Edwards writes: ... I needed a way to detect, in a program made in C/C++, if ... a file is a binary one or a simple ascii one. ... distribution frequencies and do a "fuzzy" compare to the ...
    (comp.os.linux.development.apps)