Re: detecting file types



Grant Edwards <grante@xxxxxxxx> writes:

On 2006-02-08, Joe Pfeiffer <pfeiffer@xxxxxxxxxxx> wrote:

I read it quickly, but in fact this was not what I was looking
for. I needed a way to detect, in a program made in C/C++, if
a file is a binary one or a simple ascii one.

There is no quick and reliable way to do it. All you can really do is
scan the file looking for non-printing characters, and if you find
enough of them decide it's not ASCII (do you really mean ASCII, by the
way, or an eight-bit extension like ISO-8859-1?);

If he wants to allow something like ISO-8859-1, then he's going
to need to build a table containing the file's byte
distribution frequencies and do a "fuzzy" compare to the
distributions of known language/charset pairs. Not a
particularly easy/simple thing to do.

Not easy at all -- but lots of people say "ASCII" these days when they
don't really mean it.

or, you can use the "system" call from inside your program to
execute file.

Or he can trust that the user knows what he's doing and just
process the file he's been told to. ;)

Yeah -- "do the contents of this file conform to the syntax my program
is expecting?" is both a lot simpler and a lot more useful than "is
this file a binary file or a text file?"
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
skype: jjpfeifferjr
.



Relevant Pages

  • Re: detecting file types
    ... I needed a way to detect, in a program made in C/C++, if ... a file is a binary one or a simple ascii one. ... distribution frequencies and do a "fuzzy" compare to the ... Or he can trust that the user knows what he's doing and just ...
    (comp.os.linux.development.apps)
  • Re: detecting file types
    ... Grant Edwards writes: ... I needed a way to detect, in a program made in C/C++, if ... If bit 7 is set in the result, it's not an ASCII file. ... Looking for non-printable characters < 0x20 is also a good idea. ...
    (comp.os.linux.development.apps)
  • Re: Text formating for reumes?
    ... Grant Edwards wrote: ... > After a few jobs, getting everything onto one page while still ... > looking decent requires something a bit more sophisticated than ... > ASCII text. ...
    (comp.os.linux.misc)
  • Re: Text formating for reumes?
    ... Grant Edwards wrote: ... > After a few jobs, getting everything onto one page while still ... > looking decent requires something a bit more sophisticated than ... > ASCII text. ...
    (alt.os.linux.suse)
  • Re: detecting file types
    ... I needed a way to detect, in a program made in C/C++, if ... If bit 7 is set in the result, it's not an ASCII file. ... Looking for non-printable characters < 0x20 is also a good idea. ... he's what he's looking for is a typical ASCII _text_ ...
    (comp.os.linux.development.apps)