Re: remove/replace non-ascii characters from file



Johannes Wiedersich wrote:
Mike McCarty wrote:

garbage (represented as ^@^@^@^@^@^@ etc.)


I suppose you mean "non-graphic ASCII". Those are NUL characters,
which the ASCII *definition* states can be inserted or removed
from *any* stream without changing its meaning. This means that
your application is not ASCII compliant. Sorry, but in this case
(unusual, I know) Windows is right and your app is wrong.


Well, I don't know that much about the ASCII *definition*, but if I open the file in Window$ notepad (I never use that for any purpose, I just did it out of curiosity), these characters appear as additional spaces. They are saved as spaces and in the saved file the characters are replaced by spaces (ie. linux-compliant spaces).

So, if you are right, that means that M$ notepad converts these NUL characters to spaces, which is a bad thing, if these are indeed different characters and useful for anything.

Yes, it is doing a Bad Thing. ASCII was originally intended for use
as an Information Interchange, including use over serial lines,
and to slow (mechanical) printers connected on the other end.
The purpose of NUL was to allow the sender to pad the transmission
after sending characters which might take the receiver a "long"
time to process, like CR (carriage return). They are like NOPs in
computer programming. They eat time, but otherwise do nothing
else. One is supposed to be able to insert or delete them from
any ASCII stream without changing the meaning of the stream.

The ASCII code for SP (graphic space) is 0x20. The ASCII code for NUL
(null character) is 0x00. They are indeed not the same thing. SP is
supposed to be *meaningful* in an ASCII stream. NUL is not.
Deleting/inserting an ASCII space is supposed to change its meaning.
For example, "therapist" and "the rapist" do not mean the same thing
(usually).

Anyway, I don't think it is a useful feature of a program to include NUL characters in the header of data files like the present one which just consists of a short header and two columns of x and y data. I'd be curious of the programmer's reason for putting about 50 of these at the end of the comment.

I have no idea why they were inserted there[*]. They are not very useful
when used to *store* as opposed to *move* data. If one had a very dumb
terminal program, and needed to communucate with some possibly slow
"other" device (like a uController programming EEPROM or the like)
it might be useful to insert NUL characters into the file itself
at strategic points to allow programming time.

[*]A possible guess why they were put there: This is a fixed-length
field, and it makes a C programmer's job a little easier if he reads
a NUL terminated string into a fixed array.

You might try tr. On another note, here's a C program which will do what
you want. It's written as a filter, so no file names on the line... this
is strictly no-frills programming. Placed into the public domain by
me, the original author today, Thursday 3 August 2006. If you *need*
file names on the command line (like for use with find and xargs)
then I can add that, but I thought something quick'n'nasty might
be more what you need.


I appreciate your effort! I was anyway writing a script to postprocess the data, so the most convenient way was to remove the junk via another command line.

You're welcome, and no problem if you don't use it. It was a 15 minute
effort anyway. I did test it, as you saw, though.

Mike
--
p="p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
This message made from 100% recycled bits.
You have found the bank of Larn.
I can explain it for you, but I can't understand it for you.
I speak only for myself, and I am unanimous in that!


--
To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx with a subject of "unsubscribe". Trouble? Contact listmaster@xxxxxxxxxxxxxxxx



Relevant Pages

  • Re: File-Compare "fc" falsely reports mismatch between identical files
    ... first and last lines of each set of differences, whereas /L is said to compare files as ascii text. ... Show me a couple of "text files" that fc/a does not compare properly, and I would argue that they are so extreme in some way that I would not consider them "text files". ... One of the definitions found by google is this: "A file that contains characters organized into one or more lines. ... the tax department reacted to a customer's complaint and insisted that the faulty tax calculation be fixed. ...
    (microsoft.public.win2000.cmdprompt.admin)
  • Re: POS. Cash Register on AS400.- New and Updates
    ... Probably the easiest way would be to send them as ASCII. ... You need to change the printer file to not convert unprintable characters. ... "The INITPRT tag defines the ASCII control ... but still can not open cash drawer. ...
    (comp.sys.ibm.as400.misc)
  • Re: Unicode Support
    ... consider:)...but, you know, a file is still just a "stream of characters" ... "escape sequence" but accessing an ordinary ASCII character) are considered ... English, while all your identifiers are in "Romanji" Japanese or something ... NASM appears already to do so with strings and comments in ...
    (alt.lang.asm)
  • Re: System 360 EBCDIC vs. ASCII
    ... I suppose they could have created a 7-bit architecture if it ... There are a few vestiges of 7-bit characters in other computer systems due ... If you set your modem to 8 bits you ... connections, including hardwired ones: plotters, ASCII terminals, etc. ...
    (bit.listserv.ibm-main)
  • Re: File-Compare "fc" falsely reports mismatch between identical files
    ... compare files as ascii text. ... to two different printers and produced different results. ... the definitions found by google is this: "A file that contains characters ... calculated tax payable - for a period of six years. ...
    (microsoft.public.win2000.cmdprompt.admin)

Loading