0

I have a command to replace the non printable characters and single quotes from a file but its taking more time to execute as I am replacing these characters for multiple files and the files size is around 30GB.

LANG=iso-8859-1 sed -i 's/[^[:print:]]//g;s/'\''//g;s/'//g' $path/EID*_$1.xml

The $path and $1 passing through parameters. How can I make the process run faster and is there any other command which I can use? I heard tr command is faster compare to sed but how can I use the tr command in my situation. (tr command in single line for all the files).

I tried this command:

LANG=iso-8859-1 sed 's/[^[:print:]]//g;s/'\''//g;s/&apos;//g' < $path/EID123_$1.xml > $path/EID123_$1_new.xml
mv -f $path/EID123_$1_new.xml EID123_$1.xml
LANG=iso-8859-1 sed 's/[^[:print:]]//g;s/'\''//g;s/&apos;//g' <     $path/EID456_$1.xml > $path/EID456_$1_new.xml;
mv -f $path/EID456_$1_new.xml EID456_$1.xml 

for each single files without i option but its not giving the expected result and I could still see the non printable characters in file.

terdon
  • 242,166
Azhar
  • 11
  • 1
    Please [edit] your question and include i) an example of your file; ii) the output you would like from that example and iii) an explanation of exactly what characters you want to remove. What is &apos; supposed to be, for example? – terdon Feb 11 '16 at 14:57
  • Without knowing the details I can't say for sure, but doing a for loop that forks itself into several background processes might increase your sed performance by a great deal, there is an answer about it here: http://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop – Gravy Feb 11 '16 at 15:12
  • would the strings command work for you? – Rui F Ribeiro Feb 11 '16 at 15:43
  • for one you do not need to the mv part, just use sed -i option it will do it for you. – Rob Feb 11 '16 at 15:54
  • @terdon, My file is of xml file and ' determines the single quotes. so I want to get rid of single quotes(') and ' from the file and also I want to remove all the non printable characters([^[:print:]]) from file. The above command works fine but just that its taking time. – Azhar Feb 12 '16 at 07:48
  • @Rob, in the -i option the sed command usually takes more time to execute thats why I was trying to remove -i and was using mv command at the end – Azhar Feb 12 '16 at 07:49
  • Can we use tr command to do the same thing? – Azhar Feb 12 '16 at 07:49
  • @Azhar please [edit] your question and give us an example of your input and desired output as I requested. That way, we can know exactly what you need. Make sure to include examples of all the characters/strings you want to remove. Note, however, that if you want to modify several files of ~30GB size, it will be slow no matter what you do. – terdon Feb 12 '16 at 08:46

1 Answers1

0

Input binary file foo, and for any character that's not printable (or some kind of space), replace that char with a space, and send output to pipe, where another tr replaces single quotes with spaces, then output that to bar.

tr --complement '[:print:]'  ' ' < foo | tr "'"  ' ' > bar
agc
  • 7,223