1

I have a database controlfile on a Linux system that I want to filter out (for training purposes). However, I am unable to find a proper way to get rid of "block-like" characters:

▒▒▒▒
▒▒▒▒
▒▒▒▒
▒▒▒▒▒
▒▒▒{
▒▒▒▒▒▒9
▒▒▒▒
▒▒▒▒▒

I've tried many ways, but they do not get rid of the block chars:

258     strings o1_mf_d3rrgv0l_.ctl|grep -vE '▒'
259     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|?|+|(|)|<|>'
260     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|(|)|<|>'
261     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>'
262     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!'
263     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!|\^|\%|\`'
264     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!|\^|\%|\`|\$'
265     strings o1_mf_d3rrgv0l_.ctl|grep -v '[^[:print:]]'
266     strings o1_mf_d3rrgv0l_.ctl|grep -v '[[:print:]]'
267     strings o1_mf_d3rrgv0l_.ctl|grep  '[[:print:]]'
268     strings o1_mf_d3rrgv0l_.ctl|grep -v  '[[:cntrl:]]'
269     strings o1_mf_d3rrgv0l_.ctl|grep -v '\x{09}'
270     strings o1_mf_d3rrgv0l_.ctl|grep -vP '[^\x00-\x7f]'
271     strings o1_mf_d3rrgv0l_.ctl|tr -dc '\007-\011\012-\015\040-\376'
272     strings -1 o1_mf_d3rrgv0l_.ctl|tr -dc '\007-\011\012-\015\040-\376'
273     strings o1_mf_d3rrgv0l_.ctl|tr -dc '[:print:]\n\r'
274     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!|\^|\%|\`|\$'
275     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!|\^|\%|\`|\;|\:|\=|\$'
276     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!|\^|\%|\`|\;|\:|\=|\$|\"'
277     strings o1_mf_d3rrgv0l_.ctl|grep -vE '@|\?|\+|\(|\)|\<|\>|!|\^|\%|\`|\;|\:|\=|\$|\"|\&|\#'
terdon
  • 242,166
  • grep -v is not the right tool, as it removes entire lines that contain the regular expression. You could try tr as shown at https://unix.stackexchange.com/questions/201751/replace-non-printable-characters-in-perl-and-sed, followed by sed. Or start with cat -v, which represents non-printable characters like ^A, then also filter them out with sed. The problem with cat -v is that it doesn't distinguish between a ^ character and an unprintable character. I am sure there are other solutions. – berndbausch Jan 26 '21 at 09:14
  • Welcome, you want to remove the characters or remove the lines that contain them? – schrodingerscatcuriosity Jan 26 '21 at 10:03
  • yes, I want to remove the lines that contain these weird brackets. – ChrisTheDevil Jan 26 '21 at 11:36
  • cat -v displays a following output:

    ^@▒^@^@▒▒^@^@^@^@^@^@^@^@^@^@<▒^@^@^@^@@^@^@^@^D~z{|}^@^@^▒^@^@^@^@^@^@^@^@^@^@

    – ChrisTheDevil Jan 26 '21 at 11:40

2 Answers2

0

After an interesting journey I hope this answer your question, with GNU grep:

Sample file.txt:

▒▒▒▒
▒▒▒▒
▒▒▒▒
foo
bar
@▒^@^@▒▒^@^@^@^@^@^@^@^@^@^@<▒^@^@^@^@@^@^@^@^D~z{|}^@^@^▒^@^@^@^@^@^@^@^@^@^@

$ grep -v $(printf %b \\U2592) file.txt
foo
bar
  • unfortunately grep@AIX does not have the -P parameter available – ChrisTheDevil Jan 26 '21 at 11:54
  • @user452948 it was not necessary, it remained from previous tests. Remove it and try. – schrodingerscatcuriosity Jan 26 '21 at 12:01
  • It has no effect: BEFORE: strings o1_mf_d3rrgv0l_.ctl |head -10 ~z{|} H▒DB_CHRIS 7aM▒▒;▒Q ▒'y?Yܣ ▒;▒^ ▒▒X?Yܩ= (▒(A'y ?S▒▒Fr▒ ▒▒▒▒ ▒▒▒▒

    . . and AFTER: strings o1_mf_d3rrgv0l_.ctl |head -10 | grep -v $(printf %b \U2592) ~z{|} H▒DB_CHRIS 7aM▒▒;▒Q ▒'y?Yܣ ▒;▒^ ▒▒X?Yܩ= (▒(A'y ?S▒▒Fr▒ ▒▒▒▒ ▒▒▒▒

    – ChrisTheDevil Jan 26 '21 at 12:10
0

It may be easier to remove anything except "known good" characters. e.g. to limit output to standard ASCII characters you could use

tr -dc '[^ -~\012\015]'

That will only keep characters between SPACE and ~ (character 126) and the CR/LF characters. Everything else will be removed.

Alternatively you might want to replace them with another character, e.g. a space

tr -c '[^ -~\012\015]' ' '

which will keep any indentation levels

Finally, you might be seeing this because of locale settings; eg if the OS thinks you have UTF8 but the terminal isn't, then you might see this.

So setting LANG=C before running the command might change the output

LANG=C strings o1_mf_d3rrgv0l_.ctl

That'll change what the strings command considers to be a printable character.