Remove new line, space from file

Question

I have many files in a directory each like so:

AAA 
AA

AAAAAA
A


AAAA

I want to end up with this:

AAAAAAAAAAAAAAAA

So that when I run:

find ./ -name '*' -exec wc -m {} +

I get back 16, not 20+ depending on how many new line/spaces are counted.

Basically, I want to remove EVERYTHING from a file unless it is a letter.

Stéphane Chazelas · Answer 1 · 2023-06-23T09:15:47.647

Note that if you remove every newline character from a file, even the last one, then it's no longer a text file (unless the file ends up being empty) as a text file contains a sequence of text lines, text lines being delimited by newline characters.

Now, to remove all but alphabetical characters (any alphabet), as @Kusalanada said, POSIXly, you'd use tr -cd '[:alpha:]'.

Now, unfortunately, with some tr implementations, including GNU tr, that doesn't work for multi-byte characters. In UTF-8 locales, that means all characters but ASCII ones.

On GNU systems, you can use GNU awk or GNU sed which do support multibyte characters:

<file sed 's/[^[:alpha:]]//g' | tr -d '\n'
<file awk -v ORS= '{gsub(/[^[:alpha:]]/, ""); print}'

That syntax is not GNU-specific, but you'll find some non-GNU sed/awk implementations that don't support multibyte characters. Beware that GNU sed/awk at least will not remove sequences of bytes that don't form valid characters (like the output of printf 'à b \200\n' in a UTF-8 locale).

With uconv from the ICU project, you could do:

<file uconv -i -x '[^[:Letter:]]>;'

Where -i tells uconv to skip input it can't decode.

But that only works for UTF-8 data. Note that it uses Unicode character properties (some version of Unicode) as opposed to what your locale decides what's alphabetical or not.

With GNU grep, you could use:

<file grep -o '[:alpha:]' | tr -d '\n'

Or if built with PCRE support (using Unicode properties):

<file grep -Po '\pL' | tr -d '\n'

With GNU awk, another approach to skip the invalid input is to use RS:

<file gawk -v RS='[[:alpha:]]' -v ORS= '{print RT}'

To modify the files in-place, you can use gawk's inplace module:

gawk -i /usr/share/awk/inplace.awk gawk -v RS='[[:alpha:]]' -v ORS= '{print RT}' file

Do not use -i inplace as gawk tries to load the inplace extension (as inplace or inplace.awk) from the current working directory first, where someone could have planted malware. The path of the inplace extension supplied with gawk may vary with the system, see the output of gawk 'BEGIN{print ENVIRON["AWKPATH"]}'

Kusalananda · Answer 2 · 2019-04-10T16:42:39.817

You don't need -name '*' as you want to process every file (* matches every file anyway, so it does not make any difference). You might however want -type f to only process regular files (not directories etc.)

To remove anything that is not a letter, you may use

tr -cd '[:alpha:]' <file

The -c complements the given set of characters, and [:alpha:] matches only alphabetical characters. The -d instructs tr to delete the matching characters.

The command you may want to execute is therefore

tr -cd '[:alpha:]' <file | wc -m

for each file.

Since this is too complex for find to execute directly, you will have to employ an in-line script:

find . -type f -exec sh -c '
    for pathname do
        tr -cd "[:alpha:]" <"$pathname" | wc -m
    done' sh {} +

Here, the in-line sh -c script will get batches of pathnames of files as arguments from find. The pipeline will be executed for each file.

score 0 · Answer 3 · answered Jul 10 '23 at 17:44

Using Raku (formerly known as Perl_6)

~$ raku -e 'S:g/ <-alpha> //.put given lines;'  file
#OR
~$ raku -e 'S:g/ <- :L > //.put given lines;'  file

OR:

~$ raku -e 'S:g/ <-alpha> //.put given slurp;'  file
#OR
~$ raku -e 'S:g/ <- :L > //.put given slurp;'  file

Raku features high-level support for Unicode built-in, so no external libraries need be loaded to count multibyte characters. The Regex character-class :L denotes Unicode letters, and the <- :L > means "everything but" Unicode letters will get deleted in the substitution.

Sample Input (first line w/ ~ 6 trailing spaces, sixth line with ~ 12 spaces):

AAA     
AA1234
ÀÁÂÃÄÅÆ
1234
AAAA

Sample Output:

AAAAAÀÁÂÃÄÅÆAAAA

Counting characters:

~$ raku -e 'S:g/ <- :L > //.raku.put given lines;'  file
"AAAAAÀÁÂÃÄÅÆAAAA"
~$ raku -e 'S:g/ <- :L > //.chars.put given lines;'  file
16
~$ raku -e 'S:g/ <- :L > //.comb.elems.put given lines;'  file
16

https://docs.raku.org/language/unicode
https://raku.org

Remove new line, space from file

3 Answers3