Using sed, how to format one word per line, removing white space

Question

I'm trying to replaces patterns and cleanup a file containing multiple words to get one word per line.

The result is achieved using this command line:

sed -e '/^[[:space:]]*$/ d' \             # remove empty line
    -e 's/^[[:space:]]*//' \              # remove white space at the beginning
    -e 's/[[:space:]]*$//' \              # remove white space at the ending (EOL)
    -e 's/[[:space:]][[:space:]]*/\n/g' \ # convert blanks between words to newline
    -e '$a\'                              # add a newline if missing at EOF
    -e .....                              # replace other patterns.

(the last expression was found in How to add a newline to the end of a file?)

The idea is to process the file (eg. replaces some pattern) and format the file at the same time with only one small sed program.

I'm sure its possible to use other sed features to reduces the expression.

Regards

score 8 · Answer 1 · edited Jun 14 '13 at 17:08

8

You can use tr:

tr -s "[[:blank:]]" "\n" < file | grep .

The [:blank:] character class includes all horizontal whitespace. The -s squeezes or reduces multiple character occurrences to one.

The grep removes a blank line (if present).

edited Jun 14 '13 at 17:08

Stéphane Chazelas

544,893

answered Jun 14 '13 at 15:26

JRFerguson

14,740

Thanks for the suggestion, but it's not using sed and an empty line in the beginning of the input file is not removed and a newline is not added after the last word. Regards. – Yann Droneaud Jun 17 '13 at 08:49

score 4 · Answer 2 · edited Jun 14 '13 at 15:46

4

Try this

sed -e 's/[[:space:]]/\n/g' | grep -v '^$'

It uses both grep and sed, but I hope it's OK (if you have sed on a system, you usually have grep too)

edited Jun 14 '13 at 15:46

don_crissti

82,805

answered Jun 14 '13 at 15:06

Karel Bílek

1,951

@Karel-Bilek: while it's working, eg. it's putting words one per line, removing all spaces, add a new line at EOF: could this be done with a single invocation of 'sed' without any other Unix tool ? Regards. – Yann Droneaud Jun 17 '13 at 08:54
@ydroneaud: don_crissti wrote exactly that. But it needs GNU version of sed. Not sure about UNIX standard. – Karel Bílek Jun 17 '13 at 21:11
@KarelBílek not exactly as it's using 2 invocations of sed with a pipe. I would like only one sed invocation. – Yann Droneaud Jun 18 '13 at 13:31
@YannDroneaud: Although very messy, there is a solution to ensure a single newline at EOF. Not sure if it works with non-GNU sed implementations. See my updated answer. – Larry Jan 05 '19 at 11:03

glenn jackman · Answer 3 · 2013-06-14T18:44:45.317

4

Not sed, but:

gawk length RS='[[:space:]]+' file

That treats any sequence of whitespace as the record separator, and prints each non-empty record.

edited Jun 14 '13 at 18:44

answered Jun 14 '13 at 17:06

glenn jackman

85,964

@don_crissti, leading spaces are actually removed, but awk treats the empty string before the whitespace as an empty record. I've updated to only print non-empty records – glenn jackman Jun 14 '13 at 18:45
To convince me to use awk instead of sed, you need to show how to replace multiple patterns in the file while formatting it with a POSIX awk. Regards. – Yann Droneaud Jun 17 '13 at 08:56

score 1 · Answer 4 · edited Jan 12 '19 at 15:29

Since OP seems to be adamant on using a "single invocation" of sed, here is one:

Non-word splitting approach with partial pattern-space hiding:

sed -n -e 's/^\W*//' -e 's/$\W\+$/\n/gp' words.txt

EDIT: Note that as pointed out by @don_crissti, this solution is not complete, as it fails to print out words that appear at a line by themselves in the first place, as well as inserting a newline at the very end of the output if the file was missing a terminating newline. To remedy this issue, see the following, extremely ugly solution.

The main issue with sed is that the pattern space, on which each -e expression operates, is always defined by lines. If you insert newlines, thereby changing the line structure between the first expression and the next, the next expression won't be able to run on the processed data.

Explanation:

First, leading whitespace, if any, is dealt with on each line. Those that consist entirely of it are turned into empty lines, while still keeping the line length of the pattern space.
The key in the second part is a combination of the -n option and the p (print) command, which some people like to call sed's "grep mode" and basically effects that only matched and/or changed lines will be printed. -n prevents printing of any output, and p forces printing of matched and/or changed lines. This way, you avoid printing lines that were completely blank. Since \W\+ expects at least one non-word character, empty lines are out. And leading whitespace that could have been matched by the expression were turned into empty lines previously.
EDIT: I forgot to explain that the lack of the p command in the first expression is also key. On each expression, the pattern space is normally printed, causing us to see each line as many times as there were expressions that printed it, with variations if any of those expressions also changed the given line. However, even though the pattern space is not printed, it is carried over to subsequent expressions in its changed form, allowing us to chain expressions that operate on a single pipeline that originates with one input line, while only seeing the output of the last expression.

If you like to see words as sequences of non-whitespace characters, well... they are, but that definition encompasses a lot more than words. Those are not words, those are non-whitespace sequences. However, if you'd like to match these and print them on separate lines instead of words, use:

sed -n -e 's/^\s*//' -e 's/$\s\+$/\n/gp' words.txt

Zero-byte substitution approach

EDIT: The issue of lines with a single word on them and missing newline on EOF as pointed out by @don_crissti can be solved by the following command. Although not too long, aside from it being ridiculously hacky, it has at least one flaw I know of: namely that it does not work for a file with only a single line, if that single line has multiple words. An idea to solve that would be to add branches to check if the last line is the first, complicating the program even more (and taking me even more time :D). Here is the command:

sed -rn 's/(\b|\W)+/\x0/g; s/^\x0//; s/\x0$//; s/\x0/\n/g; /^$/d; $! p; $ { s/$/\n/; P }'

Explanation:

The command works in the following passes:

First, non-word characters along with word boundaries, such as the end and beginning of lines, which are zero-width assertions, not characters, are substituted by zero-bytes. This also includes the word boundaries together with their adjacent non-word character sequences, where they occur in such positions.
Then, zero-bytes are removed from the beginning and end of each line.
Then, every intermediate zero-byte is substituted by a newline.
Any resulting empty lines are deleted from the pattern space. There are no whitespace-only lines at this point.
If the address of the current pattern space is not the last address (i.e., we are not at the last line), we simply print the line.
At the end of our data, we execute 2 commands:
- We add a newline at the end of the current pattern space, to have at least 1 terminating newline, even if the original data didn't end in one.
- We print only until the first embedded newline in our current pattern space, which has a maximum of 2 newlines.

By the way, the simplest solution to this problem I've seen is:

grep -o '\w\+' words.txt

Or, if you don't need to deal with lines starting with whitespace:

fmt -1 words.txt

printf '%s\n' 'A B' ' C D' ' E' | your_sed_here fails to print the E on last line. Your code also fails to add a newline at EOF if it doesn't exist... This isn't that easy to solve (with sed). — don_crissti, Jan 04 '19 at 20:38

Using sed, how to format one word per line, removing white space

4 Answers4

Non-word splitting approach with partial pattern-space hiding:

Explanation:

Zero-byte substitution approach

Explanation: