Since OP seems to be adamant on using a "single invocation" of sed
, here is one:
Non-word splitting approach with partial pattern-space hiding:
sed -n -e 's/^\W*//' -e 's/\(\W\+\)/\n/gp' words.txt
EDIT: Note that as pointed out by @don_crissti, this solution is not complete, as it fails to print out words that appear at a line by themselves in the first place, as well as inserting a newline at the very end of the output if the file was missing a terminating newline.
To remedy this issue, see the following, extremely ugly solution.
The main issue with sed
is that the pattern space, on which each -e
expression operates, is always defined by lines. If you insert newlines, thereby changing the line structure between the first expression and the next, the next expression won't be able to run on the processed data.
Explanation:
First, leading whitespace, if any, is dealt with on each line. Those that consist entirely of it are turned into empty lines, while still keeping the line length of the pattern space.
The key in the second part is a combination of the -n
option and the p
(print) command, which some people like to call sed
's "grep mode" and basically effects that only matched and/or changed lines will be printed.
-n
prevents printing of any output, and p
forces printing of matched and/or changed lines.
This way, you avoid printing lines that were completely blank. Since \W\+
expects at least one non-word character, empty lines are out. And leading whitespace that could have been matched by the expression were turned into empty lines previously.
EDIT: I forgot to explain that the lack of the p
command in the first expression is also key. On each expression, the pattern space is normally printed, causing us to see each line as many times as there were expressions that printed it, with variations if any of those expressions also changed the given line.
However, even though the pattern space is not printed, it is carried over to subsequent expressions in its changed form, allowing us to chain expressions that operate on a single pipeline that originates with one input line, while only seeing the output of the last expression.
If you like to see words as sequences of non-whitespace characters, well... they are, but that definition encompasses a lot more than words. Those are not words, those are non-whitespace sequences.
However, if you'd like to match these and print them on separate lines instead of words, use:
sed -n -e 's/^\s*//' -e 's/\(\s\+\)/\n/gp' words.txt
Zero-byte substitution approach
EDIT: The issue of lines with a single word on them and missing newline on EOF as pointed out by @don_crissti can be solved by the following command.
Although not too long, aside from it being ridiculously hacky, it has at least one flaw I know of: namely that it does not work for a file with only a single line, if that single line has multiple words. An idea to solve that would be to add branches to check if the last line is the first, complicating the program even more (and taking me even more time :D).
Here is the command:
sed -rn 's/(\b|\W)+/\x0/g; s/^\x0//; s/\x0$//; s/\x0/\n/g; /^$/d; $! p; $ { s/$/\n/; P }'
Explanation:
The command works in the following passes:
First, non-word characters along with word boundaries, such as the end and beginning of lines, which are zero-width assertions, not characters, are substituted by zero-bytes. This also includes the word boundaries together with their adjacent non-word character sequences, where they occur in such positions.
Then, zero-bytes are removed from the beginning and end of each line.
Then, every intermediate zero-byte is substituted by a newline.
Any resulting empty lines are deleted from the pattern space. There are no whitespace-only lines at this point.
If the address of the current pattern space is not the last address (i.e., we are not at the last line), we simply print the line.
At the end of our data, we execute 2 commands:
We add a newline at the end of the current pattern space, to have at least 1 terminating newline, even if the original data didn't end in one.
We print only until the first embedded newline in our current pattern space, which has a maximum of 2 newlines.
By the way, the simplest solution to this problem I've seen is:
grep -o '\w\+' words.txt
Or, if you don't need to deal with lines starting with whitespace:
fmt -1 words.txt