Read a line-oriented file which may not end with a newline

Question

I have a file named /tmp/urlFile where each line represents a url. I am trying to read from the file as follows:

cat "/tmp/urlFile" | while read url
do
    echo $url
done

If the last line doesn't end with a newline character, that line won't be read. I was wondering why?

Is it possible to read all the lines, regardless if they are ended with a new line or not?

It's discussed at Why is using a shell loop to process text considered bad practice? (with some way to do it) — Stéphane Chazelas, Jan 18 '18 at 18:02
Another way to add the trailing newline if it's missing; awk 1 /tmp/urlFile .. so awk 1 /tmp/urlFile | while ... — muru, Jan 19 '18 at 05:03
Since you’re asking why it isn’t read: https://stackoverflow.com/a/729795/1968 — Konrad Rudolph, Jan 19 '18 at 12:58

score 32 · Answer 1 · answered Jan 18 '18 at 18:03

32

You'd do:

while IFS= read -r url || [ -n "$url" ]; do
  printf '%s\n' "$url"
done < url.list

(effectively, that loop adds back the missing newline on the last (non-)line).

See also:

answered Jan 18 '18 at 18:03

Stéphane Chazelas

544,893

Thanks. I read the linked articles, and maybe I miss something, why "that loop adds back the missing newline on the last (non-)line"? – Tim Jan 18 '18 at 18:26
1

@Tim What Stephane seems to mean is that it adds back the missing newline in the output since all printf calls here have \n . – Sergiy Kolodyazhnyy Jan 18 '18 at 22:53
This is really clever. If the input does not end in a newline, then it must be the last/only line, so as long as it's not completely empty, we still want to process it. I suspect this could also work with read -r url || : instead, but then (I think) you'd always get an empty string at the very end of a file that does in fact end in \n. – shadowtalker Oct 06 '23 at 17:02
1

@shadowtalker read -r url || : is always true, so the loop would never end and you'd keep printing empty lines after the end of the input is reached. – Stéphane Chazelas Oct 06 '23 at 18:20
Of course, that's silly of me. – shadowtalker Oct 06 '23 at 18:27

score 8 · Answer 2 · answered Jan 18 '18 at 18:03

Well, read returns a falsy value if it meets end-of-file before a newline, but even if it does, it still assigns the value it read. So, we can check if the final call of read returns something else than an empty line, and process it as normal. So, only exit the loop after read returns false and the line is empty:

#!/bin/sh
while IFS= read -r line || [ "$line" ]; do 
    echo "line: $line"
done

$ printf 'foo\nbar' | sh ./read.sh 
line: foo
line: bar
$ printf 'foo\nbar\n' | sh ./read.sh 
line: foo
line: bar

score 7 · Answer 3 · edited Jan 19 '18 at 00:06

By definition, a text file consists of a sequence of lines. A line ends with a newline character. Thus a text file ends with a newline character, unless it's empty.

The read builtin is only meant to read text files. You aren't passing a text file, so you can't hope it to work seamlessly. The shell reads all the lines — what it's skipping are the extra characters after the last line.

If you have a potentially malformed input file that may be missing its last line, you could add a newline to it, just to be sure.

{ cat "/tmp/urlFile"; echo; } | …

Files that should be text files but are missing the final newline are often produced by Windows editors. This usually goes in combination with Windows line endings, which are CR LF, as opposed to Unix's LF. CR characters are rarely useful anywhere, and can't appear in URLs in any case, so you should remove them.

{ <"/tmp/urlFile" tr -d '\r'; echo; } | …

In case the input file is well-formed and does end with a newline, the echo adds an extra blank line. Since URLs can't be empty, just ignore blank lines.

Note also that read does not read lines in a straightforward way. It ignores leading and trailing whitespace, which for a URL is probably desirable. It treats backslash at the end of a line as an escape character, causing the next line to be joined with the first minus the backslash-newline sequence, which is definitely not desirable. So you should pass the -r option to read. It is very, very rare for read to be the right thing rather than read -r.

{ <"/tmp/urlFile" tr -d '\r'; echo; } | while read -r url
do
  if [ -z "$url" ]; then continue; fi
  …
done

score 6 · Answer 4 · edited Sep 25 '19 at 18:37

6

This seems to be solved in part with readarray -t:

readarray -t urls "/tmp/urlFile"
for url in "${urls[@]}"; do
    printf '%s\n' "$url"
done

Note however that while this does work for reasonably-sized files, this solution introduces a potential new problem with very large files - it first reads the file into an array which then must be iterated through. For very large files this could be both time- and memory-consuming, potentially to the point of failure.

edited Sep 25 '19 at 18:37

Stéphane Chazelas

544,893

answered Jan 18 '18 at 17:59

DopeGhoti

76,081

Thanks. Which part does it solve and which does it not? – Tim Jan 18 '18 at 18:00
It solves the problem with the lack of a trailing newline, but introduces a potential new problem with very large files, because it first reads the file into an array which then must be iterated through. – DopeGhoti Jan 18 '18 at 18:01
1

@DopeGhoti That's good information - can I suggest you add it directly into the answer? – RJHunter Jan 19 '18 at 08:54
Tha answer has been so amended. – DopeGhoti Jan 19 '18 at 16:09

score 2 · Answer 5 · answered Jan 18 '18 at 18:06

Another way would be like this :

When read reaches end-of-file instead of end-of-line, it does read in the data and assign it to the variables, but it exits with a non-zero status. If your loop is constructed "while read ;do stuff ;done

So instead of testing the read exit status directly, test a flag, and have the read command set that flag from within the loop body. That way regardless of reads exit status, the entire loop body runs, because read was just one of the list of commands in the loop like any other, not a deciding factor of if the loop will get run at all.

DONE=false
until $DONE ;do
read || DONE=true
echo $REPLY 
done < /tmp/urlFile

Referred from here.

score 1 · Answer 6 · answered Jan 18 '18 at 20:09

1

cat "/tmp/urlFile" | while read url
do
    echo $url
done

This is a Useless Use of cat.

Ironically, you can replace the cat process here with something actually useful: a tool that POSIX systems have for adding the missing newline, and making the file into a proper POSIX text file.

sed -e '$a\' "/tmp/urlFile" | while read -r url
do
    printf "%s\n" "${url}"
done

Read a line-oriented file which may not end with a newline

7 Answers7

Further reading

Linked

Related