12

I know this is not a very descriptive title (suggestions are welcome), but the fact is that I've been pulling my hair over this for hours and I have no clue where the root of the problem might lie.

I wrote a simple Bash script for CLI chat between peers on a local network:

#!/usr/bin/env bash

Usage: ./lanchat <local_ip>:<local_port> <remote_ip>:<remote_port>

set -x

set -o errexit -o nounset -o pipefail

IFS=':' read -a socket <<< "$1" LOCAL_IP=${socket[0]} LOCAL_PORT=${socket[1]}

IFS=':' read -a socket <<< "$2" REMOTE_IP=${socket[0]} REMOTE_PORT=${socket[1]}

RECV_FIFO=".tmp.lanchat"

trap "rm '$RECV_FIFO'; kill 0" EXIT

mkfifo "$RECV_FIFO"

EDIT: As per @Kamil Maciorowski's suggestion, removing the -q 0 part below solves the issue.

while true; do nc -n -l -q 0 -s "$LOCAL_IP" -p "$LOCAL_PORT" > "$RECV_FIFO"; done &

TMUX_TOP="while true; do cat '$RECV_FIFO'; done" TMUX_BOTTOM="while IFS= read -r line; do nc -n -q 0 '$REMOTE_IP' '$REMOTE_PORT' <<< $line; done"

tmux new "$TMUX_TOP" ; split -v "$TMUX_BOTTOM"

The machine on IP 172.16.0.2 is a VPS running Debian 11, and on 172.16.0.100 is my local computer running Arch.

When I run the commands manually at the prompt on both sides, I get the desired result, which confirms that there is no issue with network communication and that the logic of the script is correct.

## VPS (Debian) side as follows; exchange IPs for local (Arch) side.
$ mkfifo .tmp.lanchat
$ while true; do nc -n -l -q 0 -s 172.16.0.2 -p 1234 > .tmp.lanchat; done &
$ tmux new "while true; do cat .tmp.lanchat; done" \; split -v "while IFS= read -r line; do nc -n -q 0 172.16.0.100 1234 <<< \$line; done"
## Test communication in both directions: all right; then CTRL-C twice to exit both tmux panels
$ kill %1; rm .tmp.lanchat

when I run both sides as a script, however, only the local side (Arch) prints messages from the server (Debian). The server prints nothing from my local computer. When I trace the execution with set -x, everything on both sides looks exactly like the commands that I enter manually, with the right values in place of variables.

Now the odd thing is that if I run the script on the Arch side and commands at the prompt (like above) on the Debian side, then everything works fine again. Furthermore, if I execute the script on the Arch side but source it on the Debian side, that too works fine.

Adding verbose output to both nc calls on the Arch side even prints Connection to 172.16.0.2 1234 port [tcp/*] succeeded!. However, adding a tee log.txt to the call to nc in listening mode on the Debian side does not capture anything:

#...
while true; do
    nc -n -l -q 0 -s "$LOCAL_IP" -p "$LOCAL_PORT" | tee log.txt > "$RECV_FIFO";
done &
#...

I tried establishing the connection in all possible orders between the two peers. I even restarted both the server and my local machine to make sure that there were no orphaned or zombie instances of nc hugging the socket that had somehow evaded detection.

Now, Debian and Arch run different versions of nc. So, on the face of it, it sounds like this could be a possible explanation. But doesn't the fact that sourcing the script on Debian's side works fine rule out that possibility?

What the heck is going on, here?

mesr
  • 399
  • 1
  • 12
  • Well, at least that tmux new "while true; do ... is not the same command as that tmux new "$TMUX_TOP" \; ... and for some reason you have single quotes embedded in the TMUX_TOP and TMUX_BOTTOM vars, ones that don't appear in the command you show in the other snippet. But you also don't show the set -x trace output, so we can't see what exactly happens. – ilkkachu Feb 26 '24 at 08:16
  • By "both commands are not the same", do you mean that there are no quotes in the CLI form? Aren't quote inert anyways? They're in the scripted form to avoid code injection exploits. I tried adding them to the CLI form, just to be sure, with exact same result. I will update my question with debug output. – mesr Feb 26 '24 at 12:16
  • Right, tmux runs those commands through a shell (since otherwise that <<< within wouldn't work), and the shell interprets the quotes here. So nevermind. But, in general, foo="'something'"; echo "$foo" is not the same as echo "something", and putting extra quotes is a common mistake people make when trying to store a command in a variable... In any case, it's best to take care to not add unwanted differences when comparing one situation to another. – ilkkachu Feb 26 '24 at 13:43
  • Just to note, you are re-inventing "talk". https://sourceforge.net/directory/unix-talk/ – Preston L. Bannister Mar 01 '24 at 16:56
  • Good point, @PrestonL.Bannister and I'm aware. I tried talk already and could never get it to work. I came across multiple issues that many other users seem to be experiencing too. That is normally the point where I start exploring the possibility of implementing what I need, especially for such a simple, single task that I need this for. Without mentioning the learning value of doing so. – mesr Mar 01 '24 at 18:16
  • There are (were) a number of talk derivatives. I think ytalk was one that worked for me - but it was a good number of years ago – Chris Davies Mar 01 '24 at 22:40

1 Answers1

17

I have tested your script in Debian 12 (localhost to localhost, separate working directories) and I confirm the problem. My nc is from netcat-traditional 1.10-47 (i.e. not from netcat-openbsd).

The problem is in -q 0 of the listening nc. From man 1 nc:

-q seconds
after EOF on stdin, wait the specified number of seconds and then quit. If seconds is negative, wait forever.

It seems the listening nc waits for an incoming connection before quitting because of -q 0, it does not wait for incoming data though. Establishing a connection and transmitting data are separate events and because of -q 0 the tool usually quits in between. It's a race; in my tests the listening nc sometimes did relay incoming data to the pipe.

The EOF that triggers the unexpected behavior happens immediately because when a shell without job control runs an asynchronous command (terminated by &, this is how you run the loop with the listening nc), it is obliged to redirect its stdin to /dev/null or to an equivalent file.

When you source the script, your interactive shell interprets it. It's probably bash with job control enabled (the default behavior for an interactive bash). If so, it runs the background loop in a separate process group, but its stdin is still connected to the terminal (in general this allows us to fg a background job and type to it). For a background job the inability to steal input from the terminal comes from SIGTTIN, EOF never happens. This way, when the script is being sourced, the listening nc does not suffer from -q 0 that is the problem when you run the script without sourcing.

Specifying -q 1 for the listening nc will help in practice (while still being racy in theory, I guess), but I think it's best to use -q -1 (wait forever) or simply omit -q (in my tests the default behavior seems to be "wait forever").

-q 0 for the connecting nc (the one inside tmux) makes sense, you do want this nc to quit immediately after sending the payload.

nc on your Arch behaved differently maybe because it's different, or maybe because the overall stress on the OS at that time affected the race.

The lesson is: in case of a nc+nc -l pair that sends data in only one direction (you use one such pair for each line), -q 0 is a useful option for the sender; but for the receiver it's unnecessary, in some circumstances even harmful.


There is more to improve, e.g.:

  • there is a code injection vulnerability (./lanchat <local_ip>:<local_port> <remote_ip>:<remote_port>"'; rogue command'");
  • there are short time windows when there is no listening nc on one end or the other;
  • one pair of ncs is enough to handle a whole "session".

I won't address these here, however I can give you a sketch of an alternative script:

#!/usr/bin/env bash

target="$(tmux new -dP 'tail -f /dev/null')" uptty="$(tmux display-message -p -F '#{pane_tty}' -t "$target")" tmux split -t "$target" -v " rlwrap tee >(sed -u 's/^/ < /' | ts %H:%M >${uptty@Q})
| nc ${*@Q} > >(sed -u 's/^/> /' | ts %H:%M >${uptty@Q}) " tmux a -t "$target"

The script does require bash (for itself and inside tmux). You run it with arguments you want to provide to nc, so e.g.

  • first a listening side: ./lanchat -n -l -s 192.168.11.22 -p 2345,
  • then a connecting side: ./lanchat 192.168.11.22 2345.

A single nc to nc connection handles all the communication in both directions. The script uses ts for timestamps (you can remove both instances of | ts %H:%M if you want) and rlwrap for line editing with readline (you can remove rlwrap if you want). sed -u is not portable; sed without -u will cause buffering issues, unless you also get rid of ts.

Tested in bash 5.2.15, tmux 3.3a.

  • Thanks @Kamil Mciorowski for this detailed explanation. I confirm that the -q 0 part is indeed what caused the issue, and removing it is the solution. It was a carry-over from my earlier attempts where I had issues with nc -l seemingly hanging indefinitely after accepting a connection. Special thanks for your explanation of how shells without job control handle STDIOs; very useful information indeed. However, I'm particularly interested in your final points. Would you mind expanding a bit more on them? – mesr Feb 26 '24 at 12:49
  • 1
    @mesr "I'm particularly interested in your final points. Would you mind expanding a bit more on them?" – (1) If $REMOTE_PORT contains a single-quote then "… '$REMOTE_PORT' …" will not behave nicely when given to another shell for interpretation (in your case it's the shell in tmux). In Bash use "… ${REMOTE_PORT@Q} …" etc. (see ${parameter@operator} here). (2) In while true; do nc -n -l …, when true runs, nc does not. (3) nc to nc is bidirectional and able to last. – Kamil Maciorowski Feb 26 '24 at 13:13
  • Thanks again for your edit, @Kamil Maciorowski. There are a couple of good ideas in there! – mesr Feb 26 '24 at 15:31