While it's not explictly written, I guess the goal is to split traffic such that:
- 172.16.0.0/24 traffic flows through enp1s0f1
- 10.0.0.0/24 traffic flows through enp4s0f0
As OP wrote, this needs policy/source-based routing. iptables and netfilter are rarely useful (at least alone):
- generally speaking iptables and netfilter don't route and don't care about routes. The network routing stack routes. Some of iptables' actions will still cause routing decision alterations (as described in this schematic)
- any action done in POSTROUTING, as the name tells, happens after routing decisions were made: it's too late to alter the route. Here while the nat/POSTROUTING rule are needed, they won't alter the route.
Whenever iptables can be avoided to solve a routing problem, better avoid it. Sometimes it can't be avoided (and then usually iptables is used to add a mark to packets and this mark is used in an ip rule
entry).
Routes
I will assume that rp_filter=1
is set on all interfaces, since it's the default for most distributions, to enable Strict Reverse Path Forwarding.
Source address is selected by rule, destination by routing table. The additional routing tables should have enough informations to override (without ambiguity) routes when only one among multiple should be chosen (then only this one is added to the table). Often additional routes from the main table must also be copied or bad things can happen.
In my answer I will give no preference over one network or an other: each will get its own routing table. I'll forget table 1 and use tables 10 for LAN 10.0.0.0/24 and 172 for LAN 172.16.0.0/24. Keep the NAT rules, remove the rules and additional routing tables, as well as 192.168.0.1 dev enp4s0f0 scope link
from main.
Routes for 10.0.0.0/24 <--> 10.0.0.6 enp4s0f0 | enp4s0f1 192.168.0.6 <--> 192.168.0.1/default:
ip rule add from 10.0.0.0/24 lookup 10
ip route add table 10 10.0.0.0/24 dev enp4s0f1
ip route add table 10 192.168.0.0/24 dev enp4s0f0 src 192.168.0.6
ip route add table 10 default via 192.168.0.1
Above, without also the duplicated route entry for 10.0.0.0/24, the system wouldn't be able itself to access this LAN: it would resolve the route as having to go through the default gateway, only for Strict Reverse Path Forwarding(SRPF) purposes making this difficult to debug. That's an example of bad thing if not added. When in doubt, just duplicate routes.
An other equivalent option could have been instead of the additional route to change the rule above into:
ip rule add from 10.0.0.0/24 iif enp4s0f1 lookup 10
so it wouldn't have matched for local (non-routed) traffic and only the main table would be used.
Routes for 172.16.0.0/24 <--> 172.16.0.3 enp1s0f0 | enp1s0f1 192.168.0.3 <--> 192.168.0.1/default:
ip rule add from 172.16.0.0/24 lookup 172
ip route add table 172 172.16.0.0/24 dev enp1s0f0
ip route add table 172 192.168.0.0/24 dev enp1s0f1 src 192.168.0.3
ip route add table 172 default via 192.168.0.1
To also alter the route (the link) for locally initiated outgoing traffic when changing the outgoing source IP address on the Linux system. This should be optional, but next part about ARP flux makes it mandatory:
ip rule add from 192.168.0.6 lookup 10
ip rule add from 192.168.0.3 lookup 172
Any non-special case involving the overriden routes from the rules must also be duplicated
Here the only missing routes are between the two special LANs themselves:
in table 10 to reach 172.16.0.0/24
in table 172 to reach 10.0.0.0/24
because each additional table doesn't yet have a route for this other side, it would use the default route (but would be blocked yet again by SRPF) preventing each of the two special networks to communicate anymore between each other. So just duplicate the missing route for each table:
ip route add table 10 172.16.0.0/24 dev enp1s0f0
ip route add table 172 10.0.0.0/24 dev enp4s0f1
With this model, if for example two other "normal" internal networks were to be added, they could communicate between themselves (and would use the main table's default route to go outside) without extra setting, but would again require duplication of their routes in each additional routing table to communicate with the two special LANs.
Routes are now fine, but there's still...
Linux follows the weak host model. That's the case for IP routing, and likewise for the way Linux answers ARP requests: from any interface for any IP, but of course using the interface's own MAC address. As this can happen on all interfaces simultaneously when multiple interfaces are on the same LAN, usually fastest wins. Then the ARP information is cached on the remote system and will stay there for some time. Eventually cache expires, the same happens, with a possible different outcome. So how does this cause a problem? Here's an example:
- Router (modem) sends an ARP request for 192.168.0.6 to send back routed and NATed (by Linux) reply to traffic initially sent from 10.0.0.0/24.
- Linux replies on enp1s0f1 (enp1s0f1 won the race) using enp1s0f1's MAC address in reply to tell it has 192.168.0.6.
- For a few seconds to a few minutes, future ingress IP packets from Router for 192.168.0.6 arrive on enp1s0f1,
- at the same time egress packets from 192.168.0.6 leave using enp4s0f0.
This asymmetric routing is caught by Strict Reverse Path Forwarding (rp_filter
) and the traffic will fail. This can even appear to work randomly for a few seconds then fail again. Depending on overall traffic the problem could even later switch to the other link (and then the problems switch to the other LAN).
Luckily to prevent this, Linux provides a setting, to be used only together with policy routing, to have ARP follow the same rules defined by routing: arp_filter
.
arp_filter - BOOLEAN
1 - Allows you to have multiple network interfaces on the same subnet,
and have the ARPs for each interface be answered based on whether or
not the kernel would route a packet from the ARP'd IP out that
interface (therefore you must use source based routing for this to
work). In other words it allows control of which cards (usually 1)
will respond to an arp request.
sysctl -w net.ipv4.conf.enp4s0f0.arp_filter=1
sysctl -w net.ipv4.conf.enp1s0f1.arp_filter=1
Now the ARP behaviour is correct, if the settings were just been put in place, one should force-flush the ARP cache of peers (here: the modem) by doing a duplicate address detection with arping
(from iputils / iputils-arping) which will broadcast to peers and have them update their cache:
arping -c 5 -I enp4s0f0 -D -s 192.168.0.6 192.168.0.6 &
arping -c 5 -I enp1s0f1 -D -s 192.168.0.3 192.168.0.3
Note that the two rules in bullet 3. in the previous part are now mandatory, because the IP addresses 192.168.0.3 and 192.168.0.6 must match in the policy routing rules for correct ARP resolution with arp_filter=1
.
How to debug
ip route get
is very useful to check routes and reverse path filtering:
new test case for bullet 4. above:
# ip route get from 10.0.0.111 iif enp4s0f0 172.16.0.111
172.16.0.111 from 10.0.0.111 dev enp1s0f0 table 10
cache iif enp4s0f0
# ip route get from 172.16.0.111 iif enp1s0f0 to 10.0.0.111
10.0.0.111 from 172.16.0.111 dev enp4s0f1 table 172
cache iif enp1s0f0
when deleting rules or routes:
# ip route get from 10.0.0.111 iif enp4s0f1 8.8.8.8
8.8.8.8 from 10.0.0.111 via 192.168.0.1 dev enp4s0f0 table 10
cache iif enp4s0f1
# ip rule del from 10.0.0.0/24 lookup 10
# ip route get from 10.0.0.111 iif enp4s0f1 8.8.8.8
8.8.8.8 from 10.0.0.111 via 192.168.0.1 dev enp1s0f1
cache iif enp4s0f1
# ip route get from 192.168.0.1 iif enp4s0f0 192.168.0.6
local 192.168.0.6 from 192.168.0.1 dev lo table local
cache <local> iif enp4s0f0
# ip rule delete from 192.168.0.6 lookup 10
# ip route get from 192.168.0.1 iif enp4s0f0 192.168.0.6
RTNETLINK answers: Invalid cross-device link
This shows how results are altered depending on (lack of) rules and additional routes. The last result is the error message that tells Reverse Path Forwarding check failed (=> drop).
Then there are ip neigh
(most useful on peer systems) to check ARP entries, tcpdump
, etc.