Posted: Thu Sep 01, 2022 9:47 Post subject: [TESTING] Public IP used instead local IP as source on oet1
Let me know in DM if I can somehow help DD-WRT contributors/community in response of solved issue.
----
This is happening on R7800 router.
Why I've got public IP calling local IP on WG oet1? Does it look right to you? Exactly those packets don't reach remote WG endpoint. Other packets work fine.
Please see tcpdump below for "7.X.X.110 > 192.168.250.111"
Packet arrives from from a local network.
Is there any way to replace source IP from "7.X.X.110 > 192.168.250.111" to "192.168.240.111 > 192.168.250.111" ?
root@dd-2:~# ip r
default via 7.X.X.254 dev eth0
10.1.220.1 via 10.1.240.1 dev oet1
10.1.230.1 via 10.1.240.1 dev oet1
10.1.240.0/24 dev oet1 scope link src 10.1.240.1
10.1.250.1 via 10.1.240.1 dev oet1
7.X.X.0/22 dev eth0 scope link src 7.X.X.110
127.0.0.0/8 dev lo scope link
192.168.220.0/24 dev oet1 scope link
192.168.230.0/24 dev oet1 scope link
192.168.240.0/24 dev br0 scope link src 192.168.240.240
192.168.250.0/24 dev oet1 scope link
root@dd-2:~#
A note that issue happens occasionally. When it happens, I have to restart this DD-WRT router (192.168.240.240) or restart sender (192.168.240.111). Sometimes multiple restarts are needed or keeping any of those two units offline for ~10 minutes. It helps, but I don't know why.
Firmware: DD-WRT v3.0-r41813 std (12/29/19)
Wireguard version v1.0.20191226
Many thanks!
Last edited by LaimisV on Mon Oct 17, 2022 18:35; edited 2 times in total
Joined: 18 Mar 2014 Posts: 12812 Location: Netherlands
Posted: Thu Sep 01, 2022 9:53 Post subject:
You have a very old build and a lot of things have changed
Upgrade to the latest build 50012, *after* upgrade reset to defaults and put settings in manually, never restore from a backup (to a different build that is)
I have three R7800 that connect different locations via Wireguard.
Those locations are remote. So doing an upgrade means traveling. Probably multiple times until it got stable.
It would be great firstly to try something remotely in existing firmware (I always test each action with "command; sleep 5m; reboot"). So unless we know that there is non trivial chance to solve an issue I can do an upgrade.
----
I believe I had same issue with OpenVPN, before migration to WG.
Successful flow is this:
192.168.230.111 (sender) -> 192.168.230.230 (DD-3) -> source address on oet1 is 192.168.230.111 -> public IP -> internet -> another public IP -> 192.168.250.250 (DD-1) -> 192.168.250.111. So it reached last IP via WG successfully.
Unsuccessful flow is this:
192.168.240.111 (sender) -> 192.168.240.240 (DD-2) -> source address on oet1 is PUBLIC IP NOT 192.168.240.111 AS EXPECTED -> public IP -> internet -> another public IP -> 192.168.250.250 (DD-1) -> 192.168.250.111. So packet fails without even reaching an internet out. Locally on DD-2.
----
It is more complicated than this. 192.168.230.111, 192.168.240.111 , 192.168.250.111 can always communicate to each other successfully via WG (over internet), but the issue happens only with specific packets initiated on between three guys (Calico Kubernetes networking based on IPIP). So I'm using IPIP (Calico) in IPIP (Wireguard). It's working fine, but sometimes DD-WRT decides to use public IP on oet1.
You may can skip this part to avoid complexity. I don't think issue is in servers. It is DD issue based on tcpdump, etc.
era@master-2:~$ sudo tcpdump -i any -n | grep ' 1308$'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
11:57:04.381414 IP 10.233.66.0 > 10.233.79.26: ICMP echo request, id 10356, seq 1, length 1308
# request looks good, but there is no response
# dd-2 is 192.168.240.240
root@dd-2:~# sudo tcpdump -i any proto 4 -n | grep ' 1308$'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), snapshot length 262144 bytes
13:57:04.372490 IP 192.168.240.111 > 192.168.250.111: IP 10.233.66.0 > 10.233.79.26: ICMP echo request, id 10356, seq 1, length 1308
13:57:04.372490 IP 192.168.240.111 > 192.168.250.111: IP 10.233.66.0 > 10.233.79.26: ICMP echo request, id 10356, seq 1, length 1308
13:57:04.372574 IP 7.X.X.110 > 192.168.250.111: IP 10.233.66.0 > 10.233.79.26: ICMP echo request, id 10356, seq 1, length 1308
# request received in dd-2, but it weirdly tried to use 7.X.X.110 so this can be root cause why there is no response.
R7800 are configured in exactly same way, servers are configured in exactly same way. I ping same endpoint. One location can reach it, another cannot reach it. I'm sure all locations are working perfectly fine when this issue doesn't appear (it appears occasionally).
It's really hard issue, isn't? TBH, I have this issue for 2 years now. This is the only one that is unsolvable by me. If you can't go into details due to time consumption, any high view ideas where I should dive deep?
Joined: 18 Mar 2014 Posts: 12812 Location: Netherlands
Posted: Thu Sep 01, 2022 11:28 Post subject:
That build is really old and has security issues, like I said a lot has changed especially regarding WireGuard, It is now much easier to setup and you do not need any scripts etc, all can be handled via the GUI.
You have a multi site-to-site setup some examples (Mesh, hub and spoke) are covered in the advanced guide but it assumes you are running a recent build.
Unfortunately I cannot help you very well if you stay on this old build.
Thanks. I will consider an upgrade in a while. Security points are also interesting (even if I have custom dynamic-secret IP whitelist, non root SSH login + key using custom user, etc).
Your docs are really good, followed them and set up WG successfully in the beginning of this year.
Seems I missed a part about SFE. I had it set as sfe=1 , now I switched to zero.
Likely, it didn't help, because I had to restart DD-WRT twice to solve this oet1/publicIP issue.
This issue will repeat in around 2-10 weeks. It's hard to fix when this happens occasionally. Indeed, it happens rarely and hard to reproduce on demand.
If you ever need to explore off topics that I mentioned in my posts and they are useful, feel free to get in touch.
Also I'll check tcpdump & netstat. If In understood correctly TCP/IP stack was somehow limited with SFE - that is why it performs better. I personally prefer to firstly ensure quality over performance.
Intuitively, I believe, frequency of this issue was reduced, if not completely disappeared, when I had backups turned off in our servers for 3-4 months. Those backups were running very frequently and generated traffic and connections through WG links. Backups are turned on again few months ago. The issue is back with its frequency.
I've checked maximum connections value on DD, it is at low level, but maybe there is something else related to load and because of this DD goes crazy with this issue.
A note that CPU is at normal levels, let's say average is at 10-30% on routers.
Also the fact that I have to turn off the server or dd for 10 minutes or do multiple restarts to temporary fix an issue, shows that maybe entries in layer 2/layer 3 have to be fully cleared to start fresh without this issue.
BTW, any of three R7800 is not free of this issue. Issue appears in any of them, randomly.
A note that Kubernetes Calico networking (let's say app) requirement is not NATed Wireguard. It should see local IPs as they are. So NAT was removed two years ago.
Also attached oet1 interface settings. If anything is related with this source address issue, I can play with those values. Eg never tried to disable "Masquerade / NAT" in Networking section for oet1.
Ok, after some time using build 50012 on all three R7800 routers:
1) The issue that I reported on oet1 (public IP -> local IP incorrect call) - seems didn't happen. That's promising.
2) Random router reboot in approx. 10-20 days.
3) If I access router via SSH and check commands like `ip`, `wg`, it reboots after 5-30 minutes. So it can reboot multiple times in a day. I'm mostly using OpenSSH. I've switched to default Dropbear to see how it goes.
4) WiFi disconnects depending on client computer, every 30-60 minutes. Eg two WiFi clients work fine, one has this issue.
5) When I connected additional WG clients with Macbooks (probably, OS doesn't matter), I've noticed WG disconnections "latest handshake" for around 5-10 minutes every few hours. Disconnection happens between R7800 units, not clients such as Macs. So I removed those computers from WG tunnels and it fixed an issue.
I know that it could be just my configuration, but in r41813, I din't have those additional issues 2-5.
If you have any advice what to do, how to debug precisely or how to select another firmware version for stability, that would be great.