OpenVPN defect, bug, on MTU handling - you decide

Post new topic   Reply to topic    DD-WRT Forum Index -> Advanced Networking
Goto page 1, 2, 3  Next
Author Message
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Thu May 06, 2021 15:16    Post subject: OpenVPN defect, bug, on MTU handling - you decide Reply with quote
Here's the setup:

I have 2 networks. The first is 172.16.1.0 it has a dd-wrt router Asus RT-N16 running DD-WRT v3.0-r46446 big (04/24/21) K3 build It is setup as an OpenVPN SERVER. It's inside interface is 172.16.1.30 and it's outside interface is foo1. It does not act as an internet gateway just a VPN server. This network has a VoIP phone system on it. It has a gateway router at 172.16.1.1

The second is 172.16.100.0 it has a dd-wrt router Netgear WNDR4500 running Firmware VersionDD-WRT v3.0-r46395 giga (04/19/21) It also acts as an internet gateway. It's inside interface is 172.16.100.1 and it's outside interface is foo2 It is setup as an OpenVPN CLIENT. There is a server behind it at 172.16.100.15

There's multiple actual Internet routers and several networks between foo1 and foo2.

I can establish a routed VPN between 172.16.1.0 and 172.16.100.0 with NAT turned off on the CLIENT network as well as compression. I can ping from the server at 172.16.100.15 to the router at 172.16.1.1 - that is, a ping travels from the server through the dd-wrt router, into the VPN, through foo2, through The Internet, through foo1, into the dd-wrt router, out of the VPN, then to the other gateway router on it, just fine. I can do filesharing and telnet/ftp/ssh/whatever over it. I even can run Microsoft networking and mount shares over it.

HOWEVER I have SIP phones on 172.16.100.0 that register into the VoIP phone system on 172.16.1.0 that will not maintain registration.

Now, I have been screwing around for a month or so trying to fix this. The problems started when I upgraded dd-wrt systems from an antique version of dd-wrt that had NO problems with the phones or anything else. I went way down the rabbithole of thinking it was code efficiency, etc. All baloney.

I finally noticed the following error on the client router's OpenVPN log:

20210505 09:53:26 W WARNING: 'link-mtu' is used inconsistently local='link-mtu 1570' remote='link-mtu 1550'

I attempted to file a bug with dd-wrt and was sent here for help. The devs did not believe it was a bug. So here is the results of my investigation you can tell me if it's a bug or not.

First thing I did was replace the dd-wrt router on the SERVER side with a second ASUS RT-N16 running firmware DD-WRT v3.0-r45993 mega (03/12/21) K2.6 build
The mtu error messages went away. The phones NOW maintain registration. This is with the exact same config. Maybe it's an older OpenVPN build I didn't check.

So I switched back to the original ASUS RT-N16 running the new K3 build and the problems came back - warnings in the logs on both routers and the phones started dropping registration.

I then went on to the server at 172.16.100.15 and issued the following command:

ping -M do -c 1 -s 1419 172.16.1.1

it worked. But, if I issued

ping -M do -c 1 -s 1420 172.16.1.1

it failed. BUT, and this is CRITICAL, not only did it fail but it failed in the WORST possible way - by simply trashing the ICMP packet and dropping it into the bit bucket. No ICMP error messages sent back.

So I switched back to the RT-N16 running the K2.6 code and from the server I executed the command:

ping -M do -c 1 -s 1420 172.16.1.1

it now works. In fact it works all the way up to a segment size of 1472. If I set the segment size above that and do this:

ping -M do -c 1 -s 1480 172.16.1.1
PING 172.16.1.1 (172.16.1.1) 1480(1508) bytes of data.
ping: local error: message too long, mtu=1500

I get an error. BUT the CRITICAL thing is that the error is a PROPER error insofar that I'm getting told the packet is too big.

THIS IS NORMAL since the 1472+28=1500 And the logs on the client and the server say the link is negotiating at a size of 1500 MTU and since the minimum size of an IPv4 packet is 20 bytes and the size of a UDP header is 8 bytes so an empty UDP packet in an IPv4 packet takes 28 bytes. In other words, OpenVPN's overhead on an OpenVPN VPN is 28 bytes, so MTU of any packet going through the VPN will NEVER be able to be above 1472. BUT that is perfectly fine - because the K2.6 dd-wrt router is properly returning an ICMP "fragment required" message for packets larger than that. This of course causes the TCP/IP stacks on the phones to resend their SIP registration messages with a smaller MTU so then those messages get through and the phones stay registered.

Now, I posted a query here a few weeks ago thinking it was some sort of efficiency nonsense in the OpenVPN code, it got some useful academic discussion but not much beyond that. BUT it doesn't matter since I stumbled over the real issue - MTU thing.

It was now clear to me that the dd-wrt router running OpenVPN as a server on the K3 builds is not properly negotiating MTU with the dd-wrt router running OpenVPN as a client, thus the client dd-wrt is not returning Packet Too Big error messages back to the phones. Thus the UDP sip packets are getting trashed, causing the phones to lose registration.

OpenVPN on dd-wrt K3 builds works properly IF it is setup as a CLIENT. But it DOES NOT if it's setup as a SERVER. At least, NOT on the current versions. Possibly there's been a newer OpenVPN version put into dd-wrt from the 4/19/21 to the 4/24/21 but I don't know.

So I tried some more configuration experimenting on the dd-wrt routers to prove all this out and here's the results.

If I run the ASUS RT-N16 openvpn SERVER dd-wrt router on the new K3 build AND I do the following - I can get it to work properly:

On the Server router, enable the Advanced options.
In Advanced options, set the Tunnel MTU setting to 1440

On the Client dd-wrt router, do the same thing. Advanced options is already enabled in order to turn off NAT Just knock the MTU down on the tunnel from it's default of 1500 to 1440

NOW: on the Server dd-wrt openvpn log I get this:

20210506 00:11:00 W 68.185.12.178:56313 WARNING: 'link-mtu' is used inconsistently local='link-mtu 1490' remote='link-mtu 1510'

On the Client dd-wrt openvpn log I get this:

20210506 00:11:00 W WARNING: 'link-mtu' is used inconsistently local='link-mtu 1510' remote='link-mtu 1490'

BUT MUCH MORE IMPORTANTLY on the 172.16.100.15 server ping tests I get the following:

pings with a segment size of 1412 make it through
pings with a segment side of 1413 get an error sent back that the message is too long

This is the exact same behavior as the K2.6 build except of course that the maximum packet size is smaller - but it still tracks out - 1412+28=1440

Most importantly, my phones do not seem to be losing registration. In fact they seem to immediately get and stay registered.

Basically what is happening is somewhere in the cipher or encryption engine, openVPN is completely screwing up or losing the MTU or something. When the tunnel is forced to have a low low 1440 MTU, it drops the grand total packet size below 1500 MTU. Since the dd-wrt routers have Ethernet connections and Ethernet MTU is 1500, that has to happen or the encrypted packet will get trashed.

Now I DID notice on the 4/24 code that Brainslayer put "Default 1400" on the TUN MTU setting that is under the advanced settings. BUT the actual default that is still put in there is 1500. And worse when I tried using MTU 1400 on each size - I got more errors from OpenVPN because now it was too small. When I tried above 1440 it worked HOWEVER I started getting black holes in segment size tests above 1400. 1440 seems to make both sides happy. And of course, this is tun MTU not link MTU. But apparently OpenVPN calculates - more like MIS-calculates - the link MTU from the tun MTU. Attempting to set the link-mtu parameter in the openvpn config does not work.

Sure seems like a bug to me. So OK maybe it's an OpenVPN bug. But it's still a bug. The "workaround" is to set the DEFAULT tun MTU to 1440 if OpenVPN is in server mode - and then even though OpenVPN will complain - it will negotiate a link MTU that fits under the Ethernet MTU of 1500. Or you can run K2.6 code on your openvpn servers.

One last thing - the TCP protocol has black hole MTU path discovery that will workaround OpenVPN's brokenness. That's why a TCP connection will work - although it WILL trash your throughput over the VPN.
Sponsor
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Thu May 06, 2021 18:24    Post subject: Reply with quote
For a (crude) way to find the max MTU your VPN link will pass before it errors out:

Start by picking an MTU that's well below what you think is being trashed. (I'll pick 1350) Then pick the destination (a machine on the network on the other side of the VPN I'll pick 172.16.1.1) Then at a root Ubuntu command prompt:

# size=1350
# while ping -s $size -c1 -M do 172.16.1.1; do ((size+=1)); done

hit enter and make sure you get a response from ALL values!!! At the end of the list ping will error out which will end the loop, and print out something like:

--- 172.16.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 49.832/49.832/49.832/0.000 ms
PING 172.16.1.1 (172.16.1.1) 1413(1441) bytes of data.
ping: local error: message too long, mtu=1440

--- 172.16.1.1 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Now, to see the ACTUAL max MTU size:

# echo $size
1413

1413 is when it starts getting too big. Next, you have to continue testing for all packet sizes ABOVE 1413 because Ethernet's MTU is 1500 and unless you are pinging from a Linux box jacked into an OC3, 1500 will ALWAYS be presented to your IP stack as the max MTU

This hack will do this:

while ping -s $size -c1 -M do 172.16.1.1 || true ;
do ((size+=1)); done

carefully examine the output as it is important that EVERY SIZE TESTED up to a size of 1500 comes back with a "message too long"

You will have to close the shell session to kill the loop.
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Fri May 07, 2021 6:21    Post subject: Reply with quote
Hi All,

Well I have a bit of egg on my face - I said:

"First thing I did was replace the dd-wrt router on the SERVER side with a second ASUS RT-N16 running firmware DD-WRT v3.0-r45993 mega (03/12/21) K2.6 build
The mtu error messages went away. The phones NOW maintain registration"

This wasn't exactly true. The error messages DID NOT go away although it was the case that the phones DID maintain registration. The reason why they did is because the MTU negotiation was PROPERLY handled - that is, up until a packet size of MTU 1472 the packets went through but for all packet sizes above that I got a proper ICMP "packet too large" error message. This was on the K2.6 code. The K3 code is the one that gave me MTU "black holes" for packets above a certain size. So far, the MTU hack I outlined above OR switching to K2.6 code is the only way for large packets to not get trashed, however.

I CAN confirm that Firmware: DD-WRT v3.0-r46239 mega (04/01/21) K2.6 code on a Netgear WNDR4000 ALSO DOES properly work on the OpenVPN Server side. That is, with tun MTU's set to 1500 on both ends of the VPN, packets up to a size of 1472 are passed, and ALL packet sizes 1473 and above are dropped and a proper ICMP "packet too large" error message is returned.

I can also confirm the firmware: DD-WRT v3.0-r46395 mega (04/19/21) K2.6 nv64k also does properly work on the WNDR4000 router with the MTU size issue. That is the last K2.6 version available. I'm now going to look at some other K3 devices that I know were broken with this.
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Fri May 07, 2021 9:57    Post subject: Reply with quote
Now for the K3 stuff.

The first router I tested was a Netgear AC1450 running firmware: DD-WRT v3.0-r46446 std (04/24/21) This is an ARM device not broadcom.

With MTU on the tun interfaces set at 1500 at both sides of the VPN, I was able to only pass packets up to a size of 1419 through the VPN. From sizes 1420 to size 1472 there exists a black hole where the proper "packet too large" ICMP error messages ARE NOT returned when any packet in that size range is attempted to be put through the VPN. Size 1473 and up onwards I AM getting the proper ICMP "packet too large" messages. So, THIS device with dd-wrt is broken and must use the set-tun-to-1440 hack detailed above.

The next router was an RT-N66U running firmware: DD-WRT v3.0-r46446 (04/24/21) this also displayed the same bug (as expected) indicating this was not related to hardware.

Clearly we have a bug here, although it is not an openVPN bug it is that the K2.6 and K3 builds of the same version of dd-wrt are different - the K3 bug has a bug handling MTU and the K26 does not. And is seems to be in the openvpn server not the client.

For my own use I would prefer to use routers like the Netgear AC1450 since it's faster and more powerful but I don't want to give up the larger packets I can get with the K2.6 code. Unfortunately there are not many K2.6 routers that have more than 32k of nvram which makes OpenVPN a non-starter on the newest firmware since you run out of nvram. So I hope this bug gets fixed.

For a MTU discussion that is in more layman's terms see: https://blog.cloudflare.com/ip-fragmentation-is-broken/
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12837
Location: Netherlands

PostPosted: Fri May 07, 2021 12:30    Post subject: Reply with quote
Are you sure you disabled SFE on both server and client?

I have seen that wreaking havoc.

MTU problems can be a real PITA.

I added this thread to the VPN troubleshooting guide there is a section about MTU problems which also details the approach to get the right MTU size if PMTUD is not functioning (for interested readers: https://en.wikipedia.org/wiki/Path_MTU_Discovery )

But strange it is, I will try to replicate it on my setup

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Fri May 07, 2021 16:48    Post subject: Reply with quote
SFE is NOT disabled as part of the config on either router. I can swap in a K3 router on the server side which displays the bug and disable SFE then run the MTU black hole discovery test again and see if anything has changed.

Just to re-iterate for the benefit of anyone reading, the MTU size limit itself IS NOT PARTICULARLY IMPORTANT. As long as the MTU the link is able to carry is above 1000 it will be plenty large enough to carry ANY ipv4 traffic. What IS critical is whether or not the routers on each end properly fragment/reassemble a packet and whether or not they reply back to the sender if a packet is too big for EVERY SIZE of large packet. By definition they can receive a packet up to a max of 1500 because dd-wrt devices all have Ethernet interfaces.

Because the MTU of Ethernet is 1500 any sender on either side of the VPN link is going to initially assume they can send a packet as large as 1500. If a VPN link (or any other link) has to restrict this for whatever reason, the VPN routers must fragment the packet, send the fragments, then at the other end reassemble the fragments. However, some packets cannot be fragmented and so the sender sets a Do Not Fragment flag. In that case the routers must respond to a packet that is too large by notifying the sender it's too large. The sender can then elect to reduce the size of the packet it sends.

With TCP, the TCP/IP stack determines whether or not fragmentation is working and what the optimum MTU size is. It does this through MTU Path Discovery which you referenced in the wikipedia link.

With UDP, the application itself is in charge of setting the MTU of packets it sends. Many UDP applications are simple and crude and lack sophisticated Black Hole MTU Path Discovery methods and so use the crudest and simplest method which is to depend on any router in the path returning an ICMP Packet Too Large message. If a router in the link fails to do this for any packet size that is too large for it's MTU, that router is referred to as a Black Hole.

Modern TCP stacks can compensate for Black Holes but simple UDP applications cannot. Such as my VoIP phones. I have tested a number of VoIP phones on this link years ago when I set it up and discovered SOME of them (I won't name manufacturers names) are so incredibly crude that they don't even pay attention to Packet Too Large messages at all!!! They assume an MTU of Ethernet which is 1500 and won't work over a VPN at all. (at least not an OpenVPN vpn)

It is not just phones that will be affected by a black hole. DNS, and VoIP applications, and even many voice and video streaming applications are dependent on UDP packets and lack any sort of advanced Path Discovery.

That is why this bug is so serious. It's simple enough to test for Black Holes using the crude ping example above (it is worth noting that even with the K2.6 router that properly returns an ICMP Packet Too Large message, it returns the INCORRECT size that the maximum MTU is in that message - but what's a cup of water poured over your head when you are drowning) but testing that proper fragmentation/reassembly of UDP packets is more difficult and I have not done it - assuming that if the code doesn't botch MTU size reporting it won't botch fragmentation/reassembly. Fortunately it does appear my VoIP phones aren't dependent on correct fragmentation/reassembly and are indeed setting the DF flag on packets they are sending out. (of course, that's yet another crude application shortcut but what do you expect from phone guys designing phones)
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Sat May 08, 2021 13:36    Post subject: Reply with quote
I setup a second WNDR4000 with K3 code, version 46446 mega and swapped it with the one running K2.6 code.

I turned off SFE as requested. It made ZERO difference.

I also did some more extensive testing for the black hole - this time I tried sending traffic BOTH from devices behind the OpenVPN Server on the main network, and the OpenVPN client on the remote network, to machines on the other network. Based on the results of that I am now recommending a max tun size of 1430 not 1440 as I said before. The 1440 size is OK for packets originating from the Client side but it trashes packets originating from the Server side.

With a tun size of 1430 on each side, from the Linux host on the client side running the pings I can pass a packet of up to 1402 size while 1403 size and larger kick an ICMP message back of "packet too large"

From a FreeBSD host on the Server side running the pings I can pass a packet of up to 1402 size and packets from 1403 and larger get a message back saying "frag needed and DF set" up to a size of 1472, then from a packet size of 1473 and larger I get a message back saying "message too long" (1402+28=1430 and 28 is an empty UDP packet size, that is how that's related to the tun MTU, and 1472+28=1500 which is the max MTU of Ethernet, if anyone is interested.)

I believe the diagnostic response from the ping on the Linux side is probably due to limitations in the Debian TCP/IP stack - Debian probably lumps "frag needed" in with "message too long" and issues it as "packet too large" Linux tends to do this - strips out useful diagnostic info because the developers think users are stupid. That attitude really annoys me.

This leads me to a guess on the OpenVPN bug that the 41 bytes of packet size that is getting black holed when the tun MTU is set higher than 1430 is being consumed by the encryption protocol. I set AES-128-CGM on this but who knows what else OpenVPN is stuffing in there. I think that OpenVPN's bug is that it is not taking into account the space needed for encryption when it gets a packet for encryption. What SHOULD be happening is when OpenVPN starts it should be querying the kernel how big the MTU on every interface in the router is then when it gets a packet to encrypt, if the total packet size after encryption is too large to send out of an interface, OpenVPN should abort the encryption process and transmit an ICMP "too large" message back if the DF bit is set, OR it should be fragmenting the packet. It may be that in the K3/4 kernel OpenVPN is not able to obtain the correct MTU while in the K2.6 kernel it IS able to obtain it. This may be a kernel bug that is happening due to the port of the K3/4 kernel to the Broadcom CPU or ARM architecture. Or it could just be stupidity in OpenVPN after all it botched up the link-mtu settings.

Anyway getting back to it,

On the Server the warning message from OpenVPN is

WARNING: 'link-mtu' is used inconsistently local='link-mtu 1480' remote='link-mtu 1500'

On the Client the warning message from OpenVPN is

WARNING: 'link-mtu' is used inconsistently local='link-mtu 1500' remote='link-mtu 1480'

It's possible to modify the tun size in the Server to 1450, this will make the link-mtu message go away but replace it with a message that the tun-mtu sizes are inconsistent. However, doing that then causes a black hole of packet sizes to open up. In other words it appears OpenVPN will work with asymmetrical link-tun sizes but that will trigger a failure to respond with an ICMP message for certain packet sizes.

I SUSPECT but have not tested the K2.6<->K3 OpenVPN link that it is producing a black hole of packet sizes with tun set to 1500 in the direction from server network to client network. I should have tested for asymmetrical black hole behavior with that link up. I'm now starting to believe there are 2 bugs, one in OpenVPN the other in the K3/4 kernel - the OpenVPN bug being present in both K2.6 and K3/4 code while the kernel bug is only present in the K3/4 kernels. Or something like that.

Another thing that needs testing is the different ciphers. If I'm right in my guess, then if the different ciphers user less or more space, then the tun MTU of 1430 may not work - or it could even be wasting space, the packet could be larger and still fit in 1500 bytes.

Lastly, proper frag handling needs testing. I'm just looking at proper "packet too large" handling on dd-wrt/openVPN but that only applies to packets that have DF set in the header. A regular packet does not have DF set so is OpenVPN properly fragmenting/reassembling it? (well, we KNOW it's DEFINITELY fragging it when it dumps the packet in the bit bucket, hah hah)

AND I need better tools on Debian. The FreeBSD ping is very good - all the FreeBSD diagnostic network tools are great. Debian, not so much.

IN SUMMARY all of this is very interesting but mainly of interest to us academics - it's quite clear to me that dd-wrt's responsibility is to not lay a trap for users (that's openwrt's forte) and the quickest way to fix this is to set the default on the tun in OpenVPN to 1430 or lower, so that the great many regular users who just want to run dd-wrt with an OpenVPN server are not bitten in the ass by this and sending in periodic help queries about phantom "bugs" in OpenVPN on dd-wrt.

So about bug reports - I opened a svn bug that you closed. That is OK because it was not complete. But now I think enough investigation has been done that at the least it warrants an immediate change in the tun default size on dd-wrt. Obviously the "proper" fix is to figure out why the K2.6 builds "work right" and the K3/4 builds don't and fix the K3/4 builds. And then figure out if there isn't some other brokenness in OpenVPN. But my guess is that this ALSO affects Tomato and all the other router projects plus people building on Raspberry Pis and other stuff like that and probably should be handled in the OpenVPN project. That's going to take a while and until then I think we should be protecting our users with that default change. It can always be put back to 1500 later.
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12837
Location: Netherlands

PostPosted: Sat May 08, 2021 16:19    Post subject: Reply with quote
I am just testing with CHACHA-POLY and as tun-MTU value applies to packets before compression/encapsulation so that can play a role in the (mis)calculation as are the different kernels used.

I can pass packets of 1420 bytes (both from client to server and back) Sad
Above that value packets are black-holed above, 1472 I get the proper response.

OpenVPN (if it really an OpenVPN bug) is certainly not bug free we were troubled with a non functioning PBR if multiple tables were present. OpenVPN always used the default route of the lowest table number instead of the main table (254) so we had to add a low table with the correct default route, luckily this bug has been resolved in the last update from April 24

I can see the users complaining why is the value so low, OpenVPN advises to use mssfix and fragment and leave tun-mtu to 1500:
From the MAN page:
Quote:
--tun-mtu n
Take the TUN device MTU to be n and derive the link MTU from it (default=1500). In most cases, you will probably want to leave this parameter set to its default value.
The MTU (Maximum Transmission Units) is the maximum datagram size in bytes that can be sent unfragmented over a particular network path. OpenVPN requires that packets on the control or data channels be sent unfragmented.
MTU problems often manifest themselves as connections which hang during periods of active usage.
It's best to use the --fragment and/or --mssfix options to deal with MTU sizing issues.


The warning I get when setting MTU to 1448:
Code:
20210508 16:40:23 W WARNING: normally if you use --mssfix and/or --fragment you should also set --tun-mtu 1500 (currently it is 1448)


and of course using 1500 usually works as most traffic is TCP and UDP traffic like DNS usually uses smaller packets. So normal Speedtest, packetloss test blackhole detection just show no problems
But if you are using specific UDP traffic like you are then you are in trouble Sad

But it gets even stranger, I fired up tcpdump to see what is going on while using streaming media (youtube) which runs fine because it apparently sends UDP packets of 1350 bytes.
(MTU was set to 1500), then all of a sudden the blackhole was gone and maximum unfragmented size was 1392, above that I got the proper response, so no more blackhole, WT* and of course streaming still was fine.

I rebooted both client and server and the blackhole between 1420 and 1472 was there again Question

So not sure what is going on, something is buggy but of course 99% of the user will not be affected/notice.

A well documented bug report which is discussed in the forum is more than welcome, so thanks for your work Smile

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
SurprisedItWorks
DD-WRT Guru


Joined: 04 Aug 2018
Posts: 1446
Location: Appalachian mountains, USA

PostPosted: Sat May 08, 2021 17:43    Post subject: Reply with quote
Hi @egc... just to add another data point for you. In 46069 in recent weeks I've done many ping tests for mtu across the OpenVPN client. My approach recently has NOT been the usual one of picking one target to ping and just searching the packet-size space for success. That was yielding confusing results. Instead I've been pinging all 36 (or so) other US servers of AirVPN (can't ping the one I'm using) at a fixed packet size. An outer loop then searches the packet-size space.

What I've seen is that 1419 and below always works and that 1473 and up always fails. But 1420 through 1472 is more interesting, as generally these packet sizes will work with many ping targets but fail on one or two or perhaps a few. Of course 1420 through 1472 on the OpenVPN client's input corresponds to 1448 through 1500 after ping adds its own overhead of 28 bytes, so it's as if the desired 1500 tunnel mtu isn't really provided over all paths, in my case from my OpenVPN client to various other AirVPN servers around the US. Most paths work at 1500, but some don't.

_________________
2x Netgear XR500 and 3x Linksys WRT1900ACSv2 on 53544: VLANs, VAPs, NAS, station mode, OpenVPN client (AirVPN), wireguard server (AirVPN port forward) and clients (AzireVPN, AirVPN, private), 3 DNSCrypt providers via VPN.
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Sat May 08, 2021 19:49    Post subject: Reply with quote
egc wrote:
I am just testing with CHACHA-POLY and as tun-MTU value applies to packets before compression/encapsulation so that can play a role in the (mis)calculation as are the different kernels used.


I can test with CHACHA

egc wrote:

I can pass packets of 1420 bytes (both from client to server and back) Sad
Above that value packets are black-holed above, 1472 I get the proper response.


Do you have control of both ends of the VPN link or is this terminating at a VPN provider?

egc wrote:

I can see the users complaining why is the value so low, OpenVPN advises to use mssfix and fragment and leave tun-mtu to 1500:
From the MAN page:
Quote:
--tun-mtu n
Take the TUN device MTU to be n and derive the link MTU from it (default=1500). In most cases, you will probably want to leave this parameter set to its default value.
The MTU (Maximum Transmission Units) is the maximum datagram size in bytes that can be sent unfragmented over a particular network path. OpenVPN requires that packets on the control or data channels be sent unfragmented.
MTU problems often manifest themselves as connections which hang during periods of active usage.
It's best to use the --fragment and/or --mssfix options to deal with MTU sizing issues.




from

https://openvpn.net/community-resources/reference-manual-for-openvpn-2-4/

on the fragment option:

"It should also be noted that this option is not meant to replace UDP fragmentation at the IP stack level. It is only meant as a last resort when path MTU discovery is broken. Using this option is less efficient than fixing path MTU discovery for your IP link and using native IP fragmentation instead."

In other words, if one of the various Internet routers in between the 2 openVPN devices has a black hole. Well I did NOT check for that but I CAN since I have subnets on both ends and I happen to have servers at both ends that are directly on the public IPs of those subnets.

From the manual on the mssfix:

"Announce to TCP sessions running over the tunnel that they should limit their send packet sizes such that after OpenVPN has encap.."

So right there, that's off the table. This is UDP traffic that's being sent over the VPN

egc wrote:

The warning I get when setting MTU to 1448:
Code:
20210508 16:40:23 W WARNING: normally if you use --mssfix and/or --fragment you should also set --tun-mtu 1500 (currently it is 1448)



Because we are dealing with a defect all bets are off - you cannot assume that setting mtu to 1448 or 1500 or even 1430 is actually doing what the manual says it's doing. I don't think there's any value in trying to calculate what the value is logically supposed to be and then using it. That's why I changed my recommendation from a tun size of 1440 to 1430 - because of observation of what was going on in response to various sizes of tun mtu

egc wrote:

and of course using 1500 usually works as most traffic is TCP and UDP traffic like DNS usually uses smaller packets.


DNS lookups CAN be affected by a black hole if the DNS server is on the "server" network and the client is behind a dd-wrt router. Some of the DNS queries are quite large - see these discussions https://redmine.pfsense.org/issues/6870, and https://www.zdnet.com/article/how-to-test-your-resolver-for-dns-reply-size-issues/ and https://www.icann.org/en/system/files/files/sac-035-en.pdf

Even regular non DNSSEC queries can get large if for example they are looking up a website that has a load balancer that lists a dozen server names in it. The problem exists when the DNS server is on one side of the VPN and the client is on the other.

egc wrote:

But it gets even stranger, I fired up tcpdump to see what is going on while using streaming media (youtube) which runs fine because it apparently sends UDP packets of 1350 bytes.
(MTU was set to 1500), then all of a sudden the blackhole was gone and maximum unfragmented size was 1392, above that I got the proper response, so no more blackhole, WT* and of course streaming still was fine.

I rebooted both client and server and the blackhole between 1420 and 1472 was there again Question


The Youtube UDP packets very likely do not have the DF flag set so OpenVPN and/or the Linux kernel might be fragmenting them - but are your youtube packets coming through the VPN? Wouldn't they be coming from youtube? Do you have control of both ends of your VPN or just one end?

egc wrote:

So not sure what is going on, something is buggy but of course 99% of the user will not be affected/notice.


I think this is wishful thinking. TCP is built on UDP so problems in the UDP stack like a black hole can be worked around but there is always going to be a performance penalty because the stack will just fragment everything to bits.

egc wrote:

A well documented bug report which is discussed in the forum is more than welcome, so thanks for your work Smile


Your welcome! This all started out from my desire to save around $50 a month in paying for a land phone line at the remote site. Over the last 2 years that would have cost me $1200.

I suppose I have spent at least twice that on labor and keeping all of this running not to mention the phone hardware and time spent learning asterisk and phone provisioning.

I suppose I am an idiot. Wink

But seriously, it's my hope that others can benefit from the work here. VoIP is still pretty primitive and most solutions out there are packaged deals all rolled up into one - you pay a frightfully large amount of money and they provide -everything- and claim they will make all of it work together perfect. My setup is a roll your own and I have ALWAYS had just a bit of trouble with phones staying provisioned and it always bugged the eff out of me why that was.

Ever since closing the black hole on that VPN link - rock solid phones. ROCK solid. That's almost reward enough!

I have one customer running 300 phones on Cisco's turnkey UCS at one site and 10 remote sites with at least 10 remote sites with 15 extensions each. You cannot imagine how much they have to fork over every month for this let alone the initial hardware cost. And yet, underneath the pretty plastic with Cisco stamped on it is the same bones we are dealing with here. Cisco just packages 'em up. And do they feed bugs they discover back upstream to OpenVPN? I have to wonder. I really, really have to wonder. You know, the RV340 supports OpenVPN clients and it's a wrapper on Linux.....
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Sat May 08, 2021 19:56    Post subject: Reply with quote
SurprisedItWorks wrote:
it's as if the desired 1500 tunnel mtu isn't really provided over all paths, in my case from my OpenVPN client to various other AirVPN servers around the US. Most paths work at 1500, but some don't.


It is simply impossible for OpenVPN to provide a full 1500 MTU. The encrypted, encapsulated VPN packet must exit the OpenVPN router with no larger MTU size than 1500 because that is Ethernet's MTU. And that packet must have space for the overhead of OpenVPN itself. So you can NEVER have a path provided through an OpenVPN vpn that is 1500 You should be seeing this when you set the DF bit on your pings.
SurprisedItWorks
DD-WRT Guru


Joined: 04 Aug 2018
Posts: 1446
Location: Appalachian mountains, USA

PostPosted: Sat May 08, 2021 22:37    Post subject: Reply with quote
[Edited to a null post. Effectively replaced with my next post, further downstream.]
_________________
2x Netgear XR500 and 3x Linksys WRT1900ACSv2 on 53544: VLANs, VAPs, NAS, station mode, OpenVPN client (AirVPN), wireguard server (AirVPN port forward) and clients (AzireVPN, AirVPN, private), 3 DNSCrypt providers via VPN.


Last edited by SurprisedItWorks on Tue May 11, 2021 16:53; edited 1 time in total
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12837
Location: Netherlands

PostPosted: Sun May 09, 2021 8:38    Post subject: Reply with quote
@Surpriseditworks MTU is one of the more complicated and ill understood things (at least for me )
It has similarities with Heisenbergs Uncertainty principle by merely looking at it it changes Sad

@tedm, I do control both ends (both are DDWRT routers K4.4. one running build 46450 the other 46601, both are experimental builds and SFE and CTF are off)

Youtube is coming via the VPN and runs fine, I also tested with own HD material that also works without a problem so "regular" UDP traffic seems to work.

One user reported VoIP problems when using VoIP via the VPN using DDWRT K4.4 but that was due to SFE and after disabling SFE his problems were gone so I presume with MTU 1500 ( https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=328961 )

But the strangest was that while testing, PMTUD suddenly seemed to start working and the black hole was temporarily gone as described above.

I will test more the coming weeks (I am starting an other project tomorrow so will be busy the next weeks) also testing K4.9 (maybe it is only K4.4. which is affected as K2.6 is working like you showed) and to commercial providers.

The VPN Troubleshooting guide already had a paragraph about MTU problems, I added this thread as reference and updated the text.

The helptext on the server setup (Default: 1400) has been there for ages and as 1500 is used that is contradictory, I have asked to update that but we might concluded to ask to lower that value in future releases (maybe only for K4.4 ?).
At least the problem and the solution is known and described.

Again thanks for bringing this up will be continued.

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Mon May 10, 2021 1:11    Post subject: Reply with quote
SurprisedItWorks wrote:

The thing that will make you tear your hair out is that there are roughly 1.4 Gazillion different MTU quantities, each including a different subset of all the overheads.


Heh. No, seriously don't overthink it. There is only ONE MTU quantity that matters - it is the maximum size that a packet can be on any given link. That size and under, the network is required to make a best effort at delivery of an IPv4 packet and the TCP/IP stack is expected to retry or do whatever it needs to do to compensate for latency, packet loss, and so on to deliver the packet. If it's TCP by definition it's guaranteed to be transmitted unless the link is "down" if it's UDP it's not guaranteed to be transmitted 100% of the time. Whether or not 99.99% of the time or 80% of the time is OK for a UDP based application is up to the application. A well written UDP application will test the link and decide if the link is good enough and it it's not, inform the user.

For all MTU sizes above that, the network is required to inform the sender the packet is too big. The network MAY elect to be nice about it and split the packet into fragments and deliver the fragments whereupon the receiving stack is required to put them back together into a complete packet, but that is all. The sender's IP stack can override that by setting DF in the header of the packet.

For ICMP the network is REQUIRED to deliver them. Period. No arguments. Unless the link is in a down state or they packet is too large and DF is set.

Everything else is supposed to be handled by the application or the TCP/IP stack.

Year ago there were a lot of dorkuses in the world who don't understand this and thought they were making things "more secure" by blocking the ICMP traffic types that inform the sender that a packet is too big. I have gotten into online shouting matches with those sorts of dorkuses who insist they are "protecting themselves" from the so-called "ping-of-death" Those generally result in me dragging in a bunch of RFCs that say no, you can't do that whereupon the dorkus claims they can. I then follow with sending them the firewall documentation from their very own firewall vendor that says they are not supposed to do that also, and usually by then I'm told to go to hell. Later of on course when they get tired of their users screaming at them because of weird odd network problems they get off their lazy asses and study what I told them and follow it. Of course, I'm never thanked. Sigh. It's a cruel world we live in. Smile

Most of the ICMP blocking wars died after Microsoft removed raw sockets from Windows XP SP2 and all later versions of Windows as that prevented the script kiddies from hijacking 'doze boxes to attack people. The current generation likely doesn't even know what ping-of-death is much less has ever seen one in the wild, and thankfully the marketing departments of firewall vendors have stopped dangling "block ICMP" in their marketing materials so the dummies generally aren't going down that rabbit hole anymore. It's rare I see an MTU issue anymore much less get into a discussion on it. Of course, taking away raw sockets crippled Windows for any decent network testing but after all it's a desktop OS so Microsoft will tell you "here's a nickel buy a real computer" if you complain to them. You still do get 1 raw socket for UDP use only if you wish to write a network flooding app to test maximum throughput on an ethernet switch or something (I have) but no more ICMP. We just can't have nice stuff. Sad

Anyhoo, the MTU's you are most likely to see in the wild are 1500 for ethernet, up to 9000 for so-called "Jumbo" frames on Gigabit Ethernet, (it would be interesting to know if ANY of the dd-wrt capable routers can do Jumbo frames) and 810 for STS-1 and 2430 for STM-1/STS3 - the last 2 are weird optical standards used on long haul networks and unless you are doing work in a datacenter you will never see those - and the gear often presents an MTU of 4470 on DS3 interfaces to the user which is even more fun.

This has led to a sort of de-facto standard where everyone assumes they can send an MTU of 1500 bytes for a frame on the Internet. That mostly works except for PPPoE connections but ISP's are mostly getting rid of PPPoE because of user complaints (likely triggered by black hole routers at those same ISP's that hose up datastreams to their users)

The fly in the ointment is the various network application protocols. PPPoE running over a 1500 MTU link presents an MTU of 1492. GRE (so-called "iptunnel" in Cisco parlance) presents an MTU of 1476. (20 bytes for the IPv4 header and 4 bytes for the GRE header) There's a good article that explains the common ones here https://www.networkworld.com/article/2224654/mtu-size-issues.html

OpenVPN is likely screwing up by reserving THREE IPv4 headers instead of one, which accounts for that 40 byte "black hole" But I'm NOT an OpenVPN developer so I have no idea what the issue is or why it's not appearing in the K2.6 kernel.

SurprisedItWorks wrote:

If this 66 bytes of overhead is taken off the ethernet 1500, we seem to have that OpenVPN MTU needs to be no larger than 1434. But I don't think we are expected to set this number. Do the peers negotiate down from 1500 to reach this on their own?



YES ABSOLUTELY!!! After all it does it properly on K2.6

As I said, when there's a defect all bets are off. It's difficult enough to understand how something is working internally by external observation when that something is functioning properly - but when that something is defective, you are EXTREMELY lucky to get enough outer layers winnowed away to get it reproducible.

I don't know how long this bug has been there but you can be very sure if it's been there for years that a LOT of people have had unreproducible network failures over OpenVPN tunnels on dd-wrt - random glitches that they have blamed on buggy applications, the occasional trashed packet that may have interacted with a weak TCP/IP stack that caused a freeze and a host of other seemingly random unexplainable issues with OpenVPN on dd-wrt. There's a LOT of embedded IoT devices out there from my phones to printers to God-knows what else including the proverbial "refrigerator connected to the Internet" that have implemented a stripped down IP stack because they are working with inadequate amounts of ram/flash/cpu power hardware devices purchased because some pinhead could save a dollar in a lot of 10,000 of them. They don't have the space in their device to afford to graft in the complete full-meal-deal IP stack's from FreeBSD or Linux or whatever.

Do you seriously think that the little FTP server in the Broadcom CFE that is a miserable 1k in size has room to implement a complete MTU black hole path discovery? Hardly. The world is FULL of devices like that from thermostats to doorbells the list goes on and on. The developers that write code for that stuff make every assumption in the book to shave off 20 bytes of code size and so those devices will crash and burn. The average user is going to setup an OpenVPN, ping across it with a 64 byte ICMP echo reply packet, maybe send a file or 2 across it with FTP, then conclude "hells bells she's up and running let 'er rip!!" And if ONE device of theirs that is hosed by a bug like this fails, while all their other ones work - into the garbage it goes followed by a muttered curse about "chincom crap" And MAYBE if they are lucky the next Internet-connected doorbell/thermostat/weather station/tee shirt/tush warmer they buy will just happen to use a smaller default size packet and "work" Well, mostly.


Last edited by tedm on Mon May 10, 2021 1:43; edited 2 times in total
tedm
DD-WRT Guru


Joined: 13 Mar 2009
Posts: 554

PostPosted: Mon May 10, 2021 1:40    Post subject: Reply with quote
egc wrote:

One user reported VoIP problems when using VoIP via the VPN using DDWRT K4.4 but that was due to SFE and after disabling SFE his problems were gone so I presume with MTU 1500 ( https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=328961 )


Yeah I saw that one also and was tempted to ask but as you said he isn't complaining and there's no point trying to engage with someone who is happy. If we get the bug found and fixed by the time he gets around to upgrading again he won't even know there was an issue. And there's always the hack of changing the default tun to 1430 on K3 and K4 if we don't get it identified.

I need to setup a more complete test bed on this and I really should test on 2 PCs. I have the public IPs available at different locations as well as the space for servers and such. If I can duplicate the problem on i386 or amd64 "regular Linux" I can post it on the actual OpenVPN forum and light them up about it. Unfortunately since one site is 80 miles away this isn't going to happen immediately.

In the meantime, I am going to try to put together a jpg of my network and vpn link, and complete the other tests I mentioned (parallel test and different crypto) plus try to find the exact largest MTU tun can go to before the bug presents and post that to this thread. After that I'll file a new bug in SVN.

Let me ask you this - the K4.9 you are testing, I assume that is dd-wrt? The most powerful router I have is a Netgear R7000, if dd-wrt on K4.9 will be able to run on that I could test with that as well.
Goto page 1, 2, 3  Next Display posts from previous:    Page 1 of 3
Post new topic   Reply to topic    DD-WRT Forum Index -> Advanced Networking All times are GMT

Navigation

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum