At the main net is the SIP pbx. It has a Netgear AC1450 running version r46974 and is acting as the OpenVPN server
At the remote net 80 miles away are the SIP phones. It has a Netgear R6300v2 running r46788 and acts as the OpenVPN client. The SIP phones register to the server over a site-to-site OpenVPN vpn.
I set this link up a month ago using r46788 on both ends and it was solid for a week. Then I tried loading a newer firmware revision on the R6300v2 and the SIP phones would NOT remain registered. I tried factory resetting, screwing with OpenVPN parameters, the works. They would NOT remain registered for more than an hour even with the exact same config on the remote even trying different firmware.
After a couple days of this I reverted the R6300v2 back to r46788. The phones have been rock solid since. Even with changing OpenVPN parameters to use TCP instead of UDP and other experiments.
It does not seem to affect the phones if I firmware update the OpenVPN server end of the link. But if I try any dd-wrt version newer than r46788 on the remote, I get problems.
I've looked through the SVN changes and I cannot figure out what in the world was changed after 46788 that could possibly have anything to do with this. What is so special about r46788?
It's probably worth mentioning that if I backrev from R46788 the phones seem to work for a while but if I go many months backwards they become problematic.
Last edited by tedm on Thu Jul 22, 2021 7:03; edited 1 time in total
I'm at the main end right now so it would be fairly easy to try running the latest version with OpenVPN 2.5.3 and see if that screws anything up. But the main end is ONLY acting as an OpenVPN server, it is not acting as an Internet router - no other traffic goes through it than the VPN traffic.
The remote end is acting as both a VPN client -and- an Internet router.
Unfortunately it seems that just when all the OpenVPN black hole nonsense was fixed with the change to the tun MTU default, now it's time to break things with SFE. Sigh.
I loaded build r47000 (6/28/21) on my server end of the VPN a week ago and the phones through the VPN have been stable since. I did need to make a few changes to the VPN - both Encryption Algorithm were changed from "None" to "not set" since otherwise OpenVPN filled up the logs with "link not encrypted" nonsense, and the tunnel protocol was changed from tcp to udp. First data cipher is AES-128-CGM. It would be interesting to know if AES-128 is faster or slower than CHACHAPOLY as I've figured out that the BCM4708 CPU used in the Netgear AC1450 is just a tad bit too old to have the special AES instructions in it so there's no benefit to running AES
It also appears the new OpenVPN version enforces that tl key renegotiation every hour. So I bumped the TLS renegotiation with
since 1 hour on a key renegotiation is too often - too much chance for interrupting a phone call. Besides I doubt I'll be able to send 256 exabytes of data over this link in less than 8 hours. (to where a birthday attack would be feasible) /s
So it does appear that the issue is centered on the Netgear R6300v2 running r46788 and acting as an openvpn client, since updating that one beyond that version triggered the dropping. I have another Netgear router with a faster 1Gb CPU and I'm thinking it might be worth testing that at the remote and seeing if that would allow me to run current builds on it.
Hmmmm...setting the key renegotiation time period to 8 hours was not good. Started losing the phone registrations and the openvpn log on the server side started showing "bad source address from client [172.16.1.16] packet dropped" error messages from the PBX.
So now I finally think I have a working theory of what is actually going on - at least this is my most current working theory. (it will do until something better comes along)
When the phones register in they are initiating the registration on the client side of the vpn. That causes the client OpenVPN to send a host route to the server VPN. The server OpenVPN then adds it with the message:
Further info on this as the phones started losing registration again. I reverted back to 46974 on the server end but I don't think that is the key.
I think there are several things that have to be setup perfectly for the phone registrations to stay up:
1) If the phone is using SIP-over-UDP then the VPN must use TCP transport. However if the phone is using SIP-over-TCP then the VPN can use either the udp or tcp transport. Some phones (like the Cisco 7940) can only do SIP-over-UDP so if you are using those you have to run openvpn over tcp transport.
This is because the underlying public Internet has too much loss for regular SIP over UDP. Loss on the Internet is not a regular amount, some days it's higher than others. And it is disguised because all public routers on the Internet are configured to prioritize ICMP packets over TCP packets and TCP packets over UDP packets - it's part of the TCP/IP standard. So you can have maybe 10-20% loss on UDP and if you try viewing loss with ping (which uses ICMP) you will see nothing, or an app like FTP (which uses TCP) you also will see nothing. Yet during that there will be loss on UDP traffic. SIP over UDP is not tolerant of loss, at least not very much.
It was maddening to figure this out because some days there IS no loss on a UDP path. If you configure the VPN on one of those days then SIP-over-UDP works perfectly over OpenVPN-over-udp.
2) If using UDP for SIP from the phone, there can be no black holes with different packet sizes because UDP has no means of path discovery.
This was covered extensively in the prior thread referenced at the beginning of this thread
3) OpenVPN Server installs individual host routes for hosts on the other end of the link. It does this when it first gets a packet from one of those individual hosts from the OpenVPN client. However the host routes time out and are expired on the server end if there are no TCP connections through them. So if the PBX sends a UDP keepalive to the phone on the client end, if the host route is gone then OpenVPN server and OpenVPN client take time discussing it and reestablishing the route. This is OK for SIP over TCP because the TCP stack on the PBX will just send a retry. But if it's UDP then the keepalive packet will be lost and the phone registration can fail out
The fix to this is any phone using SIP-over-UDP to statically number it at the client end. Then install a host route (mask 255.255.255.255) in the openvpn server configuration for that number
4) For SIP-over-UDP, Registration expiration timers appear to need to be put way down from the default of 3600. Near as I can tell this is because dd-wrt's linux core IP Filter settings seem to have UDP timeouts set lower and there seems to be an interaction with OpenVPN even though we are not running NAT in this configuration. Note that I am not 100% positive about this and will be doing further experimentation on it.
To do: more testing with newer dd-wrt/openvpn versions on the server side, more testing with UDP sip timers.
Updated to r47117 a month ago on the server end of the OpenVPN tunnel and the SIP registrations on the phones have remained completely stable
Updated to r47206 at the remote client end a week ago and the SIP registrations on the phones have continued to remained stable
I suppose it's time for a wiki article. Fundamentally SIP is timing sensitive at a millisecond level and to be able to reduce packet delays to where they won't interfere with it you need a fast CPU in the router.