Posted: Sun May 21, 2023 10:47 Post subject: Bridge to bridge performance
Hello, I've recently upgraded my Netgear R7800 router that I use as a PPPoE gateway to r52569 (with a full reset). It has 4 bridges, the use of which is probably not relevant to the question, but for clarity: one bridge is direct traffic to the web, a second for IoT -- also direct but with extra safeguards. The other have all external traffic routed via two separate openvpn connections. Each bridge has its own subnet, vlan tag, and wap. With the exception of IoT, all the bridges/subnets have routes to each other (and I'm not testing IoT in the below).
I've experienced some instability in connections, particularly voip, and speedtests have been unusually volatile (ie speed goes up and down) and mostly a little lower than expected. In investigating this I noticed something that seems strange to me: traffic within a single bridge is fast, as expected. When sending across the LAN from one bridge to the other, I'm seeing very high packet loss and decreased throughput. To me it seems this shouldn't be happening (but maybe you can tell me it's just normal, what do I know, I'm not a networking guru!). All the more so because, looking at 'top' on the router while this is happening, the CPU never gets above maybe 20% and the sirq (which I'm guessing is system IRQs) tops out at under 30% (also the load rarely goes over 1). So I could see if the CPU was unable to keep up it would be forced to drop packets, but it seems that's not even the case?
The setup: two linux laptops, connected directly to the router ethernet ports with 1Gbps connections. All results are shown from the server1 perspective, connecting to iperf3 running 'iperf3 -s' on server2. (Note that not much else is happening on the network during these tests)
First, results from when both laptops are on the same bridge (br0)/subnet (192.168.9.*):
Note that I ran the reverse test twice there, and there is already some significant variation for reasons unknown (including during each test). There is also a difference between directions, for reasons unknown, but not enough that I'm too worried (maybe I should be?).
Now when server 2 is on another subnet (192.168.10.*)/bridge (br1):
Across bridges, there is a loss rate of 33%-44% of UDP packets and about 40% reduced throughput!
Could I be doing something wrong here? Anybody have any ideas what's going on? Is it just me? Any suggestions for digging further into this?
Joined: 16 Nov 2015 Posts: 6446 Location: UK, London, just across the river..
Posted: Sun May 21, 2023 11:37 Post subject:
i can imagine the topology of your network...kind of...
but...switch has its own cpu, so when traffic goes over the switch than the router CPU is not in use
like port to port...router cpu get involved only when there is a traffic WAN to LAN and opposite..moreover you run pppoe that puts a toll on routers CPU..
so, for loss packets you can blame either different clients with different NIC's and settings..or the switch firmware that is utilising the switch CPU capabilities...that has nothing common with DDWRT side...
as you stated you have vlan's tagged, assigned to bridges...on their own subnets..dhcp..
and vap's (wifi stations) on each vlan/bridge...if wifi is used to measure it could generate losses..
in general UDP is stateless protocol..and its normal to have a loss there..well im not a networking guru neither i can imagine your overall goal...but complex networks are not the best to squeeze out from a consumer grade router...although R7800 is rock solid..
What i have on it you can see in my signature...basically ive x3 Vlans on bridges and those are for x3 of the LAN ports on their own subnets... nothing is tagged (apart of the tags you need to make vlans to work)..see the guide... https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=334342
I do have an extra switch on one of the LAN ports and extra router on the other port(IoT's), and 3rd VLAN goes to another switch where there is lots of devices too + another router..I also have br to br limiting rules, so i dont want those to communicate, as well i have net&ap isolation..and my set up is working great..so, far no complains and no complex network with tags,i believe tag's are made to identify traffic and tagged traffic can go out of the WAN port, as well over the LAN ports..(never had a need to), as tags are only if you use a single port and you want a differentiated traffic (kinds of) to comes out of it..that also could be too overwhelming for the switch CPU as well...and than you have errors.. _________________ Atheros
TP-Link WR740Nv1 ---DD-WRT 55630 WAP
TP-Link WR1043NDv2 -DD-WRT 55723 Gateway/DoT,Forced DNS,Ad-Block,Firewall,x4VLAN,VPN
TP-Link WR1043NDv2 -Gargoyle OS 1.15.x AP,DNS,QoS,Quotas
Qualcomm-Atheros
Netgear XR500 --DD-WRT 55779 Gateway/DoH,Forced DNS,AP Isolation,4VLAN,Ad-Block,Firewall,Vanilla
Netgear R7800 --DD-WRT 55819 Gateway/DoT,AD-Block,Forced DNS,AP&Net Isolation,x3VLAN,Firewall,Vanilla
Netgear R9000 --DD-WRT 55779 Gateway/DoT,AD-Block,AP Isolation,Firewall,Forced DNS,x2VLAN,Vanilla
Broadcom
Netgear R7000 --DD-WRT 55460 Gateway/SmartDNS/DoH,AD-Block,Firewall,Forced DNS,x3VLAN,VPN
NOT USING 5Ghz ANYWHERE
------------------------------------------------------
Stubby DNS over TLS I DNSCrypt v2 by mac913
Last edited by Alozaros on Sun May 21, 2023 12:27; edited 1 time in total
The UDP test is just bullshit, you tell iperf with "-b 1000M" to send with 1Gbit regardless if the client drivers can do it or if there are enough resources available in the network - of course there is packet loss.
You are also testing only single stream, many NIC drivers have better multistream performance.
and with TCP you need / must not specify any bandwidth at all, it is unlimited by default.
Test TCP with "-P 4 or 8" and without "-m".
The rest is of no interest.
And like Alozaros said in your first test the traffic runs only over the switch there is the router not involved at all - there you should measure stable 1Gbit throughput (if not your notebooks have bad network cards and drivers )
With your other test over several bridges the traffic runs also through the router CPU - logical that the results are worse.
For me, a test on the same bridge (via the switch looks like this)
@Alazaros I have a similar setup to yours, but I need the vlan tagging because my actual topology is this and the tagging is necessary to traverse the switches and have ports on different bridges:
Code:
DD-WRT GATEWAY DD-WRT WAP
Netgear R7800 --- Managed switch --- Netgear AC1450 --- My normal work location
Wifi with Wifi extension wired or wireless
virtual interfaces w/ virtual IFs
All the earlier tests were done with ethernet cables on the router itself to isolate things. That's hard because I have to be atop a ladder with two laptops . But I have a problem with the WAP too!
@egc I do have (and have had) Shortcut Forwarding Engine disabled.
@ho1Aetoo I understand that some loss with UDP is expected, but I was surprised how much. Running the same tests you suggest with TCP when directly connected seem to give similar results to you, which aren't bad, so I guess I'll chalk it up as normal performance.
Overall throughput is similar to yours, but the variability is still much higher (e.g. goes down to 357Mbps one second), not sure what that's about. (On the same bridge, throughput is much more steady and around 920Mbps -- so all good at the switch level and with these two machines.) With 4 sockets open, throughput increases to around 810-820Mbps, which would be totally unconcerning by itself. And you're right that when running like that for a while, the CPU is getting taxed, for example:
What's most concerning is the variability at this point. I have a suspicion that may be driving what I'm seeing with video calling (skype/teams/etc) having occasional hiccups. (Could it be interrupt handling?)
Well, as far as I know you only have an ADSL connection as WAN (at least that's what you wrote in another thread).
With ~16Mbit down + ~1Mbit UP you have more throughput and latency problems on the WAN side.
The router should handle such bandwidths effortlessly
Maybe your connection also has a crappy bufferbloat and for some services like VoIP a low latency is more important than a high data rate. (keyword QoS)
When I run a test across the bridges and across my full network, throughput in one direction (only!) drops to around 300Mbps. Looking at the network topology in the post above, server1 is in my normal work location, server2 is connected to the managed switch. If they were both on the managed switch (or on the gateway router), results are normal. With server1 connected to the WAP, I get this when sending from server1 to server2 (all components using wired gigabit ethernet):
311 Mbps, what?! That's 60% less than the other direction. No such problem when these two devices are on the gateway router (as in the previous test), or both on the managed switch. But when server1 is on the wap, this happens consistently. Any ideas here? I see effectively no load on the WAP router at this time...
Last edited by jtbr on Tue May 23, 2023 10:43; edited 2 times in total
Well, as far as I know you only have an ADSL connection as WAN (at least that's what you wrote in another thread).
With ~16Mbit down + ~1Mbit UP you have more throughput and latency problems on the WAN side.
The router should handle such bandwidths effortlessly
Maybe your connection also has a crappy bufferbloat and for some services like VoIP a low latency is more important than a high data rate. (keyword QoS)
I misremembered, it's VDSL now. 100Mbps down and 35Mbps up. But your point still stands -- shouldn't be a problem for this router. Generally I've been happy with my ISP, but who knows what's going on downlink. I ran a test here with dslreports
http://www.dslreports.com/speedtest/71936087. You can see the variability, but they say no bufferbloat.
Perhaps you're right I should be doing some QoS.. so far I haven't turned anything on because I don't think I have enough network traffic to justify it, but maybe I need it just to prioritize voip stuff?
BTW some of my internet speed issues last week were apparently due to a loose WAN cable which autonegotiated 100Mbps rather than 1000Mbps after I updated the router using a cable ... so at least that's fixed . I'm now getting my usual ~low 80Mbps down and low 30Mbps up speeds again. Now it's more the stability issue like I said.
I've gone ahead and tried to set up QoS to see if that helps. It looks like it might (surprising because the issues occur when the WAN link is not saturated).
as the output from 'iptables -t mangle -nvL', but that's not what I see. Does anyone know how to show the output above for active connections?
2) When I enable QoS and apply settings when the router is already up, it seems to work as expected. However when I reboot with those same settings, the WAN PPPoE doesn't seem to be fully configured. I can connect to the internet from the router, but not from outside the router, and the dd-wrt web interface shows WAN is connected with an IP of 0.0.0.0 (and the disconnect button has no effect). Any ideas what might be causing this? When I kill pppd and restart it manually, things work again.
1. your bufferbloat test is meaningless because you did not enable "Hi-Res BufferBloat" in the settings.
2. no idea what you have for WAN / PPPoE / QoS problems that works here all fine - all things that I use myself.
3. the shown output is not an iptable rule but the active connections from "netfilter conntrack" (and I don't know what you want with that, it's completely irrelevant)
1. your bufferbloat test is meaningless because you did not enable "Hi-Res BufferBloat" in the settings.
2. no idea what you have for WAN / PPPoE / QoS problems that works here all fine - all things that I use myself.
3. the shown output is not an iptable rule but the active connections from "netfilter conntrack" (and I don't know what you want with that, it's completely irrelevant)
Code:
cat /proc/net/nf_conntrack
Thanks for your help.
1) When I turn on that setting, after disabling QoS again, it gives BufferBloat score of A rather than A+.
2) I'm guessing it is because my boot times are so long. Not sure what I can do about that, don't think there's anything special about my wireless settings. All channels are fixed in the GUI. I guess I'll try resetting pppd after a sleep in the startup commands.
3) Perfect. It's helpful because it shows the 'mark' field which tells how the router is treating each connection, so I'll be able to verify if the protocols/apps are treated correctly by QoS.
You can also see this in the GUI, when you set a service priority in the QoS tab, a packet counter is also displayed.
If the packet counter shows 0 then there is either no traffic or the filter is not working ...
Adding a comment for posterity:
I've noticed that the QoS tab packet counter seems not to fully reflect what I see in /proc/net/nf_conntrack. There are at least some cases where the GUI seems to indicate that QoS is not working for certain services, but the conntrack file shows that the connections are in fact being correctly tagged (so presumably QoS is working correctly on them). Generally the packet counts are strangely low as well. Maybe it's my somewhat complicated routing table, who knows. But if you're worried something isn't being tagged properly, have a look at conntrack to be sure.
NB: The mark tags are not those mentioned in the WIKI. Here is what I've noticed:
Code:
Mark Priority Minimum % at full capacity"
? Maximum 75%"
0x2800 / 10240 Premium 50%
0x5000 / 20480 Express 25%
0x7800 / 30720 Standard 15%
0xa000 / 40960 Bulk 5%
0 default
I don't have any "Maximum" priority, so don't know that mark. Note that iptables uses hexadecimal (eg 0x2800) while conntrack uses decimal (10240).