nas / timer / rekey hang watchdog

Post new topic   Reply to topic    DD-WRT Forum Index -> Broadcom SoC based Hardware
Goto page 1, 2  Next
Author Message
lazardo
DD-WRT User


Joined: 17 Apr 2014
Posts: 135
Location: SF Bay Area

PostPosted: Tue Sep 11, 2018 19:26    Post subject: nas / timer / rekey hang watchdog Reply with quote
Update: Simplified here: https://forum.dd-wrt.com/phpBB2/viewtopic.php?p=1170847#1170847

This script here https://forum.dd-wrt.com/phpBB2/viewtopic.php?p=1141085#1141085 functions as a watchdog+fixup for a sequence of events thought to be related to the post-KRACK nas blob and 'Key Renewal Interval', or simply 'rekey'.

For as yet unknown reasons the rekey cycle will sometimes generate thousands of user.info 'timer' log entries along with high system load and the occasional wifi and/or system hang.

When a 'timer' entry is caught, the script increments rekey values by 1 for easy event tracking and restarts nas. Temporarily setting rekey values to '360' makes for a quick sanity check.

Tested with Asus RT-N66U / v3.0-r36808 mega.

Cheers,

Update: Still present in 39xxx builds, timer events seem to degrade system usability if not caught/reset.


Last edited by lazardo on Mon Jul 29, 2019 20:23; edited 11 times in total
Sponsor
jpp
DD-WRT Novice


Joined: 26 Sep 2017
Posts: 28

PostPosted: Tue Sep 11, 2018 19:42    Post subject: Reply with quote
Do i have to enter this script (the first part under Code, i imagine the second part is a log) in administration -> commands? Then save startup?

JP
lazardo
DD-WRT User


Joined: 17 Apr 2014
Posts: 135
Location: SF Bay Area

PostPosted: Tue Sep 11, 2018 19:53    Post subject: Reply with quote
jpp wrote:
Do i have to enter this script (the first part under Code, i imagine the second part is a log) in administration -> commands? Then save startup?

JP

Yes, BUT, check the code before running as there are escape rules in the GUI 'commands' section that can be tricky so I don't use it for anything but one liners.

I would suggest copy/paste to a different computer then use ssh/scp/putty to move to /tmp on the router.

Remember that /tmp contents do not survive a reboot.
kernel-panic69
DD-WRT Guru


Joined: 08 May 2018
Posts: 14125
Location: Texas, USA

PostPosted: Wed Sep 12, 2018 0:37    Post subject: Reply with quote
The thing is, the nas, gtk, KRACK fixes are related to client modes mainly, from my best understanding. In router / gateway / AP modes, this shouldn't be an issue IMHO, but it is. Seems like the timer fixes and all haven't really fixed things (yet). Anyhow, can this also be saved as a custom script in the webUI and used, if one chooses? Not sure if I want to add this to my startup script as I already have one, and since I am not using any client modes, it's probably moot, I presume. Awesome workaround / fix for this problem, though, thanks for the info!
lazardo
DD-WRT User


Joined: 17 Apr 2014
Posts: 135
Location: SF Bay Area

PostPosted: Wed Sep 12, 2018 5:48    Post subject: Reply with quote
I ran a test. Cut+paste into the 'admin -> commands' window, followed by 'save custom script' gets you this:
Code:
root@dd-wrt:/tmp# ls -l | grep custom
-rwx------    1 root     root           905 Sep 11 22:34 custom.sh

Which can be run on the command line, or in the same 'commands' window, eg, 'sh -x /tmp/custom.sh'.

Running 36 hours now under 'normal' wifi traffic, no hangs. I'll probably run it in 24 hour cycles rather than looping forever.

Cheers,
slice1900
DD-WRT User


Joined: 18 Feb 2013
Posts: 99

PostPosted: Wed Sep 12, 2018 14:43    Post subject: Reply with quote
It is also saved in the rc_custom variable, so that it survives a reboot.
lazardo
DD-WRT User


Joined: 17 Apr 2014
Posts: 135
Location: SF Bay Area

PostPosted: Wed Sep 12, 2018 17:21    Post subject: Reply with quote
Good catch re: rc_custom.
lazardo
DD-WRT User


Joined: 17 Apr 2014
Posts: 135
Location: SF Bay Area

PostPosted: Thu Sep 13, 2018 17:56    Post subject: Re: nas / timer / rekey hang watchdog Reply with quote
Cleaned up, added virtual wl interfaces per kernel-panic69
Code:

#!/bin/sh
# watchdog for post-KRACK, Broadcom-based K3x nas / rekey / timer hang+crash
# tags: experimental, works-for-me

##########################################
# adjust these for your router

# working directory
WD=/tmp

#0....+....1....+....2....+....3....+....4...
#Sep 11 02:23:23 dd-wrt user.info : timer : #445 (0x42ca20)->0x42ca40: 0 sec 0 usec0x4033a9
#...................................^^^^^^^^^
IDX=35
SIZE=9

# nvram show | grep _gtk_rekey
KEYS="wl0_wpa_gtk_rekey wl1_wpa_gtk_rekey"

# OPTIONAL: virtual interfaces may be a factor even if unused
#KEYS="$KEYS wl0.1_wpa_gtk_rekey wl0.2_wpa_gtk_rekey wl0.3_wpa_gtk_rekey"

# ls /tmp/nas*pid
PIDS="/tmp/nas.wl0lan.pid /tmp/nas.wl1lan.pid"

# optional
TAG=rekey.watchdog

##########################################
# no edits below

REKEYPID=/var/run/rekey.pid
TAILPID=/var/run/rekey.tail.pid

cd $WD
for oldproc in $REKEYPID $TAILPID; do
   if [ -e $oldproc ]; then
      kill $( cat $oldproc )
      rm $oldproc
   fi
done

[ -e REKEYTAIL ] && rm REKEYTAIL

[ "$1" == "stop" ] && exit

echo $$ > $REKEYPID
logger -st rekey started

mkfifo REKEYTAIL
tail -n 0 -F /var/log/messages > REKEYTAIL &
echo $! > $TAILPID

##########################################

cat REKEYTAIL |
while read raw; do
   [ "${raw:IDX:SIZE}" != "timer : #" ] && continue

   for PID in $PIDS; do
      kill $( cat $PID )
   done

   # may be redundant
   stopservice nas
   stopservice wlconf

   for KEY in $KEYS; do
      NEWKEY=$(( $( nvram get $KEY ) +1 ))
      [ $NEWKEY -gt 99990 ] && NEWKEY=86401
      nvram set $KEY=$NEWKEY
   done
   nvram commit

   startservice wlconf
   startservice nas

   logger -s -t $TAG "caught ${raw:IDX}, rekey is now $NEWKEY"
   touch $TAG.killed.$NEWKEY

   rm $REKEYPID $TAILPID
   ( sleep 3; $WD/rekey_watchdog.sh & ) &

   exit
done
Code:
...
/var/log/messages.1:Oct  5 17:01:39 192.168.1.251 ntpclient[30048]: Time set from 192.168.1.13 [192.168.1.13].
/var/log/messages.1:Oct  5 17:01:39 192.168.1.251 process_monitor[1642]: cyclic NTP Update success (servers 192.168.1.13)
/var/log/messages.1:Oct  5 17:51:44 192.168.1.251 : timer : #0 (0x42e600)->0x42bfa0: ^I36 sec 204686 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:44 192.168.1.251 : timer : #1 (0x42bfa0)->0x42a920: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:44 192.168.1.251 : timer : #2 (0x42a920)->0x4297a0: ^I0 sec 0 usec^I0x4033a9
...
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #963 (0x42e320)->0x42bd40: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #964 (0x42bd40)->0x42a6c0: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #965 (0x42a6c0)->0x4295a0: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 [b]rekey.watchdog: caught : timer : #0 (0x42e600)->0x42bfa0: ^I36 sec 204686 usec^I0x4033a9[/b]
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #966 (0x4295a0)->0x431220: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #967 (0x431220)->0x42e340: ^I32 sec 136777 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #968 (0x42e340)->0x42bd60: ^I0 sec 0 usec^I0x4033a9
...
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1000 (0x4313a0)->0x42e440: ^I28 sec 821709 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1001 (0x42e440)->0x42be00: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1002 (0x42be00)->0x42a780: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 [b]: nas : daemon successfully stopped[/b]
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1003 (0x42a780)->0x429620: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1004 (0x429620)->0x4313e0: ^I0 sec 0 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1005 (0x4313e0)->0x4313c0: ^I0 sec 990000 usec^I0x4033a9
...
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1085 (0x4317a0)->0x4317c0: ^I21 sec 502485 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1086 (0x4317c0)->0x4317e0: ^I76 sec 471811 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : #1087 (0x4317e0)->0x0: ^I100 sec 570553 usec^I0x4033a9
/var/log/messages.1:Oct  5 17:51:45 192.168.1.251 : timer : eventlist is full
/var/log/messages.1:Oct  5 17:51:47 192.168.1.251 : nas : daemon hanging, send SIGKILL
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 : nas : start nas lan
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 : nas : start nas for wl0
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 : nas : NAS lan (wl0 interface) successfully started
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 : nas : start nas lan
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 : nas : start nas for wl1
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 : nas : NAS lan (wl1 interface) successfully started
/var/log/messages.1:Oct  5 17:51:48 192.168.1.251 [b]rekey.watchdog: rekey is now 36003[/b]
/var/log/messages.1:Oct  5 17:51:53 192.168.1.251 rekey: started
/var/log/messages.1:Oct  5 18:01:39 192.168.1.251 ntpclient[30447]: Time set from 192.168.1.13 [192.168.1.13].
/var/log/messages.1:Oct  5 18:01:39 192.168.1.251 process_monitor[1642]: cyclic NTP Update success (servers 192.168.1.13)
...


Last edited by lazardo on Fri Oct 19, 2018 22:09; edited 1 time in total
kernel-panic69
DD-WRT Guru


Joined: 08 May 2018
Posts: 14125
Location: Texas, USA

PostPosted: Wed Oct 17, 2018 13:10    Post subject: Reply with quote
Ok, the 'update' at the top of your last post is a little confusing, looking at the 'updated' script towards the bottom of the post. Could you please verify the changes and updated script, please?

EDIT: 'nvram show | grep _gtk_rekey' shows values for vifs / vaps that are not configured / enabled. I am currently testing 'nvram set' via cli, setting those values to 0 to see if it affects problems that seem to have recently arisen with wi-fi stability, because I think even though I have the main interfaces (wl0 and wl1) re-key set to 0, those values were being read and affecting things.
moraym
DD-WRT Novice


Joined: 21 Jul 2015
Posts: 1

PostPosted: Mon Dec 24, 2018 3:51    Post subject: Reply with quote
I did not have any NAS and yet there are plenty of these messages and LAN/WiFi are stuck sporadically for 2-20 seconds. Will it be enough for me just to stop nas and wlconf services?
stopservice nas
stopservice wlconf


Router Model Asus RT-AC66U
Firmware Version DD-WRT v3.0-r37442 giga (10/19/1Cool
Kernel Version Linux 3.10.108-d6 #21410 Fri Oct 19 15:34:31 CEST 2018 mips

Thanks!
danielwritesback
DD-WRT User


Joined: 29 Aug 2011
Posts: 240

PostPosted: Tue Dec 25, 2018 9:06    Post subject: Reply with quote
moraym wrote:
I did not have any NAS and yet there are plenty of these messages and LAN/WiFi are stuck sporadically...

NAS = Network Attached Storage, but nas = network authentication service

So, run this command:
nvram set wl0_wpa_gtk_rekey=0;nvram set wl0.1_wpa_gtk_rekey=0;nvram set wl0.2_wpa_gtk_rekey=0;nvram set wl0.3_wpa_gtk_rekey=0

And run this command:
nvram set wl1_wpa_gtk_rekey=0;nvram set wl1.1_wpa_gtk_rekey=0;nvram set wl1.2_wpa_gtk_rekey=0;nvram set wl1.3_wpa_gtk_rekey=0

And run this command too:
nvram commit

And install this startup script:
sleep 10;stopservice nas;stopservice wlconf;startservice wlconf;startservice nas

P.S.
The error was installed at 35667; so, earlier versions don't need the script.
kernel-panic69
DD-WRT Guru


Joined: 08 May 2018
Posts: 14125
Location: Texas, USA

PostPosted: Tue Dec 25, 2018 9:57    Post subject: Reply with quote
danielwritesback wrote:

P.S.
The error was installed at 35667; so, earlier versions don't need the script.


Supposedly, that was a 'bugfix' to $linuxver/net/ipv4/etherip.c that was backported all the way to 3.2 (in DD-WRT).... but, it introduced another bug. Now, the question is, was it a fix from upstream vanilla kernels or from the linux-MIPS community? Also, I am curious how you determined that it was that specific revision?
danielwritesback
DD-WRT User


Joined: 29 Aug 2011
Posts: 240

PostPosted: Thu Dec 27, 2018 7:25    Post subject: Reply with quote
kernel-panic69 wrote:
danielwritesback wrote:

P.S.
The error was installed at 35667; so, earlier versions don't need the script.


Supposedly, that was a 'bugfix' to $linuxver/net/ipv4/etherip.c that was backported all the way to 3.2 (in DD-WRT).... but, it introduced another bug. Now, the question is, was it a fix from upstream vanilla kernels or from the linux-MIPS community? Also, I am curious how you determined that it was that specific revision?

Version 35531 didn't have much errors, and it is quite famous for being a recent stable 'go to' version. After that point, the new build threads can be roughly translated as something about hell and a handbasket. So, I decided that the error was introduced in the version(s) after 35531. It was a somewhat educated guess also involving a bit of research (not too conclusive due to small number of reports posted).

Speaking of educated guess! Oh, I need your help in those top secret dnsmasq commands, as in whatever makes it work well aboard the e4200.
I'd love to try it.
kernel-panic69
DD-WRT Guru


Joined: 08 May 2018
Posts: 14125
Location: Texas, USA

PostPosted: Thu Dec 27, 2018 15:28    Post subject: Reply with quote
danielwritesback wrote:
Speaking of educated guess! Oh, I need your help in those top secret dnsmasq commands, as in whatever makes it work well aboard the e4200.
I'd love to try it.


The only problems I am having right now is due to the APs always being 'isolated', or something. DNSMasq tends to be a little screwy (again), but ONLY on wi-fi, not wired. I think the last good running build I had was 37442 or 37582 K3X, didn't seem to be too problematic. Below is how I have it set up, I don't use local DNS or the DNSSEC cache (that may be where I might be having problems?). Also, I am using OpenDNS servers via the additional options box with no-resolv directive:

Code:
no-resolv
server=208.67.222.220
server=208.67.220.222
server=208.67.220.220
server=208.67.222.222



setup_page.png
 Description:
 Filesize:  80.92 KB
 Viewed:  7386 Time(s)

setup_page.png



DNSMasqOPT.png
 Description:
 Filesize:  67.93 KB
 Viewed:  7386 Time(s)

DNSMasqOPT.png


danielwritesback
DD-WRT User


Joined: 29 Aug 2011
Posts: 240

PostPosted: Sun Dec 30, 2018 9:06    Post subject: Reply with quote
Thanks!
kernel-panic69 wrote:
...DNSMasq tends to be a little screwy (again), but ONLY on wi-fi, not wired....
I bet that is related to wi-fi connectivity in some way. If a wifi client wants dns but can't get through, there can be a cyclic retry issue.

Forced redirect may make that get worse: If you have a google home that really wants 8.8.8.8, connection count skyrockets out of control (unless you block 8.8.8.8 using reject). In case you needed the redirect, here's a cpu efficient version, without dnat.
iptables -t nat -A PREROUTING -i br0 -p udp --dport 53 -j REDIRECT
This an example of redirect to myself command (and dnsmasq answers port 53). If you have an optional time server installed in your router, same command will do for port 123. The dnat version (in the dd-wrt checkbox) is only needed if you want to redirect to a different locale.
Client on a retry spree + dnat + wifi = wee little mushroom cloud.

Here's per client restriction to prevent udp bombardment over wifi (startup script):
iptables -I INPUT -i eth1 -p udp -s 192.168.1.0/24 -m connlimit --connlimit-mask 32 --connlimit-above 60 -j REJECT
iptables -I INPUT -i eth2 -p udp -s 192.168.1.0/24 -m connlimit --connlimit-mask 32 --connlimit-above 60 -j REJECT
nvram set ip_conntrack_udp_timeouts=60
The only client to get 'punished' is the one out of control on a retry spree, and the suppression only lasts a minute at worst, but probably just the couple of seconds it takes for a client to back off when it gets hammered with rejects (a QOS method).
I actually have my connlimit range 192.168.1.1/25 and dhcp auto range ends at 192.168.1.126 so that connlimit is applied to it (I don't have -i interface specified). This leaves the upper half for static assignments without connlimit and that's where I put server oriented equipment that is supposed to make more numerous connections.
You can do other things with connlimit, such as if more than x number connections set to very worst QOS classification and/or limit overage connections to x number per second (trickle enough to prevent timeout). That would be very transparent, but I'm not quite that skilled with itables.

Meanwhile, back to wifi,
DNS can be the first thing to complain, even if DNS isn't the actual cause; so, if the '53 unreplied' connection count gets extreme... personally, I have to restart my cable modem to fix that. There's a variety of commands that could make DNSMasq slightly more resilient to connectivity losses; however, fixes are less effective when applied too far away from the locale of the actual problem. So, it turns in to more of a question, cause or effect?

According to my records, 36104 has working WDS, which, possibly, means more stable wifi.

Thanks again for sharing the DNSmasq examples.
Goto page 1, 2  Next Display posts from previous:    Page 1 of 2
Post new topic   Reply to topic    DD-WRT Forum Index -> Broadcom SoC based Hardware All times are GMT

Navigation

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum