CC3220SF: Secondary DNS Failure

Ben Schoeler

Intellectual 770 points

Part Number: CC3220SF
Other Parts Discussed in Thread: CC3200

Hi TI support, we are still looking for a resolution to this issue here:

over 4 years ago

0 Michael Reymond over 4 years ago

TI__Mastermind 40965 points

Hi Ben,

I see the original thread had left off with the NWP logs being shared.

The NWP logs unfortunately do not show any irregularities, with no errors or other unexpected behavior. To fully debug this I will need to reproduce the issue. I do have a raspberry pi on hand so I will try to follow the steps explained in the original thread to setup a DNS server under my control, and then purposely induce a failure to see if the CC32xx can correctly switch DNS servers.

I will work on getting that setup and done by the end of the week.

Regards,

Michael

0 Michael Reymond over 4 years ago in reply to Michael Reymond

TI__Mastermind 40965 points

Hi Ben,

An update on the DNS issue you are seeing - the NWP logs you had provided in the previous did not show any attempts to use DNS through the sl_NetAppDnsGetHostByName API. So I am still not 100% sure what fault behavior you're running into.

That being said, I setup a test that is an analogue of your pihole DNS server test. I have my phone running a Wi-Fi hotspot, with internet connectivity through cellular. I also have a secondary DNS set through SimpleLink API, to the Google 8.8.8.8 DNS.

To run my test, I have my CC3220 attempt to ping an arbitrary hostname such as http://www.ti.com/. With internet connectivity, my primary DNS is used as expected. When I turn off my phone's cellular connection, removing connectivity to both the primary and secondary DNS, I see the CC3220 attempt to contact the secondary server as expected once the primary DNS lookup times out. So I do not see the same behavior you're seeing on the CC3220.

If you could collect NWP logs again to see if you can capture logs during your DNS failure cases and perhaps use an air sniffer capture to ensure that the secondary DNS is not used at all that would be helpful.

Regards,

Michael

0 Ben Schoeler over 4 years ago in reply to Michael Reymond

Intellectual 770 points

Hi Michael,

I will attempt to gather another log where the secondary DNS is not being used.

I believe the first log that Tom left may show the DHCP sequencing, where the DHCP is overwriting the custom secondary DNS we set previously. We would like to know of something to key off so we know when we can know when it is safe to set the custom secondary DNS instead of introducing an arbitrary delay.

0 Ben Schoeler over 4 years ago in reply to Michael Reymond

Intellectual 770 points

Hi Michael,

Here is an NWP exhibiting the issue.

dns_log.log

The NWP gets into a state where DNS repeatedly fails, even though the Primary DNS 192.168.86.1 and Secondary DNS 1.1.1.1 should be valid and functional. The only thing that can recover this condition is power cycling the device.

Towards the end of the log, I power cycle the device and you should see the DNS complete successfully with identical Primary and Secondary DNS IP addresses.

Hope this helps things.

Ben

0 Michael Reymond over 4 years ago in reply to Ben Schoeler

TI__Mastermind 40965 points

Hi Ben,

Thanks for providing the fresh set of logs.

Looking at the log data, it does appear that the NWP tries both the primary and secondary DNS repeatedly even during the failure case - I see the NWP sending DNS requests to the IPs that you mention in your post, for a total of 15 retries for each server.

Have you used an air sniffer to see if the DNS request packets are actually sent out by the CC3220? Perhaps the packets are somehow dropped by the AP, or otherwise not sent to the DNS servers. In the logs, the DNS request simply fails due to no response, error 6150. Do you see the DNS packets actually get transmitted on the air during the failure cases?

Regards,
Michael

0 Ben Schoeler over 4 years ago in reply to Michael Reymond

Intellectual 770 points

Hi Michael

Thanks for the insight. I did reproduce the problem with a WiFi sniffer, and the DNS queries go off the NWP to the router but we do not receive any packets back with the DNS response at the NWP level.

However, further upstream, at the router to the DNS server layer, I can see this:

The DNS Response does come back from the server to the Nest Mesh Router, but then the Router reports to the DNS server an ICMP error (even though it should be expecting this packet). So it seems like the router is eating the packet and not passing it along to the device...

Not really sure what is actionable from this...

Are there any internal DNS parameters we can change other than the timeout and retry?

Ben

0 Michael Reymond over 4 years ago in reply to Ben Schoeler

TI__Mastermind 40965 points

Hi Ben,

Thanks for performing the test with an air sniffer, and checking the WAN traffic of the router to see where the DNS request appears to fail.

From your results, it does appear that the DNS response from the server is black-holed by the AP. So the issue isn't directly caused by the CC3220. Still, the behavior of the CC3220 could be a factor. Some questions I have:

1. Does this DNS request behavior, where the response is black-holed by the AP, ever manifest itself if you do not use the secondary DNS?

2. Are you able to see this issue occur with other models of APs, or just the specific Nest Mesh AP?

3. When this DNS issue occurs, are you able to still maintain connectivity on existing sockets? I'm wondering whether this issue only impacts DNS lookup functionality, or whether there is a deeper underlying issue at hand that also affects broader internet connectivity on the CC3220.

In the past, there have been instances where we have observed interoperability issues with the CC3220 and mesh APs. I do not remember if the Nest Mesh AP was specifically impacted, but one issue that we have seen is that for some reason the satellite APs of the mesh network will drop the CC3220 MAC address from its ARP table, causing the inability to route packets from an internet server back to the CC3220. Are you able to examine the running state of the Nest Mesh AP to check its ARP table? This usually isn't something readily accessible in the AP's management interface, but if you had a method to SSH in to the underlying Linux OS then you can check the ARP table to see if the CC3220 is still present when this issue occurs.

The resolution to the issue we saw with the CC3220 being dropped from the AP's ARP tables was to deliberately issue a dummy ARP request from the CC3220 at regular intervals - for efficiency, when responding to an ARP request most APs will also record the requesting device's MAC and IP into its ARP table.

At this post here, I demonstrate how to build and perform this 'gratuitous' ARP request:

https://e2e.ti.com/support/wireless-connectivity/wifi/f/968/p/845693/3134002#3134002

If you adjust the IP/MAC addresses in the example to fit your CC3220 and your AP, and run the Wireless_Arp_Test() function when a DNS error is encountered, are you able to then successfully perform a DNS request?

Regards,
Michael

0 Ben Schoeler over 4 years ago in reply to Michael Reymond

Intellectual 770 points

Hi Michael. Thanks for the idea. I will implement and test the ARP request from CC3220 to router this morning.

Interestingly enough, with the WiFi Packet sniffer, at the beginning of the failure case I do see a flurry of DNS requests with a DNS response from the AP, however this does not seem to stop the DNS queries from continuing from the CC3220. Then after a bit an ARP request and response goes out, and then the DNS responses stop coming. Not sure that provides us any info but in theory the router should have it's MAC address due to that. Also, it seems like the CC3200 should just accept the first DNS response and stop sending further requests?

Your questions:

1. Not sure... we've had secondary DNS set for a while now, I will turn off and report after the ARP request test.

2. We've had reports of other mesh routers having similar issues.

3. a. Not sure. We only have one connection on our system, and during the DNS resolve we are trying to (re)establish that connection. Would it be helpful if I tried a ping to a known good address during the fail case?

b. The Google Nest router seems pretty locked down. Not sure I can SSH to check an ARP table but I will try and see if there's a way.

0 Ben Schoeler over 4 years ago in reply to Ben Schoeler

Intellectual 770 points

Hi Michael.

It appears sending the ARP Request to the Router IP and MAC immediately after the initial DNS failure allows subsequent DNS requests to work again!

I ported the code you linked but we are not waiting for a response (not sure it's necessary in our case to know the response anyhow).

Do you suggest this ARP Request only on DNS failures, or are there other failure cases that may indicate we should send an ARP request? We have seen another failure case where the error was related to a TLS error, but have no been able to reproduce this yet. One thing that did seem to help that problem was enabling the NO POLL feature, but it doesn't make much sense why that would help things. Do you think this could be the root cause of that?

Thanks!

Ben

0 Michael Reymond over 4 years ago in reply to Ben Schoeler

TI__Mastermind 40965 points

Hi Ben,

Reading the ARP response back from the AP is not necessary in your case. I had it in the linked post to form a complete example of how you could build a full request using the raw socket mechanism.

You should only use the gratuitous ARP request when needed, such as during a DNS error case. It may or may not help in your TLS error case, and more debug info would be needed there.

In general, performing the gratuitous ARP request should be limited, as while the raw socket is open all existing sockets will not function correctly. While the window of time in which the raw socket is open is very small, this could still potentially cause an issue with your existing socket code if a packet happened to come in right when the ARP request is made. In the case of a DNS error, performing the ARP request is fairly safe since the AP cannot route packets back to the CC3220 anyway, but if your CC3220 has connectivity it would be best to avoid performing the gratuitous ARP request.

Regards,

Michael

Wi-Fi

Wi-Fi forum

CC3220SF: Secondary DNS Failure