DNS resolution failures manifest in several distinct ways, each pointing to different underlying problems. When users cannot reach your site or when monitoring alerts fire, understanding the specific error code returned by resolvers allows you to quickly narrow down the root cause and apply the appropriate fix.
Two of the most common resolver errors are SERVFAIL and NXDOMAIN, but they indicate entirely different failure modes. NXDOMAIN means the resolver successfully contacted authoritative servers and confirmed that the requested name does not exist. SERVFAIL means something went wrong during the resolution process itself, preventing the resolver from obtaining a definitive answer. Distinguishing between these errors and understanding their variants forms the foundation of effective DNS troubleshooting.
Understanding DNS Response Codes
DNS responses include a status code (RCODE) that indicates what happened during query processing. RCODE 0 (NOERROR) means the query succeeded, though it may or may not have returned records. RCODE 3 (NXDOMAIN) means the queried domain name does not exist. RCODE 2 (SERVFAIL) indicates a server failure during resolution.
These codes appear in resolver responses and authoritative responses. A recursive resolver might receive NOERROR from an authoritative server but still return SERVFAIL to the client if it encountered problems during recursion. Understanding where in the resolution chain the error originates helps identify whether the problem lies with your authoritative infrastructure, with intermediate nameservers, or with the recursive resolver itself.
Other response codes exist but appear less frequently in typical troubleshooting. RCODE 5 (REFUSED) indicates the server refused to answer the query, often due to policy restrictions. RCODE 4 (NOTIMP) means the server does not implement the requested operation. For most domain owners, SERVFAIL and NXDOMAIN represent the bulk of resolution failures requiring investigation.
When NXDOMAIN is Actually Correct
NXDOMAIN responses are not always errors. If a user mistypes a subdomain or requests a resource that genuinely does not exist, NXDOMAIN is the correct response. The challenge is distinguishing between legitimate negative answers and configuration problems that incorrectly produce NXDOMAIN.
If your monitoring shows NXDOMAIN responses for domains you expect to exist, start by verifying that the records are actually published. Query your authoritative nameservers directly using dig or another DNS tool:
dig @ns1.example.com www.example.com
If the authoritative server returns the expected records, but recursive resolvers return NXDOMAIN, the problem likely involves delegation or zone configuration. Check that your parent zone correctly delegates to your nameservers and that all delegated nameservers are responding consistently.
A common cause of unexpected NXDOMAIN responses is lame delegation, where a parent zone delegates to nameservers that do not actually serve the zone. This might happen after migrating DNS hosting if the parent zone still points to old nameservers. When resolvers follow the delegation to nameservers that do not recognize the zone, those nameservers may return NXDOMAIN because they lack authority for the queried name.
If you recently registered a domain or changed nameservers, verify that the parent zone delegation has propagated. Registry updates can take time, and during the transition period, some resolvers may reach nameservers with stale or missing zone data. SERVFAIL: When Resolution Itself Breaks
SERVFAIL indicates that a resolver encountered a problem preventing it from completing the query. Unlike NXDOMAIN, which represents a successful query with a negative answer, SERVFAIL means the resolution process failed. The specific cause can range from network timeouts to DNSSEC validation failures to resource exhaustion.
When troubleshooting SERVFAIL responses, your first step is determining where the failure occurs. Query a public resolver like 8.8.8.8 or 1.1.1.1 to see if they return SERVFAIL. If they do, the problem likely lies with your authoritative infrastructure or the zone configuration itself. If public resolvers work but your local resolver fails, investigate network connectivity, firewall rules, or resolver configuration.
SERVFAIL responses often lack detailed error information in the DNS response itself. The RCODE indicates failure, but not why. Advanced troubleshooting requires examining resolver logs, testing with diagnostic flags, or using tools that provide extended error information.
DNSSEC Validation Failures and Bogus Status
One of the most common causes of SERVFAIL responses is DNSSEC validation failure. When a zone is signed with DNSSEC but the signatures are invalid, expired, or do not chain correctly to the root of trust, validating resolvers return SERVFAIL rather than potentially serving unvalidated data.
DNSSEC validation failures fall into several categories. A "bogus" status means the resolver detected a validation error—perhaps signatures have expired, or the chain of trust is broken. An "insecure" status means the zone is not signed with DNSSEC, which is acceptable. An "indeterminate" status means the resolver could not determine validation status, possibly due to timeouts or missing records.
To test DNSSEC validation, use dig with the +dnssec flag:
dig +dnssec www.example.com
The response will include RRSIG records if the zone is signed. Check the signature expiration timestamps to ensure they are current. If signatures have expired, your zone signing process has failed, and resolvers validating DNSSEC will return SERVFAIL until you generate fresh signatures.
A more comprehensive DNSSEC test involves checking the entire chain of trust:
dig +trace +dnssec www.example.com
This command follows the delegation chain from the root through TLDs to your authoritative servers, showing DNSSEC status at each level. If the chain breaks at any point—missing DS records in the parent zone, invalid signatures, or algorithm mismatches—validation will fail and resolvers will return SERVFAIL.
For domains using DNSSEC in combination with SSL certificates, both validation failures can cause service disruptions. While HTTPS certificate validation happens at the application layer and DNSSEC operates at the DNS layer, both are critical for establishing trust. A DNSSEC failure that prevents resolution means users never reach your site to validate TLS certificates. EDNS and Response Size Issues
Extension Mechanisms for DNS (EDNS) allows DNS messages to exceed the original 512-byte limit for UDP responses. Modern zones with DNSSEC signatures or numerous records often produce responses larger than 512 bytes. EDNS negotiation allows resolvers and servers to agree on larger message sizes, but problems in this negotiation can cause SERVFAIL.
When a server's response exceeds the size the resolver can handle, the server sets the truncation flag (TC=1) and the resolver should retry over TCP. If the resolver fails to perform this retry, or if the server does not properly support TCP fallback, resolution fails with SERVFAIL.
You can test EDNS behavior by querying with specific buffer sizes:
dig +bufsize=512 www.example.com
dig +bufsize=4096 www.example.com
If the smaller buffer size triggers truncation or SERVFAIL while the larger size works, your zone's responses may be too large for some resolvers. This commonly affects zones with extensive DNSSEC signatures, particularly when using NSEC3 with many iterations.
Firewalls or middleboxes that block large UDP packets or TCP port 53 can also cause EDNS failures. If your infrastructure sits behind security appliances, verify that they permit DNS over TCP and allow UDP fragments or large UDP packets. Some security devices incorrectly treat large DNS responses as attacks and block them.
Lame Delegation and Authority Confusion
Lame delegation occurs when a zone's NS records point to nameservers that do not actually serve the zone. This creates inconsistent responses depending on which nameserver a resolver queries. Some queries succeed while others return SERVFAIL or NXDOMAIN, creating intermittent failures that are difficult to diagnose.
To check for lame delegation, query each of your authoritative nameservers individually:
dig @ns1.example.com example.com SOA
dig @ns2.example.com example.com SOA
Each nameserver should return the same SOA record with the same serial number. If any nameserver returns SERVFAIL, REFUSED, or NXDOMAIN, that nameserver is lame for your zone. Remove it from your NS records or fix its configuration to properly serve the zone.
Parent zone delegation can also be incorrect. Check that your domain's NS records in the parent zone match the NS records in your own zone:
dig @ns1.example.com example.com NS
The two sets of results should be identical. Mismatches indicate stale delegation or synchronization problems between your zone and the registry.
Network Timeouts and Connectivity Problems
SERVFAIL can result from network timeouts when resolvers cannot reach authoritative nameservers within acceptable time limits. If your nameservers are unreachable due to network outages, firewall misconfigurations, or routing problems, resolvers will timeout and return SERVFAIL.
Test nameserver connectivity from multiple vantage points:
dig @ns1.example.com example.com +time=2 +tries=1
The +time and +tries flags limit how long dig waits for a response. If queries consistently timeout from certain networks or regions, investigate network paths between resolvers and your nameservers.
Load balancers or anycast configurations can mask individual nameserver failures by automatically routing around problems. However, if all instances in a region fail simultaneously, resolvers in that region will experience SERVFAIL. Monitoring should test each nameserver independently and from diverse geographic locations.
For operators managing hosting infrastructure with authoritative DNS, ensure that nameservers have adequate capacity and redundancy. A single overloaded nameserver might respond slowly or drop queries, causing intermittent SERVFAIL responses that are difficult to reproduce. Recursion Failures and Resolver Chain Problems
Recursive resolvers must traverse the DNS hierarchy from root servers through TLDs to authoritative servers. If this traversal breaks at any point, the resolver returns SERVFAIL. Problems might include:
- Root or TLD servers not responding
- Intermediate delegation failures
- Circular dependencies in NS records
The +trace flag helps diagnose recursion failures:
dig +trace www.example.com
This shows each step of recursion, from root servers to TLDs to your authoritative servers. If the trace stops or returns errors before reaching your authoritative nameservers, the problem lies in upstream delegation.
Circular dependencies occur when nameserver hostnames themselves require resolution from the zone they serve. For example, if example.com delegates to ns1.example.com, but resolvers need to look up ns1.example.com to find example.com's nameservers, a circular dependency exists. Glue records in the parent zone break this dependency by providing the IP address of ns1.example.com directly.
Building an Effective Troubleshooting Runbook
Systematic troubleshooting requires a methodical approach. When resolution failures occur, follow this workflow:
First, identify the RCODE returned to clients. Use dig or similar tools to query from the perspective of affected users:
dig www.example.com @8.8.8.8
If the response is NXDOMAIN, verify the record exists by querying authoritative nameservers directly. If it exists, check delegation and nameserver consistency.
If the response is SERVFAIL, test DNSSEC validation:
dig +dnssec +cd www.example.com @8.8.8.8
The +cd flag disables DNSSEC validation. If queries succeed with +cd but fail without it, DNSSEC validation is the problem. Examine signatures, check DS records in the parent zone, and verify that signing infrastructure is working.
Next, check EDNS and response size:
dig +bufsize=512 www.example.com
If small buffer sizes cause failures, your responses may be too large. Consider reducing DNSSEC signature sizes, decreasing NSEC3 iterations, or optimizing record sets.
Finally, verify nameserver connectivity and authority:
for ns in $(dig +short example.com NS); do
This tests each nameserver individually. Any that fail indicate lame delegation or connectivity problems.
Monitoring and Alerting Strategies
Proactive monitoring helps catch DNS problems before users notice. Configure synthetic monitoring to query your critical DNS records from multiple geographic locations and resolver types. Alert on consistent SERVFAIL responses or unexpected NXDOMAIN results.
For email infrastructure relying on MX records, monitor both the MX records themselves and the A records of mail servers. Email delivery can fail if either layer returns errors. Include SPF, DKIM, and DMARC records in monitoring to catch misconfigurations that affect deliverability. Set up logging on authoritative nameservers to track query patterns and response codes. Spikes in SERVFAIL responses might indicate DNSSEC problems, capacity issues, or attacks. Tracking NXDOMAIN responses helps identify common typos or obsolete subdomains that still receive traffic.
Many DNS hosting providers offer built-in monitoring with dashboards showing query volumes, response codes, and latency. Review these metrics regularly to establish baselines and quickly identify anomalies.
Common Pitfalls and How to Avoid Them
Several configuration mistakes consistently lead to resolution failures. After changing nameservers, always verify that the parent zone delegation has updated before decommissioning old nameservers. Prematurely removing nameservers leaves resolvers with stale delegation and causes SERVFAIL for queries reaching those resolvers.
When implementing DNSSEC, test thoroughly before enabling validation in production. Use tools like DNSViz to visualize your DNSSEC chain and identify potential problems. Ensure your signing process includes signature refresh well before expiration—signatures should typically be regenerated and published at least weekly.
Keep zone transfers working between primary and secondary nameservers. If secondaries serve stale data, they may return outdated records or SERVFAIL for newly added names. Monitoring should verify that all nameservers have the current zone serial number.
Avoid mixing authoritative and recursive resolver roles on the same server unless you fully understand the implications. Many DNS server packages default to operating in both modes, which can create confusion and security issues. Production authoritative nameservers should typically have recursion disabled.
When to Escalate and What Information to Provide
Some DNS problems require support from your hosting provider, domain registrar, or DNS service. When opening support tickets, include specific diagnostic information to speed resolution:
- The exact query that fails
- The resolver used for testing
- Complete dig output including flags and response sections
- Results from querying authoritative nameservers directly
- Trace output showing delegation path
- Recent changes to zone configuration or infrastructure
Avoid generic descriptions like "DNS is not working." Instead, specify "Queries for www.example.com to 8.8.8.8 return SERVFAIL with DNSSEC validation enabled, but succeed with validation disabled. Authoritative queries to ns1.example.com return valid signatures."
If the problem involves DNSSEC, run a full validation check using online tools like Verisign's DNSSEC Debugger and include the results. These tools often identify specific validation failures that are not obvious from dig output alone.
Resolution Failures in Context
DNS resolution failures sit at the foundation of many service disruptions. When resolution breaks, everything above it in the stack becomes unreachable, regardless of whether applications, databases, or servers are functioning correctly. Understanding SERVFAIL and NXDOMAIN responses, along with their common causes, allows you to quickly diagnose and resolve problems.
The key is building systematic troubleshooting habits. Rather than randomly trying fixes, methodically test each layer of DNS infrastructure, from parent zone delegation through authoritative responses to DNSSEC validation. This approach reliably identifies root causes and avoids the trial-and-error that extends outages.
As your DNS infrastructure grows in complexity, invest in monitoring, automation, and documentation. Runbooks that capture common failure modes and their solutions allow team members to respond effectively even if they were not involved in the original infrastructure design. DNS reliability depends on both proper configuration and operational discipline when problems arise.