XML sitemaps are an important (and overlooked) tool that your website can use to point search engines to all of the most important pages on your website.
Instead of crawling your website page by page, sitemaps give search engines an efficient way to find, crawl, and index all pages on your website.
Sitemaps are a really important SEO factor, and it is often the first thing we look at when evaluating the SEO standing of a website. After submitting a sitemap to search engines through Google Search Console and Bing Webmaster Tools, many site owners find that search engines find issues with their sitemap.
It is important to do everything possible to eliminate sitemap errors, as search engines don’t like to rely on sitemaps that have problems. Below are some of the most common errors associated with sitemaps and how you can go about fixing them.
Submitted URL has Crawl Issue
This is probably the most common sitemap issue that you will encounter in Google Search Console.
This error message means that your sitemap has a page listed with a known crawl error, but the status of the error is unspecified.
This is basically a catch-all warning, where Google isn’t able to provide the specifics regarding what is wrong with the page, but they do believe that something is wrong.
This requires a bit of investigating on your part, and the best way to do that is by loading the page in your browser. The root issue could be one of many problems, such as but not limited to the following:
- Page loads too slowly and the connection timed out
- Too many redirects – Search engines will only follow a few redirects. If this is the case for your page, you will want to re-setup your site’s 301 redirects
- The page has a 4xx status page other than a 404, such as a 403 “Forbidden” page or 410 “Gone” page
If the page loads correctly on your end, you may want to try submit a request for Google to fetch and render the page to see how Google bot crawls and views your content.
If you believe that you have fixed the issue, or if you don’t believe an issue exists after fetching and rendering the page in Google Search Console, you can submit the URL to Google’s index by selecting “Inspect URL” and then “Request Indexing”. Monitor the page over the next few days to see if the error message continues to appear in Google Search Console or not.
Submitted URL not found (404)
There are pages in your sitemap that return a 404 “File Not Found” error message. These pages should be removed from your sitemap and you may want to create 301 redirects to a live page on your website.
Keep in mind that Google may list a handful of pages in your sitemap with crawl issues, but they may not list all of the pages in the sitemap with 404 errors. You can use a tool like Screaming Frog to check the crawl status of pages in your sitemap to check for additional errors.
Once you fix these issues, and the pages no longer exist in your sitemap, you should resubmit the sitemap to search engines.
Submitted URL seems to be a Soft 404
Unlike a 404 page, a soft 404 page displays a 404 error message to the user but also returns a 200-level status code. In other words, it sends conflicting messages. It tells the person using the website that the page does not exist while the website tells Google that the page exists.
If the page no longer exists, it should return a 404 or 410 response code. You likely will want to also create a 301 redirect to a 200 status page. Do not list either the 4xx error page or the page that will redirect in the sitemap – only list URLs with 200 statuses in the sitemap.
Submitted URL returns unauthorized request (401)
A page in your sitemap returns a 401 “Unauthorized” response message. This is typically a page that is only accessible by a user who is logged into the website. Depending on the page, you may or may not want this URL in your sitemap.
Either remove the URL from your sitemap or change the authorization requirements for this page. If you are removing the page from your sitemap, you may also want to find and remove any internal links on your website that point to the 401 error page. You can use a crawler like Screaming Frog or Xenu to do this.
If your sitemap lists last modified dates, they need to be listed in the proper W3C date format (the time format is optional).
Dates need to be listed as YYYY-MM-DD – not YYYY/MM/DD, DD/MM/YYYY, or any other format.
Submitted page blocked by robots.txt file
Your site’s robots.txt file instructs search engines with what pages it should not crawl. This helps reduce bandwidth and prevents search engines from indexing pages that you don’t want them to index, such as a login page.
You can view your site’s robots.txt file by visiting www.yoursite.com/robots.txt – the Disallow: command instructs bots to not crawl a particular URL path.
If you list a page like yourwebsite.com/services/pet-sitting in your sitemap, but your robots.txt file blocks all service pages “Disallow: /services/*” then that’s an issue.
While making changes to your robots.txt file, it is a good idea to link to your sitemap in the file. You can do this by adding the following line of text:
Submitted URL marked ‘noindex’
Website owners can block search engines from indexing a page by using a noindex tag – typically as a meta tag, but it can also be done as an HTTP response header. Again, this is often used on pages that a website owner does not want search engines to index, like a login page or a terms and services page.
This poses a problem when pages with a noindex tag are listed in a sitemap. Review these pages and either remove the noindex tag or remove these pages from the sitemap.
Too Many URLs in Sitemap
Google currently limits the number of URLs in a sitemap to 50,000 URLs. You can submit more than one sitemap to search engines, and this is best done by creating a Sitemap index file, which lists all sitemap files.
Regardless, if your sitemap is anywhere near 50,000 URLs, it is a good idea to split your sitemap into multiple sitemaps, as some studies have shown that Google indexes a larger percentage of URLs if they are listed in smaller sitemaps.
Submitted URL not selected as canonical
A website may have different URL versions of the same page, such as the following:
Website owners can specify the preferred version of a page by using the canonical tag – using the above example, if you want search engines to display the second option, the other pages would specify that page as the canonical using a rel=canonical tag in the <head> section of the page. It would look like:
<link rel=”canonical” href=” https://www.example.com/potatoes” />
However, if the sitemap lists a different version of that page, that sends conflicting information to search engines. Check to see what version of your website is set as the canonical – you can do this by looking up the source code of a page, or by using one of many tools like SEO Minion to check the canonical. Then, make sure that version matches what is listed in your sitemap.
Any other sitemap error messages that we missed? Let us know! If you’re looking for other technical guides, check out the following articles:
- Google Indexing Issues and How to Fix Them
- How to Boost your Website to Ludicrous Speed
- How to Quickly Find and Optimize your Website’s Image Sizes
- SSL FAQs