As our websites mature, we often move from doing things one way to a different, more efficient way. But sometimes, the footprint of how our site used to work remains behind.
Recently, I had a client come to me with a most intriguing problem. They had a large e-commerce site, that had gone through many iterations of product inventory, URL structure, and sitemap structure over the years.
They had noticed in their Google Search Console (formerly Google Webmaster Tools) that they had a sudden proliferation of Crawl Errors. What we discovered were there were numerous old URLs that no longer existed that Googlebot was recording as
Not Found 404 and
Access Denied 403.
The culprit was an old sitemap in the root folder. The website had used the Google XML Sitemaps plugin several years ago, but had been using WordPress SEO by Yoast for sitemaps for a couple of years.
Many older sitemap plugins create a file in the root folder called
sitemap.xml.gz. The Yoast plugin creates a sitemap for each post type, along with sitemaps for tags and categories, and an index named
Though the old
sitemap.xml.gz was not submitted to Google Search Console under Crawl > Sitemaps, and only the Yoast sitemaps were submitted, Google still crawled the older sitemap, because it had never been deleted from the root folder.
Sitemaps Are Only A Suggestion Of What To Crawl
You might think that deleting the old sitemap would deter Google from trying to crawl those defunct URLs — but you would be incorrect.
What I found out through an afternoon of research is Google will attempt to crawl old URLs several times, just to make sure that the 404s are real, and not just an accident. They may choose to crawl these URLs months, or even years later.
Googlebot stores the memories of links that it finds in your sitemaps, on your own website, and on other sites. It can come back to these at seemingly random intervals, to see if those pages still exist. Googlebot wants to know if the links are still valid, or if they have experienced link rot.
Naturally, it makes site owners very nervous when they see the number of Crawl Errors increasing instead of decreasing.
How could we speed up the process of telling Google to disregard the old sitemap?
Killing Unwanted Sitemap Crawls
As it turned out, there was a way to do this through the
WordPress sites usually have this file in the root folder. Files that start with a dot are hidden files, so you may need FTP access or enable hidden file visibility in cPanel’s File Manager to edit this.
.htaccess file incorrectly can take your site down, so make sure you have a backup, or can revert your
.htaccess quickly in case something goes unexpectedly.
What we are doing is adding a redirect for the sitemap we want to disappear form Googlebot’s crawl. This type of redirect is a
410 Gone, which basically means “Don’t bother looking for this file ever again”.
Add the following to the end of your
.htaccess file. Adjust the relative URL to fit whatever sitemap you want to vanish.
# Kill old sitemap crawls # The most common sitemap URL redirect 410 /sitemap.xml.gz # or Fill in the blank with your own URL redirect 410 /path/to/sitemap.xml
After adding this snippet to the
.htaccess file, the expired URLs started disappearing from my client’s Crawl Errors at a rapid pace. Within a few days, they disappeared.
Keep in mind, Google expects to encounter some degree of 404s. The web is a temporal place. Things change. Google just needs to know what has changed permanently, and what has not.