Blog: SEO
Woman with hands Outstretched and crossed to say "stop"

Prevent Google From Crawling Old Sitemaps

Avatar for John Locke

John Locke is a SEO consultant from Sacramento, CA. He helps manufacturing businesses rank higher through his web agency, Lockedown SEO.

As our websites mature, we often move from doing things one way to a different, more efficient way. But sometimes, the footprint of how our site used to work remains behind.

Recently, I had a client come to me with a most intriguing problem. They had a large e-commerce site, that had gone through many iterations of product inventory, URL structure, and sitemap structure over the years.

They had noticed in their Google Search Console (formerly Google Webmaster Tools) that they had a sudden proliferation of Crawl Errors. What we discovered were there were numerous old URLs that no longer existed that Googlebot was recording as Not Found 404 and Access Denied 403.

The culprit was an old sitemap in the root folder. The website had used the Google XML Sitemaps plugin several years ago, but had been using WordPress SEO by Yoast for sitemaps for a couple of years.

Many older sitemap plugins create a file in the root folder called sitemap.xml.gz. The Yoast plugin creates a sitemap for each post type, along with sitemaps for tags and categories, and an index named sitemap_index.xml

Though the old sitemap.xml.gz was not submitted to Google Search Console under Crawl > Sitemaps, and only the Yoast sitemaps were submitted, Google still crawled the older sitemap, because it had never been deleted from the root folder.

Sitemaps Are Only A Suggestion Of What To Crawl

You might think that deleting the old sitemap would deter Google from trying to crawl those defunct URLs — but you would be incorrect.

What I found out through an afternoon of research is Google will attempt to crawl old URLs several times, just to make sure that the 404s are real, and not just an accident. They may choose to crawl these URLs months, or even years later.

Googlebot stores the memories of links that it finds in your sitemaps, on your own website, and on other sites. It can come back to these at seemingly random intervals, to see if those pages still exist. Googlebot wants to know if the links are still valid, or if they have experienced link rot.

Naturally, it makes site owners very nervous when they see the number of Crawl Errors increasing instead of decreasing.

How could we speed up the process of telling Google to disregard the old sitemap?

Killing Unwanted Sitemap Crawls

As it turned out, there was a way to do this through the .htaccess file.

WordPress sites usually have this file in the root folder. Files that start with a dot are hidden files, so you may need FTP access or enable hidden file visibility in cPanel’s File Manager to edit this.

Editing your .htaccess file incorrectly can take your site down, so make sure you have a backup, or can revert your .htaccess quickly in case something goes unexpectedly.

What we are doing is adding a redirect for the sitemap we want to disappear form Googlebot’s crawl. This type of redirect is a 410 Gone, which basically means “Don’t bother looking for this file ever again”.

Add the following to the end of your .htaccess file. Adjust the relative URL to fit whatever sitemap you want to vanish.

# Kill old sitemap crawls
# The most common sitemap URL
redirect 410 /sitemap.xml.gz

# or Fill in the blank with your own URL
redirect 410 /path/to/sitemap.xml

The Results

After adding this snippet to the .htaccess file, the expired URLs started disappearing from my client’s Crawl Errors at a rapid pace. Within a few days, they disappeared.

Keep in mind, Google expects to encounter some degree of 404s. The web is a temporal place. Things change. Google just needs to know what has changed permanently, and what has not.

Avatar for John Locke

John Locke is a SEO consultant from Sacramento, CA. He helps manufacturing businesses rank higher through his web agency, Lockedown SEO.

2 comments on “Prevent Google From Crawling Old Sitemaps

  1. Thank you for this Mr. Locke! These 404 not found errors are driving me crazy because of an old sitemap. Now I just need to go looking for the .htaccess file, and I may need to call my web hosting provider as I’m not really sure where to find it. Again, thanks for this article, it has put me on the right track! 🙂

  2. Hi Jade:

    Usually your .htaccess file is at the root of your server directory. If you look have a File Manager or log into your site via FTP, you’ll find it at the top level folder.

    If you have Google Search Console crawl errors due to a sitemap that no longer exists on your site, Google resurfaces those URLs from time to time, but they also expect a certain amount of 404s.

Leave a Reply to Jade Sambrook Cancel reply

Your email address will be kept private. Required fields marked *.