XML Sitemap
An XML Sitemap is a special document which lists all pages on a website to provide search engines with an overview of all the available content.
It's strongly recommended to implement an XML Sitemap, especially on larger websites (500+ pages).
Stick to the following best practices when implementing an XML Sitemap:
- Keep the XML Sitemap up to date with your website's content.
- Make sure it's clean: only indexable pages should be included.
- Reference the XML Sitemap from your robots.txt file.
- Don't list more than 50.000 URLs in a single XML Sitemap.
- Make sure the (uncompressed) file size doesn't exceed 50MB.
- Don't obsess about the lastmod, priority and changefreq properties.
What is an XML Sitemap?
An XML Sitemap is a special document which lists all pages on a website and is meant for search engines. Compare it to a telephone book: it tells the search engineSearch Engine
A search engine is a website through which users can search internet content.
Learn more what content is available and how to reach it. Furthermore some extra information can be provided, such as when the content was last updated and what the relative importance is of the content.
XML Sitemaps are very useful for search engines, as it provides them with a single overview of all the available content at once. This serves for them as both a starting point for the first time they go through your website as a way to quickly discover newly added content.
What's important to note is the distinction between XML sitemaps and "regular" sitemaps (also called "HTML sitemaps"). Those sitemaps are meant for your visitors to find content on your website, while XML sitemaps are meant for search engines.
Why should you care about XML Sitemap?
XML Sitemaps help search engines to assess your website's content, and is a mechanism to notify them of new or updated content. Therefore it's recommended to implement them whenever feasible. And especially for larger websites (500+ pages) they become a real must-have.
What does an XML Sitemap look like?
An XML Sitemap is meant for search engines, and thus they are formatted in a language that's easy to understand for computers: XML. Fortunately XML is also quite readable for humans as well, so let's take a look at an example:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://www.contentkingapp.com/</loc> <lastmod>2017-06-14T19:55:25+02:00</lastmod> </url> <url> <loc>https://www.contentkingapp.com/blog/</loc> <lastmod>2016-06-24T10:23:20+02:00</lastmod> </url> </urlset>
Now, to understand what's going on let's dissect the individual parts!
XML Header
<?xml version="1.0" encoding="UTF-8"?>
This header denotes that the contents is structured according to version 1.0 of the XML standard and describe the character encoding. It basically informs search engines what they can expect from the file.
Definition of the URL set
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
This urlset definition encapsulates all the URLs contained in the sitemap and describes which version of the XML Sitemap standard is used. Note that the urlset gets closed at the bottom of the document:
</urlset>
Definition of the individual URLs
<url> <loc>https://www.contentkingapp.com/</loc> <lastmod>2017-06-14T19:55:25+02:00</lastmod> </url>
Finally we get to the most important part: the definition of the individuals URLs through the url
-tag. Every URL definition needs to contain at least the loc
-tag (short for location). The value of this tag should be the full URL of the page, including the protocol (e.g. "http://").
On top of that every URL definition may contain the following optional properties:
lastmod
: the date of when the content on that URL was last modified. The date is in "W3C datetime " format.priority
: the priority of the URL, relative to your own website on a scale between 0.0 and 1.0.changefreq
: how often the content on the URL is expected to change. Possible values are always, hourly, daily, weekly, monthly, yearly and never.
Where should I place my XML Sitemap?
Just like your website's pages, the XML Sitemap resides on its own URL. Usually the URL for an XML Sitemap is /sitemap.xml
, and it's recommended to follow this convention to make it easy for search engines to discover it.
However, if for any reason this is not possible you can choose a different location or filename, as long as you reference it in your robots.txt file through the Sitemap-directive:
Sitemap: http://www.example.com/alternativelocation/alternativefilename.xml
Are there any limitations for XML Sitemaps?
XML Sitemaps have a couple of limitations to keep in mind:
- They must not contain more than maximum 50.000 URLs.
- Their file size is limited to 50MB when uncompressed.
If your XML Sitemap exceeds these limits you need to split them across multiple XML Sitemaps and use an XML Sitemap Index.
What is an XML Sitemap Index?
Whenever you cross the limitations for a single XML Sitemap you need to split them up into separate XML Sitemaps and bundle them together with an XML Sitemap Index. This index is a separate XML-file which references the various XML Sitemaps.
Let's take a look at an example:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.example.com/sitemap1.xml.gz</loc> <lastmod>2004-10-01T18:23:17+00:00</lastmod> </sitemap> <sitemap> <loc>http://www.example.com/sitemap2.xml.gz</loc> <lastmod>2005-01-01</lastmod> </sitemap> </sitemapindex>
This XML Sitemap Index references two XML Sitemaps: sitemap1.xml.gz
and sitemap2.xml.gz
. Let's dissect this file as well!
XML Header
<?xml version="1.0" encoding="UTF-8"?>
Nothing new here, just like with the XML Sitemap file we first define that the file is in XML format and which character encoding is used.
Definition of the Sitemap Index
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
Now, instead of a urlset definition we see a sitemapindex definition. This definition encapsulates all the sitemaps contained in the sitemap index and again which version of the XML Sitemap standard is used. Just like the urlset definition the sitemapindex definition is closed at the bottom of the document:
</sitemapindex>
Definition of the individual sitemaps
<sitemap> <loc>http://www.example.com/sitemap1.xml.gz</loc> <lastmod>2004-10-01T18:23:17+00:00</lastmod> </sitemap>
And then on to the meat: the actual definition of the individuals sitemaps. Just like for URLs, every sitemap definition needs to contain at least the loc
-tag, containing the full URL of the individual XML Sitemap.
On top of that the sitemap definition may optionally contain a lastmod definition. The date when the referenced XML sitemap was last updated. Again in "W3C datetime " format.
Where should I place my XML Sitemap Index?
Similar to XML Sitemaps there is a convention for the location and filename of the XML Sitemap Index: /sitemap_index.xml
. But again you're free to deviate from this, as long as you reference it in your robots.txt file:
Sitemap: http://www.example.com/alternativelocation/alternativefilename.xml
Best practices for XML Sitemap
When implementing XML Sitemaps it's essential to follow these best practices.
Keep your XML Sitemap up-to-date
Make sure that your XML Sitemap provides an up-to-date picture of your website. Whenever a page is removed it should also be delisted from your XML Sitemap. If you're using the optional lastmod
-tag, make sure to update the timestamp whenever the page changes.
The Indexed, not submitted in sitemap issue in Google Search Console's Index Coverage Report is very useful to verify whether your XML sitemap includes all of your indexable pages.
Only include indexable pages in your XML Sitemap
Your XML Sitemap should only describe indexable pages. This means that you should leave out all URLs pointing to redirects (e.g. 301 status code) and missing pages (e.g. 404 status code).
Furthermore these pages need to be indexable, which means they are accessible for search engines (no exclusion in robots.txt) and there are no directives telling search engines not to index the page (such as meta robots, canonical links or x-robots-tag).
Learn what do these issues mean in GSC's Index Coverage report:
Stick to the default location and filename
Whenever possible stick to the default location and filename for your XML Sitemap (/sitemap.xml
) and XML Sitemap Index (/sitemap_index.xml
). This makes it the easiest for search engines to find them.
Reference the XML Sitemap in your robots.txt file
When you're deviating from the convention for the URL of your XML Sitemap or XML Sitemap Index you should reference it in your robots.txt file. However, even if you're sticking to the standard URL it's recommended to include a reference to it in your robots.txt to ensure discoverability by search engines.
Don't obsess about lastmod, priority and changefreq
Although for every URL you can define the lastmod, priority and changefreq properties, this is fully optional. Defining them won't hurt, and there may be a slight chance search engines will use this information, but it's generally understood that search engines don't pay (much) attention to them .
Stick to the limits for XML Sitemaps
Make sure that your XML Sitemaps don't contain more than 50.000 URLs and the uncompressed file size is limited to 50MB. Whenever you exceed either limit you should split the XML Sitemap up and use an XML Sitemap Index.
Frequently asked questions about XML Sitemap
1. What does the .gz extension mean?
The .gz
extension is added to the filename when the XML Sitemap is compressed (via gzip compression). XML Sitemaps containing many URLs usually grow to significant file sizes, and through the use of compression the impact of this on disk storage and network transfer time can be reduced.