siteIQ 3 min read 563 words

robots.txt and sitemap.xml: A Complete Guide for 2026

C
Codaiman Admin
Author · Codaiman
May 31, 2026
Updated Jun 12, 2026

Two small files control how search engines crawl and index your entire site — and getting them wrong can make your pages invisible. Here's exactly how robots.txt and sitemap.xml work.

Two tiny text files quietly govern how Google crawls and indexes your entire website. Get them right and search engines find your best pages efficiently. Get them wrong and you can accidentally hide your whole site from Google — it happens more often than you'd think. Here's how robots.txt and sitemap.xml actually work.

robots.txt — the crawler's instructions

robots.txt lives at yourdomain.com/robots.txt and tells search-engine crawlers which parts of your site they may or may not crawl. It's the first thing a crawler reads when it visits.

A basic, sensible robots.txt

User-agent: *
Allow: /

Disallow: /admin
Disallow: /account
Disallow: /api/

Sitemap: https://yourdomain.com/sitemap.xml

This says: all crawlers may crawl everything, except the admin, account and API areas — and here's where the sitemap lives.

The one mistake that hides your whole site

Disallow: /
That single line tells every crawler to ignore your entire website. It's the most common catastrophic SEO mistake — usually left over from a staging site.

Other key points:

  • robots.txt controls crawling, not indexing. A blocked page can still appear in results (without a description) if it's linked elsewhere. To keep a page out of the index, use a noindex meta tag instead — and don't block it in robots.txt, or Google can't see the noindex.
  • Never block your CSS and JavaScript. Google needs them to render and understand your pages.
  • It's public. Don't use it to "hide" sensitive URLs — you're publishing a list of them. Protect private areas with real authentication.

sitemap.xml — your site's table of contents

A sitemap is an XML file listing the URLs you want indexed, helping search engines discover them quickly — especially on large or newly launched sites. Each entry can include the URL, last-modified date, and how often it changes.

<url>
  <loc>https://yourdomain.com/blog/my-post</loc>
  <lastmod>2026-06-13</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>

Sitemap best practices

  • Only include indexable URLs — canonical, 200-status, not noindex'd or redirected. A sitemap full of junk URLs wastes crawl budget and erodes trust.
  • Keep it current. Generate it dynamically so new pages appear automatically and lastmod stays accurate.
  • Use a sitemap index for big sites — split into child sitemaps (e.g. pages, blog, products) under one index file. Each sitemap caps at 50,000 URLs.
  • Use absolute URLs (full https://) — relative paths are invalid in a sitemap.
  • Submit it in Google Search Console and reference it in robots.txt.

How they work together

Think of it as: robots.txt sets the rules of the house (where crawlers can go), and sitemap.xml is the guided tour (here are the rooms worth visiting). robots.txt points to your sitemap, and your sitemap should only list pages robots.txt allows. When they contradict each other — e.g. a sitemap lists a page robots.txt blocks — you send Google mixed signals.

Common problems we see

  • A leftover Disallow: / from development blocking the entire live site.
  • Sitemaps full of redirected, 404 or noindex URLs.
  • Relative URLs in the sitemap (invalid).
  • A sitemap that's never updated, so new content goes undiscovered.
  • The sitemap not submitted to Search Console at all.

Check yours in seconds

Not sure if your robots.txt and sitemap are set up correctly? Our free SEO checker inspects both as part of a crawlability audit, and siteIQ flags conflicts and indexing blockers across your site. For the full foundation, work through the technical SEO checklist — and if pages still aren't showing up, see why your website isn't ranking.

robots.txtsitemapcrawlingindexingtechnical SEOsiteIQ
C
Written by
Codaiman Admin

Part of the Codaiman team — building AI-powered digital solutions and sharing insights on web development, mobile apps, and the future of technology.