Two tiny text files quietly govern how Google crawls and indexes your entire website. Get them right and search engines find your best pages efficiently. Get them wrong and you can accidentally hide your whole site from Google — it happens more often than you'd think. Here's how robots.txt and sitemap.xml actually work.
robots.txt — the crawler's instructions
robots.txt lives at yourdomain.com/robots.txt and tells search-engine crawlers which parts of your site they may or may not crawl. It's the first thing a crawler reads when it visits.
A basic, sensible robots.txt
User-agent: *
Allow: /
Disallow: /admin
Disallow: /account
Disallow: /api/
Sitemap: https://yourdomain.com/sitemap.xml
This says: all crawlers may crawl everything, except the admin, account and API areas — and here's where the sitemap lives.
The one mistake that hides your whole site
Disallow: /
That single line tells every crawler to ignore your entire website. It's the most common catastrophic SEO mistake — usually left over from a staging site.
Other key points:
- robots.txt controls crawling, not indexing. A blocked page can still appear in results (without a description) if it's linked elsewhere. To keep a page out of the index, use a
noindexmeta tag instead — and don't block it in robots.txt, or Google can't see the noindex. - Never block your CSS and JavaScript. Google needs them to render and understand your pages.
- It's public. Don't use it to "hide" sensitive URLs — you're publishing a list of them. Protect private areas with real authentication.
sitemap.xml — your site's table of contents
A sitemap is an XML file listing the URLs you want indexed, helping search engines discover them quickly — especially on large or newly launched sites. Each entry can include the URL, last-modified date, and how often it changes.
<url>
<loc>https://yourdomain.com/blog/my-post</loc>
<lastmod>2026-06-13</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
Sitemap best practices
- Only include indexable URLs — canonical, 200-status, not noindex'd or redirected. A sitemap full of junk URLs wastes crawl budget and erodes trust.
- Keep it current. Generate it dynamically so new pages appear automatically and
lastmodstays accurate. - Use a sitemap index for big sites — split into child sitemaps (e.g. pages, blog, products) under one index file. Each sitemap caps at 50,000 URLs.
- Use absolute URLs (full
https://) — relative paths are invalid in a sitemap. - Submit it in Google Search Console and reference it in robots.txt.
How they work together
Think of it as: robots.txt sets the rules of the house (where crawlers can go), and sitemap.xml is the guided tour (here are the rooms worth visiting). robots.txt points to your sitemap, and your sitemap should only list pages robots.txt allows. When they contradict each other — e.g. a sitemap lists a page robots.txt blocks — you send Google mixed signals.
Common problems we see
- A leftover
Disallow: /from development blocking the entire live site. - Sitemaps full of redirected, 404 or noindex URLs.
- Relative URLs in the sitemap (invalid).
- A sitemap that's never updated, so new content goes undiscovered.
- The sitemap not submitted to Search Console at all.
Check yours in seconds
Not sure if your robots.txt and sitemap are set up correctly? Our free SEO checker inspects both as part of a crawlability audit, and siteIQ flags conflicts and indexing blockers across your site. For the full foundation, work through the technical SEO checklist — and if pages still aren't showing up, see why your website isn't ranking.