You published content. Google ignored it. This is the systematic checklist we use to find the exact reason—robots.txt, noindex, sitemap gaps, or canonical bleed—and fix it in minutes, not days.
Most SEOs jump straight to 'submit to Google Search Console' when pages don't index. That's a waste of a ping. The bottleneck is almost never a missing request—it's a blocked resource, a self-canonicalization error, or a sitemap that lists URLs Google can't even reach. In practice, when you run the Coverage report in GSC, you'll see 'Discovered - currently not indexed' for pages that Google knows exist but chose to skip. That's a different problem than 'Excluded' or 'Crawl anomaly'. Each status demands a different fix. This checklist treats them distinctly.
We start with the three gates every URL must pass: accessible (no block), indexable (no noindex or canonical theft), and valuable (not thin or duplicate). If you skip one gate, the fix fails. The canonicalization rules from Google's official canonicalization documentation are especially misunderstood—many sites accidentally point all product pages to the category page. That single misconfiguration can kill indexation for thousands of URLs. Let's walk the gates.
| Gate / Checkpoint | What to Inspect | Common Tool / Command | Expected Result | Failure Mode & Risk |
|---|---|---|---|---|
| Gate 1: Accessibility | Check robots.txt, HTTP status code, and server response time | curl -I https://example.com/page or GSC URL Inspection | 200 OK, no disallow in robots.txt, load under 3s | Blocked by disallow (even wildcard * disallow) or 5xx errors. Risk: entire section potentially excluded for weeks. |
| Gate 2: Indexability | Check noindex tag, X-Robots-Tag, and canonical URL | View page source or use browser devtools for meta robots | No 'noindex' meta or header; canonical points to itself or a valid variant | Self-canonical to wrong URL bleeds link equity and tells Google not to index this version. Silent data loss. |
| Gate 3: Content Value | Check word count, uniqueness vs. other pages, internal linking | Screaming Frog crawl + site: search for duplicate title samples | At least 300 words of unique content, 2+ internal links, no exact match within site | Thin content or soft 404s get 'Discovered but not indexed'. Google reserves slots for higher-quality pages. |
| Gate 4: Sitemap Accuracy | Check if URL appears in sitemap and sitemap is in robots.txt | GSC Sitemaps report or manual XML validation | URL present in sitemap, sitemap submitted and has no errors, lastmod is recent | Old lastmod dates or sitemap that lists 404s signals neglect. Google may deprioritize the entire sitemap. |
Run the URL in <strong>Google Search Console URL Inspection</strong>. Read the 'Coverage' status exactly—don't guess.
Check <strong>robots.txt</strong> via the live test tool. One disallow line for the wrong path blocks everything underneath.
Inspect the page's <strong>HTML source</strong> for <code><meta name="robots" content="noindex"></code>. Also check HTTP headers for X-Robots-Tag.
Verify the <strong>canonical URL</strong>. Use the inspection tool's 'View crawled page' to see which URL Google considers canonical. If it's not the current URL, that's your root cause.
Look at the <strong>sitemap</strong>: is the URL listed? Is the sitemap referenced in robots.txt? Does the sitemap have any errors in GSC?
Cross-check the page's <strong>internal link profile</strong>. Pages with zero internal links often get crawled late or not at all.
Review <strong>server logs</strong> (or use a log analyzer) to confirm Googlebot actually requested the URL. No request = no chance to index.
The situation: An e-commerce site with 1,200 product pages had only 340 indexed. GSC showed 'Alternate page with proper canonical tag' for 860 URLs. The inspection revealed all product pages had a canonical tag pointing to the category page (e.g., /products/shoes instead of /products/shoes/nike-air-max). This was a template bug: the canonical was dynamically set to the parent category URL.
The fix: We updated the template to set the canonical tag to the product's own URL. Then we removed all non-canonical product URLs from the sitemap. The sitemap now contained only the 1,200 product URLs with self-canonical tags.
The result: Within 10 days, indexed count rose from 340 to 1,080. The remaining 120 were thin pages (under 100 words) that needed content expansion. The fix required 2 hours of developer time and no server changes.
If URL not found, check if page exists and has a valid status 200.
Check robots.txt and noindex tag. If blocked, remove the block and request reindexing.
Ensure canonical tag points to the current URL or an equivalent variant. Fix if pointing elsewhere.
Verify the URL is in a submitted sitemap. If missing, add it and resubmit the sitemap.
Check if the page has at least 2 internal links from other indexed pages. Add links if missing.
Wait 2-7 days. Recheck in GSC. If still not indexed, check for thin content or quality issues.
Agencies should create a standardized diagnostic checklist (like the one above) and use Google Search Console's API to pull coverage data for all clients at scale. Automate the 'blocked by robots.txt' and 'canonical misconfiguration' checks using a script that flags any URL where the canonical does not match the inspected URL. This reduces manual work by 80% and ensures consistent quality across accounts.
For guest posts, the most common cause is that the host site's robots.txt blocks the post's URL or the page has a noindex tag. Ask the host to check their robots.txt and meta robots for that specific page. Also request an internal link from a high-authority page on the host site. Google often discovers guest posts through internal links, not sitemaps.
Use the Google Indexing API (for job posting or livestream markup) or Google Search Console API with the 'urlInspection.index' method. You can send up to 2000 URLs per day per property. Parse the response for 'indexStatusResult.coverageState' — values like 'notIndexedExcluded' or 'discoveredNotIndexed' tell you the exact reason. Automate the re-inspection of fixed URLs.
The top crawl errors are: 404 (page deleted without redirect), 500 (server timeout or error), DNS resolution failures, and robots.txt timeouts. Googlebot waits only a few seconds. If your server responds slowly or errors, Google may drop the URL and not retry for days. Fix these by setting up server monitoring and using a CDN to ensure fast, stable responses.
If you submit a URL and GSC says 'URL is on Google' but it's not in the index, check the canonical. The new post might be canonicalized to an older, similar post. Also, look at the 'Crawled - currently not indexed' status — that means Google found the page but decided it's not valuable enough yet. Improve internal links from your homepage or top posts to signal importance.
After removing a robots.txt disallow or noindex tag, Google can re-crawl and index within hours to a few days if you request indexing via GSC. Without a manual request, it can take 1-2 weeks for Google to naturally rediscover the page. We recommend using the 'Request Indexing' button in GSC immediately after removing the block.
'Discovered - currently not indexed' means Google knows the URL exists (from sitemap or links) but hasn't crawled or indexed it yet, often due to crawl budget limits. 'Excluded' means Google intentionally chose not to index the page because of a noindex tag, canonical pointing elsewhere, or a blocked resource. The fix for 'Discovered' is to add internal links; for 'Excluded', fix the specific exclusion reason.
Yes, absolutely. If you set a self-canonical correctly, it's fine. But if you accidentally point the canonical to a different page, Google may treat the current page as a duplicate and exclude it from the index entirely. This is a common error in CMS templates that dynamically generate canonicals. Always verify the canonical tag on every template type.
Use the command line: <code>curl -A 'Googlebot' https://example.com/page</code> and check the response. If you get a 200 status, it's not blocked. If you get a 404 or 500, that's a different issue. For a definitive test, use Google's robots.txt Tester in Search Console, which shows exactly which lines match the URL. The live test is more reliable than parsing the file manually.
Export the 'Not indexed' URLs from GSC Coverage report. Use a script to check each URL's robots.txt accessibility, canonical tag, and HTTP status. Group by error type (robots, canonical, noindex, thin content). Fix each group using site-wide template changes or htaccess rules. Then resubmit only the fixed groups via the Indexing API or GSC bulk upload. Track progress weekly.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.