A diagnostic workflow for site owners who need to verify indexing status, uncover crawl blockers, and fix the root causes of missing pages. Built on Google Search Central best practices and real-world audit data.
Every page you want found must survive the crawl queue, pass robots.txt rules, avoid noindex directives, and meet Google's quality bar. Most site owners assume their content is indexed. It is not. A common situation we see is a client with 2,000 blog posts complaining about zero organic traffic. When we run a bulk index check, 1,400 pages are missing from the index. The root cause is almost never a penalty. It is a misconfigured robots.txt, a rogue noindex tag, or a thin content flag that Google silently ignores. You do not need a panic button. You need a repeatable diagnostic workflow.
This guide walks you through the exact steps to check if Google is indexing your site, interpret the data Google gives you, and fix the three most common indexing killers. We use production-tested methods from Google's official robots.txt documentation and field-tested bulk verification protocols.
| Method | How It Works | Accuracy & Scale | Common Failure Mode |
|---|---|---|---|
| Google Search Console URL Inspection tool | Submit one URL per query. Returns live index status, crawl date, and coverage details. | 100% accurate for single URLs. Manual, no bulk export by default. | Rate limits after ~50 inspections per hour. API quota errors on larger sites. |
| site: operator Direct search query | Type site:yourdomain.com/page in Google search. Shows approximate results count per query. | Rough estimate only. Google rounds numbers and omits certain pages intentionally. | Zero results shown even if page is indexed. Parameterized URLs often fail. |
| Bulk Index Checker API Automated batch tool | Upload list of URLs to an API endpoint. Returns indexed / not indexed per URL. | 95-98% match with GSC data. Handles 10,000+ URLs per run. | Slow vendors throttle after 500 URLs. False positives for redirected URLs. |
| Server log analysis Raw access logs | Parse log files for Googlebot user-agent hits. Shows actual crawl activity per URL. | Hard evidence of crawl attempts. Requires log storage and parsing tools like GoAccess or ELK. | Log rotation deletes data after 7 days. No index status, only crawl presence. |
| Manual URL Inspection API Programmatic GSC access | Uses Google Indexing API via OAuth. Returns structured index status for each URL. | Authoritative data. High setup cost but reliable for recurring checks. | API daily quota: 2,000 queries per property. Authentication tokens expire every hour. |
Export all pages from CMS or sitemap. Include canonical URLs only. Remove parameter duplicates.
Use GSC API or reliable bulk checker. Process batches of 500 URLs to avoid timeout.
Flag all 'not in index' and 'crawled but not indexed' results. These are your target fix list.
Check robots.txt rules, meta robots tags, HTTP status codes, and content quality score.
Blocked by robots.txt? Update the file. Noindex tag present? Remove it. Thin content? Consolidate or improve. Use GSC URL Inspection to request re-crawl.
Re-run the bulk check after 2-3 days. Track index coverage trend week over week in GSC.
Scenario: An online retailer with 1,200 product pages sees only 340 indexed pages in Google Search Console. We need to check if Google is indexing the missing 860 pages.
Step 1 - Export URLs: We pull the full product sitemap (1,200 URLs) and remove faceted filter parameters (e.g., ?color=red, ?size=L) leaving 980 canonical product URLs.
Step 2 - Bulk check: We run the 980 URLs through a GSC API script in batches of 100. Each batch takes ~2 seconds. Total runtime: 20 seconds.
Step 3 - Results: 340 indexed (matches GSC). 640 not indexed. Breakdown: 220 blocked by robots.txt (Disallow: /product/old), 310 with noindex meta tags (staging pages pushed live), 110 returned 404 errors (deleted variants).
Step 4 - Fixes: Updated robots.txt to remove the Disallow rule. Removed noindex tags from 310 pages. Set up 301 redirects for the 110 deleted pages.
Step 5 - Outcome: After re-submission, 210 of the previously blocked pages indexed within 5 days. Remaining 430 had thin content (less than 200 words) and required content expansion before re-check.
In practice, when you check if Google is indexing your site, you will hit data limits, wrong filters, and duplicate lists. Here are the ones that trip up most practitioners:
Blocked URLs: A site-wide Disallow in robots.txt can hide thousands of pages from the index, but Google may still crawl some if the directive is not fully respected. Always verify with the robots.txt tester inside GSC.
Wrong filters in GSC: The 'Index Coverage' report can show 'Submitted and indexed' for a sitemap URL that actually returns a 302 redirect. The green checkmark is misleading. Always confirm with the URL Inspection tool.
Bad data from bulk checkers: Some cheap bulk index checkers return 'indexed' for URLs that redirect to an indexed page. They do not follow redirects. You end up with false positives. Always use a tool that resolves the final destination URL.
Empty results due to quota limits: The GSC API has a daily quota of 2,000 queries per property. If your list has 5,000 URLs, you need to spread the check over three days or use a vendor with higher limits.
Weak pages with zero content: Google crawls a page but does not index it if the page has fewer than 80-100 words of visible text. These 'crawled but not indexed' entries are the hardest to fix because they require content improvement, not just technical changes.
| Symptom in GSC | Likely Root Cause | Immediate Fix | Priority Level |
|---|---|---|---|
| Submitted URL not indexed Coverage report shows this status for 30%+ of pages | Pages are orphaned, have thin content, or internal linking is weak. Google sees no value to index. | Improve internal links from high-authority pages. Add unique content (minimum 300 words per page). | High Affects discoverability of entire site sections. |
| Crawled but not indexed URL Inspection says 'Page was crawled but not indexed' | Content quality is below threshold. Duplicate or near-duplicate content detected. | Consolidate duplicate pages with 301 redirects. Add canonical tags to original versions. | High Wastes crawl budget and dilutes relevance. |
| Blocked by robots.txt URL Inspection shows 'Blocked by robots.txt' | Disallow directive in robots.txt prevents crawling. Often legacy rules from old site structure. | Edit robots.txt to allow the blocked path. Use the robots.txt tester in GSC to validate. | Critical No crawl means no chance to index. |
| Soft 404 Page returns 200 status but content is missing or useless | Empty category pages, search results with no results, or placeholder pages with no real content. | Remove or redirect soft 404 pages. For empty categories, add curated product lists or remove the page. | Medium Harms user experience and wastes Googlebot time. |
| Alternate page with proper canonical GSC shows indexed with duplicate canonical | Multiple URLs pointing to the same canonical. This is normal for paginated or faceted URLs. | No action needed if canonical is correct. If wrong canonical used, fix the tag on the duplicate pages. | Low Only fix if wrong canonical is diluting intended primary page. |
Export a complete list of all public page URLs from your CMS or sitemap index. Include only canonical URLs. Exclude pagination parameters.
Remove any URLs that intentionally should not be indexed (login pages, admin sections, thank-you pages).
Verify that your robots.txt file allows crawling of the paths you want indexed. Use the GSC robots.txt tester.
Ensure you have Google Search Console owner access to the property. Without it, you cannot use the URL Inspection API.
Check your API quota limits if you plan to use the bulk Indexing API. Standard daily limit is 200 queries per property; 2,000 if you request an increase.
Prepare a staging environment to test robot meta tag changes before pushing to production.
Use Google Search Console's URL Inspection tool for individual URLs. The Index Coverage report shows aggregate data. Both are free. For bulk checks, use the GSC API with a simple script; it costs nothing beyond your development time. Avoid paid tools until you hit the 2,000-query daily limit.
Common reasons: robots.txt blocks the path, a noindex meta tag is present, the page returns a 404 or 500 status, the content is too thin (under 80 words), or the page is orphaned (no internal links). Check each factor in order. The GSC URL Inspection tool will tell you the exact blocking reason.
After submitting a re-crawl request via GSC, Google typically re-crawls within 1-4 days. Full re-indexing of a fixed page can take 1-3 weeks depending on the site's crawl budget and content quality. Monitor the URL Inspection tool for status updates. Do not re-submit repeatedly; it does not speed up the process.
Crawled but not indexed means Googlebot visited the page but chose not to add it to the index, usually due to thin content, duplication, or low value. Discovered but not crawled means Google knows the URL exists (from a sitemap or link) but has not yet attempted a crawl. The first requires content improvement; the second requires patience or better internal linking.
Yes. A Disallow: / directive in robots.txt blocks crawling of all pages. If Googlebot cannot crawl a page, it cannot index it. However, Google may still index a URL if it is linked from an external source and the noindex tag is absent. To fully prevent indexing, use a noindex meta tag or HTTP header in addition to robots.txt blocking.
After a domain or URL structure migration, run a bulk index check on the new URLs using the GSC API. Compare the count against the old site's index count. Monitor the Index Coverage report for spikes in '404' or 'Soft 404' errors. Expect a dip in indexing for 2-4 weeks as Google re-crawls. Use 301 redirects from old to new URLs to preserve equity.
Agencies handling 50,000+ URLs per month should use the GSC Indexing API directly with a custom script, or tools like Sitebulb, Screaming Frog (with GSC integration), or the dedicated Bulk Google Index Checker protocol documented on Medium. Avoid tools that charge per URL without showing their false positive rate. Always cross-check a 5% sample manually.
Google sometimes indexes the redirect destination URL instead of the source. The URL Inspection tool may report the source as indexed if the destination is also on your site. To fix, ensure the source URL either returns a 301 permanent redirect (which Google follows) or add a noindex tag if you want it gone from the index.
Each subdomain (blog.yoursite.com, shop.yoursite.com) is treated as a separate property in GSC. You must verify each subdomain individually. Run the same diagnostic workflow per property. Use the site: operator with the subdomain prefix to get a quick count, but rely on GSC for accurate data. Cross-subdomain redirects can cause indexing confusion; verify redirect chains.
The top five: 'Submitted URL not indexed' (thin content), 'Crawled but not indexed' (low quality), 'Blocked by robots.txt' (misconfiguration), 'Soft 404' (empty pages with 200 status), and 'Alternate page with proper canonical' (duplicate handling). Fix these in priority order: robots.txt issues first, then content quality, then redirect errors.
Randomly checking five URLs per week gives you a false sense of control. A structured bulk check reveals the real scale of missing pages. For teams that need to validate indexing for guest posts, link placements, or client deliverables, a repeatable protocol saves hours of manual work. The detailed approach described in the Bulk Google Index Checker Protocol shows how to automate this for agency-scale operations, including handling rate limits and deduplication.
Your next step is simple: export your current sitemap, run a bulk check, and categorize every 'not indexed' result by root cause. Fix the robots.txt block first, then the noindex tags, then the thin content. Re-check after one week. That cycle is the core of a mature indexing strategy.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.