🔴 Broken — will cause problems

This page's canonical points to /canonicals/robots-blocked-destination/ — a URL that is disallowed in robots.txt. The destination page exists but Google cannot crawl it. Check https://sallymills.com/robots.txt to see the live disallow rule.

What this demonstrates

The canonical tag on this page points to a URL that has been blocked from crawling by a Disallow rule in robots.txt. The destination page exists and returns a 200 — but Googlebot is instructed not to access it, so it can't read the page to verify it as a valid canonical destination.

Why it matters

This is one of the subtler canonical failures. The canonical looks correct — it's an absolute URL, the destination exists, there's no 404. The problem is invisible unless you cross-reference the canonical destination against robots.txt.

If Google can't access the canonical destination, it can't confirm it as the preferred version. Consolidation may not happen. This trap appears most often when a site blocks staging or parameter-based variants in robots.txt — and someone sets those blocked URLs as canonical destinations, or vice versa.

The code

The canonical tag on this page — and the robots.txt rule that blocks its destination.

<!-- Canonical tag on this page --> <link rel="canonical" href="https://sallymills.com/canonicals/robots-blocked-destination/"> # robots.txt — the destination is disallowed Disallow: /canonicals/robots-blocked-destination/

What Google does

  1. Googlebot crawls this page and reads the canonical pointing to /canonicals/robots-blocked-destination/.
  2. Googlebot checks robots.txt before attempting to crawl the destination.
  3. The Disallow rule prevents Googlebot from accessing the destination.
  4. Google cannot verify the destination as a valid preferred URL.
  5. The canonical hint is effectively ignored. This page is likely treated as self-canonical.

How to detect it

  • view-source Ctrl+U (Windows) / Cmd+U (Mac) → search for canonical → copy the destination URL. Then open https://sallymills.com/robots.txt and search for that path.
  • curl Open Command Prompt (Windows) or Terminal (Mac) and run: curl https://sallymills.com/robots.txt | grep robots-blocked-destination — Returns the Disallow rule. Then run: curl -I https://sallymills.com/canonicals/robots-blocked-destination/ — The destination returns a 200 (it exists, it's just blocked to crawlers). The -I flag fetches headers only. (Windows: replace | grep robots-blocked-destination with | findstr robots-blocked-destination.)
  • Google Search Console The destination URL may appear in Coverage under "Blocked by robots.txt". This page may appear under "Crawled — currently not indexed" since its canonical is inaccessible.
  • Screaming Frog Canonicals tab → copy the canonical destination URL → crawl it separately with robots.txt checking enabled → it will show as "Blocked by robots.txt". Or run a robots.txt check on the destination directly from the Robots tab.

How to fix it

Ensure canonical destinations are always crawlable. Never disallow a URL in robots.txt that you're using as a canonical destination — and never set a robots.txt-blocked URL as a canonical target. Audit your canonical destinations against your robots.txt rules, especially after migrations or when robots.txt changes are made.