Using ImportXML in Sheets for SEO & Scraping

a cute robot in rainbow colours using importxml

Often as SEO’s we need to review content and data points on different URLs. Screaming Frog is an amazing tool for this but when we are working with a list of less than 1,000 and only need a couple of data points I find it way easier to use Google Sheets in built Scraping.

Gimme! Gimme! Gimme! the Google sheet now – Take a copy or check out the formulas now!

What’s XPath?*

XPath (XML Path Language) is a query language used for selecting nodes from an XML (or HTML) document. It allows you to navigate through the elements and attributes of an XML document, enabling you to pinpoint specific parts of the document’s structure.

XPath expressions can be used in various programming languages and tools for tasks such as data extraction, validation, transformation, and traversal of XML documents. It provides a syntax for selecting nodes based on their properties, relationships, and positions within the document hierarchy.

Basically, XPath lets you target a specific HTML element such a h1 or a meta title etc.

You can get the XPath in dev tools by copying the element you want to capture.

ImportXML for SEO Basics

The examples below will all work on the following cute page from the Australian Museum as an example in cell A2:

https://australian.museum/learn/animals/mammals/greater-bilby/

They should work on your website too assuming you have the element on the page. If you have more than one of these elements on a page see my troubleshooting at the bottom.

H1

<h1>Greater Bilby</h1>

XPath = //h1

Xpath to HTML Break down:

  • //h1 looks for <h1>
Google Sheet FormulaOutput
=IMPORTXML(A2,”//h1″)Greater Bilby

Meta Title / Page Title

<title>Greater Bilby - The Australian Museum</title>

XPath = //title

Xpath to HTML Breakdown:

  • //title looks for <title>
Google Sheet FormulaOutput
=IMPORTXML(A2,”//title”)Greater Bilby – The Australian Museum

Meta Description

<meta name="description" content="The Greater Bilby, sometimes depicted as Australia's Easter Bunny, belongs to a group of ground-dwelling marsupials known as bandicoots.">

XPath = //meta[@name=’description’]/@content

Xpath to HTML Break down:

  • //meta looks for <meta
  • [@name=’description’] looks for name=”description”
  • /@content looks for content=”The Greater Bilby…..

Formula:

Google Sheet FormulaOutput
=IMPORTXML(A2,”//meta[@name=’description’]/@content”)The Greater Bilby, sometimes depicted as Australia’s Easter Bunny, belongs
to a group of ground-dwelling marsupials known as bandicoots.

Canonical

<link rel="canonical" href="https://australian.museum/learn/animals/mammals/greater-bilby/">

XPath = //link[@rel=’canonical’]/@href

Xpath to HTML Break down:

  • //link looks for <link
  • [@rel=’canonical’] looks for rel=”canonical”
  • /@href looks for href=”https://australi…

Formula:

Google Sheet FormulaOutput
=IMPORTXML(A2,”//link[@rel=’canonical’]/@href”)https://australian.museum/learn/animals/mammals/greater-bilby/

H2/H3/H3 etc.

<h2>H2 1</h2>
<p>Some text</p>
<h2>H2 3</h2>
<p>Some text</p>
<h2>H2 3</h2>
<p>Some text</p>

XPath = //h2

Xpath to HTML Break down:

  • //h2 looks for all <h2>

Formula:

When there are multiple values you’ll need to use transpose to make sure it doesn’t overwrite content in the cell below.

Google Sheet FormulaOutput Col 1Output Col 2Output Col 3
=transpose(IMPORTXML(A2,”//h2))H2 1H2 2H2 3

Custom ImportXml

Sometimes we want to grab say authors or dates of articles. So we need to use a bit more complex XML. The two below are quite complicated but all you really need to know is you can copy XPath and use it.

Date & Author Example

https://australian.museum/learn/animals/mammals/greater-bilby/

<main id="main">
..
<article>
<div class="page-header__metadata">
		<ul class="metadata">
			<li class="metadata__item metadata__item--author">
				<dl class="metadata__pair">
					<dt class="metadata__label">Author(s)</dt>
					<dd class="metadata__value">Sue Burrell</dd>
				</dl>
			</li>
			<li class="metadata__item">
				<dl class="metadata__pair">
					<dt class="metadata__label">Updated</dt>
					<dd class="metadata__value">21/04/22</dd>
				</dl>
			</li>
			<li class="metadata__item">
				<dl class="metadata__pair">
					<dt class="metadata__label">Read time</dt>
					<dd class="metadata__value">2 minutes</dd>
				</dl>
			</li>
		</ul>
	</div>
..
</article>
</main>

Date

Date XPath = //*[@id=’main’]/article/header/div/div/div/div/ul/li[2]/dl/dd”

Formula:

Google Sheet FormulaOutput
=importxml(A2,”//*[@id=’main’]/article/header/div/div/div/div/ul/li[2]/dl/dd”)28/04/22

Author

Author XPath = //*[@id=’main’]/article/header/div/div/div/div/ul/li[contains(@class, ‘metadata__item metadata__item–author’)]/dl/dd

Formula:

Google Sheet FormulaOutput
=importxml(A2,”//*[@id=’main’]/article/header/div/div/div/div/ul/li[contains(@class, ‘metadata__item metadata__item–author’)]/dl/dd”)Sue Burrell

Troubleshooting

I have more than one value for the element and it is overwriting the one below.

You can use the =transpose(importXML(A2, “XPATH”) to cater for this. Transpose will instead add the values in the next column instead of row.

My value isn’t working and I copied straight from dev tools?

Watch out for the different quotation marks.

Xpath in Chrome Dev Tools uses ” ” around attributes

Google Sheets uses ‘ ‘ around attributes

If you have any questions feel free to reach out through email or twitter or comment below!


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *