Keyword Clustering & Intent for SEO [FREE Colab Script]

Welcome all, below you can find a quick guide to allow you to take CSV with a list of up to 50,000 keywords and cluster them and add the keyword intent as well.

Gimme! Gimme! Gimme! the script now – Take a copy or use the script now!

Thank you to these geniuses

I have Frankensteined a couple of scripts together to build exactly what I want to shout out to these geniuses for both of their scripts. I have very much merged them and made some changes to suit my use! You guys rock!

Semantic Clustering Tool

LeeFootSEO

Website: https://www.searchsolved.co.uk

Twitter: https://twitter.com/LeeFootSEO

Keyword Intents

Greg Bernhardt

Website Article https://importsem.com/use-python-to-label-query-intent-entities-and-keyword-count/

Twitter https://twitter.com/GregBernhardt4

Jump to my Google Colab Script: Keyword Clustering Intent

What is keyword topic clustering?*

Keyword topic clustering is a technique used in search engine optimization (SEO) to group related keywords based on their semantic meaning. This process helps organize and optimize website content, allowing search engines better to understand the topic and intent behind the keywords.

One popular tool used for semantic keyword clustering is Python, a programming language known for its versatility and ease of use. By utilizing Python libraries and algorithms, SEO professionals can analyze large sets of keywords and identify patterns or similarities between them. This allows them to create clusters of keywords that share common themes or intent.

What is keyword intent and how can we categorise them?*

Understanding keyword intent is crucial. Keyword intent refers to the underlying motivation or purpose behind a user’s search query. By categorizing keywords based on their intent, we can optimize their content to better align with what users are looking for. This not only improves search engine rankings but also enhances the overall user experience.

Keyword intent categorization involves analyzing search queries and classifying them into different intent categories. There are generally four main types of keyword intent: informational, navigational, transactional, and commercial investigation.

Informational Intent Keywords:

Users utilize these in search of knowledge or solutions to particular queries. They generally contain terms like who, what, when, where, why, how, can, will etc. They could also be entities such as movie names or song names. Anytime a user wants more information and isn’t going to action something online or in-store it is an informational keyword.

What is a German Shepherd?
My Chemical Romance
Is Deception Bay a safe suburb?

Navigational Intent Keywords:

These keywords signal that users aim to find a specific website or brand. They already have a destination in mind and use search engines to navigate directly there. This could be to land on a specific product or brand in e-commerce as well.

Examples:

Bunnings warehouse
Kong dog toys
Semrush login

Transactional Intent Keywords:

These keywords indicate users’ readiness to buy or perform a particular action. They denote a high intent to engage in a commercial activity, such as making a purchase.

Examples:

buy dog food
cheap blue dress
books for sale

Commercial Investigation Keywords:

Employed by users who are in the process of comparing their purchasing options. These potential customers are gathering information and evaluating alternatives before finalizing their buying decisions.

Compare dog toys
ahref vs semrush
best tech SEO tool (Obviously Screaming Frog duh)

My issue with the above keyword intent categorisation?

Whenever a tool automatically builds this into its keyword data, I constantly find it not quite right. So, this is part of the reason I built it into the below python script.

I also seem to find that commercial and transactional can have a massive overlap. Across most the verticles I work across be that hobby pet store or financial websites I find myself targeting both commercial and transaction together.

Finally, I don’t love ‘navigational’ as a grouping for things in the pet industry either. Tools end up grouping your own brand names, product brand names and specific products into navigational together.

So because of all these nuances and maybe cause, I’m a control freak, I prefer to set them myself.

How to use Google Colab Python Script

I am writing this for complete beginners as I know many SEOs have never coded before or used Google Colab.

Google Colab is a free coding notebook, so you can go ahead and open up my script: Keyword Clustering Intent and take a copy. Just like you would with any other Google Drive Doc, Sheet, Slide etc.

Then working from the top you’ll click play next to each set of code and let it run.

CSV Structure or download my example:

Please ensure your document is a csv not Excel file and has 2 columns, your keywords in the first, and search volumes in the second.

Keyword	Search Volume
cat	201000
dog	165000

queries Download

Decide on your Keyword Intent Categories

These are mine, you can alter them however you’d like. It’s going to look for these words when deciding on intent.

# Define your intent categories
informative = ['what', 'who', 'when', 'where', 'which', 'why', 'how', 'can ']

transactional = ['buy', 'order', 'purchase', 'cheap', 'price', 'discount', 'shop', 'sale', 'offer', 'snuffle mat', 'pet crate', 'food', 'toy', 'feeder', 'collar', 'bed', 'hardness', 'ball', 'carrier', 'litter', 'bowl', 'best', 'top', 'review', 'comparison', 'compare', 'vs', 'versus', 'guide', 'worm treatment']

branded = ['royal canin', 'revolution', 'science diet', 'bravecto', 'balance life', 'black hawk', 'adaptil']

Step 1: Run Pip

If you haven’t used colab before, there are little play buttons in the top left corner of the code. Click play. This will install the needed libraries for the code.

You’ll get a green tick when it’s done.

For this script we are using 4 libraries:

Sentence_transformers

Sentence Transformers is a python framework from sbert.net that allows us to compute sentence and text embeddings. It can then compare similarities allowing us to use it to cluster keywords.

Pandas

If you’ve ever used python before you’ve probably heard of Pandas which allows us to analyse and manipulate data. It’s an amazing tool and looks after our data frames.

Chardet

Chardet allows us to detect a various range of Character Encodes so it helps with working with CSVs.

Detect_demlimiter

This helps us detect the delimiter of the csv. So whether is a comma or pipe or semicolon etc.

[',', ';', ':', '|', '\t']

Step 2: Run python imports

We then need to import the libraries we want to use. Same thing, click play.

import sys
import time
import sys
import pandas as pd
import chardet
import codecs
from detect_delimiter import detect

from google.colab import files
from sentence_transformers import SentenceTransformer, util

Step 3: Upload CSV Keyword & Volume File

Script expects keywords to be in the first column
Script expects search volumes to be in the second column
Expects a csv file
Recommend No More Than 50K Rows

Click play and a upload option will appear allowing you to upload a file.

# upload the keyword export
upload = files.upload()
input_file = list(upload.keys())[0]  # get the name of the uploaded file

Step 4: Run Cluster Accuracy, Size & Choose Sentence Transformer

You can change these figures if you wish.

cluster_accuracy:

0-100 (100 = very tight clusters, but higher percentage of no_cluster groups)

min_cluster_size:

set the minimum size of cluster groups. (Lower number = tighter groups)

Sentence Transformer

Defaulted to the faster one. The best quality option runs better on a premium colab subscription.

To change this uncomment “#” the first one, comment out the second one.

Pre-Trained Models: https://www.sbert.net/docs/pretrained_models.html

cluster_accuracy = 85  # 0-100 (100 = very tight clusters, but higher percentage of no_cluster groups)
min_cluster_size = 2  # set the minimum size of cluster groups. (Lower number = tighter groups)
#transformer = 'all-mpnet-base-v2'  # provides the best quality
transformer = 'all-MiniLM-L6-v2'  # 5 times faster and still offers good quality

Step 6: Run CSV Checker

This confirms which character enconding your csv is and makes changes if needed.

# automatically detect the character encoding type

acceptable_confidence = .8

contents = upload[input_file]

codec_enc_mapping = {
    codecs.BOM_UTF8: 'utf-8-sig',
    codecs.BOM_UTF16: 'utf-16',
    codecs.BOM_UTF16_BE: 'utf-16-be',
    codecs.BOM_UTF16_LE: 'utf-16-le',
    codecs.BOM_UTF32: 'utf-32',
    codecs.BOM_UTF32_BE: 'utf-32-be',
    codecs.BOM_UTF32_LE: 'utf-32-le',
}

encoding_type = 'utf-8'  # Default assumption
is_unicode = False

for bom, enc in codec_enc_mapping.items():
    if contents.startswith(bom):
        encoding_type = enc
        is_unicode = True
        break

if not is_unicode:
    # Didn't find BOM, so let's try to detect the encoding
    guess = chardet.detect(contents)
    if guess['confidence'] >= acceptable_confidence:
        encoding_type = guess['encoding']

print("Character Encoding Type Detected", encoding_type)

Step 7: Run to create DataFrames

# automatically detect the delimiter
with open(input_file,encoding=encoding_type) as myfile:
    firstline = myfile.readline()
myfile.close()
delimiter_type = detect(firstline)

# create a dataframe using the detected delimiter and encoding type
df = pd.read_csv((input_file), on_bad_lines='skip', encoding=encoding_type, delimiter=delimiter_type)
count_rows = len(df)
if count_rows > 50_000:
  print("WARNING: You May Experience Crashes When Processing Over 50,000 Keywords at Once. Please consider smaller batches!")
print("Uploaded Keyword CSV File Successfully!")
dfkeyword = df

# Get the name of the first column
first_column_name = df.columns[0]
second_column_name = df.columns[1]

# If the first column is not named 'Keyword', rename it
if first_column_name != "Keyword":
    df.rename(columns={first_column_name: "Keyword", second_column_name: "Search Volume"}, inplace=True)

# Continue with your data processing
cluster_name_list = []
corpus_sentences_list = []
df_all = []

corpus_set = set(df['Keyword'])
corpus_set_all = corpus_set
cluster = True

Step 8: Run Clustering Keywords – This can take a while!

# keep looping through until no more clusters are created

cluster_accuracy = cluster_accuracy / 100
model = SentenceTransformer(transformer)

while cluster:

    corpus_sentences = list(corpus_set)
    check_len = len(corpus_sentences)

    corpus_embeddings = model.encode(corpus_sentences, batch_size=256, show_progress_bar=True, convert_to_tensor=True)
    clusters = util.community_detection(corpus_embeddings, min_community_size=min_cluster_size, threshold=cluster_accuracy)

    for keyword, cluster in enumerate(clusters):
        print("\nCluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

        for sentence_id in cluster[0:]:
            print("\t", corpus_sentences[sentence_id])
            corpus_sentences_list.append(corpus_sentences[sentence_id])
            cluster_name_list.append("Cluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

    df_new = pd.DataFrame(None)
    df_new['Cluster Name'] = cluster_name_list
    df_new["Keyword"] = corpus_sentences_list

    df_all.append(df_new)
    have = set(df_new["Keyword"])

    corpus_set = corpus_set_all - have
    remaining = len(corpus_set)
    print("Total Unclustered Keywords: ", remaining)
    if check_len == remaining:
        break

# make a new dataframe from the list of dataframe and merge back into the orginal df
df_new = pd.concat(df_all)
df = df.merge(df_new.drop_duplicates('Keyword'), how='left', on="Keyword")
df

# rename the clusters to the shortest keyword in the cluster
df = df.sort_values(by="Search Volume", ascending=False)

df['Cluster Name'] = df.groupby('Cluster Name')['Keyword'].transform('first')
df.sort_values(['Cluster Name', "Search Volume"], ascending=[True, False], inplace=True)

df['Cluster Name'] = df['Cluster Name'].fillna("zzz_no_cluster")

# move the cluster and keyword columns to the front
col = df.pop("Keyword")
df.insert(0, col.name, col)

col = df.pop('Cluster Name')
df.insert(0, col.name, col)

df.sort_values(["Cluster Name", "Keyword"], ascending=[True, True], inplace=True)
df

uncluster_percent = (remaining / count_rows) * 100
clustered_percent = 100 - uncluster_percent
print(clustered_percent,"% of rows clustered successfully!")

Step 9: Apply Intents

In this part of the code you will need to make changes to suit your intents. And you can choose whatever groupings you want as well.

Update below intents, feel free to add more if you wish. These are Greg’s defaults:

# Define your intent categories
transactional = ['buy','order','purchase','cheap','price','discount','shop','sale','offer']

commercial = ['best','top','review','comparison','compare','vs','versus','guide','ultimate']

informational = ['what','who','when','where','which','why','how']

custom = ['brand variation 1','brand variation 2','brand variation 3']

The next step is to update below if you changed the defaults:

# Update 'Intent' based on filters directly in 'df'

df.loc[df['Keyword'].str.contains('|'.join(transactional), case=False, na=False), 'Intent'] = df['Intent'] + ' Transactional'

df.loc[df['Keyword'].str.contains('|'.join(commercial), case=False, na=False), 'Intent'] = df['Intent'] + ' Commercial'

df.loc[df['Keyword'].str.contains('|'.join(informational), case=False, na=False), 'Intent'] = df['Intent'] +' Informational'

df.loc[df['Keyword'].str.contains('|'.join(custom), case=False, na=False), 'Intent'] = df['Intent'] +' Custom'

Final Step!

Run the last bit of code and it’ll output your clustered CSV!

df_intents.to_csv('clustered.csv', index=False)
files.download("clustered.csv")

You can then put these into a pivot table to allow you to filter by intent and find your next topic to target!

Here is an example of mine:

This image has an empty alt attribute; its file name is image-1-1024x869.png

If you have any questions feel free to reach out through email or twitter or comment below!

Posted

February 28, 2024

SEO Automation

Sal

Tags:

On this page