Below you'll find a script to help you match old 404 URLs with new URLs. Rather than manually matching URLs one by one, you can click play and let FuzzyMatch do the work for you.
Get the script — take a copy or use it now in Google Colab
The script outputs a CSV with your old URLs matched to your new ones, along with a percentage similarity score.
Thank you to these geniuses
As with most of my scripts, I've frankensteined other people's work — so a massive shoutout to these legends.
Understanding Fuzzy Matching
A huge thank you to Lazarina Stoy — her article on Fuzzy matching for SEO is excellent, and her blog is generally worth bookmarking.
Twitter: @lazarinastoy
Fuzzy Matching in Python
I originally got into FuzzyMatch thanks to Antoine Eripret and his script — the code is very much his, with minor changes to make it more accessible for less technical SEOs. Also check out the second part of his article where he uses Google Search to decide where a URL should redirect to.
Twitter: @antoineripret
Example Excel workbook
For the script to work, you need to upload an .xlsx file with a specific structure:
- Create an
.xlsxworkbook with 2 worksheets - Name the first sheet
oldand paste your old website URLs in the first column - Name the second sheet
newand paste your new website URLs in the first column
Step 1: Run pip install PolyFuzz
Click the play button next to the line of code below. This installs the PolyFuzz library.
pip install PolyFuzz
Step 2: Load in the libraries
PolyFuzz is a Python library that runs FuzzyMatch to match similar strings. This is the important part doing the heavy lifting.
Pandas allows us to analyse and manipulate data. It handles our data frames throughout the script.
Openpyxl allows Python to open, read, and manipulate Excel files.
import pandas as pd
from openpyxl import load_workbook
from polyfuzz import PolyFuzz
from google.colab import files
Step 3: Upload your Excel file
- Create an
.xlsxworkbook with 2 worksheets - First sheet called
old— your old 404 URLs in column A - Second sheet called
new— your new 200 URLs in column A
# Upload the xlsx file
upload = files.upload()
input_file = list(upload.keys())[0] # get the name of the uploaded file
Step 4: Run FuzzyMatch
This is the part of the script that does all the matching and outputs results. If results look good and there are no errors, move to the final step.
# Load URL lists
old = pd.read_excel(input_file, sheet_name='old')
new = pd.read_excel(input_file, sheet_name='new')
# Convert to Python list (required by PolyFuzz)
old = old['URL'].tolist()
new = new['URL'].tolist()
# Launch fuzzy matching
model = PolyFuzz("TF-IDF")
model.match(old, new)
# Load results
result = model.get_matches()
# Print results so you can visually check what's been done
print(result)
# Create a DataFrame from the results
df = pd.DataFrame(result, columns=['From', 'To', 'Similarity'])
# Save the DataFrame to a CSV file
df.to_csv('/content/redirect-map.csv', index=False)
Step 5: Download your redirect map
files.download("/content/redirect-map.csv")
Done. Your URLs will be matched and you can QA the results rather than doing the matching by hand.