Howdy folks, below you’ll find a script to help you match old 404 URLs with new URLs! This means that rather than playing matchy with URLs you can click play and let Fuzzy Match do the work for you.
Gimme! Gimme! Gimme! the script now – Take a copy or use the script now!
This script will output a csv with your old URLs matched to your new ones and a percentage match.
On this page
Thank you to these geniuses
As per most of my scripts throughout time, I have frankensteined other people’s work. So massive should out to these legends!
Understanding Fuzzy Matching
A massive thank you to Lazarina Stoy, your article on Fuzzy matching for SEO is amazing! And your blog is amazing – you are a genius.
If you want to learn about fuzzy matching, please read her article and honestly everything else on her site.
Twitter: lazarinastoy
Fuzzy Matching in Python
I originally started using Fuzzymatch all thanks to Antoine Eripret and his script is very much his with some minor changes to allow less technical SEOs use it easily.
Thank you Antoine!
Also, check out the second part of his article where he uses Google Search to decide where a URL should be sent to. Very cool.
Twitter: antoineripret
Example Excel Workbook
For this script to work you need to upload an xlsx file with a specific layout.
- Create an xlsx workbook with 2 worksheets.
- Name the first sheet ‘old’ and paste your old website URLs in the first column.
- Name the second sheet ‘new’ and paste your new website URLs in the first column.
Download an example!
Step 1: Run ‘pip install PolyFuzz’
Click the play button next to the line of code below. This installs the library PolyFuzz – Don’t be worried about the movie hacker code below.
pip install PolyFuzz
Step 2: Load in the libraries
Polyfuzz
Polyfuzz is a Python library that runs Fuzzy Match to match similar strings. This is the important part of the script doing the hard work for us.
Pandas
If you’ve ever used Python before you’ve probably heard of Pandas which allows us to analyse and manipulate data. It’s an amazing tool and looks after our data frames.
Openpyxl
Openpyxl allows Python to open, manipulate and read Excel files.
import pandas as pd from openpyxl import load_workbook from polyfuzz import PolyFuzz from google.colab import files
Step 3: Upload your Excel file
- Create an xlsx workbook with 2 worksheets,
- 1st called ‘old’ with your old 404 urls in the first column
- 2nd called ‘new’ with your new 200 urls in the first column
# upload the xlsx file upload = files.upload() input_file = list(upload.keys())[0] # get the name of the uploaded file
Step 4: Run FuzzyMatch!
This is the fancy part of the script that does all the matching for you and outputs results. If the results look good and you have no errors you can move to the final step and export the matched URLs!
#load urls lists old = pd.read_excel(input_file, sheet_name='old') new = pd.read_excel(input_file, sheet_name='new') #convert to Python list (required by Polyfuzz) old = old['URL'].tolist() new = new['URL'].tolist() #launch fuzzy matching model = PolyFuzz("TF-IDF") model.match(old, new) #load results result = model.get_matches() #prints the results below so you can visually see what's been done print(result) # Create a DataFrame from the results df = pd.DataFrame(result, columns=['From', 'To', 'Similarity']) # Save the DataFrame to a CSV file df.to_csv('/content/redirect-map.csv', index=False)
Step 5: Download your redirect map! Woohoo!
files.download("/content/redirect-map.csv")
Then you’re done! Your URLs will be matched and you can instead QA the results.
If you have any questions feel free to reach out through email or twitter or comment below!
Leave a Reply