Python is all the rage in SEO, and for good reason. Python is a programming language that is accessible enough for those with little programming experience, but powerful enough to extend your SEO efforts with advanced data analysis and tool automation.
In this intermediate tutorial, Rocket Clicks will guide you step-by-step in creating the necessary framework to automate Screaming Frog and store the data in a MySQL database. First, we’re going to map out the requirements, then import the needed modules, get the crawl ready, process the crawl, and then automate it. From there, you can analyze historical trends generated for the metrics you care most about.
Note: placeholders exist in some of the code below where you need to fill in the details for your environment.
Requirements and Assumptions
- Access to a Linux installation (I recommend Ubuntu) and an understanding of basic terminal functions
- Python 3 is installed and basic Python syntax understood
- Screaming Frog for Linux is installed
- Access to a MySQL database, locally or externally
Creating MySQL Tables
First, we need to set up the database where we’ll store our crawl data. There are two common ways to access and manage MySQL: via the command terminal and via phpMyAdmin GUI by cPanel.
Option 1 – Command Line Terminal:
If you are without cPanel or most comfortable in the terminal, see this guide for logging into MySQL and creating the database and user. Then run the SQL statements below to create the tables.
Option 2 – phpMyAdmin:
If you have access to cPanel, you can create the database and user in the MySQL Databases area. After that, head over to phpMyAdmin (also found in cPanel). Select your database from the list on the left side and, in the SQL tab found at the top, enter the SQL below to create the table that will contain the websites you want to run Screaming Frog on.
CREATE TABLE websites ( websiteid int NOT NULL AUTO_INCREMENT, name varchar(255), url varchar(255), PRIMARY KEY (websiteid ) );
At this point, you’ll have an empty table for your websites. Naturally, you’ll want this table populated with the number of websites you want to crawl. If you created your table in phpMyAdmin, you can select it on the left side column and then select “Insert” at the top. Fill out that form for each website you want to crawl.
You can also insert website records via SQL as shown below (Note: websiteid is auto-generated):
INSERT INTO websites (name,url) VALUES ("Rocket Clicks","https://rocketclicks.com")
Next, we’re ready to create the table for the crawl data using the SQL statement below.
CREATE TABLE crawls ( crawlid int NOT NULL AUTO_INCREMENT, websiteid int(255), date varchar(255), status_200 int(255), status_301 int(255), status_404 int(255), status_403 int(255), avg_wordcount int(255), avg_titlecount int(255), avg_metacount int(255), avg_crawl_depth int(255), avg_h1count int(255), avg_textratio int(255), indexability int(255), avg_inlinks int(255), avg_outlinks int(255), PRIMARY KEY (crawlid) );
Importing Necessary Modules
Python modules are like libraries in other coding languages. They are collections of premade functions that you can use to save time by not reinventing the wheel. Most of the Python modules we’re going to use should be preinstalled, but one that isn’t is mysql.connector. To install this, go to your command terminal and type in this command:
pip3 install mysql.connector
If you get any errors about other missing modules, you can use the same code above to install the rest. Just make sure to replace the last part with the name of the new module. Sometimes the names aren’t obvious; you can search for the module names here.
Getting the Crawl Ready
Next, fire up your favorite code editor or IDE. I recommend PyCharm for more experienced coders or Thonny for beginners.
Place the code below on the first line of the file. It’s called a shebang or hashbang and tells Linux how to execute the file. Often, this is optional but required when running from a cronjob, which we will be doing later. This tells Linux to run using Python 3.
#!/usr/bin/python3
Next, let’s import the modules we’re going to be using:
### Used to execute ScreamingFrog import os ### Cleanup files after script is done import glob ### Get date from datetime import date ### Get time of day from datetime import datetime ### Use to delay script execution import time ### Communicate with database import mysql.connector ### Convert csv file to table form import pandas as pd
Now we’ll retrieve the list of website names and URLs from the table you created earlier. We’ll use this list to have Screaming Frog loop through each of those clients. We then assign values to the variables, id, name, URL. We’ll replace any spaces with an underscore for when we save the crawl files as file names can’t contain spaces.
Customize the variables “mydb” and “sql_websites” variables with your own details.
mydb = mysql.connector.connect(port="3306", host="HOST_IP",user="USERNAME",password="PASSWORD", database="DATABASE_NAME") cursor = mydb.cursor() sql_websites = "SELECT * FROM websites" cursor.execute(sql_websites) records = cursor.fetchall() for row in records: id = str(row[0]) name = row[1].replace(" ","_") url = row[2]
Next, we will use the os module to execute Screaming Frog, which you installed earlier. If you need more assistance, see the Screaming Frog user guide for command-line (headless) options. We are running it using a URL from the MySQL records we received, outputting the HTML crawl data, and using a specific configuration. I would strongly suggest creating a configuration file, then exporting for headless mode, and limiting it to only what you want to store as well as limiting the number of pages to crawl.
Why limit the number of pages crawled? When you execute Screaming Frog via a Python script, there is no communication between the two applications (I’d love to hear workarounds for this, if someone has one). The script won’t inherently wait for Screaming Frog to end the crawl. It will just keep moving on. So we need to put in an artificial pause in the script. I set it for 10 minutes (600 seconds) in the example below. For sites with 2000 or fewer pages, that should be plenty of time. For bigger sites, you’ll need to play with the timing. If you undercut your time, you’ll end up with a Python error when the script starts looking for the outputted crawl file but can’t find it because the crawl hasn’t finished yet.
Customize the variable “stream” with your own desired configuration and details.
stream = os.popen('screamingfrogseospider --crawl '+url+' --headless --save-crawl --output-folder PATH_TO_EXPORT --export-tabs "Internal:HTML" --overwrite --config "PATH_TO_FILE/NAME_OF_FILE.seospiderconfig"') time.sleep(600) print(name + ": Crawl Finished")
Processing the Crawl
If the crawl completed properly it will output two files, a CSV file, and a Screaming Frog specific .seospider file. These files are nice to keep for historical reference so we’re going to keep them, but we’ll need to rename these generically named files or they’ll be overwritten when the next website is crawled.
First, we’ll get today’s date and use it in the new file name so we can ID them at a glance. We’ll use another short pause as we envoke another operating system process that Python can’t communicate with.
Customize both “os.name()” functions with your own details.
getdate = datetime.now().strftime("%m-%d-%y") os.rename(r'PATH_TO_OLD.csv',r'PATH_TO_NEW'+name+'_' + getdate+'.csv') os.rename(r'PATH_TO_OLD.seospider',r'PATH_TO_NEW'+name+'_' + getdate+'.seospider') time.sleep(2)
Now that we have our renamed crawl files, we can load them into a Pandas dataframe. Pandas is a module used for data manipulation and analysis. A Pandas dataframe is essentially a table used to store data. Columns and rows are created just like in a spreadsheet. In the code below, we will be removing a space character in the Status Code column to make operation easier later on.
Customize the variable “data” with your own details.
data = pd.read_csv('PATH_TO' + name + '_' + getdate + '.csv') data.columns = data.columns.str.replace('Status Code', 'StatusCode')
Now that we’ve loaded our crawl CSV into a dataframe (spreadsheet), we can start using it!
At this point, you can decide what metrics of the crawl you want to store. Below are the metrics I thought were important.
In this guide, we’re not going to store individual page metrics, but averages based on the entire crawl. To achieve that we’re going to access the columns like data['Word Count']
and then take the average by using .mean()
function. After that we round()
the number and use str()
to turn it into a string for later use.
avg_word_count = str(round(data['Word Count'].mean())) avg_title_length = str(round(data['Title 1 Length'].mean())) avg_crawl_depth = str(round(data['Crawl Depth'].mean())) avg_h1_count = str(round(data['H1-1 length'].mean())) avg_text_ratio = str(round(data['Text Ratio'].mean())) avg_metacount = str(round(data['Meta Description 1 Length'].mean())) avg_inlinks= str(round(data['Unique Inlinks'].mean())) avg_outlinks = str(round(data['Unique External Outlinks'].mean()))
Continuing on with aggregating our data, we’ll use some conditionals in our Pandas dataframe to count the number of each status code we want to track along with counting the indexability status of the crawled pages.
indexability = str(data[data.Indexability =="Non-Indexable"].count()['Indexability']) status_200 = str(data[data.StatusCode == 200].count()['StatusCode']) status_404 = str(data[data.StatusCode == 404].count()['StatusCode']) status_403 = str(data[data.StatusCode == 403].count()['StatusCode']) status_301 = str(data[data.StatusCode == 301].count()['StatusCode'])
Once we have all the data we want to store, we can start building part of our SQL statement:
addsql = status_200 + "','" + status_301 + "','" + status_404 + "','" + status_403 + "','" + avg_word_count + "','" + avg_title_length + "','" + avg_metacount + "','" + avg_crawl_depth + "','" + avg_h1_count + "','" + avg_text_ratio + "','" + indexability + "','" + avg_inlinks + "','" + avg_outlinks
Next, we will execute the SQL statement that adds the crawl data and close the database connection. These SQL fields need to correspond to the fields in your database table that we set up earlier. This code assumes your database table is named “crawl.” After this point, the script will loop:
new_crawl = "INSERT INTO crawls (websiteid,date,status_200,status_301,status_404,status_403,avg_wordcount,avg_titlecount,avg_metacount,avg_crawl_depth,avg_h1count,avg_textratio,indexability,avg_inlinks,avg_outlinks) VALUES ('"+websiteid+"','" + getdate + "','" + addsql +"')" print(name + " added to database") cursor.execute(new_crawl) mydb.close()
Lastly, we’ll clean up the folder where the Screaming Frog output files were stored so next time this script runs you’ll have a clean folder to work with. Make sure the path is correct. This can be a dangerous command to run because you could accidentally delete the wrong folder if you’re not careful.
Customize the variable “files” with your own details.
files = glob.glob('PATH_TO_CRAWL_FILES/*') for f in files: os.remove(f) print("removed:" + f)
Automating the Crawl
If your script is working well when you run it manually, it’s time to automate it. Luckily, Linux already supplies us with a solution by using the crontab. The crontab stores entries of scripts where you can dictate when to execute them (like a scheduler). You have lots of flexibility with how you schedule your crawl (any time of day, day of the week, day of the month, etc.). To add entries to the crontab, run this command:
crontab -e
It will likely open up the crontab file in vi editor. On a blank line at the bottom of the file, type the code below. This code will run the script at midnight every Sunday. To change the time to something else, use this cronjob time editor. Customize with your path to the script.
0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py
If you want to create a log file to record each time the script ran, you can use this instead. Customize with your path to the script.
0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py > PATH_TO_FILE/FILENAME.log 2>&1
Save the crontab file and you’re good to go! Just note, your computer needs to be on at the time the cronjob is set to run.
Conclusion
So there you have it! As you can see, you can automate Screaming Frog and store the data without too much effort. Naturally, the next step would be to tap into the database with another script or existing application to display or further analyze the data. Please follow me on Twitter for feedback and showcasing interesting ways to extend the script. Enjoy!
Next up, see my guide: Automate GTmetrix with Python!
Have you tried throwing the stream line into a while loop that looks for the existence of the file? Something similar to this: https://stackoverflow.com/questions/21746750/check-and-wait-until-a-file-exists-to-read-it
Very interesting, I’ll try this out, thanks so much!