24th May 2017
Who doesn’t like to turning big jobs into small jobs? As an SEO analyst, I do, and when I discovered that I could scrape insightful data into excel – better yet, a combination of insightful data – saving me from downloading excel sheets from several online tools to combine them.
In this blog, you’re going to learn how to extract data and information from APIs and websites in ways the webmasters would never have imagined!
*You’re going to need SEOToolsforExcel to perform some of the following formulas in Excel. It takes a few minutes to download and install this Excel plugin. If you don’t want to install this plugin, it’s possible to scrape with Google sheets using the ImportXML formula but I prefer to do it in Excel.
Let’s say you’ve discovered a ton of 404ing legacy URLs that have external backlinks and you need to set up relevant redirects to reclaim link equity, but these legacy URLs are structured like this… http://www.dafuqrly.com/Product.aspx?id=1337.
How do you conclude a relevant redirect for that? Using The Internet Archive Wayback Machine you can find a backup of old pages. There you can gain some insight as to what this page was about, then you can figure out a relevant redirect. However, if you have hundreds or URLs to check, it can take a while. Fortunately, you’re about to learn how to speed this process up!
The fastest way to gain insight about mysterious legacy URLs is to scrape the wayback machine for meta titles, these meta titles will provide enough information for you to assign an appropriate redirect.
Now you should have an ugly looking spreadsheet that contains meta titles for all of these mysterious URLs. Enjoy.
The method for retrieving meta titles from mysterious URLs does not require any paid subscriptions. The following spreadsheets may require paid subscriptions to Majestic, STAT etc so we’re just going to provide you with the ideas. Here are some resources that will help you learn how to scrape and to put these ideas into practice:
In the screenshot below, you’ll see the how the top ten URLs that rank for an example keyword can be accompanied by link metric data.
The spreadsheet that this screenshot is taken from only requires you to enter the keyword and you will receive this data (as long as you have this keyword tracked in STAT, or whatever ranking tool you use, and you need a subscription to MajesticSEO). By the power of scraping it would also be easy to extract information such as meta titles, meta descriptions and so on that can give more insight as to why these pages rank where they rank.
At the same time as appreciating STAT for its accuracy, insightfulness, simplicity and user friendliness, you can’t help but wish they’d use the data they have to its full potential.
The following chart displays how STAT’s interface would show you a ranking trend line for a given keyword. Pay attention to the sudden dips and complete drop offs:
The chart below shows how I’ve improved the use of STAT’s data for checking keyword cannibalisation issues. This has been achieved by scraping the top 100 URLs (that STAT provides every day for a given keyword for up to 6 months) and matching the given domain you want to check across a period of time.
The insight gained from here is: the URL represented by the grey line ranks the highest, but not most frequently. Fortunately, this URL happens to be the most relevant URL that I would choose to rank for the given keyword. Now that we know this, we can take steps towards making sure that the strongest, most relevant URL is the only undisturbed ranking URL!
Once you create an automatic/dynamic spreadsheet like this, you’ll just have to enter in new keyword IDs, wait a few minutes for Excel to scrape/process the data and viola.
Here are a couple of examples of my favourite scraping formulas and why I use them.
You can use XPathOnURL to extract information from an on-page element. The following formula will extract the on-page element we’re looking for, which is a hyperlink “/a” that’s classed with [email protected]=’r’” and =Dump means that it will auto fill the following rows in Excel with more results with just this one formula. So by entering this formula you will extract the top ten search results for the query “dogs”.
I won’t go into great detail, there are just some ideas: to make this formula more dynamic, or if you want to enter several keywords, you could split the URL up in this formula and use the =CONCATENATE formula to build the URL back up again but with something different to “dogs” in different cells.
Furthermore, you can use the MajesticSEO formulas that are pre-built into the SEO Tools for Excel plugin to get the link metrics of all the results displayed. This is similar to what I have done for the Automatically Combine Ranking Data with Link Metrics idea above.
Here’s the scenario and dilemma that I most often use this formula for:
Sometimes the 301s displayed by Screaming Frog will redirect to a page that 404s, but you don’t know that unless you go into Screaming Frog > Spider > Advanced and tick ‘Always follow redirects’, then you can extract a redirect chain report to see which URLs go to a 404 page. I do this quite often but only when I’m checking a lot of URLs.
If I only need to check a few hundred URLs, I will use the =UnshortURL formula because it follows a URL to the final destination and let you know if it 404s or not. This way you can just add a few hundred URLs to excel, enter this formula =UnshortURL, copy down and it’ll just take a minute to process.
Please share your favourite scrape formulas, guides and ideas with us in the comments below!
Jamal Hunter, SEO Analyst, Further