Home » Data Analysis » The underrated power of scraping with Excel

The underrated power of scraping with Excel

by Zak Ireson
07th Oct 2016 - 9 min read

Who doesn’t like to turning big jobs into small jobs? As an SEO analyst, I do, and when I discovered that I could scrape insightful data into excel – better yet, a combination of insightful data – saving me from downloading excel sheets from several online tools to combine them.

In this blog, you’re going to learn how to extract data and information from APIs and websites in ways the webmasters would never have imagined!

Contents:

 *You’re going to need SEOToolsforExcel to perform some of the following formulas in Excel. It takes a few minutes to download and install this Excel plugin. If you don’t want to install this plugin, it’s possible to scrape with Google sheets using the ImportXML formula but I prefer to do it in Excel.  

How to find relevant redirects for mysterious URLs

Let’s say you’ve discovered a ton of 404ing legacy URLs that have external backlinks and you need to set up relevant redirects to reclaim link equity, but these legacy URLs are structured like this… http://www.dafuqrly.com/Product.aspx?id=1337.

Whuuuurrrt?

How do you conclude a relevant redirect for that? Using The Internet Archive Wayback Machine you can find a backup of old pages. There you can gain some insight as to what this page was about, then you can figure out a relevant redirect. However, if you have hundreds or URLs to check, it can take a while. Fortunately, you’re about to learn how to speed this process up!

Scraping meta titles from web archive

The fastest way to gain insight about mysterious legacy URLs is to scrape the wayback machine for meta titles, these meta titles will provide enough information for you to assign an appropriate redirect.

Method

  • Title Cell A1 as ‘URLs’ and paste the URLs that you want to get a Meta title for in cell A2.
  • Put this http://web.archive.org/web/2010*/ in to cell B1. You’ll have to guesstimate what year is the most appropriate to use, as you can see, I’ve used 2010 in this example. The URLs that end up returning #VALUE or #N/A were not discovered in that year, and it’d be worth trying this method with a different year.
  • Put this formula in cell B2 =CONCATENATE($B$1,A2), apply down.
  • Put this in cell C1 /web/2010.
  • Put this formula in cell C2 =Resize(TRANSPOSE(RegexpFindOnUrl(B2,”.*”&$C$1&”.*”,0))), apply down. *If there are a lot of URLs, you may want to do this gradually, 100 URLs at a time maybe.
  • Now it’s time to strip away the coding that surrounds the URL to the archived page that we want. Copy and PASTE AS VALUES the contents from column C to column D.
  • Put this formula in cell E2 =LEFT(D2,FIND(“”””,D2,30)-1).
  • Copy and PASTE AS VALUES the contents of column E into column F.
  • Highlight column F, press CTRL+H, put <a href=” into Find and click replace all. This will replace all instances of this with nothing. Sometimes a random “on” will appear. So do the same again replacing all instances of “on” with nothing.
  • Put this into cell G1 http://web.archive.org
  • Put this formula in cell G2 =CONCATENATE($G$1,F2), apply down.
  • I haven’t figured out why, but random quotation marks and spaces appear. Copy and paste the URLs from column G2 into a notepad, replace all instances of quotation marks with nothing. Copy the space between .org and /web, replace all of those spaces with nothing. Copy these back into column G, from cell G2 where they were originally.
  • Put this formula in cell E2 =XPathOnUrl(G2,”//title”), apply down. *If there are a lot of URLs, you may want to do this gradually, 100 URLs at a time maybe.
  • If there are any URLs that this hasn’t worked for, please see step 2. You may want to copy and paste these URLs into a different tab and process them with a different year.

Now you should have an ugly looking spreadsheet that contains meta titles for all of these mysterious URLs. Enjoy.

Data scraping ideas – combining datasets and more

The method for retrieving meta titles from mysterious URLs does not require any paid subscriptions. The following spreadsheets may require paid subscriptions to Majestic, STAT etc so we’re just going to provide you with the ideas. Here are some resources that will help you learn how to scrape and to put these ideas into practice:

Automatically combine ranking data with link metrics

In the screenshot below, you’ll see the how the top ten URLs that rank for an example keyword can be accompanied by link metric data.

Excel data

The spreadsheet that this screenshot is taken from only requires you to enter the keyword and you will receive this data (as long as you have this keyword tracked in STAT, or whatever ranking tool you use, and you need a subscription to MajesticSEO). By the power of scraping it would also be easy to extract information such as meta titles, meta descriptions and so on that can give more insight as to why these pages rank where they rank.

Turn STAT into a keyword – cannibalisation insights powerhouse

At the same time as appreciating STAT for its accuracy, insightfulness, simplicity and user friendliness, you can’t help but wish they’d use the data they have to its full potential.

The following chart displays how STAT’s interface would show you a ranking trend line for a given keyword. Pay attention to the sudden dips and complete drop offs:

jamal2

The chart below shows how I’ve improved the use of STAT’s data for checking keyword cannibalisation issues. This has been achieved by scraping the top 100 URLs (that STAT provides every day for a given keyword for up to 6 months) and matching the given domain you want to check across a period of time.

jamal3

The insight gained from here is: the URL represented by the grey line ranks the highest, but not most frequently. Fortunately, this URL happens to be the most relevant URL that I would choose to rank for the given keyword. Now that we know this, we can take steps towards making sure that the strongest, most relevant URL is the only undisturbed ranking URL!

Once you create an automatic/dynamic spreadsheet like this, you’ll just have to enter in new keyword IDs, wait a few minutes for Excel to scrape/process the data and viola.

Examples of my favourite scrape formulas

Here are a couple of examples of my favourite scraping formulas and why I use them.

=XPathOnUrl

You can use XPathOnURL to extract information from an on-page element. The following formula will extract the on-page element we’re looking for, which is a hyperlink “/a” that’s classed with [email protected]=’r’” and =Dump means that it will auto fill the following rows in Excel with more results with just this one formula. So by entering this formula you will extract the top ten search results for the query “dogs”.

=Dump(XPathOnUrl(“http://www.google.com/search?q=dogs&num=10″,”//h3[@class=’r’]/a”,”href”))

Click here for more information about XPathOnURL

I won’t go into great detail, there are just some ideas: to make this formula more dynamic, or if you want to enter several keywords, you could split the URL up in this formula and use the =CONCATENATE formula to build the URL back up again but with something different to “dogs” in different cells.

Furthermore, you can use the MajesticSEO formulas that are pre-built into the SEO Tools for Excel plugin to get the link metrics of all the results displayed. This is similar to what I have done for the Automatically Combine Ranking Data with Link Metrics idea above.

=UnshortURL

Here’s the scenario and dilemma that I most often use this formula for:

Sometimes the 301s displayed by Screaming Frog will redirect to a page that 404s, but you don’t know that unless you go into Screaming Frog > Spider > Advanced and tick ‘Always follow redirects’, then you can extract a redirect chain report to see which URLs go to a 404 page. I do this quite often but only when I’m checking a lot of URLs.

If I only need to check a few hundred URLs, I will use the =UnshortURL formula because it follows a URL to the final destination and let you know if it 404s or not. This way you can just add a few hundred URLs to excel, enter this formula =UnshortURL, copy down and it’ll just take a minute to process.

Click here for information about =UnshortURL

Tips:

  • If you get really good at scraping and make massive spreadsheets, you can speed up the processing time by adding =IF to your formulas to check a cell, if its empty – do nothing, if there’s a URL – scrape something.
  • You can also set up a switch to only start scraping once you change a certain cell to “Yes”, this will prevent formulas from scraping new information every time you alter dates, keywords or sites for example. You may notice yourself creating settings tabs in Excel as you level up as a scraper!

Please share your favourite scrape formulas, guides and ideas with us in the comments below!

Jamal Hunter, SEO Analyst, Further

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn
0

Comments

Your email address will not be published. Required fields are marked *