May 20, 2018

Unleashing AI on SEO, predict google rankings with a click

Will I ever rank in top 5 on google? Can someone predict google rankings for me? What if someone could answer these questions for you? What if even before writing your next blog, someone could tell you that your page will never rank for a keyword? But, who could be this someone?Actually, no one but a machine, an AI-based algorithm!

‍SEO has matured and so has AI, so why not take advantage of this ai with seo techniques and combinations to find answers to one of the most and long-awaited questions in digital marketing.

‍“Can you tell me will my page rank for a keyword in the top 5 on google, well in advance? If not, what changes should I make?”

Impact on a Digital Marketer’s Life?

Getting straight to the point, you as a digital marketer spend your entire day optimizing content for different keywords to the best of your knowledge. It’s been a few days since you optimized, you do a google search to find out that your page is still not on the first page!It’s not that you’re not good enough but maybe it was that your keyword’s keyword difficulty was too high or your website’s page load speed is too much. You never know what factors are important for the type of keyword you’re trying to rank for. For e.g. in your first project, you heavily optimized the page by stuffing your focus keywords everywhere and anywhere possible, keeping all other factors constant. On the other hand, in your second project you follow the same steps, because of course, you got successful by doing it, and you hardly make an impact on the rankings, reasons being the focus keyword is very different from the previous one and maybe the competition here is the quality of backlinks that your website has. As you observe the two scenarios, the SEO strategy was spot on but the way that Google interprets results for these two focus keywords is very different. If an AI algorithm could simulate scenarios based on the type of keyword you’re trying to rank for and tell you why you won’t rank for a keyword, similarly, what if an AI algorithm could also perform predictive lead scoring for you! If these kinds of tools are just a click way, I’m sure you’d have good night’s sleep that day!So, without wasting any further time, let me walk you through the strategy of how Znbound is planning to solve this long-awaited problem!

The Experiment

1. Getting the Data

‍Any AI system is as good as its data. Also, an AI model needs to be trained first before it can make any predictions, feeding high quality and meaningful data into it is important.

One of the many obstacles in acquiring data was to avoid google’s personalizing feature for its users. It’s a trap and that’d be an inaccurate sample of data. Like this, one also needs to know what scenario of Google are you trying to re-create, for e.g. your location, which is another way you could get very biased results. So, keeping all these factors in mind, we started to scrap data for approximately 100 keywords, extracting 20 results for each keyword. That gives us around 2000 results to play with, which I know isn’t a convincing size for a dataset but we would at least get an idea of the data quality we’ll be working with.We only managed to extract the URL, rank and no. of backlinks for a keyword.

Sample SEO Data

2. Preparing/Improving the Data

‍Getting this data wasn’t the hard part, but the following process is. As you can see in our sample SEO data, we have around 20 different URLs for every keyword, i.e. 20 different classes of data with each class having only 100 data points, which is too small a dataset when doing machine learning. To overcome this problem, let’s convert it into a Binary Classification Problem. We’ll take the top 5 ranking URLs for every keyword and put them in one class, say “1” and other 15 URLs in class “0”.

Distribution of data in the 2 classes

Now, you might be thinking that this is very skewed data I’ve created but there are techniques using which we can handle such datasets, so without going into tech part let’s concentrate on what other factors should we track for these URLs?One of the objectives of this experiment is also to tell the user what optimizations they should perform in order to rank their URL better and not just tell if they’ll rank or not. Considering this, we should keep our scope confined to tracking parameters that is under the control of a digital marketer. For every keyword, we’ve tracked around 30 factors which can later be optimized. Among these factors are some important ones that definehe semantic strength between the keyword and the content found on that URL (this was made possible by ParallelDots), while some of the others check for keyword density, frequency of keywords, page loading speed and many similar characteristics. Here is some analysis from the dataset, distributed among Top 5 URLs, Class: 1 and Other 15 URLs, Class: 0

(These charts are after balancing the two classes)

As you can see there is some noticeable distinction between the two classes, which our model would also hopefully discover while training. At this point, we can conclude that our data is ready to go into the modeling stage and soon we can get some predictions from it!

‍3. Results

‍The results we achieved after evaluating the test data, weren’t too bad to simply say that the experiment was a failure but demands improvement.

Accuracy to detect Class 1 URLs correctly: 66.6%

‍Accuracy to detect Class 0 URLs correctly: 73%

Here’s a chart for what our model thinks is important to classify the URLs in the two classes

Whats Next?

Improvement! The results confirm that we’re heading in the right direction but there’s still scope for improvement. Among several things we’ve researched that can eventually improve the model is our data acquiring strategy.

Maybe, the channel from where we gather our data i.e. from the Google Search is passing on noise into our data, some call it the “Noisy Layer” of the search engines. This layer, at the same time, is very dynamic too, the results you see today may not be there tomorrow. The best way to address this problem could be automating the scraping process live, this way we can be assured that the model being trained is on the basis of the most recent data and predict accordingly.

After fixing some of the data quality issues we’ll have a more robust model to work with and soon you’ll be able to ask us if “I can rank in top 5 on google or not” and we’ll be able to answer back with a decent confidence interval.Keep checking back for updates on our model’s launch date!