Google Paid Text Link Detection

March 5th, 2009

There was a time when SEO companies and affiliate websites could gain top organic raking positions with relative ease.  That was about 3-4 years ago, before Google added filters to detect reciprocal and paid text links.

If this is your first exposure to reciprocal and paid links, Google’s blog post is good prerequisite reading: http://googlewebmastercentral.blogspot.com/2007/12/information-about-buying-and-selling.html.

Whether or not Google should be filtering such links has been the subject of many forum and blog debates.  Some have been so passionate about the subject that you would think they were preaching a gospel message of salvation.  Fortunately, that is not the subject of this post.  Rather, we will explore how Google may be performing their link filtering.

Two-Way Links

Google considers reciprocal links (also known as link trading or two-way links) as a form of paid linking.  Although money is not transacted, the process involves bartering: you give me something and I give you something in return.  And as such, Google considers these kinds of links to be biased and not representative of the public’s interest in a website which Google attempts to reflect in its rankings.

It is pretty easy to understand how Google might detect reciprocal links.  Their database of links includes links from and to every website page.  So, with relative ease, Google can determine which incoming links may be offset by outbound links.

One-Way Links

But now, to the tougher question: How does Google algorithmically detect a paid text link  (one-way links)?  To answer that question, lets first review what we do know about the subject.

  1. What Kind of Websites Offer Text Links? Of the 4 primary types of websites (E-commerce, Info/Corporate, Directories, and Publications, including blogs and forums), both Publications and Directories are most likely to have a business model based on ad revenue, which includes text links.  So, it follows that if a website has an ad revenue model, such as Publishers and Directories, they are the most likely types of websites to offer paid links.  Likewise, E-commerce and Info/Corporate types of sites are most likely to be purchasers of paid text links.
  2. Google Can Profile Website Types! Google can easily detect websites that fit Publication and Directory profiles.

    Directories have thousands of pages with outbound links.  Publications, on the other hand, have very few outbound links.  Most of their ads are served by ad serving systems that use internal linking schemes with 302 re-directs.  Although it would be relatively easy for Google to profile Publication websites on this merit alone, Google also has a list of over 9000 URLs of the most prominent online publishers – those that are part of  Google’s News network.  In a recent study conducted by Position Research,  about 1/2 of paid text links found were from domains on this list.

    If you would like a spreadsheet list of Google’s news partners, click here.  Link to this post as I will be changing the URL of the spreadsheet list frequently.  Credit to newsknife.com as the primary source for the raw data.

    But there are Publications and Directories that do not use their own ad serving system or are not part of Google’s News network – they use Adsense or other 3rd party ad serving systems like Clickbank or DoubleClick.  In these cases, links are constructed in a specific and repeatable manner – giving each of these types of links a specific ‘fingerprint’.  It is relatively easy to detect these link ‘fingerprints’ and add the websites that are using them to the Publication or Directory category.

  3. Google can Count Outbound Links/Page! Google knows how many outbound links that point to other domains are on a page.  A large number of outbound links suggests a directory page.  Fewer outbound links (less than 10) are more common among a broad range of website types and less likely to be profiled.
  4. Google can Count How Often an Outbound Link Appears on a Site! Google knows if the same outbound link appears multiple times on the same website.  If the same link appears on multiple pages, the chances are very high that it is a paid link – just like other advertisements.  But if an outbound link appears on just one page, then the chances are reasonable that the link could be part of editorial copy.
  5. Google can Analyze Link Proximity! Google can detect how closely nested text links appear on a page.  There are several processes that Google can use.  One method simply finds an outbound link pointing to another domain and then looks at the surrounding html code to see if other links exist.  If the majority of surrounding characters are all anchor text, then the probability is greater that links are paid text links.
  6. Google can Analyze Relevancy! Google can determine whether a text link target (the page where the link points) is consistent with the content on the page.  One simple way Google may be using is to examine the anchor text of the links to determine if the page content is consistent with the phrases in the anchor text.  For example, if the anchor text reads, “ring tones” on a page that is about fishing, and there are no words on the page (except for the anchor text) that are even closely related to “ring tones,” then Google assumes the anchor text is not related.  Another way is to compare the page title with the title of the target page.  With these 2 strings of text, Google can apply an algorithm that determines the general relevance of the 2 strings.  If the relevance is weak, then Google may consider the likelihood of paid text links to be high.

So, in summary, Google knows:

  • The websites that are the most like candidates to engage in selling paid text links
  • Whether there are several text links in close proximity of one another
  • If the same link appears multiple times on the same website
  • Whether the link anchor text or target is relevant to the content on the page

This means that if it (a text link) walks like a duck, quacks like a duck, and looks like a duck, the chances are good that it’s a duck (a paid text link).  With this level of information, Google can make some pretty good guesses whether a text link is legitimate or paid.  And Google will be right most of the time.

Is it a “Duck”?:

Criteria

Case A

Case B

Case C

Type of website hosting backlink

Publisher

Publisher

Org/Edu

Number of outbound links on page

15

8

25

Number of target links on site

>1

1

1

Links closely nested

YES

YES

YES

Link anchor text is relevant to page copy

NO

YES

NO

Likelihood of Paid Text Link

Very High

Low

Low

Let’s consider 3 hypothetical examples.  Case A describes a link located on a Publisher’s website that is a partner of Google’s news network.  The link is among 15 other links on the same page.  And the same link appears on more than one page on the same website.  With this information alone, there is high confidence that the link is paid.  The fact that the link is nested with many others and some of the other links are not relevant to the page subject puts the nail in the coffin.  It’s a “duck”.

For case B, suppose an online magazine posts an article that includes a list of 7 or 8 prominent companies (with links).  These are legitimate editorial links – not paid links.  But in all likelihood, these links pass through the publishers own internal web systems that monitor visitor behavior and are not search engine friendly (they will not count toward SEO).  But for the sake of this example, let’s suppose these links are search engine friendly.  In this case, the links are modest in number (less than 10) and the target sites are very relevant to the article subject matter.

Even though these links reside on a publisher’s site, which in and of themselves makes them suspicious, Google may let these links ’slide’ and count toward the target’s site SEO rankings because the links, as a group, are relevant to the page subject matter.  It also helps that the list is relatively short (less than 10) and the link does not appear on other pages of the same website.

As our final example (case C), consider a.org or .edu site listing a number of websites that are part of its organization.  Google would recognize that neither of these websites are Publications and are not likely to rely on an advertising business model.  Google may be more likely to count these types of links toward SEO ranking, even though they might be paid, especially if the number of links is modest and no ad serving system or Adsense listings found.

The Tattletale System:

Google’s paid link algorithm cannot detect all paid links.  But remember: Google has a wild card.  They have their tattletale system.  Sites engaging in paid text links and are some how flying below Google’s radar may still get caught.  That is because Google encourages webmasters (that includes your competitors) to tattletale on a site if it is thought to be engaging in paid text link practices.  Reid Yokoyama, Google Search Quality Engineer, claims thousands of reports have been received:

“Even though we work hard to discount these links through algorithmic detection, if you see a site that is buying or selling links that pass PageRank, please let us know. Over the last year, users have submitted thousands and thousands of paid link reports to Google, and each report can contain multiple websites that are suspected of selling links.” http://googlewebmastercentral.blogspot.com/2008/06/impact-of-user-feedback-part-1.html

Google has suggested they may take punitive action on both the seller and purchasers of paid text links.  So if you are engaging in paid text links, it is likely that Google knows and has already discounted the value of the link from an SEO ranking perspective.  These links may still have value from an advertising point of view, but not likely from an SEO point of view.

Google Chaos

December 13th, 2008

What happens when your website falls out of Google’s index?  Most people react with panic.  But after seven (7) years of reading forum threads whose contributors have suffered similar fates, we know panic is the last thing you should do.

From the outside looking in, Google seems like a well-behaved giant.  Users rarely see Google errors.  And Google is not in the habit of issuing press releases when they do have a technology issue.  However, those who follow Google closely know better.

Google has had a history of technology goofs; many of Google’s updates don’t go as planned.  And for a few unlucky souls whose livelihoods are tied to the ‘giant’ the world comes crashing down when Google goofs.  When Google Chaos happens:

  • Websites, for no apparent reason, loose significant rankings.
  • Pages appear to be de-indexed.
  • And no matter what the Webmaster does to try to reverse the condition, in the short term, Google chaos persists.

Self Inflicted Wounds

To be fair, not all of these circumstances are Google’s fault.  Sometimes, Webmasters inadvertently do something that creates problems for Google.  Here is a short list that covers some fatal moves:

  1. A Webmaster decides to create a new home page and changes the URL to www.domain.com/home.  Furthermore, the Webmaster uses a 302 server re-direct from www.domain.com to www.domain.com/home.  And, finally, the Webmaster strips the content off the old /index.html page.  All works perfectly in a browser but Google sees something entirely different – Google no longer sees content on www.domain.com and does not necessarily follow a 302 server re-direct.  As a consequence, /home never inherits the SEO value associated with www.domain.com.
  2. A Webmaster accidentally makes an adjustment to the robots.txt file that disallows a primary directory.  Google still know about the pages but stops ranking all pages in that directory.
  3. A Webmaster makes an adjustment and adds a new, slick looking JavaScript menu.  Google does not typically read JavaScript and no longer follows the links in the menu.  As a consequence, ranked pages disappear from Google’s index.

Google Volatility

In other cases, Webmasters report wide swings in rankings.  This symptom is not unusual.  In May 2008, Matt Cutts, an engineer on the Google Spam team, went on record saying that the Search Engine Titan is currently conducting a major experiment, code named “Dewey.”  Webmasters in the SEO community have observed that pages with little or no PageRank, which have never shown up in the top 100 SERPS (Search Engine Result Pages), are now displacing other sites that had ‘page one’ rankings for more than five (5) years. Others observe that site rankings fluctuate +/- 30 positions at different times of the day and some have reported ranking fluctuation of more then 50 positions within the same day.

Position Research had been observing these conditions for many months. We speculate that Google is performing live testing similar to the tests described in a recent white paper titled “Search Engines that Learn from Implicit Feedback.”  The premise of the paper is that search engines can determine website relevance by analyzing what listings users do not click.  This testing requires that Google bring websites that may not otherwise deserve high ranking into a top position for a short period of time in order to record user behavior.

Data Loss

Many times, something happens within the Google system and data gets lost.  You might think: “How could data get lost”?  The answer has to do with understanding Google’s spidering and reporting network. Based on latest reports, Google maintains over 200,000 spidering servers.  These servers are constantly crawling website pages.  When you consider that over 100,000 new website pages are added daily, and that the current number of website pages is estimated at over 80 BILLION, you begin to understand the enormity of the task.  Other servers are consolidating and synchronizing this information so that data can be compiled into ranking data.

There is another set of servers dedicated to serving search results to the public.  These servers are clustered in datacenters spread throughout the world.  At last count, Google has over 40 datacenter with more than 750 IP addresses, each comprising of several servers.  Check out http://www.seocritique.com/datacentertool/ for a humbling view of Google datacenters.

Now we all know that (data) electrons are obedient most of the time, but not all the time.  Hard drives crash.  Data packets get lost during transit from one location to another.  And some times hardware fails during critical transmissions.  You can begin to understand how enormous Google’s task is and how easy it might be to loose data during all the consolidation / synchronization / compiling steps involved.

So what happens when data is lost?  That depends on what data is lost.  If the lost data is compiled data, then Google simply recompiles.  But if the data is original data, then Google must re-gather and then re-compile.  This can take time – weeks if not months.

Filter Traps

Google filters are another story.  In part, Google compiles page data to determine page attributes – and Google collects over 300 unique attributes for each page.  If Google determines that there is a combination of negative attributes to merit a ranking adjustment, then rankings decline.  But these attributes are only reasonably predictive when in combined with other attributes. 

Google filters are based on statistics – and in statistics, the larger the sample, the higher the correlation.  In a perfect world, Google would have all the page attributes it needs and unlimited computing power to reach very high correlation coefficients.  Under these circumstances, Google would be able to detect the bad from the good websites with 100% accuracy.  But in reality, Google doesn’t have enough attributes or computing power.  Therefore, their filters are less than perfect.  In other words, Google presumes that if ‘it’ walks like a duck, quacks like a duck, and smells like a duck, ‘it’ is probably a duck, but not certain.  ‘It’ may be a goose.  So to some degree, Google’s filters are throwing some ‘baby’ out with the ‘bathwater’.

Fireworks really start flying when Google introduces a new filter in an effort to improve rankings.  Invariably, some website pages are collateral damage.  It gets more interesting when Google starts turning the dials on these filters.  Website pages pop back in and out like popcorn.  The Webmaster’s hope is that Google engineers optimize their filter algorithms and minimize collateral damage.  But there are always some pages that get the ’short end of the stick’.

Minimize Google Chaos

So what can you do to avoid Google Chaos?  First, recognize that Google offers rankings for free and as such is not obligated to be bug free.  Second, realize Google ‘love’ goes to those whom Google chooses (through its complex algorithms).  And third, benchmark, benchmark, benchmark.

Google bugs and Google ‘love’ are things you cannot control.  But benchmarking is something you can control because when you record and log metrics (i.e. critical observations) you can better determining what your course of action you should take.

Here is a list of metrics that you should record:

  • Keyword rankings – Track Google rankings on a daily basis – weekly is not good enough because you need to know if a poor ranking condition is temporary or permanent. You also need to know the exact date when rankings declined so that you can compare your date with that other Webmasters who may have experienced a similar condition on the same date.
  • Google ‘cache’ query – Make sure Google is caching your pages and check cache dates.
  • Google ‘info:’ query – Make sure Google is reporting ‘info:’ query results. If your pages does not show results for an ‘info:’ query, something is wrong.
  • Google ‘URL’ query – Make sure Google is reporting results when a page URL is entered. Your page URL should be at or near the top of the results.
  • Google Webmaster account – Record changes to and observations reported by Google.
  • Record and log all website navigation and infrastructure changes. This can include changes to robots.txt and sitemap.xml files, your Google Webmaster account, and server changes.

Unexpected Google results for any of the Google queries may be a Google glitch or it may be indicative of something more serious.  These metrics may be performed on a weekly basis, daily if you notice something unusual.  A Google Webmaster account should be check on a monthly basis; more frequently if any other metric is concerning.

Making Sense of Google Chaos

If Google Chaos strikes your website and you have benchmarked Google metrics, you will be armed with the kind of information necessary to determine your next course of action. 

If your site looses rankings, the first thing to determine is whether your Google metrics have changed and if  your experience correlates with other webmasters.  Check the forums to see if something unusual is happening.  If your observations seem to be isolated, then the problem is likely to be self-inflicted.  Check your logs and start reverting to known stable conditions.  Then allow Google to react to these changes, which make take week or months depending on what changes were made.  As a rule of thumb, the time Google takes to react is the time Google will need to react again.  Observing your Google metrics will help determine whether your actions are making a real difference.

If your experience is not isolated and others are reporting the same conditions, it is probably a Google error.  Don’t panic.  Most of the time, Google corrects its mistakes within 1-2 weeks. 

But if the reason for Google’s reaction is based on a new filter, hang on to your hat.  It may take much more time for Google to sort things out.  And even when it does, your site may be part of an ‘elite’ few that is considered acceptable collateral damage.

How can you tell if your site is part of collateral damage?  This is pretty tough – it is a process of elimination.  First, wait and make sure that a Google bug has not caused your situation.  The forums can help determine this condition.  Second, make sure Google filters are optimized and stable.  Again, forum activity will help determine this condition.  Only after failure of these 2 conditions should a more radical approach be considered.

If the forums are quiet and your site is still in Google Chaos, start an extensive research effort, which considers any and all page attributes.  Check outbound links.  Check inbound links.  Check duplicate and near-duplicate content.  Check everything you can think of and start making site adjustments.  It just may be something really subtle that needs to be changed so that Google’s filters think you’re website is a ‘goose’ and not a ‘duck’.