More free money from Google for site scrapers
Back in December I wrote an article on how site scrapers were gaming Google’s search algorithm in order to make advertising money from articles that they scrape from RSS feeds. Just after I wrote this piece, Google got religion about spam sites and revamped their search engine results to penalise content farms. Yet, the problem with scrapers still remains.
Here’s how it works. Someone who doesn’t have original content picks a few newsworthy topics to follow. Typically we are talking about politics and business. The scraper then sets up a website oriented around those topics and makes sure to check all the boxes that define a normal high-content, multi-author site like having Twitter and Facebook accounts linked to the site, having an about this site and an ‘about our team’ page, and a terms of service page. The scraper then finds out which high-quality and high-ranking sites have full RSS feeds so that the scraper can import the content from the RSS feed and duplicate it on the scraper site. After the scraper has stolen enough content and optimized the site with keywords that the search engines deem most relevant to the niche, the scraper then submits the site for inclusion on Google News. When Google News includes the site, validating it as a reputable news site to Google users, the scraper can then make money from advertising as it has a guaranteed stream of visitors to the site via Google search and Google News.
Two weeks ago, Yves Smith wrote a post at Naked Capitalism alerting us that evil site scrapers are back! She mentioned Zmarter, one flagrant violator I tipped in December. I have since shaken them off by filing a DMCA violation (a notice that they are infringing on the Digital Millennium Copyright Act). But apparently, they are still at it, scraping other sites’ content. There are many sites of this ilk out there but Yves mentioned another scraper site, favstocks.com, that is now getting a lot of traffic from Google News as well. When I looked up favstocks.com, I saw a lot of content from Credit Writedowns on their site – all of the links in the posts were stripped out of the content in order to prevent ‘link value leakage’ (typical search engine optimization nonsense). There was a lot of content from other leading finance blogs and sits as well: Zack’s Research, Mike Konczal, Naked Capitalism, Pragmatic Capitalism, Mike Shedlock and Econbrowser.
I wrote Google News and posted a note on Google News’ support forum but have received no response. I know of at least two other bloggers who are upset with this. James Hamilton of Econbrowser told me that favstocks was outranking him on Google Search even though his site is linked to by all of the top bloggers and financial news sites and is well-respected. He wrote in response to my note:
FavStocks is a rogue site which routinely reproduces material from www.econbrowser.com despite having been repeatedly instructed that they are doing so without permission. The site FavStocks unquestionably should be banned from Google News.
So, here you have a well-respected PhD Economist, Chairman of the Economics Department at UCSD, a major American university, blogger since 2005, being outranked for the content he actually wrote by a bunch of yahoos stealing his content and re-posting it. Do you see the problem here? This is exactly why Vivek Wadwha says Google Search Still Needs ‘A Lot More Work’.
Let me give you a feel for how this has played out. Around the time Yves was complaining about favstocks, I went to their site and wrote a comment on their comment section, hosted by Disqus, the comment website, asking that they remove content from Credit Writedowns. I also contacted them through the contact form on their site. No response. So I submitted a DMCA notice against them to Google AdSense, their main advertiser at the time. That got this gibberish e-mail response:
Edward Harrison,
FavStocks is a registered service provider with the US Copyright Office. The DMCA provides service providers a safe harbor in a case of copyright infringement.
The DMCA notice you sent to Google is improper and they can not do anything to help you. The proper notice should have been sent to our registered designated agent as listed on our site and also listed on the US Copyright Office website. Please see the link below for our DMCA compliant information.
https://www.favstocks.com/copyright/
Even though your notice was improper and sent to the wrong person we did however identify and locate the content from the complaint and disabled the infringing content. It is our policy to disable accounts of repeat infringers.
Anthony
FavStocks.com
This got me banned from their Disqus comment section AND caused them to switch away from AdSense to other advertisers. It did have the wanted effect though. My content is no longer scraped. But I have switched to summary feeds because I am sick of this game.
Clearly, if you go to the site you will see that FavStocks entire business model is about scraping content. So, they are just trying to cover the bases in order not to get penalised by the search engines. In the end, I see this as a sign of flaws in Google’s business model which relies far too much on a lack of human intervention. It makes customer service atrocious. That’s why I have yet to receive a response from Google News. That’s why FavStocks is still a Google News provider despite having almost no original content and scraping the majority of the content. Google’s business model is dependent on scalability which gives them tremendous operating leverage. That means that they can scale their individual business lines without a large amount of additional cost. If they have high growth, then the revenue is supposed to fall to the bottom line. I think Google is starting to reach the point where that operating leverage has dissipated. Their costs are growing exactly because of these kinds of situations. You need human intervention because computers just do not have enough sophistication to make the kind of judgments needed to always discern original content from copied content. I anticipate this new lack of scalability will be a challenge Google will continue to face as it looks to grow its business.
“Google’s business model which relies far too much on a lack of human intervention”
I’ve though this for a while, it’s probably one reason why they closed the online selling of their Nexus phone. Occasionally I’ve had to deal with them on a technical level and found things to be very ad hoc.
You only have to look at Amazon – where in my experience customer service is top notch – to see that good large scale customer support is possible.
I remember thinking that as well at the time. Obviously Amazon has lower gross margins because they have a different business model. But, I think you’re right that Google can learn something from this kind of more people-oriented customer focus.
“Google’s business model which relies far too much on a lack of human intervention”
I’ve though this for a while, it’s probably one reason why they closed the online selling of their Nexus phone. Occasionally I’ve had to deal with them on a technical level and found things to be very ad hoc.
You only have to look at Amazon – where in my experience customer service is top notch – to see that good large scale customer support is possible.
I remember thinking that as well at the time. Obviously Amazon has lower gross margins because they have a different business model. But, I think you’re right that Google can learn something from this kind of more people-oriented customer focus.
OK, I’m gonna be a bit of a contrarian here. I don’t have a “blog”, but I have a large website that dates back to 2002 that is relatively popular and was heavily scraped by bots back in the days before RSS feeds were available. In short, site scraping has been going on a while and isn’t a new thing. How it’s being done, though, is far different.
RSS feeds have made the scrapers job far simpler. Gone is the need to fully “spider a site” and then pick out pieces of it to display for various keywords.
Ultimately, if you choose to go with RSS and make the full post available via RSS, you lose control of the content. In short, it’s a mistake to allow for full RSS feeds in my opinion. A webmaster/author can literally spend forever trying to chase down where their RSS feed info is being used. And it’s a losing game, since these types of scraper sites come and go all the time. And good luck doing anything at all about content that ends up on web servers in such places as Eastern Europe, Russia or China.
So…moral of the story is…if you don’t want your content appearing on scraper sites all over the place, don’t make your posts fully available via RSS as your just inviting trouble. Yes, you’ll upset some legitimate people who use various RSS readers instead of their browsers to read a post, but that’s a small problem compared to having your content spread about endlessly across the web.
Jim,
I don’t think you’re being too contrarian here. I agree with you 100%. If you make the full feed available, then you will get scraped. That’s the bottom line, isn’t it?
“If you don’t want to get robbed, don’t carry cash.”
Pathetic.
OK, I’m gonna be a bit of a contrarian here. I don’t have a “blog”, but I have a large website that dates back to 2002 that is relatively popular and was heavily scraped by bots back in the days before RSS feeds were available. In short, site scraping has been going on a while and isn’t a new thing. How it’s being done, though, is far different.
RSS feeds have made the scrapers job far simpler. Gone is the need to fully “spider a site” and then pick out pieces of it to display for various keywords.
Ultimately, if you choose to go with RSS and make the full post available via RSS, you lose control of the content. In short, it’s a mistake to allow for full RSS feeds in my opinion. A webmaster/author can literally spend forever trying to chase down where their RSS feed info is being used. And it’s a losing game, since these types of scraper sites come and go all the time. And good luck doing anything at all about content that ends up on web servers in such places as Eastern Europe, Russia or China.
So…moral of the story is…if you don’t want your content appearing on scraper sites all over the place, don’t make your posts fully available via RSS as your just inviting trouble. Yes, you’ll upset some legitimate people who use various RSS readers instead of their browsers to read a post, but that’s a small problem compared to having your content spread about endlessly across the web.
Jim,
I don’t think you’re being too contrarian here. I agree with you 100%. If you make the full feed available, then you will get scraped. That’s the bottom line, isn’t it?
“If you don’t want to get robbed, don’t carry cash.”
Pathetic.
Killing site scrapers really shouldn’t be that difficult. My money says a Google engineer could solve this problem before lunch, if anyone there thought it was important or interesting.
They have a fingerprint which should be easily identifiable through mechanical means: (1) text identical to that found elsewhere on the net; (2) text identical to that found on multiple sites; (3) a low percentage of original text; (4) text always publishes after it is found elsewhere on the Net. If two of the first three criteria are triggered, set a bot to monitor the competing sources for #4. Whichever one comes second gets banned. Splogs killed.
Put a Summer of Code student on it, already.
I know I shouldn’t defend Google on this but the problem I see is in their ability to differentiate the original source from the duplicate source. I would think that you could figure this out by determining what percentage of content is duplicated on the web and from how many sources. If the number of sources is high, it means that it’s a spam site.
Jay, I have the same reaction you do: why is this so hard?
Killing site scrapers really shouldn’t be that difficult. My money says a Google engineer could solve this problem before lunch, if anyone there thought it was important or interesting.
They have a fingerprint which should be easily identifiable through mechanical means: (1) text identical to that found elsewhere on the net; (2) text identical to that found on multiple sites; (3) a low percentage of original text; (4) text always publishes after it is found elsewhere on the Net. If two of the first three criteria are triggered, set a bot to monitor the competing sources for #4. Whichever one comes second gets banned. Splogs killed.
Put a Summer of Code student on it, already.
I know I shouldn’t defend Google on this but the problem I see is in their ability to differentiate the original source from the duplicate source. I would think that you could figure this out by determining what percentage of content is duplicated on the web and from how many sources. If the number of sources is high, it means that it’s a spam site.
Jay, I have the same reaction you do: why is this so hard?