The risk of data loss from a Website can come from multiple avenues. There could be an outright data breach where an attacker steals content directly from a database, or an automated bot could scrape the site, stealing data that is out in the open. The challenge of dealing with automated Web-content scraping is one that startup ScrapeDefender is aiming to tackle.
“Websites have all sorts of different types of content that is available, free to the public, and the creators of those sites intend for that content to be consumed by people to use,” Robert Kane, CEO of ScrapeDefender, told eWEEK. “What has happened is there is now a whole industry of scraping with bots that harvest mass amounts of data from sites.”
Those data-harvesting scraping bots can potentially be grabbing pricing information from retail or travel sites, for example, and then repurposing the data in ways in which the original content creator did not intend. ScrapeDefender is now launching its cloud-based anti-scraping and real-time monitoring service with the goal of tracking and limiting the risk of scraping.
Kane is no stranger to the world of security. In 1992, he founded a company called Intrusion Detection, which he sold to RSA in 1998. He also has experience in the financial services market and is the founder of Bondview, which is a municipal bond information site. It was his experiences at Bondview that helped to identify the need for an anti-scraping service.
“We discovered at Bondview that we were being scraped,” Kane said. “There were no tools to help us, and that was the spark for ScrapeDefender.”
How It Works
ScrapeDefender works in a manner that is similar to how Google Analytics operates, Kane said, with a piece of JavaScript code embedded on Website pages that helps to monitor and track activity.
“We receive a copy of the Website activity and analyze the activity on our servers and pass it through a whole bunch of metrics,” Kane said.
Those metrics include 25 different parameters that are used to help determine if the traffic is legitimate or if it is an indicator of scraping activity. The parameters include looking to see if there are things like excessive page views from a single address or direct visits to a page that should normally only be found via a referring click or page.
ScrapeDefender Protects Data by Thwarting Web-Scraping Attempts
The system is able to determine a baseline level of activity for a given site that will help indicate what is normal traffic and what activity should be considered suspicious.
“That’s what intrusion detection is all about, and this is an intrusion detection application of sorts,” Kane said.
Remediation
Kane described the remediation actions available to users as being a bit of a game of “whack-a-mole,” where new threats pop up as old ones are blocked.
“We have the ability to digitally fingerprint devices so that as activity pops up we can identify and block scraping,” Kane said.
The ScrapeDefender solution does have an application programming interface (API) that can integrate with Web Application Firewall (WAF) technology. WAFs are widely used on the Internet today as a means of blocking application attacks.
There is also an intellectual property enforcement component of ScrapeDefender, which includes form letters that enable users to contact offenders and ask them to stop their activities.
ScrapeDefender has some competition in the space, including Distil Networks, which offers a similar type of solution. Kane said a key difference is how the two solutions are implemented on a Website. The Distil cloud solution involves Websites redirecting their DNS information, which is a similar approach to how many cloud security vendors, including Incapsula and Cloudflare, protect their customers. In contrast, ScrapeDefender requires JavaScript code snippets in order to work with the cloud service.
Overall, Kane sees a lot of growth potential for the anti-scraping industry.
“I look at this business like the antivirus business was early on,” Kane said. “There is an enormous amount of opportunity for anti-scraping technology, and I think we’ll see more companies come into this space.”
Sean Michael Kerner is a senior editor at eWEEK and InternetNews.com. Follow him on Twitter @TechJournalist.