Close
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Logo
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Cloud
    • Cloud
    • Cybersecurity
    • Database
    • Servers

    Facebook Scraper Launches Open Source Data Analysis Tool for “Big Data”

    Written by

    Fahmida Y. Rashid
    Published March 23, 2011
    Share
    Facebook
    Twitter
    Linkedin

      eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

      Analyzing “big data” doesn’t have to be expensive, and it got even cheaper, thanks to a new tool that automates the data analysis needed to make sense of massive amounts of data.

      Pete Warden, the man who became famous after he scraped 220 million public Facebook profiles last year, unveiled his Data Science Toolkit at GigaOM’s Structure BigData conference in New York City on March 23. The Data Science Toolkit allows anyone to do automated conversions and data analysis on large data sets, he said.

      In a 20-minute talk titled “Supercomputing on a Minimum Wage,” Warden noted that data analysis doesn’t have to be expensive. “You can hire a hundred servers from Amazon for $10 an hour,” he said.

      A collection of open data sets and open-source data analysis tools wrapped in an easy-to-use interface, the toolkit includes features like being able to filter geographic locations from news articles and other types of unstructured data and use OCR (optical character recognition) functions to convert PDFs of scanned image files to text files, Warden said.

      The Data Science Toolkit is available under GPL (general public license) and can be used either as a Web service or downloaded to run on an Amazon EC2 (Elastic Compute Cloud) or virtual machine.

      Users can also convert street addresses or IP addresses into latitude/longitude coordinates and apply those coordinates to map the information against political demographics data, according to the toolkit’s Website.

      A quick test of a residential address in Brooklyn, N.Y., returned information about which Congressional district it was associated with.

      It can also pull country, city and regional names from a block of text and return relevant coordinates using the Geodict tool. This is similar to Yahoo’s Placemaker tool, according to the toolkit’s description. Users can also put in blocks of HTML from any page, including a news article, and see just the text that would be actually displayed in the browser, as well as to identify real sentences from a block of text. It can also extract people’s names and titles, as well as guess gender from entered text.

      Warden had used Amazon servers and a number of tools to analyze user profile data from 220 million Facebook users in February 2010. He used WebCrawler to crawl Facebook and scraped 500 million pages representing 220 million users last year. Thanks to “about a hundred bucks” and Amazon’s servers, he transformed the scraped data into a database-ready format in 10 hours, he said.

      He was able to analyze friendship relationships on Facebook using the data and performed some fun visualizations on how cities and states in the United States are connected to each other through Facebook. He also correlated the data to indicate the most common names, fan pages and friend locations around the world.

      Warden noted there were a number of ways to harvest similar data from other sources, including Google Profiles.

      Facebook didn’t like what he was doing with the data, and took steps to stop him. It took him two months and $3,000 in legal fees to convince Facebook that what he was doing wasn’t illegal, he said, but he still had to delete the data from the servers. Facebook claimed that he didn’t have permission to scrape the profiles, although he did not hack or compromise any pages and looked at only publicly available pages. Facebook also claimed that his saying he would make the raw data available to researchers violated their terms of service.

      “Big data? Cheap. Lawyers? Not so cheap,” Warden said to audience laughter.

      Fahmida Y. Rashid
      Fahmida Y. Rashid

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      MOST POPULAR ARTICLES

      Artificial Intelligence

      9 Best AI 3D Generators You Need...

      Sam Rinko - June 25, 2024 0
      AI 3D Generators are powerful tools for many different industries. Discover the best AI 3D Generators, and learn which is best for your specific use case.
      Read more
      Cloud

      RingCentral Expands Its Collaboration Platform

      Zeus Kerravala - November 22, 2023 0
      RingCentral adds AI-enabled contact center and hybrid event products to its suite of collaboration services.
      Read more
      Artificial Intelligence

      8 Best AI Data Analytics Software &...

      Aminu Abdullahi - January 18, 2024 0
      Learn the top AI data analytics software to use. Compare AI data analytics solutions & features to make the best choice for your business.
      Read more
      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Video

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2024 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

      ×