2 minute read / Mar 5, 2013 / data analysis /
Crawling - The Most Underrated Hack
It’s been a little while since I traded code with anyone. But a few weeks ago, one of our entrepreneurs-in-residence, Javier, who joined Redpoint from VMWare, told me about a Ruby gem called Mechanize that makes it really easy to crawl websites, particularly those with username/password logins.
In about 30 minutes I had a working LinkedIn crawler built, pulling the names of new followers, new LinkedIn connections and LinkedIn status updates. All of that information is useful for me. But I just can’t seem to pull it from LinkedIn any other way. Crawling is the fastest, easiest and best solution.
Over the years, I’ve used or built a number of crawlers: at Google to track competitive market share across ad networks, at Redpoint to build a BD pipleine and to mine social networks, and also mobile app store crawlers. Programmed well, crawlers are the quick and dirty solution to aggregating data from the web at scale.
Crawlers are one of the most powerful tools at the disposal of startups and I think some of the most underrated.
In the spirit of paying it forward, I’ve copied the beginning of my LinkedIn crawler below. The script pulls the first page people who have viewed your profile in the last day and extracts their names, locations, titles and industries.
H/t to Lachy Groom who helped me refactor a bit.
require 'rubygems' require 'mechanize' require 'nokogiri' @linkedin_username = "your_username" @linkedin_password = "your_password" agent = Mechanize.new agent.user_agent_alias = "Mac Safari" agent.follow_meta_refresh = true agent.get("https://www.linkedin.com") #Login to LI form = agent.page.form_with(:action => '/uas/login-submit') form['session_key'] = @linkedin_username form['session_password'] = @linkedin_password agent.submit(form) pp "Login successful" def search_class(page, search_query) page.search(search_query).map do |element| if !element.nil? && !element.inner_html.nil? element.inner_html end end end def search_image_class(page, search_query) page.search(search_query).map do |element| if !element.nil? && !element["alt"].nil? element["alt"] end end end agent.get("http://www.linkedin.com/wvmx/profile?trk=nmp_profile_stats_viewed_by") do |page| names = search_image_class(page, 'img[@class="photo"]') titles = search_class(page, 'dd[@class = "title"]') locations = search_class(page, 'dd[@class="location"]') industries = search_class(page, 'dd[@class="industries"]') end