Crawling - The Most Underrated Hack

spermophora_senoculata(2).jpg

It’s been a little while since I traded code with anyone. But a few weeks ago, one of our entrepreneurs-in-residence, Javier, who joined Redpoint from VMWare, told me about a Ruby gem called Mechanize that makes it really easy to crawl websites, particularly those with username/password logins.

In about 30 minutes I had a working LinkedIn crawler built, pulling the names of new followers, new LinkedIn connections and LinkedIn status updates. All of that information is useful for me. But I just can’t seem to pull it from LinkedIn any other way. Crawling is the fastest, easiest and best solution.

Over the years, I’ve used or built a number of crawlers: at Google to track competitive market share across ad networks, at Redpoint to build a BD pipleine and to mine social networks, and also mobile app store crawlers. Programmed well, crawlers are the quick and dirty solution to aggregating data from the web at scale.

Crawlers are one of the most powerful tools at the disposal of startups and I think some of the most underrated.

In the spirit of paying it forward, I’ve copied the beginning of my LinkedIn crawler below. The script pulls the first page people who have viewed your profile in the last day and extracts their names, locations, titles and industries.

H/t to Lachy Groom who helped me refactor a bit.

require 'rubygems'
require 'mechanize' 
require 'nokogiri'

@linkedin_username = "your_username"
@linkedin_password = "your_password"

agent = Mechanize.new
agent.user_agent_alias = "Mac Safari"
agent.follow_meta_refresh = true
agent.get("https://www.linkedin.com")

#Login to LI
form = agent.page.form_with(:action => '/uas/login-submit')
form['session_key'] = @linkedin_username
form['session_password'] = @linkedin_password
agent.submit(form)
pp "Login successful"

def search_class(page, search_query)
  page.search(search_query).map do |element|
    if !element.nil? && !element.inner_html.nil?
      element.inner_html
    end
  end 
end

def search_image_class(page, search_query)
  page.search(search_query).map do |element|
    if !element.nil? && !element["alt"].nil?
      element["alt"]
    end
  end 
end

agent.get("http://www.linkedin.com/wvmx/profile?trk=nmp_profile_stats_viewed_by") do |page|
  names = search_image_class(page, 'img[@class="photo"]')
  titles = search_class(page, 'dd[@class = "title"]')
  locations = search_class(page, 'dd[@class="location"]')
  industries = search_class(page, 'dd[@class="industries"]')
end

Mar 3, 2013

Y

Read this next: