Receive these posts by email like 150k+ others!

    Crawling - The Most Underrated Hack by @ttunguz

    Venture Capitalist at Theory

    About / Categories / Subscribe / Twitter

    2 minute read / Mar 5, 2013 /

    Crawling - The Most Underrated Hack

    It’s been a little while since I traded code with anyone. But a few weeks ago, one of our entrepreneurs-in-residence, Javier, who joined Redpoint from VMWare, told me about a Ruby gem called Mechanize that makes it really easy to crawl websites, particularly those with username/password logins.

    In about 30 minutes I had a working LinkedIn crawler built, pulling the names of new followers, new LinkedIn connections and LinkedIn status updates. All of that information is useful for me. But I just can’t seem to pull it from LinkedIn any other way. Crawling is the fastest, easiest and best solution.

    Over the years, I’ve used or built a number of crawlers: at Google to track competitive market share across ad networks, at Redpoint to build a BD pipleine and to mine social networks, and also mobile app store crawlers. Programmed well, crawlers are the quick and dirty solution to aggregating data from the web at scale.

    Crawlers are one of the most powerful tools at the disposal of startups and I think some of the most underrated.

    In the spirit of paying it forward, I’ve copied the beginning of my LinkedIn crawler below. The script pulls the first page people who have viewed your profile in the last day and extracts their names, locations, titles and industries.

    H/t to Lachy Groom who helped me refactor a bit.

    require 'rubygems'
    require 'mechanize' 
    require 'nokogiri'
    
    @linkedin_username = "your_username"
    @linkedin_password = "your_password"
    
    agent = Mechanize.new
    agent.user_agent_alias = "Mac Safari"
    agent.follow_meta_refresh = true
    agent.get("https://www.linkedin.com")
    
    #Login to LI
    form = agent.page.form_with(:action => '/uas/login-submit')
    form['session_key'] = @linkedin_username
    form['session_password'] = @linkedin_password
    agent.submit(form)
    pp "Login successful"
    
    def search_class(page, search_query)
      page.search(search_query).map do |element|
        if !element.nil? && !element.inner_html.nil?
          element.inner_html
        end
      end 
    end
    
    def search_image_class(page, search_query)
      page.search(search_query).map do |element|
        if !element.nil? && !element["alt"].nil?
          element["alt"]
        end
      end 
    end
    
    agent.get("http://www.linkedin.com/wvmx/profile?trk=nmp_profile_stats_viewed_by") do |page|
      names = search_image_class(page, 'img[@class="photo"]')
      titles = search_class(page, 'dd[@class = "title"]')
      locations = search_class(page, 'dd[@class="location"]')
      industries = search_class(page, 'dd[@class="industries"]')
    end
    

    Read More:

    Searching for Eric Schmidt