Wednesday, September 21, 2011

 

Image sizing and scraping with JQuery and Rails


I know, I know, long time no type. Not my fault, I swear! First I went travelling, then I had a buncha work to do, then I went to San Francisco TechCrunch Disrupt, all of which occupied my time. Well, I suppose, on reflection, all those things were my fault, but, um ... look, I'm back now, OK?

Back with a li'l discussion of image sizing and scraping with Rails and JQuery, for a project I can't really talk about yet. Suffice to say that sometimes I want to calculate the size of various remote images - quite a lot of them, in fact, making performance important - and sometimes I want to scrape all the images from a set of web pages.

I thought the first task was going to be tricky. And maybe it is. But fortunately, someone has solved it for me, via the totally awesome FastImage Ruby gem, for which praise should be heaped upon one sdsykes. It works exactly like advertised, which is to say, like this:


  def self.picture_info_for(url)
    return nil if url.blank?
    begin
      size = FastImage.size(url)
      return size  #[width, height]
    rescue
      logger.info "Error getting info for picture at "+url.to_s
      return Array[0,0]  #this makes sense for my app, but maybe not yours
    end
  end


So, hurrah! This meant the scraping bit was actually tricker. Sure, I could have done it all in Rails, but it's user-facing, and I didn't want the user to have to wait for a bunch of potentially sequential http requests to complete without seeing any results. I could have done it all on the client side, but parsing HTML with Javascript, even with JQuery, sounded painful and fraught with difficulties, compared to using the dead-easy Hpricot gem. So I came up with a compromise I quite like:

1. On the client side: (written using HAML, which I mostly adore)
- @image_urls.each_with_index do |url,idx|
  = link_to url, url
  %div{:id => 'page_'+idx.to_s}     

%script
  $(function() {
  - @image_urls.each_with_index do |url,idx|
    $.ajax({
    url: '/stories/scrape_images?url=#{CGI::escape(url)}',
    success: function(msg){ $('#page_#{idx}').html(msg); },
    error: function(msg){ $('#page_#idx}').html(msg); }
    });
  });

2. On the server, to first create and then respond to that client page:
  def popup_scraped_images
    start_time = Time.now
    @image_urls = []
    @seed = Seed.find(params[:seed_id])
    @seed.active_signals.each do |signal|
      next if signal.main_url.blank?
      @image_urls << signal.main_url
    end
  end

  def scrape_images
    url = params[:url]
    slash = url =~ /[^\/]\/[^\/]/
    host = slash.nil? ? "" : url[0,slash+1]
    html = ''

    page = HTTParty.get(url, :timeout => 5)
    Hpricot(page).search("//img").each do |element|
      img_src = element.attributes["src"]
      img_src = host+img_src if img_src.match(/^\//)
      html += '<img src="'+img_src+'" />' if img_src.match(/^http/)
    end
  end


Hopefully how they all interact is self-explanatory. Et voila - semi-asynchronous Rails/JQuery image scraping, handled on the server side for easy caching if need be later on.

Labels: , , , , , , ,


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]