Web scraping with JavaScript
Web scraping is a very common process which constantly gathers content from web pages, and is then either put to good use as in search engines or bad uses, such as stealing content. It’s mostly a server side process, where bots and crawlers visit pages, parse content using various pattern matching, string comparison, and regular expression based techniques.
But today, with the popularity of JavaScript, flexible access to the DOM structure, and availability of libraries such as jQuery, page scraping can be approached differently, with less code, and less intrusively using JavaScript. So, I decided to give it a try, using a well structured site like Digg as an example, and build a page scrapper using JavaScript.
DiggStripper is the result of this experiment. The functionality is simple, it takes the Digg home page, traverses the DOM structure, and extracts stories, and builds a JSON object containing the extracted stories. Now, Digg does provide an API to access its information, so there is probably not much use for this page scraper, other than to serve as an example of page scrapping using JavaScript, or to get around any limits set by the Digg API.
The DiggStripper code is available as open source under MIT License, so feel free to download it, and do provide your feedback and ideas for taking it to levels I have not thought of yet.




2 Comments
May 20, 2009 | 6:55 pm
How do I use this tool across proxy to access digg ?
Oct 21, 2009 | 6:57 am
how did you solve the problem that javascript is not allowed to call a remote host?
Your code does not work for me (security exception..)
greeings
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>