Search This Blog

Wednesday, December 25, 2013

web screen scraping tools

HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
http://htmlunit.sourceforge.net/


tool  http://screen-scraper.com/

java library list http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java/view

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
http://jtidy.sourceforge.net/

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.


http://andreas-hess.info/programming/webcrawler/index.html


No comments: