http://htmlunit.sourceforge.net/
tool http://screen-scraper.com/
java library list http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java/view
JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
http://jtidy.sourceforge.net/
jsoup
is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.http://andreas-hess.info/programming/webcrawler/index.html