HtmlCleaner - an open-source HTML parser written in Java

HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

Features Summary
  • HtmlCleaner parses input HTML and generates tree-structure suitable for programmatic manipulation.
  • Serializers are responsible for outputting the DOM structure to XML, HTML, DOM or JDom.
  • Parsing phase relies on tag descriptions which can be customized by the user.
  • HtmlClaner's behaviour can be configured through number of parameters.
  • HtmlClaner is thread safe, meaning that single instance can clean multiple html sources at the same time.
  • HtmlClaner can be used from Java code, from command line or as Ant task.
