Regular Expression Pattern to Clean Microsoft Word-Generated HTML  # 18 Aug 2010

Great stuff from Tim Mackey: a regex search pattern to clean Word-HTML cruft, to be run on your dirty data in two passes:

  1. Removes unwanted tags:
    <[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>
  2. Removes unwanted attributes: <([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>

Check out the comments under Tim’s article for more.

|