- Regular Expression Pattern to Clean Microsoft Word-Generated HTML # 18 Aug 2010
-
Great stuff from Tim Mackey: a regex search pattern to clean Word-HTML cruft, to be run on your dirty data in two passes:
- Removes unwanted tags:
<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?> - Removes unwanted attributes:
<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>
Check out the comments under Tim’s article for more.
- Removes unwanted tags: