« War of the Cameras: Digital Ixus i vs Exilim S3 | Main | Biodegrading Plastic - The way of the future. »

13 Hours of work for 1 line of code?

The search for the ultimate regular expression, is a long and tedious one. I am working on a project where text is parsed into XML files, and then converted into HTML using XSL stylesheets. XSLT parsers are generally very sensitive to broken tags, much more so then a normal browser, or a human. So I decided I was too lazy to fix every broken tag, and instead set out to create, or find a regular expression to close unclosed HTML tags.
I tried to google for it but it did not work out, so I did what I usually do, and decided to do it myself.
I spent about 13 hours making it work with every possible combination of broken tags. The final solution came to me while programming in my car on the way to see a brain-doctor about my possible ADHD diagnosis. I guess I am hoping that my work will help someone else, now that I have spent the time.

To see the regular expression that closes unclosed HTML tags, read on...

The code is below, and this one works for the <b> tag, but can be changed to work with any tag. On my Powerbook G4 12" with 1ghz cpu, using java, it executes in 25ms if there is anything to fix, otherwise it averages 0-1ms. So I bet in perl or php it will be even faster, especially on a faster machine.

I've tested it on numerous pieces of HTML, and as long as there is lack of balance, it will be corrected with suprisingly good results. This does not add tags to the end of the document, but basicly closes the opened tag if it finds that a new tag is open before the current one is closed. Try it, and you'll see.

String regex = "(?s)(?:(?=<b>)(<b>)(.*?)(?:(?:(?=<b>)(<b>))|(?:(?=</b>)(</b>))))|(</b>)";
text = text.replaceAll(regex,"<b>$2</b>");

The code is from java, but the regular expression should work in perl too... probably in this manner (I haven't coded perl in years, so bear with my errors):
$text =~ s!(?:(?=<b>)(<b>)(.*?)(?:(?:(?=<b>)(<b>))|(?:(?=</b>)(</b>))))|(</b>)!<b>$2</b>!si;

Here are some test results:
Before:
This is a piece of broken </b>HTML <i>code</i> </b>which <b>I am <b>testing<b> for you.</b>

After:
This is a piece of broken <b></b>HTML <i>code</i> <b></b>which <b>I am </b>testing<b> for you.</b>

Another smaller example:
Before:
<b>This is a <b>piece of code.</b>

After:
<b>This is a </b>piece of code.<b></b>