Script and Style HTML/ XHTML Code Smells

Remember the time? Grabbing the copy of the HTML 2 RFC, reading through within the afternoon and you started to understand where this all might lead to after some time.  Seeing 3.2, 4.1 and XHTML 1.0 passing this finally ended up in a Tagsoup. Correctly spoken, it was Tagsoup all the time. And that’s for good, that is why HyperText Markup Lanugage based documents do work in your Browser from the very beginning until these days.

But now standards get mixed in favor of the one or other party and you’re left alone. The W3C is publishing more and more technically incorrect stuff nowadays. On the one hand we have forces that push forward XML because as a industry standard it was there to solve so much issues. On the other hand, there are still some of those clever people left who still have their fingers in the tags. The differences might not look that much but when it comes to valid written documents and the valid interpretation of those, you can run into some problems which will render XHTML incompatible to HTML. In Short: If you write XHTML documents, the Browser reads them as HTML. And since XHTML is not a HTML Standard, it will read them as Tagsoup.

In this article I will show how to find (smell) invalid XHTML sequences in larger code-bases (here it’s wordpress) and how to fix them using regular expressions and file based search and replace.

Gaining XHTML Compability

One thing you can do to work around those issues is to make your XHTML documents as much compatible as possible. So they can not only be interpreted as HTML but as X(HT)ML as well. I will offer some smells I used to create the patches for Ticket #11939 . Regex can be your friend if you search problematic areas in the codebase. But before it should be evaluated how Script and Style elements should be properly written:

<style type="text/css"><!--/*--><![CDATA[/*><!--*/

<script type="text/javascript"><!--//--><![CDATA[//><!--

Maybe this looks a bit complicated at first but it’s pretty straight forward: Since in XHTML style and script elements are #PCDATA blocks (and not #CDATA as in HTML), the CDATA part needs to be properly commented out so that this really becomes XHTML compilant. Gladly <!– and –> are comments tags in XHTML so they can do the job here.

The Smells

If you want to find out if code does contain fragments that do not match these XHTML prerequisites I created a regular expression that when searched for should return the places where action is needed. So if you get zero results, looks like then you’ve done your job already. This one is for script-elements which at least worked for me on the wordpress code base:

# Script:
(//\s*<!\[CDATA\[|/\*\s*<!\[CDATA\[\s*\*/|<script type=("text/javascript"|'text/javascript')(| charset="utf-8")>(?!<!--//--><!\[CDATA\[//><!--)|(?<!(<!\]\]>|'>|">))</script>|<script[^>]*>(?!(<!--//--><!\[CDATA\[//><!--|</script>)))

# Style:

I’ve build and used that regex in my Eclipse PDT setup with it’s nice File Search. It is a very productive multi file search and replace tool that is pretty well integrated into the rest of the IDE. You should give Eclipse a try even if you use it for file search only, especially on the windows platform I’ve not found something comparable.

Find and Replace (Warning: Regular Expressions ahead)

So when the areas are identified, what’s more likely to be done then to intelligently search and replace it with more expressions as well? I must admit even it it sounds very easy to do this, I checked each finding manually if the pattern are matching properly and the replacement is done as intended (Eclipse makes this easy is well by showing you the diff side by side).  Following there is a list of expressions and replacements (one line each) and I’ve put a comment on top of each. Those were the actual expressions I was using to create the patch for script elements:

# A1: script start
(<script type=("text/javascript"|'text/javascript')(| charset="utf-8")>)\s*\n\s*(/\*\s*<!\[CDATA\[\s*\*/|//<!\[CDATA\[)

# A2: script end

# B1: script start
(<script type=("text/javascript"|'text/javascript')(| charset="utf-8")>)\s*\n\s*<!--

# B2: script end

# C1: blow up single-line scripts
([ \t]*)(<script type=("text/javascript"|'text/javascript')(| charset="utf-8")>)(?!(\n|\\n|<!--//-->))([^\n]+)(</script>)

# C2: blow up single-line var scripts in double quotes
("<script type=("text/javascript"|'text/javascript')(| charset="utf-8")>)(\\n[^\n]+\\n)(</script>(|\\n)")

# D1: script start
(<script type=("text/javascript"|'text/javascript')(| charset="utf-8")>)\s*\n

# D2: script end

# E1: style start in string, commented
("<style type='text/css'>)(|\\n)<!--

# E2: style end in string, commented

# F1: style start w newline
(<style type=("text/css"|'text/css')(| media=("[a-z0-9 -]+"|'[a-z0-9 -]+'))>)(\R)

# F2: style end w newline

# ...

So those were used in the script elements normalization. For style elements the same principle can be used, only the comments in use are other ( /* */ instead of //). I will updated this article after finishing the replacements for those. Ticket is updated as well.

This entry was posted in Code Smells, Hacking The Core, Hakre's Tips, HTML/CSS Code Smells and tagged , , , , , , , , , , . Bookmark the permalink.

2 Responses to Script and Style HTML/ XHTML Code Smells

  1. Pingback: artnorm

  2. Pingback: PHP: Curly Brackets Substring Access | hakre on wordpress

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.