CSS Selector to XPath conversion

While playing with a parser experiment that fully supports the CSS selectors syntax and the discovery of the Selectors API I started to think about the transformation from CSS selectors to XPath. Sure I’m not the only one, so I collected some existing resources to get a broad overview (you find the list below). What I have missed inside those documents is that often not the full picture is shown, so I took the opportunity to add more details about case-sensitivity in the HTML context.

For PHP developers it might be interesting, that the xpath examples here have been tested to work with DOMXpath which is part of PHP’s DOMDocument extension and they are for HTML use. The CSS examples are an adoption of the CSS3 selectors summary table, pseudo-classes have been left out, I might write about them soon. Some combination examples have been put in additionally because I thought they are interesting:

CSS Xpath Meaning
* //* any element
P //P|//p an element of type P
Remark: CSS syntax is case-insensitive within the ASCII range (i.e., [a-z] and [A-Z] are equivalent). Most examples don’t reflect that, here the xpath node-sets union operator | is used to get p and P tags, which is only one example to achieve case-insensitivity.
BODY //*['BODY' = translate(name(.), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')] an element of type BODY
Remark: Another variant for case-insensitive element name matching.
P[align] (//P|//p)[@align] a P element with a “align” attribute
CSS Xpath Meaning
P[align] (//P|//p)/@*['ALIGN' = translate(name(.), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')]/.. a P element with a “align” attribute
Remark: This example is about case-sensitivity again which does apply to attribute names as well. It’s a more correct variant of the previous example if the CSS is about a HTML document, which is normally the case.
P[class~="intro"] (//P|//p)[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'intro', ' '))] a P element whose “class” attribute value is a list of whitespace-separated values, one of which is exactly equal to “intro”
Remark: This is the famous CSS class selector, P.intro in this specific case.
P.intro (//P|//p)[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'intro', ' '))] a P element whose class is “intro” (the document language specifies how class is determined).
Remark: As this example is for HTML, it’s the same as the previous P[class~="intro"]. Because this is too simple, the next example will add case-insensitivity for all parts.
CSS Xpath Meaning
P.intro //*['P' = translate(name(.), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')]/@*['CLASS' = translate(name(.), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') and contains(concat(' ', normalize-space(.), ' '), concat(' ', 'intro', ' '))] a P element whose class is “intro” (the document language specifies how class is determined).
Remark: Case-insensitive class attribute, the classname itself is case sensitive.
P[align^="le"] (//P|//p)[starts-with(@align, 'le')] a P element whose “align” attribute value begins exactly with the string “le”
P[align$="t"] (//P|//p)[substring(@align, string-length(@align), 1) = 't'] a P element whose “align” attribute value ends exactly with the string “t”
Remark: Different to the previous example, there is no ends-with string function in xpath, so the string-length and substring functions are used.
P[align$="t"] //*['P' = translate(name(.), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')]/@*['ALIGN' = translate(name(.), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') and substring(., string-length(.), 1) = 't']/.. a P element whose “align” attribute value ends exactly with the string “t”
Remark: Case-Insensitive variant for both the tag- and the attributename. This example demonstrates well which impact the CSS specification has when a simple looking CSS selector is ported to xpath.
CSS Xpath Meaning
P[align*="igh"] (//P|//p)[contains(@align, 'igh')] a P element whose “align” attribute value contains the substring “igh”
P[lang|="en"] (//P|//p)[@lang='en' or starts-with(@lang, 'en-')] a P element whose “lang” attribute has a hyphen-separated list of values beginning (from the left) with “en”
Remark: This is not the same as :lang("en") which to the best of my knowledge is not possible to port to xpath in a single expression.
P * (//P|//p)//* all descendant elements of a P element (Descendant combinator)
P > * (//P|//p)/* all child elements of a P element (Child combinator)
CSS Xpath Meaning
P > *:first-child (//P|//p)/*[1] any element, first child of its parent P element
H1 + P (//P|//p)['H1' = translate(name(preceding-sibling::*[1]), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')] a P element immediately preceded by an H1 element
H1 ~ P (//P|//p)['H1' = translate(name(preceding-sibling::*), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')] a P element preceded by an H1 element

This subset of CSS selectors shows, that as far as HTML documents are concerned, it’s not as simple as it has been outlined in existing documents – because of case-sensitivity. But next to case-sensitivity, there are also some xpath string issues. For example if your search for the string more "of 'this'" you can not put that literally into any of the xpath expressions above. If you plan to manually write those xpath expressions, things become more and more akward – but it’s still possible.

Namespaces aren’t reflected in full and this needs additional discussion. CSS can have a default namespace now and can have other namespaces. Luckily as far as HTML documents are concerned, practically there is not much namespacing involved, so probably it’s ok to keep it out of this first table.

However pseudo-classes are largely missing and they are quite interesting as well. So there is some room for a follow-up post.

Resources

I’ve used the following blog-posts for conversion suggestions/examples:

Related Stackoverflow CSS Selector to XPATH questions:

About these ads
This entry was posted in Developing, PHP Development, Pressed and tagged , , , , , , , . Bookmark the permalink.

One Response to CSS Selector to XPath conversion

  1. Pingback: PHP: XPath on HTML and XHTML | hakre on wordpress

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s