SimpleXML and JSON Encode in PHP – Part II

In the previous post (Part I) I was giving a little overview for common woes turning a SimpleXMLElement into JSON when XML structural information is available that JSON is not capable to encode easily. The explanations given there were intended to users new to the matter and to understand the general dilemma that kind of encoding/serialization is dealing with.

In this part I will point onto some more detailed issues and show straight-forward ways how to deal with them specific to encoding a SimpleXMLElement object as JSON.

As it might be known, SimpleXML is simple and like PHP which wants to do things the simple way, it turns out that within the details, these simple things are extremely differentiated and complicated. In short: Next to dealing with what JSON can’t deal with of XML from the last part, in this part I’m more concerned about what SimpleXMLElement can’t deal with of XML.

<![CDATA[]]> and json_encode()

There is not much interesting with this. SimpleXML does partially support CDATA for normal use, but it’s already limited. For use with json_encode() it’s even more limited. Before I show some of those limitations I think it’s worth to suggest using a feature of the underlying libxml library to expand CDATA nodes when the XML document is loaded. Because it takes care of all CDATA nodes and turns them into something we’ve learned in the first part to deal with, so all CDATA related problems will go away. A great simplification.

First a code-example with CDATA and the feature deactivated (the default) to demonstrate the limitation:

$buffer = <<<BUFFER
<root>
    <element><![CDATA[test]]></element>
</root>
BUFFER;

$xml = simplexml_load_string($buffer);

echo json_encode($xml, JSON_PRETTY_PRINT), "\n";
{
    "element": {

    }
}

As the JSON output shows, there is no text for the element. It’s an empty object.

The CDATA related option is called LIBXML_NOCDATA and specified when creating a SimpleXMLElement or calling the simplexml_load_* functions, the expansion will be done and all CDATA nodes will be converted into text-nodes. Then the rules outlined in part I about text nodes apply:

$buffer = <<<BUFFER
<root>
    <element><![CDATA[test]]></element>
</root>
BUFFER;

$xml = simplexml_load_string($buffer, NULL, LIBXML_NOCDATA);

echo json_encode($xml, JSON_PRETTY_PRINT), "\n";
{
    "element": "test"
}

As the PHP example and the JSON output show, the LIBXML_NOCDATA has been given when loading XML so the CDATA text test has been turned into a text-node that json_encode can properly encode.

This should normally do it. If that feature is not used, some different rules apply: In the passed element CDATA is always ignored. It will kill all otherwise valid text nodes. In child elements CDATA is normally ignored. Only when a child contains first a non-whitespace text-node before a CDATA node, the text of both, the text node and the CDATA node will be encoded as string value into the JSON. If no or only a whitespace text-node comes before, the CDATA (and all potential) following text nodes will be dropped.

No Comment

It’s keeping wired with XML comments. Like with CDATA nodes SimpleXML does not allow to access or maniplate comment nodes, however it keeps them intact. And this time there is no special creation/loading option that allows to strip them. This is extremely sad because when using json_encode() those XML comments are encoded – as empty objects named comment!

<root>
    <!-- no comment -->
</root>
{
    "comment": {

    }
}

This is probably the point then when using out-of-the-box SimpleXML stops making fun for your encoding needs. This “comment element” will be always encoded as if it would be an element with such name – regardless of for example namespaces. So the only option here is to remove all comments before encoding. This is possible with the DOM extension, here a short excursion importing the SimpleXMLElement:

$xml = simplexml_load_string($buffer);

/* Remove all comment nodes from a SimpleXML document: */

$doc   = dom_import_simplexml($xml)->ownerDocument;
$xpath = new DOMXPath($doc);

foreach ($xpath->query('//comment()') as $comment)
{
    /* @var DOMComment $comment */
    $comment->parentNode->removeChild($comment);
}


echo json_encode($xml, JSON_PRETTY_PRINT), "\n";

The part in the middle effectively removes all comment nodes from the SimpleXML XML document.

Processing Instruction is Encoding Instruction? Rly?

Let’s cut this quick. Like with comments, if the XML document contains processing instructions, those are encoded into empty objects as values and their name as property name:

<root>
    <element>
        <?php processing instruction ?>
    </element>
</root>
{
    "element": {
        "php": {

        }
    }
}

So the “solution” here is again removing those prior JSON encoding:

/* Remove all processing instructions from a SimpleXML document: */

$doc   = dom_import_simplexml($xml)->ownerDocument;
$xpath = new DOMXPath($doc);

foreach ($xpath->query('//processing-instruction()') as $comment) 
{
    /* @var DOMComment $comment */
    $comment->parentNode->removeChild($comment);
}

Summary of Part II

The three node-types: CDATA, Comment and Processing-Intruction are not well or not even at all supported by SimpleXMLElement and stand in your way when such an object is JSON encoded. For the CDATA nodes there is a built-in feature when creating the object, for comments and processing instructions you’re on your own.

This is somehow sad as comments and processing-instructions are otherwise not exposed at all by SimpleXMLElement.

This second part already shows how to work-around these issues. In the next part I’ll show how to not run into these issues in the first place. Have fun.


Continue Reading: SimpleXML and JSON Encode in PHP – Part III and End

This entry was posted in Developing, PHP Development, PHP Development, Pressed, Tools and tagged , , , , , . Bookmark the permalink.

2 Responses to SimpleXML and JSON Encode in PHP – Part II

  1. diti says:

    nice tutorial, thanks for the help 😉

  2. Alexandre says:

    Thanks a lot for the ” and json_encode()” paragraph… Exactly what I needed and couldn’t find on Stackoverflow.
    Best regards,
    Alexandre 8)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.