SimpleXML and JSON Encode in PHP – Part I

With SimpleXMLElement it is often easy and looks like a very quick way to turn some XML into JSON. But not everything in PHP that has an easy interface works out of the box. In this three part series I’ll cover the basics of using the json_encode() function on a SimpleXMLElement, will make problematic areas visible and explain them by their limitations in JSON and Simplexml and will show how it is possible to deal with them and showing how alternative JSON encoding can be easily done even with advanced options.

The Dead-Simple Use-Case

Let’s start easy. Under normal circumstances this is pretty straight forward, the following example does not contain anything particularly fancy and is similar to what a previous post Convert XML to JSON in PHP by Senthil Nathan et. al. (Jan 2007) or XML to JSON with PHP by Sharon Lee (Sep 2010) or another one Simple XML to JSON with PHP by Sean Biefeld (Oct 2011) outlined as well:

$buffer = <<<BUFFER
<root>
  <element>value</element>
</root>
BUFFER;

$xml = simplexml_load_string($buffer);

echo json_encode($xml, JSON_PRETTY_PRINT), "\n";
{
    "element": "value"
}

The root element get’s turned into an object and it’s child-elements into properties with their node-value as value (here strings).

If the element contains further children, those are listed as further properties.

<root>
  <elements>
    <element>value</element>
  </elements>
</root>
{
    "elements": {
        "element": "value"
    }
}

Because properties in JSON object notation should have a unique names (compare RFC 4627 2.2. Objects: The names within an object SHOULD be unique.) and as XML allows to have childrens with the same name, PHP’s json_encode() will create an array of objects for all children with the same name:

<root>
  <elements>
    <element>value</element>
    <element>value</element>
  </elements>
</root>
{
    "elements": {
        "element": [
            "value",
            "value"
        ]
    }
}

The last example already shows that you can not just map an XML structure into a JSON structure 1:1. Next to elements with the same name, there are also attributes. Those are turned into a fake property named @attributes which is a JSON object containing the XML attributes as property/value pairs:

<root attribute1="value1" attribute2="value2"/>
{
    "@attributes": {
        "attribute1": "value1",
        "attribute2": "value2"
    }
}

So far even if not every XML can be turned into JSON 1:1 this still seem to work well.

So when does it break?

It exactly breaks at the point when a decision must be taken between encoding a value as text or as a structure. And this is done when an XML element contains a non-whitespace text-node as the first child-node. So the first child-node is the decision factor between the textual node-value that will become a string or turning the element into a JSON object again with all child-elements and attributes.

Let’s give two examples that show both. First for the JSON string value. This XML contains an element with a non-whitespace text-node as first child:

<root>
 <element>
   non <child/>
   white <child/>
   space
 </element>
</root>
{
    "element": "\n   non \n   white \n   space\n "
}

As the JSON shows, one property named element with a string value. The child-elements are dropped because the decision was done to prefer the string value.

Now let’s see the other example where the object value is preferred. The XML now contains an element with a whitespace text-node as first child:

<root>
 <element>
   <child/>
   white <child/>
   space
 </element>
</root>
{
    "element": {
        "child": [
            {

            },
            {

            }
        ]
    }
}

The JSON now does contain the child-elements (here an array of two empty <child/> elements) and all the text-nodes are dropped. So quite the opposite from the previous example.

Both illustrate the decision point. As XML attributes are turned into the @attributes property, those are also only available in the second mode. Here a short example:

<root>
 <element attribute="value">
   non-whitespace-text
 </element>
 <element attribute="value">
   <child/>
 </element>
</root>
{
    "element": [
        "\n   non-whitespace-text\n ",
        {
            "@attributes": {
                "attribute": "value"
            },
            "child": {

            }
        }
    ]
}

As the JSON shows, the first element is again the string value, the second element contains attributes and children but no strings. So keep in mind that there is a decision point on the first child-node being a non-whitespace text-node or not. There is some reason behind this when you consider how White Space in XML Documents works combined with the limitations of JSON (compared to XML). That’s the trade-off of JSONs simplicity, it can not express as much as XML can.

No Rule without an Exception

Sure there is no rule without an exception. What I just wrote sure is true for all elements but the root-element or more correctly, that SimpleXMLElement object that is passed into the json_encode function. Let’s illustrate the exception with the following PHP example:

$buffer = <<<BUFFER
<root attribute1="value1" attribute2="value2">
  text
</root>
BUFFER;

$xml = simplexml_load_string($buffer);

echo json_encode($xml, JSON_PRETTY_PRINT), "\n";

According to the decision formulated, as this root element (also named root for clarity) contains a non-whitespace text-node as first child-node, the attributes should be dropped in the JSON and only the string value should survive. Well, both does not really happen based on what I’ve showed so far. Something different happens, see the JSON this generates:

{
    "@attributes": {
        "attribute1": "value1",
        "attribute2": "value2"
    },
    "0": "\n  text\n"
}

As this JSON output shows, the attributes are still serialized in the JSON as the known @attributes property, so they are not dropped. And the text? It’s also part of the JSON, as a property named "0" which is a string of the number zero. This is perfectly valid in Javascript and so it is in JSON.

What this example not shows is, that if that root element has a child element, the text-node is not encoded any longer. And that is regardless if the child-element in the passed element is prefixed by a non-whitespace text-node or not. A child-element in the passed element will always kill text-nodes.

Eager first, Pragmatic then

So let’s extend the previous decision point and say that on the very first element with JSON encoding, PHP is eager to traverse all child-nodes. This makes especially sense if the element we pass in stands for many elements at once, like the following PHP example shows:

$buffer = <<<BUFFER
<root>
 <element attribute="value">
   text
 </element>
 <element attribute="value">
   <child/>
 </element>
 <element attribute="value">
   text
   <child/>
 </element>
 <element attribute="value">
   <child/>
   text
 </element>
</root>
BUFFER;

$xml = simplexml_load_string($buffer);

echo json_encode($xml->element, JSON_PRETTY_PRINT), "\n";

This example is a little larger because of the XML. It now contains four element children and those inside the different combination of child-nodes we already know from above. A difference is that now, not $xml is passed into json_encode but $xml->element. This can represent two things in SimpleXML depending on context: a) It is the first element node -or- b) it is an iterator over all element nodes. As it gets passed into json_encode() and with the second rule in mind that json_encode is eager to process as many children as possible with that one, the JSON contains all four elements:

{
    "@attributes": {
        "attribute": "value"
    },
    "0": "\n   text\n ",
    "1": {
        "@attributes": {
            "attribute": "value"
        },
        "child": {

        }
    },
    "2": "\n   text\n   \n ",
    "3": {
        "@attributes": {
            "attribute": "value"
        },
        "child": {

        }
    }
}

I’d say this is enough to chew on for one blog-post. In a later part I’ll show the handling of CDATA and comments and how SimpleXMLElement can be easily adopted to provide the JSON you need in case you’re not fine with this default handling. Have fun.


Continue Reading: SimpleXML and JSON Encode in PHP – Part II


Last time I blogged about SimpleXML was about the SimpleXML Type Cheatsheet.

This entry was posted in Developing, PHP Development, PHP Development, Pressed, Tools and tagged , , , , , . Bookmark the permalink.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.