Using Catalogs for Validation with PHP’s DOMDocument and Libxml2

XML

Powered by libxml2DOMDocument schema and DTD validation in PHP can make use of libxml2’s Catalog support feature.

A catalog is basically a XML file which contains information where to obtain the DTD and XSD schema from local disk. That is mapping a “logical” name like -//w3c//dtd html 4.01 transitional//en from the common <!doctype html public "-//w3c//dtd html 4.01 transitional//en"> doctype to a concrete file on disk. Or to map a remote URI like http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd to a local equivalent of the file.

In the second case this is extremely useful, because the World Wide Web Consortium (W3C) does add an arbitrary delay of 30 seconds because most libraries (including PHP’s DOMDocument extension) do not cache the remote files. This results in millions of hits on their servers each day.

Because of that delay and because you should always use local resources for the validation due to performance reasons, it’s technically not feasible to validate against XSD files without having such a catalog.

XML

Setting up the Catalog

The catalog for libxml – the library behind PHPs DOMDocument object – is specified via an environment variable. The variable is called XML_CATALOG_FILES. It must be set within the environment the PHP script will be executed in. It’s not enough to set the environment variable in the PHP script like putenv('XML_CATALOG_FILES=...'), that does not work.

The variable also can be used to point to multiple catalog files. The different filenames are separated by space. If a filename contains a space, the workaround is to encode it as file URI:

C:\Documents and Settings\hakre\PhpstormProjects\schema-validation/schema/catalog.xml
file://C:/Documents%20and%20Settings/hakre/PhpstormProjects/schema-validation/schema/catalog.xml

In this example, setting the XML_CATALOG_FILES to file://C:/Documents%20and%20Settings/hakre/PhpstormProjects/schema-validation/schema/catalog.xml will successfully load the catalog.xml.

Defining the Catalog

The catalog itself is a XML file (Wikipedia: XML Catalog). In this little example it makes use of two schemas, XHTML 1.0 and XML. Both xsd files have been stored into the same directory as the catalog.xml file:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>

This example shows, that you can make use of relative references to the XSD files. With having the environment variable set and the catalog file in place, the validation works now straight forward:

<?php
/**
 * Validate with a catalog
 */

$doc = new DOMDocument();
$doc->load('test-data.xml');
$isValid = $doc->schemaValidate('test-schema.xsd');
var_dump($isValid);

And that’s basically it. A workaround is available in PHP by using a callback function to resolve public and system identifiers, however once the catalog.xml file is setup, I found it much better than with the callback function.

See Also

XML

This entry was posted in Uncategorized and tagged , , , , , , , , . Bookmark the permalink.

5 Responses to Using Catalogs for Validation with PHP’s DOMDocument and Libxml2

  1. another method is to validate the dom against the root xsd:
    $dom->schemaValidate

    then have xsd on a single folder and use schemaLocation with just the XSD filename.

  2. NAMEREQUIRED says:

    This article is no help because it doesn’t state how to set the xml_catalog_files variable correctly. If not done with setenv(), how else?

    • hakre says:

      I’m sorry if that isn’t clear from the text on it’s own. I write about putenv() which is the PHP funciton: https://secure.php.net/putenv . Environment variables are – by their name – variables per a certain environment. libxml needs environment variables of the underlying system, not the thread of the system php runs in, but a differemt area (maybe better said as “a level above”).

      As libxml is loaded when you initialize PHP, the environment variable needs to be available at that point. When the PHP code is executed, it’s “too late” ™. So add this to your system configuration if you would like to be on the safe side. Contact your system administrator or linux professional and you should be able to have this sorted out for *any* system within no time in cause you have doubts. Otherwise just go through the issue with a pair of fresh eyes. Sorry for the late reply.

  3. dwilbourne says:

    off topic question – what is this?

    Obviously it is a namespace declaration for the catalog element but I have never seen a sequential series of prefixes separated by colons without any namespace identifier. Or is this shorthand for setting up a whole number of prefixes (urn through xml) all of which are prefixes for the namespace identifier “catalog”. Can you explain? Thanks!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.