DOMDocument schema and DTD validation in PHP can make use of libxml2’s Catalog support feature.
A catalog is basically a XML file which contains information where to obtain the DTD and XSD schema from local disk. That is mapping a “logical” name like -//w3c//dtd html 4.01 transitional//en from the common <!doctype html public "-//w3c//dtd html 4.01 transitional//en"> doctype to a concrete file on disk. Or to map a remote URI like http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd to a local equivalent of the file.
In the second case this is extremely useful, because the World Wide Web Consortium (W3C) does add an arbitrary delay of 30 seconds because most libraries (including PHP’s DOMDocument extension) do not cache the remote files. This results in millions of hits on their servers each day.
Because of that delay and because you should always use local resources for the validation due to performance reasons, it’s technically not feasible to validate against XSD files without having such a catalog.
Setting up the Catalog
The catalog for libxml – the library behind PHPs DOMDocument object – is specified via an environment variable. The variable is called XML_CATALOG_FILES. It must be set within the environment the PHP script will be executed in. It’s not enough to set the environment variable in the PHP script like putenv('XML_CATALOG_FILES=...'), that does not work.
The variable also can be used to point to multiple catalog files. The different filenames are separated by space. If a filename contains a space, the workaround is to encode it as file URI:
C:\Documents and Settings\hakre\PhpstormProjects\schema-validation/schema/catalog.xml file://C:/Documents%20and%20Settings/hakre/PhpstormProjects/schema-validation/schema/catalog.xml
In this example, setting the XML_CATALOG_FILES to file://C:/Documents%20and%20Settings/hakre/PhpstormProjects/schema-validation/schema/catalog.xml will successfully load the catalog.xml.
Defining the Catalog
The catalog itself is a XML file (Wikipedia: XML Catalog). In this little example it makes use of two schemas, XHTML 1.0 and XML. Both xsd files have been stored into the same directory as the catalog.xml file:
<?xml version="1.0"?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd" uri="xhtml1-transitional.xsd"/> <system systemId="http://www.w3.org/2001/xml.xsd" uri="xml.xsd"/> </catalog>
This example shows, that you can make use of relative references to the XSD files. With having the environment variable set and the catalog file in place, the validation works now straight forward:
<?php /** * Validate with a catalog */ $doc = new DOMDocument(); $doc->load('test-data.xml'); $isValid = $doc->schemaValidate('test-schema.xsd'); var_dump($isValid);
And that’s basically it. A workaround is available in PHP by using a callback function to resolve public and system identifiers, however once the catalog.xml file is setup, I found it much better than with the callback function.
See Also
- 47. Catalog Common Resources (Chapter from the book “Effective XML”; Copyright 2003 Elliotte Rusty Harold)
- Cache Soap envelope schema for schema validation (20 Oct 2011; by Chris)
- Speeding up XML schema validations of a batch of XML files against the same XML schema (XSD) (13 Dec 2012; related Stackoverflow Q&A material)
- Handle XML Catalogs by php (10 May 2011; related Stackoverflow Q&A material)
another method is to validate the dom against the root xsd:
$dom->schemaValidate
then have xsd on a single folder and use schemaLocation with just the XSD filename.
This article is no help because it doesn’t state how to set the xml_catalog_files variable correctly. If not done with setenv(), how else?
I’m sorry if that isn’t clear from the text on it’s own. I write about putenv() which is the PHP funciton: https://secure.php.net/putenv . Environment variables are – by their name – variables per a certain environment. libxml needs environment variables of the underlying system, not the thread of the system php runs in, but a differemt area (maybe better said as “a level above”).
As libxml is loaded when you initialize PHP, the environment variable needs to be available at that point. When the PHP code is executed, it’s “too late” ™. So add this to your system configuration if you would like to be on the safe side. Contact your system administrator or linux professional and you should be able to have this sorted out for *any* system within no time in cause you have doubts. Otherwise just go through the issue with a pair of fresh eyes. Sorry for the late reply.
off topic question – what is this?
Obviously it is a namespace declaration for the catalog element but I have never seen a sequential series of prefixes separated by colons without any namespace identifier. Or is this shorthand for setting up a whole number of prefixes (urn through xml) all of which are prefixes for the namespace identifier “catalog”. Can you explain? Thanks!
I guess the example was eaten from your comment and you related to this:
xmlns=”urn:oasis:names:tc:entity:xmlns:xml:catalog”
And more specifically the URI used there in. It is a standard URI of subset URN (the other subset is URL). All in all it is “just” an URI, which is outlined in RFC 3986: Uniform Resource Identifier (URI): Generic Syntax which in
1.1.3. URI, URL, and URN points to the (earlier by number) RFC 2141: URN Syntax which is what I guess you’re probably looking for here.
Hope this helps. And I don’t think it’s off topic.