The old, rusty tech-monster from swamp, beloved robots.txt, that did prevent gaga-gone droids from DDOSsing your servers years ago, still has its place in SEO, SEM and generic robots access control today. A site shouldn’t be run w/o having this file – not only to prevent 404 messages in log files.
The de-facto standard has its own history of documentation and resources exist. The normal proceedings are easily described by various resources, so I will skip the all-day usage and come to the point which catched my attention: How to encode the rules put in there and how to promote the encoding/charset of the robots.txt? That was for Ticket #14069 and being concerned about HTTP.
About Charsets in the Internet (Fast Lane)
In the internet there is always a safe charset/encoding to go: US-ASCII. It’s an IANA’s and IETF’s favorite and should do the job out of the box. Downside is its limited set of characters. Germans need to drop their crazy Umlauts, French their grave accents, los enjes for the Spain etc. – and do not even think about leaving western europe or northern america. You would be lost unless you’re a god of ASCII-Art.
Gladly the internet has evolved over the last years and after a period of having special case character-sets like Latin-1 or multi-byte charsets for asian languages (e.g. EUC-JP), things have become easier again: Unicode. It’s worth to learn about Unicode because it solves a lot of problems at once (one charset for the whole world incl. music and math etc.). In the end you do not need to be a charset specialist for every other encoding as a programmer – even if you do not know all chars and ranges Unicode offers. An encoding of Unicode which is especially important in the web is called UTF-8. That’s the suggested encoding for Wordpress these days as well btw. And – surprise, surprise – it’s kind of compatbile with US-ASCII. Circle closed.
Learning about robots.txt encoding
How does this influence the encoding of the robots.txt file? Well if you have directory- or file-names on your server which contain characters that can not be written in US-ASCII (like hebrew or japanese) how can those locations on your server be used in the robots.txt? How to “write” them?
# | URL | Writing |
---|---|---|
1 | http://example.com/וורדפרס-בעברית/ |
Ivrit |
2 | http://example.com/一样/ |
Japanese |
Resources by search engines like Google or Bing make different suggestions how your robots.txt should or could be encoded. A charset that all search engine crawlers support is ASCII, the basic 7bit ASCII charset and the default charset in internet communication. My suggestion therefore is to make your robots.txt file an US-ASCII file, because it’s the least common denominator and it is possible to do so.
But why is it so hard to give a clear suggestion which charset to use or not to use? If you read the robots.txt standard draft memo you might notice that it’s not talked about charset encoding. The memo is about plain text and HTTP (that time HTTP/1.0) and it’s about URLs. This all indicates that the encoding should be US-ASCII, but it’s never named in concrete.
And what about all the letters that you can not write in US-ASCII? Those can be urlencoded, because URLs are always encoded and inside robots.txt the memo defines that the rules contains relative URLs to the site’s root.
So regardless of which encoding your blog has, if you create valid encoded URLs, you can use them with no problems inside robots.txt and deliver the file with the internets standard charset. So there is (native writing in comments not taken into account) even no need to deliver it in another encoding then plain text 7-bit ASCII.
# | Rule | Writing |
---|---|---|
1 | Disallow: /%D7%95%D7%95%D7%A8%D7%93%D7%A4%D7%A8%D7%A1-%D7%91%D7%A2%D7%91%D7%A8%D7%99%D7%AA/ |
Ivrit |
2 | Disallow: /%e4%b8%80%e6%a0%b7/ |
Japanese |
The great plus with this basic charset is, that any programming I can think of supports it, so any robot should be able to deal with your site, even if it predates the Unicode era. Because the encoding of your site is transparently transported over HTTP in a properly urlencoded form – both directions.
References
- American Standard Code for Information Interchange – ASA standard X3.4-1963.
- M. Koster, “A Method for Web Robots Control”, Nov 1996.
- Berners-Lee, T., Fielding, R., and Frystyk, H., “Hypertext Transfer Protocol — HTTP/1.0.” RFC 1945, MIT/LCS, May 1996.
- Berners-Lee, T., Fielding, R., Masinter, L., “Uniform Resource Identifier (URI): Generic Syntax” RFC 3986, Jan 2005.
- Fielding, R., ” Relative Uniform Resource Locators” RFC 1808, Jun 1995.
- Url Sources: #14292, #14313.
- Image Credits: Robot © Christopher Grant Harris, Japanese Nada Letter provided by Haruno Akiha.
Pingback: Best Practice robots.txt | hakre on wordpress