When you link to an item on your own weblog, it would be convenient to use a relative rather absolute URL. For example:
Table 1
relative | <a href="rss.xml">rss.xml</a> |
fully-qualified | <a href="http://weblog.infoworld.com/udell/rss.xml"> http://weblog.infoworld.com/udell/rss.xml </a> |
relative vs fully-qualifed URIs
There are two advantages to the relative approach. First, it's more concise, so there's less typing and less likelihood of a copy/paste error. Second, it's more flexible. When you decide to relocate your blog to a new home -- it happens to everyone sooner or later -- you'd rather not have to chase down and adjust every internal reference.
For a while now, people in the weblog community have been discussing the use of an XML feature called XML Base:
This document describes a mechanism for providing base URI services to XLink, but as a modular specification so that other XML applications benefiting from additional control over relative URIs but not built upon XLink can also make use of it. The syntax consists of a single XML attribute named xml:base.
Figure 1 illustrates a simple RSS 2.0 feed that uses xml:base to specify the base URI for relative links.
Figure 1
<?xml version="1.0"?><rss version="2.0" xml:base="http://weblog.infoworld.com/udell"><channel><title>Jon's Radio</title><link>http://weblog.infoworld.com/udell/ </link><description>Jon Udell's Radio Blog</description><item><title>relative URI test</title><link>http://weblog.infoworld.com/udell/2003/08/04.html#a766 </link><description>(<a href="/index.html">relative</a>, <a href="http://weblog.infoworld.com/udell/index.html">fully-qualified</a>)</description></item></channel></rss> |
sample RSS 2.0 feed with xml:base
This feed is currently not considered valid, and that's appropriate. The RSS 2.0 specification is silent on this issue, and RSS readers have not hitherto been advised to support xml:base.
Although it is not deemed valid, you can in fact subscribe to the test feed shown in Figure 1. Here's how three RSS readers handle the relative and absolute links in the sample:
Table 2
| relative | absolute |
Radio UserLand | http://localhost:5335/index.html | http://weblog.infoworld.com/udell/index.html |
SharpReader | http://weblog.infoworld.com/index.html | http://weblog.infoworld.com/udell/index.html |
NetNewsWire | http://weblog.infoworld.com/udell/index.html | http://weblog.infoworld.com/udell/index.html |
How three RSS readers handle xml:base
[[[Although there is no requirement to do so, it appears that NetNewsWire implements xml:base as a best practice.]]] Although it appears that NetNewsWire implements xml:base in this example, it doesn't. Instead it uses the top-level link (rss/channel/link) which, in this example, is also http://weblog.infoworld.com/udell.
[[[We are inclined to recommend that all RSS readers honor xml:base when present.]]] We are inclined to recommend that RSS readers construe /rss/channel/link as the base by default, and xml:base (if present) as an override (if the base must differ from /rss/channel/link).
If that were to happen, weblog authors could -- after some transitional period -- begin using relative URIs for local references. Should they? The best practice in that case is debatable. Authors will, of course, always be free to choose between relative and absolute addressing. Depending on the content management system used, adjusting references when a blog moves might not be a problem, and absolute addressing will ensure that no RSS reader will be left behind. That said, if RSS readers embrace [[[xml:base]]] /rss/channel/link and xml:base, we would be inclined to recommend -- after a suitable transitional period -- that weblog authors take advantage of the feature.
So -- I don't understand the xml:base spec.
How should this resolve:
xml:base = example.com URL = /bar/
Should it be...
1. example.com
or...
2. example.com
Jon (in email) goes for option #2. I think it's #1.
I think the answer is clear in the case where the relative URL does not begin with a /. If the relative URL is bar/ and the base URL is example.com then the result is example.com
The reason I think it's #1 is that it works the way relative URL resolution already works. Say you're in your browser at example.com and there's a link to /bar/ -- if you click it, you go to example.com
I don't see how specifying the base URL in xml:base is different from the above situation.
What if there's no xml:base? What's the fallback position?
I see three alternatives:
1. It could be left undefined.
2. Relative URLs should be resolved relative to the URL of the RSS feed.
3. Relative URLs should be resolved relative to the top-level link item. (The home page of the site, often. Where the news appears in HTML.)
I advocate #3.
To dispose of #1 -- it means aggregator authors all have to figure it out themselves. Because relative links do appear in feeds, no matter whether or not it's correct, aggregators need to deal with them. Better to have a policy all aggregators can follow.
To dispose of #2 -- if you move a feed (make a copy on your local machine, for instance), then this would mean that all of the relative links are to be resolved differently. That's totally weird. RSS feeds are portable. Moving a feed should not change how it's interpreted. (Note that there's no item in the RSS feed which points to its own URL.)
Also: people write for their news page, not for their RSS feed. When they're creating links, they're expecting them to work for the HTML home page. If their publishing systems don't expand those links in the RSS feed, you can assume that they're relative to the news page.
In other words, going by #2 means aggregator authors will get bug reports about this.
That leaves #3, use the link item as the base URL, as the fallback position.
I'm with Brent on this one: #1.
In Python:
>>> import urlparse
>>> urlparse.urljoin('example.com','/bar/')
'example.com'
For more information, see www.ietf.org
Appendix C. Page 29.
Re: "What if there's no xml:base? What's the fallback position?", if the definition of xml:base is included within a document (the other choice is within an external entity, which is NOT what is happening here), then the right answer is the base URI of the document itself (i.e., #2).
See the first sentence of www.w3.org
Sam: Good example, thanks. Let's expand it:
>>> urlparse.urljoin('example.com','/bar')'example.com'>>> urlparse.urljoin('example.com','/bar')'example.com'>>> urlparse.urljoin('example.com','bar')'example.com'>>> urlparse.urljoin('example.com','bar')'example.com'
When 'example.com' is the desired result -- as in the example I gave in table 2 above -- it would seem that the xml:base must have a trailing slash, and the relative URL must not have a preceding slash.
Is that correct, and should it be stated explicitly?
Brent suggests that the top-level link (/rss/channel/link) should be the fallback, when there is no xml:base. (This explains why NNW appeared to honor xml:base in the example above; in fact, it was using /rss/channel/link which happened to be the same.)
Sam refers to this sentence from the xml:base spec: "The attribute xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity." Is this a question of precedence? Should it be stated explicitly than when both xml:base and /rss/channel/link are present, the former can override?
PS: I'm delighted to be able to retitle these entries. Useful! Now, how do I do <s>strikethrough</s> here?
My recommendation would be to either follow the xml:base specification or invent a new attribute. The danger is that someday somebody implements an XML parser with "proper" xml:base support, i.e. by providing a means to evaluate xml:base at every node in the tree. In such were done, having a grammar specific element override the xml:base spec would make things more difficult.
Social factors to consider: relative links are uncommon. The validator today does what it can to discourage relative links everywhere. Few aggregators support relative links. Those that do, do so inconsistently.
Try urlparse.urljoin('example.com','bar')
Given this "must have a trailing slash" is misleading.
> Try urlparse.urljoin('example.com','bar')
> Given this "must have a trailing slash" is misleading.
Heh. OK, what's a concise, user-friendly way to express this matrix:
xml:base relative URI result
a.com /b a.com
a.com a.com
a.com b a.com
a.com b http:
a.com /c a.com
a.com /c a.com
a.com c a.com
a.com c a.com
a.com /c a.com
a.com c a.com
A year ago I proposed a <base/> tag for RSS 0.94/2.0. It's nice to see that someone finally acknowledges the issue.
E.
"My recommendation would be to either follow the xml:base specification or invent a new attribute. The danger is that someday somebody implements an XML parser with "proper" xml:base support, i.e. by providing a means to evaluate xml:base at every node in the tree. If such were done, having a grammar specific element override the xml:base spec would make things more difficult."
Hmm. Now I'm back to Brent's point. Why isn't /rss/channel/link sufficient?
Yes, relative links are infrequently used and (properly) discouraged now. But it's a pain to fully qualify my URLs, and the ones for which I did so, back in the radio.weblogs.com/0100887 days, are now semi-bogus. So I'd like to be able to use relative URLs with confidence.
Why not say "OK, so long as you supply /rss/channel/link and construe it as your base"?
If in most cases xml:base would be the same -- as it was in the example I cooked up -- then the benefit of using xml:base devolves to the minority of cases where the two must differ. What is an example of such a case?
Why isn't /rss/channel/link sufficient?
I have no idea why not. Are there any cases where xml:base would be different?
As an aggregator developer who's looked at thousands of feeds, it hasn't come up before. Using /rss/channel/link is what people seem to naturally expect.
FWIW, I happen to hang around with Apache web server developers. The people I hang around with seen to naturally expect something quite different than the people that you do. Off course, the people you hang around with are more representative of your target market.
Be that as it may, doesn't the argument that the default for xml:base will almost always be the same as /rss/channel/link signficantly weaken the argument that one should deviate from the documented specification for xml:base?
For a concrete example where the document URL, /rss/channel/link, and the common root of the item URLs differ:
www.25hoursaday.com
NOTE: Someone with editor privleges responded to my original post that was in this entry by *editing* this post instead of making a new post. I'm sure this was accidental. In what follows, the comments preceeded by the > signs are actually mine, and the respose is someone else's. The bulk of my original post is gone.
> Has anyone *ever* actually *wanted* a base different than their > /rss/channel/link?Evidently Dare Obasanjo does exactly that.
> Please don't strive to make a messy spec > that can handle *Every* *Single* *Theoretical* *Possibility*.
Point taken. At the moment, it seems that /rss/channel/link could be a reasonable default, and that xml:base could be a rarely-needed override.
But I'm still listening and learning.
There are many target audiences for RSS feeds. And not all of them are going to be human beings running browsers. So there may yet be important examples that have not come to light.
> www.25hoursaday.com
Excellent example, thanks. How's this for a (condensed) FAQ?
Why would I want to use relative URIs?1. less typing2. portability
What's my base address?1. /rss/channel/link, if suitable and no xml:base given *2. xml:base if the base must differ from /rss/channel/link
How do these bases combine with relative URIs?... big nasty table of examples ...
* On the theory that, in the absence of xml:base, an xml:base-related preference for using the feed's URL rather than /rss/channel/link has no standing.
There's something wrong. Somebody's response to my post is appearing under my name, and my original comments are missing.
For a concrete example where the document URL, /rss/channel/link, and the common root of the item URLs differ:www.25hoursaday.com
I don't see the point you're trying to make.
The document URL is immaterial (I assume you meen the URL of the feed document). The feed is in one location, the site the feed is generated from is in another, sure, but so what? Who says there even has to *be* a site that the feed is generated from?
All the individual item links point to a single site, kuro5in.org, but so what? Just because a user only links to a single site (which is extremely rare), he should get a special exception to accomdate this? If I set up a blog to comment on new york times stories and only provide links directly to the stories, I'm still going to need an absolute URL for each one of those since my site isn't nytimes.com. Sure, I could set a BASE target in my html page, but this seems like an abuse as I'd have to instead provide absoulte URLs instead of relative ones for all the links to my own site (Archive, navigation etc), which is counterproductive at best. Even still, in this case, xml:base is not needed. If we look at the /rss/channel/link for the feed (www.kuro5hin.org), any relative URLs (e.g. /story/2003/8/4/0430/97775) would still work if they begin with a slash.
Why would I want to use relative URIs?1. less typing2. portability
That's exactly the right question to ask. Portability is the most common answer if you ask blog authors. Plus, you still get the benefit of less typing for all of your interally-pointing links.
* On the theory that, in the absence of xml:base, an xml:base-related preference for using the feed's URL rather than /rss/channel/link has no standing.
Bingo. The location of the feed is irrelevant. It could be (and often is) generated by a third party. the /rss/channel/link is the only thing that makes sense in 99.9999% of the cases.
Another thing... I just thought of this as I got into bed and had to jump back up before I forgot it.
For portability, relative URLs should only be used to point to stuff the author controlls. Even if it's on the same server, if you don't own it, you should use absolute URLs. Dare is pointing to kuro5hin stories from his kuro5hin journal, but if he ever moves his journal to his own server, the original stories will still be on kuro5hin.org. If I have a weblog at example.com and I want to comment on an entry in the weblog at example.com I should use an absolute link, not /~bar/entry.html.
Also, the behavior of urlparse.urljoin is what should be expected. I may have sub-sites... eg, example.org is my main site, but I have two subsites, each with their own feeds, at example.org and example.org I want to point to a boat entry from a car entry, so I would use a relative url like href="/boat/entry1.html". If I just want to point to another car entry from a car entry, I could just use href="entry2.html". Both are then portable when I move my entire site and all subsites to foobar.com.
It seems that at a basic level, we're all really in agreement. Just use /rss/channel/link, and for the once in a blue moon that you actually need to have the base differ from that, use xml:base, but otherwise just leave it out.
I think xml:base should be the recommended approach if relative links are used.
/rss/channel/link might be functionally equivalent, but I don't think it should be recommended as it's less in line with XML standards.
Another justification for including one of these is that the transport might not be http - RSS data may be sent over email/IM/whatever.
Leaving aside the questions as to what the xml:base spec and what the relavent RFCs for evaluation of relative URLs mean, whether or not they should be followed (and if not which portions should be), and how best to explain them in the context of an RSS 2.0 spec...
Which elements (and attributes) should be interpreted as potentially containing relative URLs? Clearly elements like pubdate and category should not be interpreted this way.
Let me propose two interesting test cases:
/rss/channel/link/rss/channel/item/content:encoded
/r/c/l is interesting in that allowing this to be relative permits an entire site to be relocated in the normal case where the RSS feed resides on the same host as the items. It is also the proposed default base for evaluating relative URLs...
/r/c/i/c:e raises the interesting question: where should this be documented?
Which elements (and attributes) should be interpreted as potentially containing relative URLs? Clearly elements like pubdate and category should not be interpreted this way.
Good question. My first instinct is that /rss/channel/link should not have relative URLs. I still don't think that having the feed residing on the same host as the itmes is the "normal" case (though it is common). The aggregator should *only* care about the location of the feed for the purpose of *getting* the feed. Once it has it, how it got it should not matter.
I'd say that /rss/channel/item/link should also not contain a relative URL, though I don't have a good reason for this other than aesthetics. Even though the /r/c/i/l is often used as a permalink, the fact that feeds are re-generated regularly and the fact that /r/c/i/l is most often auto-generated by the RSS renderer and not manually input by the author makes me think its best to avoid any chance of ambiguity and just use an absolute URL. If the site moves, the RSS renderer will change the /r/c/l, and can auto-update the /r/c/i/l as well.
By this reasoning, I'm tempted to say that even in /rss/channel/item/content:encoded only absolute URLs should be allowed, and it should be up to the renderer to detect relative URLs in the input and convert them to absoulte URLs before outputting the feed.
Portability is a big concern for me, but I'm not worried about portability of the RSS feed itself, as it only lives until my next post, when it is regenerated. I only want to use relative URLs in the entries.
Oops, my bad. Sorry Paul. I'm only just now learning to use this system.
Danny: "/rss/channel/link might be functionally equivalent, but I don't think it should be recommended as it's less in line with XML standards."
I'm a user. My feed contains /r/c/l/. I have the option to regard it as a base URL, and simplify my life in several ways, changing nothing in my feed to achieve those benefits.
An alternate approach requires: - that I modify my feed to use of xml:base- that my reader provide UI for doing so
A user would want to do this to be in line with XML standards? Doubtful.
Danny: "Another justification for including one of these is that the transport might not be http - RSS data may be sent over email/IM/whatever."
You lost me there. I don't see how the interpretation of /r/c/l or xml:base in a payload is related to the payload's transport.
OK, Jon. You are a user. So, let's take your feed as an example. Follow along with me in Python, if you like:
urlparse.urljoin('weblog.infoworld.com','2003/08/06.html#a768')
Now try that again, with the value of your r/c/l:
urlparse.urljoin('weblog.infoworld.com','2003/08/06.html#a768')
Do you see the difference? (Hint: there isn't one).
Note that you (corrrectly) put the trailing '/' in your link URL. Try it again without the '/' (a common user error). See the difference?
= = = =
My suggestion, and simply take it for what it is worth: there was a lot of thought put into xml:base and how to resolve relative URLs. If this works for you, use it. If it doesn't work for you, my only suggestion is to resist the temptation to incorrectly apply the existing standard, and instead simply create a standard that better suits your needs.
Sam: "Which elements (and attributes) should be interpreted as potentially containing relative URLs?"
Right. I noted this came up over here (diveintomark.org) as well:
"Using xml:base to handle relative links. But what should it apply to? All URLs (even the <link> element and so forth)? URLs in content only? What about escaped content? Discuss in www.intertwingly.net"
Good questions. I don't see a lot of discussion on the wiki, though, maybe it's in the change logs somewhere?
I would like to hear from Brent on this point, and also from other aggregator writers. For example, NNW already resolves to /r/c/l. Does it enumerate the possible content containers ( <description>, <content:encoded>, <xhtml:body>) as candidates for the treatment? And what do SharpReader and others do?
Enumeration seems problematic, since content:encoded and xhtml:body only appeared recently, and other content elements may follow.
Straw man: resolve anything relative to /r/c/l (or xml:base). In XML: MUST. In escaped content: SHOULD.
A user would have to go out of his/her way to form a relative URI in non-content. In which case:
User: "I made my <link> URL relative, and it hurt when I did that."
Best-practices document. "Don't do that."
A user would have to go out of his/her way to form a relative URI in non-content. In which case:
Yep. A lot of this discussion seems to subconciously assume that some human is going to manually type in a bunch of relative URLs in fields other than "description" and/or "content:encoded". A tool is going to generate the feed (of course, there will always be a few wackos who make their feed by hand in notepad). While I believe the spec should be simple enough to write a feed by hand, I also believe that anyone who chooses to do so on an ongoing basis deserves what they get. Mucking up the spec to save one or two wackos a few keystrokes is a bad idea. The only place there should be relative URLs is in the content.
NetNewsWire resolves relative links for <link> items and for URLs in items -- including any alternate descriptions like <content:encoded>.
At this point, my thinking is that bringing xml:base into this means I have to write more code for no practical benefit. It seems much clearer and cleaner if the standard is always to resolve relative to /r/c/l.
"Note that you (corrrectly) put the trailing '/' in your link URL. Try it again without the '/' (a common user error). See the difference?"
I'm assuming that weblog software is going to provide the base, and that it can do so correctly.
Where the the user is most likely to err, it seems to me, is in typing the relative URI. Which may or me not be preceded by '/' and will give different results depending.
"There was a lot of thought put into xml:base and how to resolve relative URLs."
How does xml:base help the user to decide whether to type '2003/08/06.html#a768' or '/2003/08/06.html#a768' ?
Oops, my bad. Sorry Paul. I'm only just now learning to use this system.
No problem, you actually kept only the important parts of my post anyway, so you really did everyone else a favor by deleting a bunch of crap they didn't need to read. :)
As a user, what do you care about what markup your feed contains?
I'm a user. I want applications that work and aren't too expensive. Expensive starts when work is duplicated, broken starts when variation is introduced without good reason. As a developer I'm aware that standards can help prevent both of these.
In any case, just because your feed contains X, Y, Z doesn't make it the best solution. If broader interop is possible with virtually no effort, surely it's perverse to insist on a non-standard approach?
On the second point, if all the feed contains is relative links, and it's transport doesn't anchor the references, they are useless.
"As a user, what do you care about what markup your feed contains?"
I don't. I care about benefits (less typing, portability) and costs (looking like an idiot if I try to write relative URIs and screw them up).
"If broader interop is possible with virtually no effort, surely it's perverse to insist on a non-standard approach?"
I'm not insisting on anything. I'm exploring an issue. However, you remind me to ask: What experience has there been with other xml:base applications? Can you point to some? Are there users of them whose experiences we can tap?
My major takeaway in all this, FWIW, is not about xml:base vs /r/c/l at all. It's about how it will ever be possible, in either scenario, for the user to be able to write relative URIs with confidence. This is not an RSS issue at all, it's a general issue of hypertext authoring, which although some of us have been doing it for years, is far from second nature to most people. In fact, although I've been writing HTML by hand for almost a decade, I myself still make relative URI errors from time to time.
We can make the tools responsible for protecting the user from screwups, but history shows that if you rely on tools, and if the fallback manual technique is not widely comprehensible, there's going to be a problem.
The weblog is even more complicated than straight Web publishing, because of the different representations and contexts: the environment of the published blog, the environment of the RSS reader.
Some problems simply do not have clean solutions. It's occurring to me that this might be one of them.
Trouble is a GREAT many sites don't use a channel link that ends in a trailing slash. Many sites are based on parameters (ecademy for one) and as such won't be suitable for concatenation.
"Trouble is a GREAT many sites don't use a channel link that ends in a trailing slash. Many sites are based on parameters (ecademy for one) and as such won't be suitable for concatenation."
Good data point, thanks Bill.
At this point I can see arguments in favor of all four of these approaches:
1. /r/c/l only
2. xml:base only
3. /r/c/l then xml:base
4. xml:base then /r/c/l
Question for xml:base advocates: Obviously you'd like 2. Would you buy into 3 or 4 as well, on the grounds that current RSS users are more likely to be familiar with /r/c/l than xml:base?
Another question for xml:base aficionados: Can we explore some cases where it's in use now, to see how it's working?
Questions for everybody: without regard to what method is used to specify the base,
- how should the algorithm for forming the base be presented to users?
- how should the algorithm for forming the relative URI be presented to users?
- what recommendation should be given to writers of authoring software w/respect to helping users form correct bases and relative URIs?