I recently came across the PostCon format (an RDF-based format) in a document describing an article on monsters. Take a look at it: that’s a lot of metadata! It got me thinking: how much metadata should we store on a given article?
The Finetto XML format is very small, but also in its early unsettled days (Finetto is the content management system I use and build). The elements are:
- ID – a unique ID for each article, derived from the time it was written,
- Title – the title of the item, not necessarily unique,
- Date – The date the article was created. This is a throwback to when I didn’t understand how to use event-driven parsers properly, and has always annoyed me,
- Description – A short description of the article, entered manually,
- Author – Name of the person who wrote the article. This appears automatically (taken from a users log-in), but can be entered manually,
- Content – The content itself as a chunk of XHTML.
Now, compared to PostCon, that is tiny. But there are times when I wish I had stored category information, or used an RSS-like format, or even scrubbed the date (it can be taken from ID). The question is should we attempt to store all information that could possibly maybe be useful down the line? I’m not convinced either way.
On one side, you’ve got the benefit that if you ever need to know anything about the document, it’s right there: no need to infer it from other sources (the web page that the article appears on, for instance). But, on the other side, you also have a tremendous amount of bloat if the data is never used. If a post is small, the metadata outweights the data which strikes me as horribly wrong.
When I can get a clear path to backwards-compatibility, I’ll seriously look at getting a lot more metadata into my format. For now, I’ll just muse over how much is enough.