April 18, 2003 | Category:


Recently, someone asked me what semantic markup is. Semantic markup is the way all mark-up should be. A page can exist and validate as XHTML (using the W3C validator) without really being proper XHTML.

To understand what I mean, you have to really understand what mark-up is. While I lack a complete historical knowledge of the subject, I know the rough details (but this isn’t a history lesson, just a brief overview). People wanted information that they stored to have meaning, so that people could understand what kind of data was being stored just by looking at it. The standard that eventually emerged was SGML.

Created in 1974 by Dr Charles F. Goldfarb, SGML was originally conceived to allow people to use data across different platforms and by different applications. By making the way of making text significant part of the text itself, it became easy to write something once on one computer and then re-use it on many other computers.

Tim Berners-Lee, generally considered to be the father of the web, created HTML from his loose understanding of SGML. As most people know, HTML is what a webpage is made up of. Very simply, it’s just the information you want inside tags. For example, if you want a paragraph of text you use the P tag. If you want to show an image, you used the IMG tag.

The biggest problem with HTML is that it lacked 2 of the main properties of other markup languages: well-formedness and real meaning. Almost every markup language requires that tags must be closed, and have some sort of case rule. This is called well-formedness (as in the tags are complete), and sadly HTML lacked it as a necessity. Also, the issue of meaning: every tag has a precise meaning (as mentioned before). But people, when actually using HTML, just used any old tag that got the work done, regardless of what it actually said about their markup. The line-break tag (BR) and table tags were the most abused: people would (and still do) shove entire essays into a single paragraph and separate real paragraphs with inappropriate line-breaks. They would also layout everything on their website with tables and never know it was wrong. Tabular data, such as scientific results, belongs in tables. Essays and navigation lists generally don’t.

The markup purists gawked. What to do about this bastard child of meaningful markup and presentation? The masses liked it, so getting rid of it wasn’t an option. Or was it?

In 1994, Tim Bray and a few others created XML from SGML. Much like its parent language, it’s a general way of taking information and giving it structure. It’s just a fair bit simpler (and more aimed towards web-development) than SGML. It’s been used to structure everything from mathematical documents to porn metadata (such as the title and star of the film).

Every article on this website exists as an XML document. I can access that one file and turn it into any format I wish: from the markup that you see on these pages, to the RSS feeds that I provide. In short, XML is very powerful, meaningful and widely-used.

At some point, someone at the World Wide Web Consortium (the people in charge of updating the HTML specification amongst other things) realised that HTML needed to be more like XML. And so, a new language was born: XHTML.

Presented as the future of HTML, the new XHTML had to incorporate the wellformedness and meaning of XML, as well as it’s extensibility (the X at the beginning of XHTML comes from extensibility) among other features. Finally, a web markup language capable of being meaningful, useful and platform-independent.

So what did the masses of web developers do? Those who learned of XHTML and understood the real point behind it (like myself) and didn’t just see it as HTML+ 1, used it properly. Uncoupling the way our web pages looked from the data itself, making our words richer by using the tags at our disposal rather than fudging everything together with tables and linebreaks and other presentational markup (B and U tags – I’m looking at you). And that is what we call Semantic Markup: using the tags properly.

Now the XHTML specification is still open to abuse: people write markup that will validate and think it’s good XHTML. It isn’t necessarily. The validator only checks for well-formedness. It can’t look at a page and see if it’s using the language in the spirit that it was intended: to provide meaning for content.

Why use semantic markup?

  • It’s well-formed, so it can be understood by pretty much any browser or parser on any device. That means more people can understand and view what you write. This site works in any browser: from Mosaic (which is 10 years old today), to Internet Explorer, to Mozilla, to any number of non-visual browsers (screen-readers used for the blind, for example). That is the power of well-formed markup.
  • It’s meaningful. Every word I write has implied meaning via the markup. This gives it both hidden and viewable depth. For example, if you use a visual browser, throughout this article you might have noticed words that have a dashed underline (but because my markup is separate from the presentation, I can change this). Those are acronyms. By marking them up with the ACRONYM tag, you can then put your mouse-over them to find out what those acronyms stand for.

And those are the reasons that every web developer who is even remotely competent should be using semantic markup.