This weekend much of the geekosphere was buzzing about the "Web 3.0" article in the NY Times, but from where I stand, Web 3.0 does not validate.
Apparently, Web 3.0 is the latest re-branding of the Semantic Web, an attempt to turn the Web of documents into a Web of data. Don't get me wrong - the goals of the Semantic Web are good ones, and I believe many of those goals will be met in my lifetime. But too much of the Semantic Web relies on data being valid - that is, valid XML, XHTML, RDF, etc. - and too many of us will never publish valid data.
Unless the world comes up with a way to punish those who publish invalid data, invalid data will always exist. Yeah, companies like Google could be the punishers by refusing to index data that isn't valid, but what are the chances of that happening? Google's Web search is successful in part because it makes sense of the chaos of the invalid Web. Why mess with that formula?
If the Semantic Web hopes to exist, it's going to have to deal with invalid HTML, badly-formed XML, and RSS with vague entity escaping. It's also going to have to filter out every new variation of spam, and be smart enough to know when people lie.
The Semantic Web may happen, but if it does, it's going to be a helluva lot messier than the architects would like.
You are totally missing the point here.. First Web 3.0 is just another buzz word not invented by tech people, but by sales/marketing people..
Second, right now we're stuck in an era where loads of people are still writing relatively low level data..
Stuff like RDF will be generated from existing data sources, and likely most people who publish that stuff will most likely not have to worry about the syntax, in the same way people are publishing .mp3's .doc's .pdf's or whatever..
You can also compare it with IP/TCP/HTTP.. protocols we use every day and never have to worry about. Of course there will always be buggy generators, but whats the point of writing one if you can't interface with proper parsers?
Sadly incompatibilities happen.. like with SOAP, but RDF is stable and well defined as of right now.. We should not worry about the amateur writing shitty markup, but about the big vendors that have the actual power to change incompatibility to a semi-standard.
Posted by: Evert | Sunday, November 12, 2006 at 11:23 PM
Evert, what you're suggesting is that future tools will generate valid syntax, yet past experience has proved this wrong (people said that about HTML, and then XML). What is really needed is more tools that can read invalid syntax.
TCP/IP works because geeks such as myself have agreed to try to follow the rules. But once a technology grows beyond the geekosphere, it's unreasonable to assume that it can remain syntactically valid.
Posted by: Nick Bradbury | Sunday, November 12, 2006 at 11:47 PM
I missed the hype, which is what I'm sure it is. I thought that was what Web 2.0 was all about.
I think that systems and users on systems that produce broken code will be increasing invisible, or rather unfindable and unsearchable. I have a feeling the usefulness of the semantic web will aid it's own proliferation and those who do not conform, you just won't hear about them unless they shout and spend lots of cash getting themselves noticed.
Posted by: Chris Hayes | Monday, November 13, 2006 at 07:28 AM
STOP, STOP THIS FUCKING WEB 2.0.1.2.3.2.4 BULLSHIT! JUST STOP! YOU'RE RUINING THE INTERNET!!!!
Posted by: yanni | Monday, November 13, 2006 at 09:27 AM
Your arguments are flawed.
As stated in earlier comment, much of the data published is formatted automatically using tools, which eliminates much of the invalid use of markup.
You also equate invalid markup with not being able to determine if a person is lying. By any stretch, there's no way to see the logical connection between these two. And no one problems the semantic web would be a mind reader.
Posted by: Shelley | Monday, November 13, 2006 at 10:17 AM
Shelley, the fact that much of the data will be formatted by tools doesn't mean it will be valid. People made the same arguments about HTML, but for years we've dealt with web authoring tools that generate invalid markup. And of course, right now we're dealing with tons of invalid RSS feeds despite similar arguments that we could rely on tools to generate well-formed XML files.
Also, I wasn't trying to equate invalid markup with being able to determine whether a person is lying, but I can see how my imprecise writing could be interpreted that way. I just think that the Semantic Web assumes too much about the quality and reliability of data.
Posted by: Nick Bradbury | Monday, November 13, 2006 at 11:35 AM
Don't count on Google to validate anything. It's not in their genes.
Have you ever run a validator over the pages they produce??
Posted by: Mike Gale | Monday, November 13, 2006 at 03:46 PM
Ok, my turn -
re. "The Semantic Web may happen, but if it does, it's going to be a helluva lot messier than the architects would like."
I believe it's starting to happen, and it certainly is messy.
Evert got a key point in early - there's all this stuff already in databases, moved around by software. Why should expressing it in a slightly different fashion make it any the less reliable?
Posted by: Danny | Monday, November 13, 2006 at 05:54 PM
After having to trawl through thousands of feeds, dealing with all the 'intricacies' (too polite) to 'sanitise' them for presentation, I have to heartily agree.
Call the XML Police!! ;)
Posted by: Kosso | Monday, November 13, 2006 at 08:25 PM
Hi,
what an interesting report about the WEB 3.0.
The whole world is talking about the web 2.0 and the bubble 2.0, because nobody knows exactly what the web 2.0 actually means.
According to this confusion I read a few days ago an article in a German Newspaper about the Web 3.0. It was very amusing.
Best wishes from Germany
Posted by: Bloggern | Tuesday, November 14, 2006 at 07:27 AM
Could not resist to show this link which I read just after Nick's.
http://www.codinghorror.com/blog/archives/000723.html
Seems Nick has a point.
Posted by: moorpipe | Tuesday, November 14, 2006 at 05:56 PM
I've had this out with some semwebbers recently. There's an entire layer missing in the semantic web, which is reverse engineering structured information out of semi-structured and ill-formed nonsense. You're right - the semweb is making lab level assumptions about data quality, that don't hold up minutes in the field. It's a very GOAFAI way to think. The winners here will those who parse at any cost.
Shelly: "Your arguments are flawed."
Really? Look around you. We're drowning in junk markup.
Danny: "there's all this stuff already in databases, moved around by software. Why should expressing it in a slightly different fashion make it any the less reliable?"
This doesn't make any sense - it already *is* unreliable. Data in databases are engineered or validated to be reliable, sure. Yet, lots of the malformed junk on the data comes straight from a DB.
Posted by: Bill de hOra | Tuesday, November 28, 2006 at 08:25 AM