Dave Winer has been exploring a superb news resource, exploring the depth and breadth of the New York Times‘ data-stream. The most traditional of news organizations is opening up, including its archives,in ways that could be truly revolutionary in the news business — and Dave is leading the way toward a new way of seeing a core part of our history and current knowledge.
As he inspires others to do some spelunking of their own, the result is that people outside the Times are doing crucial R&D for the world’s most important newspaper — figuring out what’s available in the story archive and current flow that, in many ways, represents a fundamental baseline for journalism about vital topics, and then figuring out how to make it more available, in smarter ways, to more people.
In this morning’s posting, Dave shows us a new outline view of the Times’ stories, part of what he’s calling “rivers of news” that take headline services to a deeper place. If you have a mobile phone that has a Web browser, load the New York Times river to see what this means. The outline is, in effect, a taxonomy of what’s happening in the world, and was inspired in large part by people at the newspaper who suggested a direction he might take.
More is coming, he says:
Further, in the process of exploring this, I’ve been shown the work of other developers who discovered the keywords on their own, and one in particular is very interesting. I’m hoping that these projects will come public so I can show them to you and tell you what I think they mean.
I don’t know where this will end up, but it’s important work. The Times is being incredibly smart, meanwhile. It’s leveraging the passion of technologists who care about news and journalism. In the end, the value will accrue not just to the paper but to everyone who cares about getting the news they need, when and where they need it.
Kudos all around on this one.
(DIsclosure: I own a small amount of New York Times Co. stock, which is currently worth a lot less than I paid for it.)
on Oct 22nd, 2007 at 2:26 pm
Irony of ironies.
When RSS was being conceived in the summer of 2000, there were two basic camps. Rael Dornfest saw in it a way to make a true RDF query engine. Imagine: anybody could then build their own query engine to universally query blogs + news + anything RDF (by *anybody*, I don’t mean *everybody* — it’s just under the notion that if anybody can build their own javascript libraries, you’ll have a fierce Darwinian competition)
But Dave wanted to make it simple. (“I find the activity towards ‘modularization’ to be dry and uninteresting.”) So, things like keywords (and a bunch of useful metadata out of NewsML) were left out and were left as an exercise to the developer. Hence many different Atom extensions these days, none of which are wholly standard.
And, here’s the irony: the Times is not supplying the metadata in their RSS2.0 feeds– they are putting them in the HTML, leaving Dave to scrape them from there! ha ha!
And, of course, Google and Topix already do this clustering technique– across thousands of sources. They scrape, because news organizations do not make metadata-rich NewsML feeds available…
Not that pulling META tags out of an HTML document is hard. It’s just curious that, if you wind the clock back 7 years, and tell a bunch of software engineers that the coolest news clustering applications today would still rely on HTML scraping, I don’t know whether they’d laugh or cry.
on Oct 22nd, 2007 at 8:07 pm
[…] Dan Gillmor: “Dave Winer has been exploring a superb news resource, exploring the depth and breadth of the New York Times‘ data-stream.” […]
on Oct 23rd, 2007 at 6:46 am
[…] it on his blog post announcement. I do see a lot of comments from people in the field (techies, here, here and here for example) that really like the work and the possibilities of the […]
on Oct 23rd, 2007 at 7:29 am
[…] those things that lead to places we simply cannot foresee (including a dead end). But Dan Gillmor has it right when he lauds The Times for opening up its data stream to outside resources — in the cause of […]