Annotated blog corpus to be released at WWE 2006
Intelliseek will be a big corpus of spidered and annotated blog posts to attendees at the 3rd Annual Workshop on the Weblogging Ecosystem (held in conjunction with the WWW 2006 Conference in Edinburgh, Scottland):
The data release comprises a complete set of weblog posts for three weeks in July 2005 (on the order of 10M posts from 1M weblogs). This data set has been selected as it spans a period of time during which an event of global significance occurred, namely the London bombings.
The data set includes the full content of the posts plus mark-up. The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type – details may be found here.
Sounds like a great resource for researchers. I’m also amused (in a dark sort of way) by the datashare individual agreement they require people to sign — essentially they admit that there’s no way they can get copyright clearance from all million or so bloggers they’ve collected, so they just ask everyone to agree to remove any posts if anyone complains, not use the results for commercial purposes and not use it passed the workshop.