Hack day project idea(s), inspired by the data science session this morning. Look at a random sample of comments across and…

  • Classify their content (e.g. how they’re responding to the post).
  • Do a topical classification of post content and compare against comment word count or frequency.
  • Calculate diversity of commenters for a site as a function of unique email addresses to number of comments.
  • Build a network graph indicating correlation between commenters across different sites.

The big takeaway: with any given dataset, play with visualizations first before trying to draw a conclusion.

Questions publishers want answered

Short list of questions publishers want answered that I believe could be answered with the right data:

  • Who are my best writers?
  • What topics are my audience most engaged in?
  • Which types of pieces do best over time?
  • What type of stories should I have my writers work on?
  • When is the best time to publish?
  • What’s the best length for a piece?
  • Does including rich media help with engagement?
  • Do my writers actually need to include links? How many?

What am I missing?

Obviously most publishers know most of these by heart, it’s key to running a successful business. What’s more interesting is to use this type of data as a baseline for experimentation.

It’s important to remember the difference between creation and optimization, and how data can be used for each.

Background information on our survey of Knight News Challenge projects

If you’re reading this post because I, Chris Amico, or one of two other collaborators emailed you this link, congratulations! You’re one of the 64 projects funded since 2007 through the Knight Foundation’s News Challenge contest. These projects have been granted $21.9 million dollars over the last four years, and we’re curious to hear how they ended up.

A bit of background. Next weekend, David Cohn of (not one of the trouble-makers) is bringing a couple dozen of us together for Hardly Strictly Young. It’s at the Reynolds Journalism Institute, sponsored by the Knight Foundation, and will be my first trip to Missouri. Over two full days, we’ll discuss facets of the Knight Foundation’s commission on the information needs of communities. Part of this, or at least what those of us running the survey think, is to help the Knight Foundation learn from the first four years of the News Challenge. It is arguably the most significant effort from news industry actors to inspire innovation within said industry. In other words, it’s been our only hope.

Unfortunately, there’s not much data for us to work with. Yet. The Knight Foundation has all of the winners listed on the News Challenge website, along with their project descriptions and amount granted, but very little information on outcomes. This is where you fit into our crowdsourced reporting project.

We have two sections on our survey form. The first asks for quantitative information on your project, and is intentionally required for you to submit the form. We want to know whether your project is still active, how much of you grant you actually spent, and whether you achieved your stated objectives. These responses will go on the big ol’ spreadsheet of data we’ll eventually release. The second (optional and/or anonymous) section asks for a qualitative perspective on your project, including how it was successful, what challenges you faced, and what you thought of your experience with the News Challenge. These questions are intentionally broad. If you decide to respond anonymously, we won’t publish the remarks with your name (if we choose to publish them).

This data is quite important. Thank you in advance for taking at least a few minutes to respond. To make things fun, we’ll be updating a public list of who has and hasn’t yet responded. So encourage your friends who haven’t yet replied to do so. I’d like to thank On The Media for the creative idea.

Idea: Alternate census

In tandem to the regular census, do an alternate census of the inanimate objects that make up the human environment. This might include numbers like:

  • Cell phone and landline subscribers
  • Grocery stores per neighborhood
  • Schools and hospitals, and their capacity
  • Restaurants per neighborhood
  • Average wait time for metro or bus

Obviously the open question is how you’d collect this data. I suspect some already exists and some we’d have to get creative about. The thought came while discussing a Craft 2 census project this morning. To me, this alternate data set is the missing (and more actionable) other half.

Key departures suggest 4 factors critical to the future of programming and journalism

Key departures suggest 4 factors critical to the future of programming and journalism. Both Matt Waite and Jeremy Bowers are out at the St. Petersburg Times. Factors influencing data journalism at larger news organizations:

  • News apps challenge longstanding perceptions of who owns technology within a media company.
  • Regardless of who is placed in what department, developers and journalists must be able to collaborate so they can create new tools.
  • News organizations will have to emphasize project management and product development if they hope to compete with digitally-native information companies.
  • News organizations must truly support risk-taking in order to see its rewards.

Excellent state of the field analysis. All challenges to solve.

Tuesday night distraction: Versioned Data Carnival of Journalism

A few days back, Saturday to be exact, the crazy notion I should spend dozens of hours doing content analysis on The Locals came to my mind. For my Carnival of Journalism blog post, I want to paint a clear picture of what university-sponsored hyperlocal journalism is like today. This can then be a foundation for any bushy-eyed speculation I might do about the future.

Sunday evening, I created a Github repository for two reasons: to see how my code is evolving and to track step by step how I’m putting this data together. After all, journalism must be reproducible.

Now that it’s closer to deadline, I want to open the floor. What data points would you like to see established about The Locals? As of right now, I know that the LEV (Local East Village) produced 100 blog posts in November 2010 from 29 authors and 19 community contributors. The FGCH (Fort Greene-Clinton Hill) produced 105 blog posts in November 2010 from 23 authors and 23 community contributors. The rest of the questions I’ve established are in my research notes.

P.S. Another part of the experiment is to see how well Git works as a versioned authoring tool.