Diffbot: Identify and Extract from Any Web page. Uses NLP and machine learning to extract structured data from arbitrary web pages.
First point: Let’s approach journalism as the science for civic participation. Give journalism the goal to help us improve our standards of living, create a more just society, and so on. Make the goals measurable in various ways, and we can track our progress towards them.
Science, according to Wikipedia, “builds and organizes knowledge in the form of testable explanations and predictions about the world.” A report in a scientific journal has an abstract, methodology, presentation of the data, discussion and conclusion. News articles typically have the first and last. They’re missing two critical pieces: presentation of the data and the methodology used to collect the data. Reproducibility is a vital aspect of the scientific method (related: Jonah Lehrer has a fascinating article on this topic in the New Yorker). Continue reading “Journalism should be reproducible”
Standards-based journalism in a semantic economy. Forget widgets and Twitter apps, this is the opportunity. Required reading.
Imagine a global economy in which every piece of information is linked directly to its meaning and origin. In which queries produce answers, not expensive, time-consuming evaluation tasks. Imagine a world in which reliable, intelligent information structures give everyone an equal ability to make profitable decisions, or in many cases, profitable new information products. Imagine companies that get paid for the information they generate or collect based on its value to end users, rather than on the transitory attention it generates as it passes across a screen before disappearing into oblivion.
This is exactly the opportunity.
I find it interesting that no one has considered this in terms of the adoption of stream-based social tools, where the use of URLs is increasingly not about navigation, but of fetching. Instead of clicking on a URL to a photo in my twitter stream, my Twitter client pulls the photo and resolves it in the context of my streaming application.
One of the principles of the web of flow is the fragmentation of older, page-based media into easily streamable bits. And for that to work, each fragment has to have a unique ID, which is exactly what deep links are all about.
Holy Moses this is smart.
An idea for the now: web application that OCR’s my grocery receipts to track my dietary habits over time while at the same time building a realtime database of food prices across the city. A bit like the Brian Lehrer project, but on a grander scale.
What follows are a few of the questions that have been consuming a significant amount my brain cycles recently. This may or may not be a departure from what I might normally post, but I’d like to start using my web presence as a personal data store as much as a place to publish opinionated pieces about this, that, or the other.
Two more notes. First, on the subject of journalism, it’d be fascinating to see beat reporters regularly post their current questions of interest. This may even be a sellable asset. In addition to benefiting from the information they produce, I as a reader could also learn tremendously from their research process.
Second, I literally can not wait until I have a tool that allows me to manage my learning process. Specifically, I’d love to be able to articulate questions that inspire movement towards knowledge, map my answers when I find them, and then computationally mine the activity data for insights.
How many hours a day are wasted trying to solve a problem that has either already been solved or just needs existing data to generate a solution? Which industries spend the greatest amount of time solving information problems, and what would be the economic gains if you could provide the “just-in-time” data needed to solve the problems? What tools do you need to actively monitor and provide for these information needs?
How does the nature of work change when the efficiencies of technology rule an increasing number of jobs obsolete? How is the nature of local business and commerce shifting because of the web and supply chain efficiencies?
What percentage of students have to take out loans for tuition, and how has that number changed over the years? How has the payback period changed in total and by course of study? Does higher education make more or less economic sense? This data repository may hold answers.
What is the breakdown of information provided by a traditional newspaper (how much and of what topics)? What other local information providers overlap with this information, and how much of it is unique to the newspaper? What are the overall information needs of the community, and how do you surface and visualize this?
What percentage of vehicles drive down I-5 with solely a single occupant? How could you incentivize these drivers to self-report their “flight plans”? What systems have attempted to solve this, and what have been their successes and failures?
In what ways can you produce, structure and save a lot of personal data in such a fashion that it can become useful in the aggregate? How do you bake this into your workflow so that it isn’t extra work? What bits of data would be useful on a personal level, a community level, and/or a societal level? Related: absolutely fascinating RadioLab episode explores how the mining of Agatha Christie’s written works led to a surprising insight.
The river seems like a wholly inadequate format for visualizing this data in any actionable sense, especially because the use of the data is context-specific.
What type of data would be useful from a real time, human-powered, and geographically distributed sensor network? For Ushahidi Haiti, it was emergencies, public health issues, security threats, infrastructure damage, and natural hazards. Part one is organically producing the data in a structured, usable format, and part two is leveraging it to significantly improve the speed and quality of the response to the event.