introduction. computational journalism week 1

Upload: jonathan-stray

Post on 06-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133

TRANSCRIPT

  • Frontiers of Computational Journalism

    Columbia Journalism School

    Week 1: Introduction

    September 11, 2015

  • Lecture 1: Basics

    Computer Science and Journalism

    Course Structure

    Interpreting High Dimensional Data

  • Computational Journalism: Denitions

    Broadly defined, it can involve changing how stories are discovered, presented, aggregated, monetized, and archived. Computation can advance journalism by drawing on innovations in topic detection, video analysis, personalization, aggregation, visualization, and sensemaking. - Cohen, Hamilton, Turner, Computational Journalism, 2011

  • Stories will emerge from stacks of financial disclosure forms, court records, legislative hearings, officials' calendars or meeting notes, and regulators' email messages that no one today has time or money to mine. With a suite of reporting tools, a journalist will be able to scan, transcribe, analyze, and visualize the patterns in these documents. - Cohen, Hamilton, Turner, Computational Journalism, 2011

    Computational Journalism: Denitions

  • Cohen et al. model

    Data Reporting

    User

    ComputerScience

  • CS for presentation / interaction

    Data Reporting

    CSCS

    User

  • Filter stories for user

    Data Reporting

    Data Reporting

    Data Reporting

    CS

    Filtering

    CS

    CS

    CSCS

    CS

    CS

    User

  • Examples of lters Facebook news feed What an editor puts on the front page Google News Reddits comment system Twitter Techmeme New York Times recommendation system

  • http://snap.stanford.edu/nifty

  • Kony 2012 early network, by Gilad Lotan

  • CS in Journalism

    Eects

    Data Reporting

    Data Reporting

    Data Reporting

    CS

    Filtering

    CS

    CS

    CSCS

    CS

    CS

    User

    CS

  • Journalism with algorithms vs.

    Journalism about algorithms

  • Websites Vary Prices, Deals Based on Users' Information Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012

  • Message Machine Jeff Larson, Al Shaw, ProPublica, 2012

  • Where does data come from?

  • Computer Science in Journalism

    Reporting

    Presentation Filtering Tracking

    Algorithmic accountability

  • Quantication

    Data

  • Journalism as a cycle

    Data

    Reporting

    Filtering

    EectsCS

    CS

    CS

    CS

    User

  • Computational Journalism: Denitions

    the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. - Jonathan Stray, A Computational Journalism Reading List, 2011

  • Course Structure Information retrieval: TF-IDF, search engines Text analysis: clustering and topic modeling Information filtering systems Social network analysis Knowledge representation Drawing conclusions from data Writing about data Information Security Tracking flow and effects

  • Natural Language Processing

    Visualization

    Sociology

    Articial Intelligence

    Cognitive Science

    Statistics

    Graph Theory

    Clustering

    Text Analysis

    Filter Design

    Social Network Analysis

    Knowledge Representation

    Drawing Conclusions

    Information Retrieval

    Epistemology

  • AdministrationAssignment after each class

    Four assignments require programming, but your writing counts for more than your code!

    Course blog http://compjournalism.com

    Final project for 6-pt students only

  • GradingDual degree students

    Pass/Fail. Final project: paper, story, or software.

    Non-journalism students 80% assignements 20% class participation

  • Definition of data?

  • a collection of related pieces of

    recorded information

    My Definition of data

  • structured data

  • unstructured data

  • Quantication

    x1x2x3xN

    !

    "

    #######

    $

    %

    &&&&&&&

  • Other things that are tricky to quantify, but quantied anyway

    Intelligence Academic performance Gender Race, ethnicity, nationality Number of sexual harassment incidents Income Political Ideology ...

  • Dierent types of quantitative Numeric

    o continuous o countable o bounded? o units of measurement?

    Categorical o finite, e.g. {on, off} o infinite e.g. {red, yellow, blue, ... chartreuse} o ordered? o equivalence classes or other structure?

  • Dierent types of scalesTemperature Continuous scale, fixed zero point, physical units, comparative, uniform

    Likert Scale Discrete scale, no xed origin , abstract units, comparative, non-uniform

  • Likert scales are non-uniform

  • No averages on a non-uniform scaleIts not linear, so is 2X1 twice as good?

    (X1+c) (X2+c) X1 X2 Lots of things dont make much sense, such as

    sum(X1 ... XN) / N = ?

    Average is not well defined! (Nor std dev, etc.) But rank order statistics are robust. And all of this might not be a problem in practice.

  • Other issues withquantitative Where did the data come from?

    o physical measurement o computer logging o human recording

    What are the sources of error? o measurement error o missing data o ambiguity in human classification o process errors o intentional bias / deception

  • Vector representation of objectsFundamental representation for many data mining, clustering, machine learning, visualization, NLP, etc. algorithms.

    x1x2x3xN

    !

    "

    #######

    $

    %

    &&&&&&&

    Each xi is a numerical or categorical featureN = number of features or dimension

  • Examples of features number of claws latitude color {red, yellow, blue} number of break-ins 1 for bought X, 0 for did not buy X time, duration, etc. number of times word Y appears in document votes cast

  • Feature selectionTechnical meaning in machine learning etc.:

    which variables matter?

    Were journalists, so were interested in an earlier

    process:

    how to describe the world in numbers?

  • Choosing Features

    where k N

    x1x2x3xN

    !

    "

    #######

    $

    %

    &&&&&&&

    x f (1)x f (2)

    x f (k )

    !

    "

    #####

    $

    %

    &&&&&

    JournalismHow do we represent the

    world numerically?

    Machine learningWhich variables carry the most information?

  • Examples of vector representationsObvious

    o movies watched / items purchased o Legislative voting history for a politician o crime locations

    Less obvious, but standard o document vector space model o psychological survey results

    Tricky research problem: disparate field types o Corporate filing document o Wikileaks SIGACT

  • What can we do with vectors? Predict one variable based on others

    o this is called regression o or maybe "classification" o supervised machine learning

    Group similar items together o This is clustering o or maybe "classification" with unknown categories o unsupervised machine learning