Tagged data literacy

css.php
Plot counting wildfires in California by cause. In the plot, the fewest fires have been attributed to illegal alien campfires and firefighter training.
1

Ethnographies of Datasets: Teaching Critical Data Analysis through R Notebooks

Abstract

With the growth of data science in industry, academic research, and government planning over the past decade, there is an increasing need to equip students with skills not only in responsibly analyzing data, but also in investigating the cultural contexts from which the values reported in data emerge. A risk of several existing models for teaching data ethics and critical data literacy is that students will come to see data critique as something that one does in a compliance capacity prior to performing data analysis or in an auditing capacity after data analysis rather than as an integral part of data practice. This article introduces how I integrate critical data reflection with data practice in my undergraduate course Data Sense and Exploration. I introduced a series of R Notebooks that walk students through a data analysis project while encouraging them, each step of the way, to record field notes on the history and context of their data inputs, the erasures and reductions of narrative that emerge as they clean and summarize the data, and the rhetoric of the visualizations they produce from the data. I refer to the project as an “ethnography of a dataset” not only because students examine the diverse cultural forces operating within and through the data, but also because students draw out these forces through immersive, consistent, hands-on engagement with the data.

Introduction

Last Spring one of my students made an important discovery regarding the politics encoded in data about California wildfires. Aishwarya Asthana was examining a dataset published by California’s Department of Forestry and Fire Protection (CalFIRE), documenting the acres burned for each government-recorded wildfire in California from 1878 to 2017. The dataset also included variables such as the fire’s name, when it started and when it was put out, which agency was responsible for it, and the reason it ignited. Asthana was practicing applying techniques for univariate data analysis in R—taking one variable in the dataset and tallying up the number of times each value in that variable appears. Such analyses help to summarize and reveal patterns in the data, prompting questions about why certain values appear more than others.

Tallying up the number of times each distinct wildfire cause appeared in the dataset, Asthana discovered that CalFIRE categorizes each wildfire into one of nineteen distinct cause codes, such as “1—Lightning,” “2—Equipment Use,” “3—Smoking,” and “4—Campfire.” According to the analysis, 184 wildfires were caused by campfires, 1,543 wildfires were caused by lightning, and, in the largest category, 6,367 wildfires were categorized with a “14—Unknown/Unidentified” cause code. The cause codes that appeared the fewest number of times (and thus were attributed to the fewest number of wildfires) were “12—Firefighter Training” and the final code in the list: “19—Illegal Alien Campfire.”

fires %>% 
  ggplot(aes(x = reorder(CAUSE,CAUSE,
                     function(x)-length(x)), fill = CAUSE)) +
  geom_bar() +
  labs(title = "Count of CalFIRE-documented Wildfires since 1878 by Cause", x = "Cause", y = "Count of Wildfires") + 
  theme_minimal() +
  theme(legend.position = "none", 
        plot.title = element_text(size = 12, face = "bold")) +
  coord_flip() 

Figure 1: Plot counting wildfires in California by cause. In the plot, the fewest fires have been attributed to illegal alien campfires and firefighter training.

Figure 1. Plot of CalFIRE-documented wildfires by cause, produced in R.

Interpreting the data unreflectively, one might say, “From 1878 to 2017, four California wildfires have been caused by illegal alien campfires—making it the least frequent cause.” Toward the beginning of the quarter in Data Sense and Exploration, many students, particularly those majoring in math and statistics, compose statements like this when asked to draw insights from data analyses. However, in only reading the data on its surface, this statement obscures important cultural and political factors mediating how the data came to be reported in this way. Why are “illegal alien campfires” categorized separately from just “campfires”? Who has stakes in seeing quantitative metrics specific to campfires purportedly ignited by this subgroup of the population—a subgroup that can only be distinctly identified through systems of human classification that are also devised and debated according to diverse political commitments?

While detailing the history of the data’s collection and some potential inconsistencies in how fire perimeters are calculated, the data documentation provided by CalFIRE does not answer questions about the history and stakes of these categories. In other words, it details the provenance of the data but not the provenance of its semantics and classifications. In doing so, it naturalizes the values reported in the data in ways that inadvertently discourage recognition of the human discernment involved in their generation. Yet, even a cursory Web search of the key phrase “illegal alien campfires in California” reveals that attribution of wildfires to undocumented immigrants in California has been used to mobilize political agendas and vilify this population for more than two decades (see, for example, Hill 1996). Discerning the critical import of this data analysis thus demands more than statistical savvy; to assess the quality and significance of this data, an analyst must reflect on their own political and ethical commitments.

Data Sense and Exploration is a course designed to help students reckon with the values reported in a dataset so that they may better judge their integrity. The course is part of a series of undergraduate data studies courses offered in the Science and Technology Studies Program at the University of California Davis, aiming to cultivate student skill in applying critical thinking towards data-oriented environments. Data Sense and Exploration cultivates critical data literacy by walking students through a quarter-long research project contextualizing, exploring, and visualizing a publicly-accessible dataset. We refer to the project as an “ethnography of a dataset,” not only because students examine the diverse cultural forces operating within and through the data, but also because students draw out these forces through immersive, consistent, hands-on engagement with the data, along with reflections on their own positionality as they produce analyses and visualizations. Through a series of labs in which students learn how to quantitatively summarize the features in a dataset in the coding language R (often referred to as a descriptive data analysis), students also practice researching and reflecting on the history of the dataset’s semantics and classification. In doing so, the course encourages students to recognize how the quantitative metrics that they produce reflect not only the way things are in the world, but also how people have chosen to define them. Perhaps, most importantly, the course positions data as always already structured according to diverse biases and thus aims to foster student skill in discerning which biases they should trust and how to responsibly draw meaning from data in spite of them. In this paper, I present how this project is taught in Data Sense and Exploration and some critical findings students made in their projects.

Teaching Critical Data Analysis

With the growth of data science in industry, academic research, and government planning over the past decade, universities across the globe have been investing in the expansion of data-focused course offerings. Many computationally or quantitatively-focused data science courses seek to cultivate student skill in collecting, cleaning, wrangling, modeling, and visualizing data. Simultaneously, high-profile instances of data-driven discrimination, surveillance, and mis-information have pushed universities to also consider how to expand course offerings regarding responsible and ethical data use. Some emerging courses, often taught directly in computer and data science departments, introduce students to frameworks for discerning “right from wrong” in data practice, focusing on individual compliance with rules of conduct at the expense of attention to the broader institutional cultures and contexts that propagate data injustices (Metcalf, Crawford, and Keller 2015). Other emerging courses, informed by scholarship in science and technology studies (STS) and critical data studies (CDS), take a more critical approach, broadening students’ moral reasoning by encouraging them to reflect on the collective values and commitments that shape data and their relationship to law, democracy, and sociality (Metcalf, Crawford, and Keller 2015).

While such courses help students recognize how power operates in and through data infrastructure, a risk is that students will come to see the evaluation of data politics and the auditing of algorithms as a separate activity from data practice. While seeking to cultivate student capacity to foresee the consequences of data work, coursework that divorces reflection from practice end up positioning these assessments as something one does after data analysis in order to evaluate the likelihood of harm and discrimination. Research in critical data studies has indicated that this divide between data science and data ethics pedagogy has rendered it difficult for students to recognize how to incorporate the lessons of data and society into their work (Bates et al. 2020). Thus, Data Sense and Exploration takes a different approach—walking students through a data analysis project while encouraging them, each step of the way, to record field notes on the history and context of their data inputs, the erasures and reductions of narrative that emerge as they clean and summarize the data, and the rhetoric of the visualizations they produce. As a cultural anthropologist, I’ve structured the class to draw from my own training in and engagement with “experimental ethnography” (Clifford and Marcus 1986). Guided by literary, feminist, and postcolonial theory, cultural anthropologists engage experimental ethnographic methods to examine how systems of representation shape subject formation and power. In this sense, Data Sense and Exploration positions data inputs as cultural artifacts, data work as a cultural practice, and ethnography as a method that data scientists can and should apply in their work to mitigate the harm that may arise from them. Importantly, walking students into awareness of the diverse cultural forces operating in and through data helps them more readily recognize opportunities for intervention. Rather than criticizing the values and political commitments that they bring to their work as biasing the data, the course celebrates such judgments when bent toward advancing more equitable representation.

The course is predominantly inspired by literature in data and information infrastructure studies (Bowker et al. 2009). These fields study the cultural and political contexts of data and the infrastructures that support them by interviewing data producers, observing data practitioners, and closely reading data structures. For example, through historical and ethnographic studies of infrastructures for data access, organization, and circulation, the field of data infrastructure studies examines how data is made and how it transforms as it moves between stakeholders and institutions with diverse positionalities and vested interests (Bates, Lin, and Goodale 2016). Critiquing the notion that data can ever be pure or “raw,” this literature argues that all data emerge from sites of active mediation, where diverse epistemic beliefs and political commitments mold what ultimately gets represented and how (Gitelman 2013). Diverting from an outsized focus on data bias, Data Sense and Exploration prompts students to grapple with the “interpretive bases” that frame all data—regardless of whether it has been produced though personal data collection, institutions with strong political proclivities, or automated data collection technologies. In this sense, the course advances what Gray, Gerlitz, and Bounegru (2018) refer to as “data infrastructure literacy” and demonstrates how students can apply critical data studies techniques to critique and improve their own day-to-day data science practice (Neff et al. 2017).

Studying a Dataset Ethnographically

Data Sense and Exploration introduces students to examining a dataset and data practices ethnographically through an extended research project, carried out incrementally through a series of weekly labs.[1] While originally the labs were completed collaboratively in a classroom setting, in the move to remote instruction in Spring 2020, the labs were reformulated as a series of nine R Notebooks, hosted in a public GitHub repository that students clone into their local coding environments to complete. R Notebooks are digital documents, written in the scripting language Markdown, that enable authors to embed chunks of executable R code amidst text, images, and other media. The R Notebooks that I composed for Data Sense and Exploration include text instruction for how to find, analyze, and visualize a rectangular dataset, or a dataset in which values are structured into a series of observations (or rows) each described by a series of variables (or columns). The Notebooks also model how to apply various R functions to analyze a series of example datasets, offer warnings of the various faulty assumptions and statistical pitfalls students may encounter in their own data practice, and demonstrate the critical reflection that students will be expected to engage in as they apply the functions in their own data analysis.

Interspersed throughout the written instruction, example code, and reflections, the Notebooks provide skeleton code for students to fill in as they go about applying what they have learned to a dataset they will examine throughout the course. At the beginning of the course, when many students have no prior programming experience, the skeleton code is quite controlled, asking students to “fill-in-the-blank” with a variable from their own dataset or with a relevant R function.

# Uncomment below and count the distinct values in your unique key. Note that you may need to select multiple variables. If so, separate them by a comma in the select() function.
#n_unique_keys <- _____ %>% select(_____) %>% n_distinct()

# Uncomment below and count the rows in your dataset by filling in your data frame name.
#n_rows <- nrow(_____)

# Uncomment below and then run the code chunk to make sure these values are equal.
# n_unique_keys == n_rows
Figure 2. Example of skeleton code from R Notebooks.

However, as students gain familiarity with the language, each week, they are expected to compose code more independently. Finally, in each Notebook, there are open textboxes, where students record their critical reflections in response to specific prompts.

Teaching this course in the Spring 2020 quarter, I found that the structure provided by the R Notebooks overall was particularly supportive to students who were coding in R for the first time and that, given the examples provided throughout the Notebooks, students exhibited greater depth of reflection in response to prompts. However, without the support of a classroom once we moved online, I also found that novice students struggled more to interpret what the plots they produced in R were actually showing them. Moreover, advanced students were more conservative in their depth of data exploration, closely following the prompts and relying on code templates. In future iterations of the course, I thus intend to spend more synchronous time in class practicing how to quantitatively summarize the results of their analysis. I also plan to add new sections at the end of each Notebook, prompting students to leverage the skills they learned in that Notebook in more creative and free-form data explorations.

Each time I teach the course, individual student projects are structured around a common theme. In the iteration of the course that inspired the project that opens this article, the theme was “social and environmental challenges facing California.” In the most recent iteration of the course, the theme was “social vulnerability in the wake of a pandemic.” In an early lab, I task students with identifying issues warranting public concern related to the theme, devising research questions, and searching for public data that may help answer those questions. Few students entering the course have been taught how to search for public research, let alone how to search for public data. In order to structure their search activity, I task the students with imagining and listing “ideal datasets”—intentionally delineating their topical, geographic, and temporal scope—prior to searching for any data. Examining portals like data.gov, Google’s dataset search, and city and state open data portals, students very rarely find their ideal datasets and realize that they have to restrict their research questions in order to complete the assignment. Grappling with the dearth of public data for addressing complex contemporary questions around equity and social justice provides one of the first eye-opening experiences in the course. A Notebook directive prompts students to reflect on this.

Throughout the following week, I work with groups of students to select datasets from their research that will be the focus of their analysis. This is perhaps one of the most challenging tasks of the course for me as the instructor. While a goal is to introduce students to the knowledge gaps in public data, some public datasets have so little documentation that the kinds of insights students could extrapolate from examinations of their history and content would be considerably limited. Further, not all rectangular datasets are structured in ways that will integrate well with the code templates I provide in the R Notebooks. I grapple with the tension of wanting to expose students to the messiness of real-world data, while also selecting datasets that will work for the assignment.

Once datasets have been assigned, the remainder of the labs provide opportunities for immersive engagement with the dataset. In what follows, I describe a series of concepts (i.e. routines and rituals, semantics, classifications, calculations and narrative, chrono-politics, and geo-politics) around which I have structured each lab, and provide some examples of both the data work that introduced students to these concepts and the critical reflections they were able to make as a result.

Data Routines and Rituals

In one of the earlier labs, students conduct a close reading of their dataset’s documentation—an example of what Geiger and Ribes (2011) refer to as a “trace ethnography.” They note the stakeholders involved in the data’s collection and publication, the processes through which the data was collected, the circumstances under which the data was made public, and the changes in the data’s structure. They also search for news articles and scientific articles citing the dataset to get a sense of how governing bodies have leveraged the data to inform decisions, how social movements have advocated for or against the data’s collection, and how the data has advanced other forms of research. They outline the costs and labor involved in producing and maintaining the data, the formal standards that have informed the data’s structure, and any laws that mandate the data’s collection.
From this exercise, students learn about the diverse “rituals” of data collection and publication (Ribes and Jackson 2013). For instance, studying the North American Breeding Bird Survey (BBS)—a dataset that annually records bird populations along about 4,100 roadside survey routes in the United States and Canada—Tennyson Filcek learned that the data is produced by volunteers skilled in visual and auditory bird identification. After completing training, volunteers drive to an assigned route with a pen, paper, and clipboard and count all of the bird species seen or heard over the course of three minutes along each designated stop on the route. They report the data back to the BBS Office, which aggregates the data and makes them available for public consumption. While these rituals shape how the data get produced, the unruliness of aggregating data collected on different days, by different individuals, under different weather and traffic conditions, and in different parts of the continent has prompted the BBS to implement recommendations and routines to account for disparate conditions. The BBS requires volunteers to complete counts around June, start the route a half-hour before sunrise, and avoid completing counts on foggy, rainy, or windy days. Just as these routines domesticate the data, the heterogeneity of the data’s contexts demands that the data be cared for in particular ways, in turn patterning data collection as a cultural practice. This lab is thus an important precursor to the remaining labs in that it introduces students to the diverse actors and commitments mediating the dataset’s production and affirms that the data could not exist without them.

While I have been impressed with students’ ability to outline details involving the production and structure of the data, I have found that most students rarely look beyond the data documentation for relevant information—often missing critical perspectives from outside commentators (such as researchers, activists, lobbyists, and journalists) that have detailed the consequences of the data’s incompleteness, inconsistencies, inaccuracies, or timeliness for addressing certain kinds of questions. In future iterations of the course, I intend to encourage students to characterize the viewpoints of at least three differently positioned stakeholders in this lab in order to help illustrate how datasets can become contested artifacts.

Data Semantics

In another lab, students import their assigned dataset into the R Notebook and programmatically explore its structure, using the scripting language to determine what makes one observation distinct from the next and what variables are available to describe each observation. As they develop an understanding for what each row of the dataset represents and how columns characterize each row, they refer back to the data documentation to consider how observations and variables are defined in the data (and what these definitions exclude). This focused attention to data semantics invites students to go behind-the-scenes of the observations reported in a dataset and develop a deeper understanding of how its values emerge from judgments regarding “what counts.”

ca_crimes_clearances <- read.csv("https://data-openjustice.doj.ca.gov/sites/default/files/dataset/2019-06/Crimes_and_Clearances_with_Arson-1985-2018.csv")

str(ca_crimes_clearances)
## 'data.frame':    24950 obs. of  69 variables:
##  $ Year               : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ County             : chr  "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
##  $ NCICCode           : chr  "Alameda Co. Sheriff's Department" "Alameda" "Albany" "Berkeley" ...
##  $ Violent_sum        : int  427 405 101 1164 146 614 671 185 199 6703 ...
##  $ Homicide_sum       : int  3 7 1 11 0 3 6 0 3 95 ...
##  $ ForRape_sum        : int  27 15 4 43 5 34 36 12 16 531 ...
##  $ Robbery_sum        : int  166 220 58 660 82 86 250 29 41 3316 ...
##  $ AggAssault_sum     : int  231 163 38 450 59 491 379 144 139 2761 ...
##  $ Property_sum       : int  3964 4486 634 12035 971 6053 6774 2364 2071 36120 ...
##  $ Burglary_sum       : int  1483 989 161 2930 205 1786 1693 614 481 11846 ...
##  $ VehicleTheft_sum   : int  353 260 55 869 102 350 471 144 74 3408 ...
##  $ LTtotal_sum        : int  2128 3237 418 8236 664 3917 4610 1606 1516 20866 ...
##  $ ViolentClr_sum     : int  122 205 58 559 19 390 419 146 135 2909 ...
##  $ HomicideClr_sum    : int  4 7 1 4 0 2 4 0 1 62 ...
##  $ ForRapeClr_sum     : int  6 8 3 32 0 16 20 6 8 319 ...
##  $ RobberyClr_sum     : int  32 67 23 198 4 27 80 21 16 880 ...
##  $ AggAssaultClr_sum  : int  80 123 31 325 15 345 315 119 110 1648 ...
##  $ PropertyClr_sum    : int  409 889 166 1954 36 1403 1344 422 657 5472 ...
##  $ BurglaryClr_sum    : int  124 88 62 397 9 424 182 126 108 1051 ...
##  $ VehicleTheftClr_sum: int  7 62 16 177 8 91 63 35 38 911 ...
##  $ LTtotalClr_sum     : int  278 739 88 1380 19 888 1099 261 511 3510 ...
##  $ TotalStructural_sum: int  22 23 2 72 0 37 17 17 7 287 ...
##  $ TotalMobile_sum    : int  6 4 0 23 1 26 18 9 3 166 ...
##  $ TotalOther_sum     : int  3 5 0 5 0 61 21 64 2 22 ...
##  $ GrandTotal_sum     : int  31 32 2 100 1 124 56 90 12 475 ...
##  $ GrandTotClr_sum    : int  11 7 1 20 0 14 7 2 2 71 ...
##  $ RAPact_sum         : int  22 9 2 31 4 21 25 9 15 451 ...
##  $ ARAPact_sum        : int  5 6 2 12 1 13 11 3 1 80 ...
##  $ FROBact_sum        : int  77 56 23 242 35 38 136 13 22 1120 ...
##  $ KROBact_sum        : int  22 23 2 71 10 7 43 3 4 264 ...
##  $ OROBact_sum        : int  3 11 2 43 11 3 7 1 1 107 ...
##  $ SROBact_sum        : int  64 130 31 304 26 38 64 12 14 1825 ...
##  $ HROBnao_sum        : int  59 136 26 351 56 32 116 3 0 1676 ...
##  $ CHROBnao_sum       : int  38 48 15 150 9 21 43 4 13 253 ...
##  $ GROBnao_sum        : int  23 2 1 0 2 7 43 6 9 83 ...
##  $ CROBnao_sum        : int  32 2 2 0 0 8 21 2 2 46 ...
##  $ RROBnao_sum        : int  11 20 6 47 14 9 19 3 2 306 ...
##  $ BROBnao_sum        : int  3 2 3 21 0 2 6 0 3 37 ...
##  $ MROBnao_sum        : int  0 10 5 91 1 7 2 11 12 915 ...
##  $ FASSact_sum        : int  25 16 3 47 6 47 43 10 26 492 ...
##  $ KASSact_sum        : int  27 30 2 103 8 38 55 13 21 253 ...
##  $ OASSact_sum        : int  111 90 10 224 9 120 208 29 43 396 ...
##  $ HASSact_sum        : int  68 27 23 76 36 286 73 92 49 1620 ...
##  $ FEBURact_Sum       : int  1177 747 85 2040 161 1080 1128 341 352 9011 ...
##  $ UBURact_sum        : int  306 242 76 890 44 706 565 273 129 2835 ...
##  $ RESDBUR_sum        : int  1129 637 100 2015 89 1147 1154 411 274 8487 ...
##  $ RNBURnao_sum       : int  206 175 33 597 32 292 295 100 44 2114 ...
##  $ RDBURnao_sum       : int  599 195 44 1418 26 485 532 163 103 5922 ...
##  $ RUBURnao_sum       : int  324 267 23 0 31 370 327 148 127 451 ...
##  $ NRESBUR_sum        : int  354 352 61 915 116 639 539 203 207 3359 ...
##  $ NNBURnao_sum       : int  216 119 32 224 44 274 238 104 43 1397 ...
##  $ NDBURnao_sum       : int  47 46 21 691 14 110 45 34 26 1715 ...
##  $ NUBURnao_sum       : int  91 187 8 0 58 255 256 65 138 247 ...
##  $ MVTact_sum         : int  233 187 42 559 85 219 326 76 56 2711 ...
##  $ TMVTact_sum        : int  56 33 4 55 9 71 88 40 9 121 ...
##  $ OMVTact_sum        : int  64 40 9 255 8 60 57 28 9 576 ...
##  $ PPLARnao_sum       : int  5 31 26 133 5 10 1 4 3 399 ...
##  $ PSLARnao_sum       : int  60 20 4 163 4 14 20 6 3 251 ...
##  $ SLLARnao_sum       : int  289 664 40 1277 1 704 1058 106 435 1123 ...
##  $ MVLARnao_sum       : int  930 538 147 3153 207 1136 753 561 241 8757 ...
##  $ MVPLARnao_sum      : int  109 673 62 508 153 446 1272 155 252 901 ...
##  $ BILARnao_sum       : int  205 516 39 611 16 360 334 276 151 349 ...
##  $ FBLARnao_sum       : int  44 183 46 1877 85 493 417 187 281 4961 ...
##  $ COMLARnao_sum      : int  11 53 17 18 24 27 59 7 2 70 ...
##  $ AOLARnao_sum       : int  475 559 37 496 169 727 696 304 148 4055 ...
##  $ LT400nao_sum       : int  753 540 84 533 217 937 1089 370 235 976 ...
##  $ LT200400nao_sum    : int  437 622 68 636 122 607 802 299 262 2430 ...
##  $ LT50200nao_sum     : int  440 916 128 2793 161 1012 1102 453 464 4206 ...
##  $ LT50nao_sum        : int  498 1159 138 4274 164 1361 1617 484 555 13254 ...
Figure 3. Basic examination of the structure of the CA Crimes and Clearances dataset.

For instance, studying aggregated totals of crimes and clearances for each law enforcement agency in California in each year from 1985 to 2017, Simarpreet Singh noted how the definition of a crime gets mediated by rules in the US Federal Bureau of Investigation (FBI)’s Uniform Crime Reporting Program (UCR)—the primary source of statistics on crime rates in the US. Singh learned that one such rule, known as the hierarchy rule, states that if multiple offenses occur in the context of a single crime incident, for the purposes of crime reporting, the law enforcement agency classifies the crime only according to the most serious offense. In descending order, these classifications include 1. Criminal Homicide 2. Criminal Sexual Assault 3. Robbery 4. Aggravated Battery/Aggravated Assault 5. Burglary 6. Theft 7. Motor Vehicle Theft 8. Arson. This means that in the resulting data, for incidents where multiple offenses occurred, certain classes of crime are likely to be underrepresented in the counts.

Sidhu also acknowledged how counts for individual offense types get mediated by official definitions. A change in the FBI’s definition of “forcible rape” (including only female victims) to “rape” (focused on whether there had been consent instead of whether there had been physical force) in 2014 led to an increase in the number of rapes reported in the data from that year on. From 1927 (when the original definition was documented) up until this change, male victims of rape had been left out of official statistics, and often rapes that did not involve explicit physical force (such as drug-facilitated rapes) went uncounted. Such changes come about, not in a vacuum, but in the wake of shifting norms and political stakes to produce certain types of quantitative information (Martin and Lynch 2009). By encouraging students to explore these definitions, this lab has been particularly effective in getting students to reflect not only on what counts and measures of cultural phenomena indicate, but also on the cultural underpinnings of all counts and measures.

Data Classifications

In the following lab, students programmatically explore how values get categorized in the dataset, along with the frequency with which each observation falls into each category. To do so, they select categorical variables in the dataset and produce bar plots that display the distributions of values in that variable. Studying a US Environmental Protection Agency (EPA) dataset that reported the daily air quality index (AQI) of each county in the US in 2019, Farhat Bin Aznan created a bar plot that displayed the number of counties that fell into each of the following air quality categories on January 1, 2019: Good, Moderate, Unhealthy for Sensitive Populations, Unhealthy, Very Unhealthy, and Hazardous.

aqi$category <- factor(aqi$category, levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous"))

aqi %>%
  filter(date == "2019-01-01") %>%
  ggplot(aes(x = category, fill = category)) +
  geom_bar() +
  labs(title = "Count of Counties in the US by Reported AQI Category on January 1, 2019", subtitle = "Note that not all US counties reported their AQI on this date", x = "AQI Category", y = "Count of Counties") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(size = 12, face = "bold")) +
  scale_fill_brewer(palette="RdYlGn", direction=-1)

Figure 4: R output when student plots the number of counties in each AQI category on January 1, 2019. Bar plot displays that most counties reported Good air quality on that day.

Figure 4. Barplot of counties in each AQI category on January 1, 2019.

Studying the US Department of Education’s Scorecard dataset, which documents statistics on student completion, debt, and demographics for each college and university in the US, Maxim Chiao created a bar plot that showed the number of universities that fell into each of the following ownership categories: Private, Public, Non-profit.

scorecard %>%
  mutate(CONTROL_CAT = ifelse(CONTROL == 1, "Public",
                          ifelse(CONTROL== 2, "Private nonprofit",
                                 ifelse(CONTROL == 3, "Private for-profit", NA)))) %>%
           ggplot(aes(x = CONTROL_CAT, fill = CONTROL_CAT)) +
           geom_bar() +
           labs(title ="Count of Colleges and Universities in the US by Ownership Model, 2018-2019", x = "Ownership Model", y = "Count of Colleges and Universities") +
           theme_minimal() +
           theme(legend.position = "none",
                 plot.title = element_text(size = 12, face = "bold"))

Figure 5: R output when student plots the number of colleges and universities by their ownership model in the 2018-2019 academic year.

Figure 5. Barplot of colleges and universities in the US by ownership model.

I first ask students to interpret what they see in the plot. Which categories are more represented in the data, and why might that be the case? I then ask students to reflect on why the categories are divided the way that they are, how the categorical divisions reflect a particular cultural moment, and to consider values that may not fit neatly into the identified categories. As it turns out, the AQI categories in the EPA’s dataset are specific to the US and do not easily translate to the measured AQIs in other countries, where for a variety of reasons, different pollutants are taken into consideration when measuring air quality (Plaia and Ruggieri 2011). The ownership models categorized in the Scorecard dataset gloss over the nuance of quasi-private universities in the US such as the University of Pittsburgh and other universities in Pennsylvania’s Commonwealth System of Higher Education.

For some students, this Notebook was particularly effective in encouraging reflection on how all categories emerge in particular contexts to delimit insight in particular ways (Bowker and Star 1999). For example, air pollution does not know county borders, yet, as Victoria McJunkin pointed out in her labs, the EPA reports one AQI for each county based on a value reported from one air monitor that can only detect pollution within a delimited radius. AQI is also reported on a daily basis in the dataset, yet for certain pollutants in the US, pollution concentrations are monitored on an hourly basis, averaged over a series of hours, and then the highest average is taken as the daily AQI. The choice to classify AQI by county and day then is not neutral, but instead has considerable implications for how we come to understand who experiences air pollution and when.

Still, I found that, in this lab, other students struggled to confront their own assumptions about categories they consider to be neutral. For instance, many students categorizing their data by state in the US suggested that there were no cultural forces underlying these categories because states are “standard” ways of dividing the country. In doing so, they missed critical opportunities to reflect on the politics behind how state boundaries get drawn and which people and places get excluded from consideration when relying on this bureaucratic schema to classify data. Going forward, to help students place even “standard” categories in a cultural context, I intend to prompt students to produce a brief timeline outlining how the categories emerged (both institutionally and discursively) and then to identify at least one thing that remains “residual” (Star and Bowker 2007) to the categories.

Data Calculations and Narrative

The next lab prompts students to acknowledge the judgment calls they make in performing calculations with data, including how these choices shape the narrative the data ultimately conveys. Selecting a variable that represents a count or a measure of something in their data, students measure the central tendency of the variable—taking an average across the variable by calculating the mean and the median value. Noting that they are summarizing a value across a set of numbers, I remind students that such measures should only be taken across “similar” observations, which may require first filtering the data to a specific set of observations or performing the calculations across grouped observations. The Notebook instructions prompt students to apply such filters and then reflect on how they set their criteria for similarity. Where do they draw the line between relevant or irrelevant, similar or dissimilar? What narratives do these choices bring to the fore, and what do they exclude from consideration?

For instance, studying a dataset documenting changes in eligibility policies for the US Supplemental Nutrition Assistance Program (SNAP) by state since 1995, Janelle Marie Salanga sought to calculate the average spending on SNAP outreach across geographies in the US and over time. Noting that we could expect there to be differences in state spending on outreach due to differences in population, state fiscal politics, and food accessibility, Salanga decided to group the observations by state before calculating the average spending across time. Noting that the passing of the American Recovery and Reinvestment Act of 2009 considerably expanded SNAP benefits to eligible families, Salanga decided to filter the data to only consider outreach spending in the 2009 fiscal year through the 2015 fiscal year. Through this analysis, Salanga found California to have, on average, spent the most on SNAP outreach in the designated fiscal years, while several states spent nothing.

snap %>%
  filter(month(yearmonth) == 10 & year(yearmonth) %in% 2009:2015) %>% #Outreach spending is reported annually, but this dataset is reported monthly, so we filter to the observations on the first month of each fiscal year (October)
  group_by(statename) %>%
  summarize(median_outreach = median(outreach * 1000, na.rm = TRUE), 
            num_observations = n(), 
            missing_observations = paste(as.character(sum(is.na(outreach)/n()*100)), "%"), 
            .groups = 'drop') %>%
  arrange(desc(median_outreach))
statename median_outreach num_observations missing_observations
California 1129009.3990 7 0 %
New York 469595.8557 7 0 %
Texas 422051.5137 7 0 %
Washington 273772.9187 7 0 %
Minnesota 261750.3357 7 0 %
Arizona 222941.9250 7 0 %
Nevada 217808.7463 7 0 %
Illinois 195910.5835 7 0 %
Connecticut 184327.4231 7 0 %
Georgia 173554.0009 7 0 %
Pennsylvania 153474.7467 7 0 %
South Carolina 126414.4135 7 0 %
Ohio 125664.8331 7 0 %
Rhode Island 99755.1651 7 0 %
Tennessee 98411.3388 7 0 %
Massachusetts 97360.4965 7 0 %
Wisconsin 87527.9999 7 0 %
Maryland 81700.3326 7 0 %
Vermont 69279.2511 7 0 %
North Carolina 62904.8309 7 0 %
Indiana 58047.9164 7 0 %
Oregon 57951.0803 7 0 %
Michigan 53415.1688 7 0 %
Florida 37726.1696 7 0 %
Hawaii 29516.3345 7 0 %
New Jersey 23496.2501 7 0 %
Missouri 23289.1655 7 0 %
Louisiana 20072.0005 7 0 %
Colorado 19113.8344 7 0 %
Iowa 18428.9169 7 0 %
Virginia 15404.6669 7 0 %
Delaware 14571.0001 7 0 %
Alabama 11048.8329 7 0 %
District of Columbia 9289.5832 7 0 %
Kansas 8812.2501 7 0 %
North Dakota 8465.0002 7 0 %
Mississippi 4869.0000 7 0 %
Alaska 3199.3332 7 0 %
Arkansas 3075.0833 7 0 %
Nebraska 217.1667 7 0 %
Idaho 0.0000 7 0 %
Kentucky 0.0000 7 0 %
Maine 0.0000 7 0 %
Montana 0.0000 7 0 %
New Hampshire 0.0000 7 0 %
New Mexico 0.0000 7 0 %
Oklahoma 0.0000 7 0 %
South Dakota 0.0000 7 0 %
Utah 0.0000 7 0 %
West Virginia 0.0000 7 0 %
Wyoming 0.0000 7 0 %
Table 1. Median of annual SNAP outreach spending from 2009 to 2015 per US state.

The students then consider how their measures may be reductionist—that is, how the summarized values erase the complexity of certain narratives. For instance, Salanga went on to plot a series of boxplots that displayed the dispersion of outreach spending across fiscal years for each state from 2009 to 2015. She found that, while outreach spending had been fairly consistent in several states across these years, in other states there had been a difference in several hundred thousand dollars from the fiscal year with the maximum outreach spending to the year with the minimum.

snap %>%
  filter(month(yearmonth) == 10 & year(yearmonth) %in% 2009:2015) %>%
  ggplot(aes(x = statename, y = outreach * 1000)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Distribution of Annual SNAP Outreach Spending per State from 2009 to 2015", x = "State", y = "Outreach Spending") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, face = "bold")) 

Figure 6: R output when student plots the distribution of outreach spending per state from 2009 to 2015.

Figure 6. Boxplot showing distribution of annual SNAP outreach spending from 2009 to 2015.

This nuanced story of variations in spending over time gets obfuscated when relying on a measure of central tendency alone to summarize the values.

This lab has been effective in getting students to recognize data work as a cultural practice that involves active discernment. Still, I have noticed that some students complete this lab feeling uncomfortable with the idea that the choices they make in data work may be framed, at least in part, by their own political and ethical commitments. In other words, in their reflections, some students describe their efforts to divorce their own views from their decision-making: they express concern that their choices may be biasing the analysis in ways that invalidate the results. To help them further grapple with the judgment calls that frame all data analyses (and especially the calls that they individually make when choosing how to filter, sort, group, and visualize the data), the next time I run the course I plan to ask students to explicitly characterize their own standpoint in relation to the analysis and reflect on how their unique positionality both influences and delimits the questions they ask, the filters they apply, and the plots they produce.

Data Chrono-Politics and Geo-Politics

In a subsequent lab, I encourage students to situate their datasets in a particular temporal and geographic context in order to consider how time and place impact the values recorded. Students first segment their data by a geographic variable or a date variable to assess how the calculations and plots vary across geographies and time. They then characterize, not only how and why there may be differences in the phenomena represented in the data across these landscapes and timescapes, but also how and why there may be differences in the data’s generation.

For instance, in Spring 2020, a group of students studied a dataset documenting the number of calls related to domestic violence received each month to each law enforcement agency in California.

dom_violence_calls %>%
  ggplot(aes(x = YEAR_MONTH, y = TOTAL_CALLS, group = 1)) +
  stat_summary(geom = "line", fun = "sum") +
  facet_wrap(~COUNTY) +
  labs(title = "Domestic Violence Calls to California Law Enforcement Agencies by County", x = "Month and Year", y = "Total Calls") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 5, angle = 90, hjust = 1),
        strip.text.x = element_text(size = 6))

Figure 7: R output when student plots the total domestic violence calls to California law enforcement agencies over time divided by county.

Figure 7. Timeseries of domestic violence calls to California law enforcement agencies by county.

One student, Laura Cruz, noted how more calls may be reported in certain counties not only because domestic violence may be more prevalent or because those counties had a higher or denser population, but also due to different cultures of police intervention in different communities. Trust in law enforcement may vary across California communities, impacting which populations feel comfortable calling their law enforcement agencies to report any issues. This creates a paradox in which the counts of calls related to domestic violence can be higher in communities that have done a better job responding to them.

Describing how the values reported may change over time, Hipolito Angel Cerros further noted that cultural norms around domestic violence have changed over time for certain social groups. As a result of this cultural change, certain communities may be more likely to call law enforcement agencies regarding domestic violence in 2020 than they were a decade ago, while other communities may be less likely to call.

This was one of the course’s more successful labs, which helped students discern the ways in which data are products of the cultural contexts of their production. Dividing the data temporally and geographically helped affirm the dictum that “all data are local” (Loukissas 2019)—that data emerge from meaning-making practices that are never completely stable. Leveraging data visualization techniques to situate data in particular times and contexts demonstrated how, when aggregated across time and place, datasets can come to tell multiple stories from multiple perspectives at once. This called on students, in their role as data practitioners, to convey data results with more care and nuance.

Conclusion

Ethnographically analyzing a dataset can draw to the fore insights about how various people and communities perceive difference and belonging, how people represent complex ideas numerically, and how they prioritize certain forms of knowledge over others. Programmatically exploring a dataset’s structure, schemas, and contexts helped students see datasets not just as a series of observations, counts, and measurements about their communities, but also as cultural objects, conveying meaning in ways that foreground some issues while eclipsing others. The project also helped students see data science as a practice that is always already political, as opposed to something that can potentially become politicized when placed into the wrong hands or leveraged in the wrong ways. Notably, the project helped students cultivate these insights by integrating a computational practice with critical reflection, highlighting how they can incorporate social awareness and critique into their work. Still, the course content could be strengthened to encourage more critical examinations of categories students consider to be standard, and to better connect their choices in data analysis with their own political and ethical commitments.

Notably, there is great risk to calling attention to just how messy public data is, especially in a political moment in the US where a growing culture of denialism is undermining the credibility of evidence-based research. I encourage students to see themselves as data auditors and their work in the course as responsible data stewardship, and on several occasions, we have worked together to compose emails to data publishers describing discrepancies we have found in the datasets. In this sense, rather than disparaging data for its incompleteness, inconsistencies, or biases, the project encourages students to rethink their role as critical data practitioners, responsible for considering when and how to advocate for making datasets and data analysis more comprehensive, honest, and equitable.

Notes

[1] I typically assign Joe Flood’s The Fires (2011) as the course text. The book tells a gripping and sobering story of how a statistical model and a blind trust in numbers contributed to the burning of the NYC’s poorest neighborhoods in the 1970s.

Bibliography

Bates, Jo, David Cameron, Alessandro Checco, Paul Clough, Frank Hopfgartner, Suvodeep Mazumdar, Laura Sbaffi, Peter Stordy, and Antonio de la Vega de León. 2020. “Integrating FATE/Critical Data Studies into Data Science Curricula: Where Are We Going and How Do We Get There?” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 425–435. FAT* ’20. Barcelona, Spain: Association for Computing Machinery. https://dl.acm.org/doi/abs/10.1145/3351095.3372832.

Bates, Jo, Yu-Wei Lin, and Paula Goodale. 2016. “Data Journeys: Capturing the Socio-Material Constitution of Data Objects and Flows.” Big Data & Society 3, no. 2. https://doi.org/10.1177/2053951716654502.

Bowker, Geoffrey, Karen Baker, Florence Millerand, and David Ribes. 2009. “Toward Information Infrastructure Studies: Ways of Knowing in a Networked Environment.” In International Handbook of Internet Research, edited by Jeremy Hunsinger, Lisbeth Klastrup, and Matthew Allen, 97–117. Springer Netherlands. https://doi.org/10.1007/978-1-4020-9789-8_5.

Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and Its Consequences. Cambridge, Massachusetts: MIT Press.

Clifford, James, and George E. Marcus. 1986. Writing Culture: The Poetics and Politics of Ethnography: A School of American Research Advanced Seminar. Berkeley: University of California Press.

Flood, Joe. 2011. The Fires: How a Computer Formula, Big Ideas, and the Best of Intentions Burned Down New York City—and Determined the Future of Cities. New York: Riverhead Books.

Geiger, R. Stuart, and David Ribes. 2011. “Trace Ethnography: Following Coordination through Documentary Practices.” In 2011 44th Hawaii International Conference on System Sciences, 1–10. https://doi.org/10.1109/HICSS.2011.455.

Gitelman, Lisa, ed. 2013. “Raw Data” Is an Oxymoron. Cambridge, Massachusetts: MIT Press.

Gray, Jonathan, Carolin Gerlitz, and Liliana Bounegru. 2018. “Data Infrastructure Literacy:” Big Data & Society, July. https://doi.org/10.1177/2053951718786316.

Hill, Jim. 1996. “Illegal Immigrants Take Heat for California Wildfires.” CNN, July 28, 1996. https://web.archive.org/web/20051202202133/https://www.cnn.com/US/9607/28/border.fires/index.html.

Loukissas, Yanni Alexander. 2019. All Data Are Local: Thinking Critically in a Data-Driven Society. Cambridge, Massachusetts: The MIT Press.

Martin, Aryn, and Michael Lynch. 2009. “Counting Things and People: The Practices and Politics of Counting.” Social Problems 56, no. 2: 243–66. https://doi.org/10.1525/sp.2009.56.2.243.

Metcalf, Jacob, Kate Crawford, and Emily F. Keller. 2015. “Pedagogical Approaches to Data Ethics.” Council for Big Data, Ethics, and Society. Council for Big Data, Ethics, and Society. https://bdes.datasociety.net/council-output/pedagogical-approaches-to-data-ethics-2/.

Neff, Gina, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. “Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science.” Big Data 5, no. 2: 85–97. https://doi.org/10.1089/big.2016.0050.

Plaia, Antonella, and Mariantonietta Ruggieri. 2011. “Air Quality Indices: A Review.” Reviews in Environmental Science and Bio/Technology 10, no. 2: 165–79. https://doi.org/10.1007/s11157-010-9227-2.

Ribes, David, and Steven J Jackson. 2013. “Data Bite Man: The Work of Sustaining a Long-Term Study.” In Gitelman 2013, 147–166.

Star, Susan Leigh, and Geoffrey C. Bowker. 2007. “Enacting Silence: Residual Categories as a Challenge for Ethics, Information Systems, and Communication.” Ethics and Information Technology 9, no. 4: 273–80. https://doi.org/10.1007/s10676-007-9141-7.

Acknowledgments

Thanks are due to the students enrolled in STS 115: Data Sense and Exploration in Spring 2019 and Spring 2020, whose work helped refine the arguments in this paper. I also want to thank Matthew Lincoln and Alex Hanna for their thoughtful reviews, which not only strengthened the arguments in the paper but also my planning for future iterations of this course.

About the Author

Lindsay Poirier is Assistant Professor of Science and Technology Studies at the University of California, Davis. As a cultural anthropologist working within the field of data studies, Poirier examines data infrastructure design work and the politics of representation emerging from data practices. She is also the Lead Platform Architect for the Platform for Experimental Collaborative Ethnography (PECE).

A woman works at a laptop looking at an image of a snowy landscape.
1

Data Literacy in Media Studies: Strategies for Collaborative Teaching of Critical Data Analysis and Visualization

Abstract

This essay addresses challenges of teaching critical data literacy and describes a shared instruction model that encourages undergraduates at a large research university to develop critical data literacy and visualization skills. The model we propose originated as a collaboration between the library and an undergraduate media and cultural program, and our specific intervention is the development of a templated data-visualization instruction session that can be taught by many people each semester. The model we describe has the dual purpose of supporting the major and serving as an organizational template, a structure for building resources and approaches to instruction that supports librarians as they develop replicable pedagogical strategies, including those informed by a cultural critical lens. We intend our discussion for librarians who are teaching in an academic setting, and particularly in contexts involving large-scale or programmatic approaches to teaching. The discussion is also useful to faculty in the disciplines who are considering partnering with the library to interject aspects of data or information literacy into their program.

Learning that emphasizes data literacy and encourages analysis within multimedia visualization platforms is a growing trend in higher education pedagogy. Because data as a form of evidence holds a privileged position in our cultural discourse, interdisciplinary undergraduate degree programs in the social sciences, humanities, and related disciplines increasingly incorporate data visualization, thus elevating data literacy alongside other established curricular outcomes. When well-conceived, critical data literacy instruction engenders a productive blend of theory and practice and positions students to examine how race-based bigotry, gender bias, colonial dominance, and related forms of oppression are implicated in the rhetoric of data analysis and visualization. Students can then create visualizations of their own that establish counternarratives or otherwise confront the locus of power in society to present alternative perspectives.

As scholarship in media, communications, and cultural studies pedagogy has established, data visualizations “reflect and articulate their own particular modes of rationality, epistemology, politics, culture, and experience,” so as to embody and perpetuate “ways of knowing and ways of organizing collective life in our digital age” (Gray et al. 2016, 229). Catherine D’Ignazio and Lauren F. Klein (2020, 10) explain this dialectic more pointedly in Data Feminism, arguing, “we must acknowledge that a key way power and privilege operate in the world today has to do with the word data itself,” especially the assumptions and uses of it in daily life. Critical instruction positions undergraduates to question how data, in its composition, analysis, and visualization, can often perpetuate an unjust socio-cultural status quo. Undergraduates who are introduced to frames for interpreting culture also need to be exposed to tools—literal and conceptual—that help them critique data visualizations. The goal is to enable a holistic critical literacy, through which students can find data, structure it with a research question in mind, and produce accurate, inclusive visualizations.

However, data instruction is challenging, and planning data learning within the context of an existing course requires an array of skills. Effective data visualization pedagogy demands that instructors locate example datasets, clean data to minimize roadblocks, and create sample visualizations to initiate student engagement with first-order cultural-critical concepts. These steps, a substantial time investment, are necessary for teaching that enables data novices to contend with the mechanics of data manipulation while remaining focused on social and political questions that surround data. When charged with developing data visualization assignments and instructional assistance, faculty often seek the support and expertise of librarians and educational technologists, who are located at the nexus of data learning within the university (Oliver et al. 2019, 243).

Even in cases where librarians and instructional support staff are well-positioned to assist, the demand for teaching data visualization can be overwhelming. It can become burdensome to deliver in-person instruction to cohort courses with a large student enrollment, across many sections and in successive semesters. In order to initiate and maintain an effective, multidisciplinary data literacy program, teaching faculty, librarians, and educational technologists must establish strong teaching partnerships that can be replicated and reimagined in multiple contexts.

This essay addresses some challenges of teaching critical data literacy and describes a shared instruction model that encourages undergraduates at a large research university to develop critical data literacy and visualization skills. Although anyone engaged in teaching critical data literacy can draw from this essay, we intend our discussion for librarians who are teaching in an academic setting, and particularly in contexts involving large-scale or programmatic approaches to teaching. In addition, we believe our essay is particularly pertinent to those designing program curricula within discipline-specific settings, as our ideas engage questions of determining scale, scope, and learning outcomes for effective undergraduate instruction.

The teaching model we propose originated as a collaboration between the New York University Libraries and NYU’s Media, Culture, and Communications (MCC) department, and our specific intervention is the development of an assignment involving data visualization for a Methods in Media Studies (MIMS) course. The distributed teaching model we describe has the dual purpose of supporting the major and serving as an organizational template, a structure for building resources and approaches to instruction that supports librarians as they develop replicable pedagogical strategies, including those informed by a cultural critical lens. In this regard, we believe that collaborative instruction empowers librarians and faculty from many disciplines to develop their own data literacy competency while growing as teachers. And, it enables the library to affect undergraduate learning throughout the university.

There is already an extensive body of research about the role of critical data literacy instruction, including critical approaches to the technical elements of data visualization (Drucker 2014; Sosulski 2019; Engebresten and Kennedy 2020). While we draw from that scholarly discussion, we focus instead on the upshot of programmatic, extensible teaching partnerships between libraries and discipline-specific undergraduate programs. Along the way, we engage two crucial questions: What is the value of creating replicable lesson plans and materials, to be taught by an array of library staff repeatedly? How can the librarians who design these materials strike a balance between creating a step-by-step lesson plan that library instructors follow and structuring a guided lesson that is flexible and capacious enough for instructors to experience meaningful teaching encounters of their own?

Data Literacy in Undergraduate Education

Several curricular initiatives and assessment rubrics in higher education pedagogy recognize the need for students to develop fluidity with digital media and quantitative reasoning, a precursor to effective data visualization. In 2005, Association of American Colleges and Universities (AAC&U) began a decade-long initiative called Liberal Education and America’s Promise (LEAP), which resulted in an inventory of 21st century learning outcomes for undergraduate education. Quantitative literacy is on the list of outcomes (Association of American Colleges and Universities 2020). A corresponding AAC&U rubric statement asserts that “[v]irtually all of today’s students … will need basic quantitative literacy skills such as the ability to draw information from charts, graphs, and geometric figures, and the ability to accurately complete straightforward estimations and calculations.” The rubric urges faculty to develop assignments that give students “contextualized experience” analyzing, evaluating, representing, and communicating quantitative information (Association of American Colleges and Universities 2020). The substance of the LEAP initiative informed the development of our collaborative teaching model, for it allowed us to ground our curricular interventions within larger university curricular trends that had already emerged.

Although quantitative literacy is important, there are other structures for teaching that see data fluidity and visualization as being tied to larger information seeking practices. For this reason, we also turned to the Framework for Information Literacy for Higher Education, developed by the Association of College and Research Libraries (ACRL). The Framework embraces the concept of metaliteracy, which promotes metacognition and a critical examination of information in all its forms and iterations, including data visualization. One of the six frames posed by the document, “Information Creation as a Process,” closely aligns with data competency, including data visualization. This frame emphasizes that the information creation process can “result in a range of information formats and modes of delivery” and that the “unique capabilities and constraints of each creation process as well as the specific information need determine how the product is used.” Within the Framework, learning is measured according to a series of “dispositions,” or knowledge practices that are descriptive behaviors of those who have learned a concept. Here, the Framework is apropos, as students who see information creation as a process “value the process of matching an information need with an appropriate product” and “accept ambiguity surrounding the potential value of information creation expressed in emerging formats or modes” (ACRL 2016). The Framework recognizes that evolved undergraduate curricula must incorporate active, multimodal forms of analysis and production that synthesize information seeking, evaluation, and knowledge creation.

Other organizations and disciplines also advocate for quantitative literacy in the undergraduate curriculum. For instance, Locke (2017) discusses the relevance of data in the humanities classroom and points to ways undergraduate digital humanities projects can incorporate data analysis and visualization to extend inquiry and interpretation. And Beret and Phillips (2016, 13) recommend that every journalism degree program provide a foundational data journalism course, because interdisciplinary data instruction cultivates professionals “who understand and use data as a matter of course—and as a result, produce journalism that may have more authority [or] yield stories that may not have been told before.” In sum, LEAP, the ACRL Framework, and movements for data literacy in the disciplines influenced the Libraries’ collaboration with the Media and Cultural Communications department, and this informed the effort to create and support a meaningful learning experience for students in this major.

Learning-by-Teaching: Structured, Programmatic Instruction and Libraries

Our collaborative model evolved with the conviction that structured, programmatic teaching can foster professional growth for librarians and library technologists. In addition to creating impactful learning for students, programmatic teaching provides a structure that allows for educators to expand the contexts in which they can teach. In many cases, librarians who specialize in information literacy are less adroit regarding the concepts and mechanics of working with data. Teaching data as a form of information, then, necessarily requires a baseline technical expertise.

Several studies published within the past decade indicate that learning with the intent to teach can lead to better understanding, regardless of the content in question. One such study finds that learners who were expecting to teach the material to which they were being introduced show better acquisition than learners who were expecting only to take a test, theorizing that learning-by-teaching pushes the learner beyond essential processing to generative processing, which involves organizing content into a personally meaningful representation and integrating it with prior knowledge (Fiorella and Mayer 2013, 287). Another study finds that learners who were expecting to teach show better organizational output and recall of main points than those who were not expecting to teach, which suggests that learners who anticipate teaching tend to put themselves “into the mindset of a teacher,” leading them to use preparation techniques—such as concept organizing, prioritizing, and structuring—that double as enhancements to a learner’s own encoding processes (Nestojko, et al. 2014, 1046). This evidence boosts our belief that learning-by-teaching is a good strategy for librarians to build foundational data literacy skills, and it informed the development of our program.

Development and Implementation of the Collaborative Teaching Model

Situated in NYU’s Steinhardt School of Culture, Education, and Human Development, the MCC program covers global and transcultural communication, media institutions and politics, and technology and society, among other related fields. MCC program administrators, who were looking to incorporate practical skills into what had previously been a theory-heavy degree, approached the library to co-develop instructional content that would expose students to applied data literacy and multimedia visualization platforms. The impetus for the program administrators to reach out to the library was their participation in a course enhancement grant program, which testifies to the lasting effects that school or university-based curriculum initiatives can have on undergraduate learning. In this case, what emerged was a sustained teaching partnership. Though the support was refined over time, its core remained constant: individual sections of a media studies methods class would attend a librarian-led class session that prepares students to evaluate data and construct a visualization exploring some element of media and political economy, grounded in an assigned reading that asserts ownership of or access to media and communications infrastructure is intrinsically related to the well-being and development of countries around the world.

The class is a first-year requirement in Media, Culture, and Communication, one of NYU’s largest majors. The course tends to be taught by beginning doctoral students, and is by design a highly fluid teaching environment. In early iterations of library support, we designed a module that attempted to have students perform a range of analysis and visualization tasks. Students were introduced to basic socio-demographic datasets and were invited to create a visualization that investigates a research question of their choosing, provided that the question adhered generally to the themes of media and political economy. The assignment as initially constituted expected the student to frame a question, find a dataset and clean it, choose a visualization platform, and generate one or more visualizations that imply a causal relationship between variables that they had identified.

The learning outcomes and assignment developed in this initial sequence turned out to be too ambitious. The assignment had fairly loose parameters, which proved problematic, and the 75-minute class session could not provide sufficient preparation. Students struggled with developing viable research questions, finding data sets, and cleaning the data (the multivalent process of normalizing, reshaping, redacting, or otherwise configuring data to be ingested and visualized in online platforms without errors). Also, we had pointed them to an overwhelming array of data analysis software tools, including ESRI’s ArcMap, Carto, Plot.ly, Raw, and Tableau. We found they had great difficulty with both selecting a tool and learning how to use it, in addition to the connected process of finding a dataset to visualize within it. The Libraries tried to accommodate, but ultimately realized that the module needed significant adjustment going forward, especially since the MCC department decided to expand the project to include up to 10 sections of the course each semester.

Besides struggling with research questions, datasets, and tools, it was also apparent that students had trouble connecting this work to the broader ideas of media and political economy intrinsic to the assignment. Informed by these first-round outcomes, we came together again to revise the instructional content and assignment. Taking our advice into account, the MCC teaching faculty and program administrators refined the learning outcomes as such:

  • Become familiar with the principles, concepts, and language related to data visualization
  • Investigate the context and creation of a given dataset, and think critically about the process of creating data
  • Emphasize how online visualization platforms allow users to make aesthetic choices, which are part and parcel of the rhetoric of visualization

The librarians also created a student-facing online guide as a home base for the module and decided to distribute the teaching load by inviting Data Services specialists from the Libraries’ Data Services department to help teach the library sessions (MCC-UE 2019). And to provide a better lead-in to the library session, a preparatory lesson plan was developed for the MCC instructors to present in the class prior to the library visit.

After further feedback from program administrators and consideration, we inserted a scaffolding component into the library session lesson plan to better prepare students for their assignment. The component involved comparing four sample visualizations created from the very same data, and it included questions for eliciting a discussion about the origins and constructions of data. Scenario-based exercises for creating visualizations in Google sheets and Carto were also incorporated into the lesson, giving students practice before tackling the actual assignment. The assignment was also redesigned with built-in support. Students would no longer be expected to find their own dataset and attempt to clean extracted data, tasks that had caused them frustration and anxiety. Instead, they would choose from a handful of prescribed and pre-cleaned datasets. Data Services staff worked to remediate a set of interesting datasets to anticipate the kind of visualization students would attempt. Also, rather than having to choose from a confusing array of data visualization tools, they would be directed to use Google sheets or Carto only. Assuming the task of identifying, cleaning, and preparing datasets meant extra front-loaded work on the Libraries’ part, but it also freed students to focus on the higher order activity of investigating the relationship between visualizing information and examining social or political culture.

Instructional Support from a Wide Community of Teachers: Growing a Base

Another issue at hand was the strain the project was having on the members of the Data Services team and Communications Librarian, who taught all ten library sessions that were offered each semester. To achieve sustainability going forward, a broader group of librarians would be needed to help teach the library sessions. Moving forward, the Data and Communications librarians decided to recruit other NYU librarians to participate as instructors. Most of the recruits were data novices, but they viewed the invitation as an opportunity to learn data basics, expand their instruction repertoire, and strengthen their teaching practice. Calling on colleagues to teach outside their comfort zone is a big ask, one that requires strong support and administrative buy-in. So recruits were provided with a thorough lesson plan, a comprehensive hands-on training session, and the opportunity to shadow more experienced instructors before teaching the module solo (MCC-UE 2019).

By including a more robust roster of instructors, the structure also gave us the ability to further tie our lesson to what was planned in the MIMS curriculum. A new reading was chosen by the media studies faculty, “Erasing Blackness: the media construction of ‘race’ in Mi Familia, the first Puerto Rican situation comedy with a black family,” by Yeidy Rivero. The article grounds the students’ exploration of the relationship between media and political economy within the MIMS class, and it also provides a good entry point to explore critical data literacy concepts. According to Rivero, the show Mi Familia, deliberately represents a “flattened,” racially homogeneous “imagined community” of lower-middle class black family life that erases Puerto Rico’s hybrid racial identity. This flattening, Rivero argues, is part and parcel of multidimensional efforts to “Americanize” Puerto Rico and align its culture with the interests of the U.S. Furthermore, since the Puerto Rican media is regulated by the U.S. Federal Communication Commission (FCC) and owned by U.S. corporations, Puerto Ricans themselves had little recourse to question the portrayal of constructed racial identities in the mainstream culture (Rivero 2002).

Students were instructed to complete the reading prior to the library session. During the session, the library instructor referred to the reading and introduced a dataset with particular relevance to it. The instructor engaged students in a discussion about the importance of reviewing the dataset description and variables in order to form a question that can be reasonably asked of the data. With students following along, the instructor then modeled how to use Google sheets to manipulate the data and create a visualization that speaks to the question.

The selected dataset resulted from a study of the experiences and expressions of racial identity by young adults who lived in first and second-generation immigrant households in the New York City area during the late 1990s (Mollenkopf, Kasinitz, and Waters 2011). The timeframe of this article and the dataset line up well. The sitcom mentioned in the article first aired in 1994, but had been picked up in Telemundo’s NYC area affiliates by the late 1990s, so it is highly possible that this sitcom would have been on the air in the homes of study participants. The dataset, which is aggregated at the person level, includes variables about participants’ family and home context, patterns of socialization, exposure to media, and sense of self. In order to foreground the analytic process of looking at data, ascertaining its possibilities, and gesturing at potential visualizations, we created a simplified version of the raw data, which omits some columns and imputes other variables for easier use. To accompany this dataset, we also created some simple data visualizations in Google Sheets, ArcGIS Online, and Tableau, which are intentionally “impoverished,” thus designed to elicit discussion from students about the claims made by the visualizations.

Undoubtedly, these adjustments to the module led to students performing better on the assignment. Improvements to the lead-in session provided by the MCC instructors ensured that the students were prepared with context for the library workshop and an understanding of why the library was supporting the assignment. Basing the assignment on a specific article made it possible for librarians to model a way of bridging the theoretical concepts of the class to a question that could be asked of data. There was also more time for two pair-and-share discussions and group work in Google Sheets and Carto, which addressed a fundamental and recurring frustration in the students’ understanding of the assignment: the ability to ask an original question of a dataset, and to ask a question that would address a larger theme of media and political economy.

From the standpoint of instructors in NYU Libraries, we also found that the model provided a strengthened group of teachers. Several people who worked with sections of MIMS contributed ideas to the instructor manual and created ancillary slides and examples that are tailored to their own interest in the claims about racial and national identity that the Rivero article makes. For us, this flexibility is an important element of the collaborative teaching model; it offers both the structure for those who are new to data analysis and visualization to teach effectively, yet it also contains enough pathways for discussion to be meaningful and personal, should individual instructors want to branch out in their own teaching.

Conclusion

Despite being familiar with technology, many students arrive at college without a holistic ability to interpret, analyze, and visualize data. Educators now recognize the need to provide foundational data literacy to undergraduates, and many teaching faculty look to the library for support in instructional design and implementation. In this article, we recognize that creating integrated, meaningful data learning lessons is a complex task, yet we believe that the collaborative teaching model can be applied in various disciplinary contexts. Sustainability of this model depends on equipping a wide range of librarians with necessary data literacy skills, which can be achieved with a learning-by-teaching approach. After developing a teaching model that calls upon the expertise of teachers across the library, we gained some important insights on maintaining the communication and support to make it sustainable, building the workshop itself, and balancing the labor that all of this requires.

Good communication and organization between the MCC department and librarians was also key in maintaining the scalability of this instruction program. Given the heavy rotation of new teachers on both the library and MCC side, we needed to provide module content that was streamlined and assignment requirements that were clear cut in order to quickly on-board teachers to the goals, process, and output of the module. When recruiting library instructors, we emphasized that volunteers will not only build their data literacy skill set, but will also expand their pedagogical knowledge and teaching range. Finally, to ensure that volunteer instructors have a successful experience, we also provide support mechanisms such as a step-by-step lesson plan, thorough train-the-trainer sessions, opportunities to observe and team-teach before going solo, and a point person to contact with questions and concerns.

There is much hidden labor in all of this work. Robust student support for the course was also crucial, and really took off when the MCC department created a dedicated student support team from graduate assistants in the program. On the library side, communicating regularly with the MCC department, assessing and revising the learning objects, organizing and hosting train the trainer sessions, and scheduling all of the library visits takes many hours of time and planning. This work should not be overlooked when considering a program of this scale.

A collaboration at this level can provide rich data literacy at scale to undergraduates, while also offering the chance for instructors in the library and in disciplinary programs to develop their own skills in numeracy and data visualization as they learn by teaching. Through time, effort, and dedicated maintenance, a program like this becomes a successful partnership that has a broad and demonstrated impact on student learning, strengthens ties between the library and the departments we serve, and allows librarians and data services specialists the opportunity to learn and grow from each other.

Related to the learning objects themselves, we had the most success when we matched the scope of the assignment closely with the time and support the students would have to complete it, and preparing a small selection of data sets for the students in advance was very helpful in this regard. We also built in a full class session of preparation before the library visit, in which MCC teachers introduced the assignment, some principles of data visualization (via a slide deck prepared by the library’s Data Services department), and how this method can connect to broader concepts of media analysis. This led to more effective learning for students. These changes to the student assignment, learning outcomes, and library lesson plan were developed through regular and structured assessments of the workshop: a survey to the instructors teaching the course, classroom visits to see the students’ final projects, and in-depth conversations with instructors on which aspects of the lesson plan were successful and which fell flat. Following each assessment the MCC administrators and the librarians would get together to discuss and iterate on the learning objects. This process of gathering feedback on the workshop, reflecting on that information and then revising the assignment enabled us to improve the teaching and learning experience over the years.

Bibliography

Association of American Colleges and Universities. n.d. “Essential Learning Outcomes.” Accessed June 2, 2020. https://www.aacu.org/essential-learning-outcomes.

Association of American Colleges and Universities (AAC&U). n.d. “VALUE Rubrics.” Accessed June 2, 2020. https://www.aacu.org/value/rubrics/quantitative-literacy.

Association of College & Research Libraries. 2016. “Framework for Information Literacy for Higher Education. “ Accessed June 2, 2020. http://www.ala.org/acrl/standards/ilframework.

Berret, Charles and Cheryl Phillips. 2016. Teaching Data and Computational Journalism. New York: Columbia Journalism School. https://journalism.columbia.edu/system/files/content/teaching_data_and_computational_journalism.pdf.

D’Ignazio, Catherine and Lauren F. Klein. 2020. Data Feminism. Boston: MIT Press. ProQuest Ebook Central.

Drucker, Johanna. 2014. Graphesis: Visual Forms of Knowledge Production. Cambridge, Massachusetts: Harvard University Press.

Engebretsen, Martin and Helen Kennedy, eds. 2020. Data Visualization in Society. Amsterdam: Amsterdam University Press. Project MUSE.

Fiorella, Logan, and Richard E. Mayer. 2013. “The Relative Benefits of Learning by Teaching and Teaching Expectancy.” Contemporary Educational Psychology 38, no. 4: 281–288. https://doi.org/10.1016/j.cedpsych.2013.06.001.

Gray, Jonathan, Lillian L. Bounegru, Stefania Milan, and Paolo Ciuccarelli. 2016. “Ways of Seeing Data: Toward a Critical Literacy for Data Visualizations as Research Objects and Research Devices.” In Innovative Methods in Media and Communication Research edited by Sebastian Kubitschko and Anne Kaun, 227–252. Cham, Switzerland: Palgrave Macmillan. ProQuest Ebook Central.

Locke, Brandon T. 2017. “Digital Humanities Pedagogy as Essential Liberal Education: A Framework for Curriculum Development.” Digital Humanities Quarterly 11, no. 3. http://www.digitalhumanities.org/dhq/vol/11/3/000303/000303.html.

Nestojko, John F., Dung C. Bui, Nate Kornell, and Elizabeth Ligon Bjork. 2014. “Expecting to Teach Enhances Learning and Organization of Knowledge in Free Recall of Text Passages.” Memory & Cognition 42, no. 7: 1038–1048. https://doi.org/10.3758/s13421-014-0416-z.

Mollenkopf, John, Phillip Kasinitz, and Mary Waters M. 2011. Immigrant Second Generation in Metropolitan New York. Ann Arbor: Inter-university Consortium for Political and Social Research [distributor]. https://doi.org/10.3886/ICPSR30302.v1/.

“MCC-UE 14 Media & Cultural Analysis.” 2019. New York University. https://guides.nyu.edu/mims/.

Oliver, Jeffry, Christine Kollen, Benjamin Hickson, and Fernando Rios. 2019. “Data Science Support at the Academic Library.” Journal of Library Administration 59, no. 3: 241–257. https://doi.org/10.1080/01930826.2019.1583015.

Rivero, Yeidy. M. 2002. “Erasing Blackness: The Media Construction of ‘Race’ in Mi Familia, the First Puerto Rican Situation Comedy with a Black Family.” Media, Culture & Society 24, no. 4: 481–497. https://doi.org/10.1177/016344370202400402.

Sosulski, Kristen. 2018. Data Visualization Made Simple: Insights into Becoming Visual. London: Routledge. ProQuest Ebook Central.

Acknowledgments

This teaching partnership, data, and associated resources would not have been possible without the work of many people in NYU Libraries and Data Services, as well as the NYU Steinhardt Methods in Media Studies program including: Bonnie Lawrence, Denis Rubin, Dane Gambrill, Yichun Liu, and Jamie Skye Bianco.

About the Authors

Andrew Battista is a Librarian for Geospatial Information Systems at New York University and teaches regularly on data visualization, geospatial software, and the politics of information.

Katherine Boss is the Librarian for Journalism and Media, Culture, and Communication at New York University, and specializes in information literacy instruction in media studies.

Marybeth McCartin is an Instructional Services Librarian at New York University, specializing in teaching information literacy fundamentals to early undergraduates.

Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar