Issues

Plot counting wildfires in California by cause. In the plot, the fewest fires have been attributed to illegal alien campfires and firefighter training.
1

Ethnographies of Datasets: Teaching Critical Data Analysis through R Notebooks

Abstract

With the growth of data science in industry, academic research, and government planning over the past decade, there is an increasing need to equip students with skills not only in responsibly analyzing data, but also in investigating the cultural contexts from which the values reported in data emerge. A risk of several existing models for teaching data ethics and critical data literacy is that students will come to see data critique as something that one does in a compliance capacity prior to performing data analysis or in an auditing capacity after data analysis rather than as an integral part of data practice. This article introduces how I integrate critical data reflection with data practice in my undergraduate course Data Sense and Exploration. I introduced a series of R Notebooks that walk students through a data analysis project while encouraging them, each step of the way, to record field notes on the history and context of their data inputs, the erasures and reductions of narrative that emerge as they clean and summarize the data, and the rhetoric of the visualizations they produce from the data. I refer to the project as an “ethnography of a dataset” not only because students examine the diverse cultural forces operating within and through the data, but also because students draw out these forces through immersive, consistent, hands-on engagement with the data.

Introduction

Last Spring one of my students made an important discovery regarding the politics encoded in data about California wildfires. Aishwarya Asthana was examining a dataset published by California’s Department of Forestry and Fire Protection (CalFIRE), documenting the acres burned for each government-recorded wildfire in California from 1878 to 2017. The dataset also included variables such as the fire’s name, when it started and when it was put out, which agency was responsible for it, and the reason it ignited. Asthana was practicing applying techniques for univariate data analysis in R—taking one variable in the dataset and tallying up the number of times each value in that variable appears. Such analyses help to summarize and reveal patterns in the data, prompting questions about why certain values appear more than others.

Tallying up the number of times each distinct wildfire cause appeared in the dataset, Asthana discovered that CalFIRE categorizes each wildfire into one of nineteen distinct cause codes, such as “1—Lightning,” “2—Equipment Use,” “3—Smoking,” and “4—Campfire.” According to the analysis, 184 wildfires were caused by campfires, 1,543 wildfires were caused by lightning, and, in the largest category, 6,367 wildfires were categorized with a “14—Unknown/Unidentified” cause code. The cause codes that appeared the fewest number of times (and thus were attributed to the fewest number of wildfires) were “12—Firefighter Training” and the final code in the list: “19—Illegal Alien Campfire.”

fires %>% 
  ggplot(aes(x = reorder(CAUSE,CAUSE,
                     function(x)-length(x)), fill = CAUSE)) +
  geom_bar() +
  labs(title = "Count of CalFIRE-documented Wildfires since 1878 by Cause", x = "Cause", y = "Count of Wildfires") + 
  theme_minimal() +
  theme(legend.position = "none", 
        plot.title = element_text(size = 12, face = "bold")) +
  coord_flip() 

Figure 1: Plot counting wildfires in California by cause. In the plot, the fewest fires have been attributed to illegal alien campfires and firefighter training.

Figure 1. Plot of CalFIRE-documented wildfires by cause, produced in R.

Interpreting the data unreflectively, one might say, “From 1878 to 2017, four California wildfires have been caused by illegal alien campfires—making it the least frequent cause.” Toward the beginning of the quarter in Data Sense and Exploration, many students, particularly those majoring in math and statistics, compose statements like this when asked to draw insights from data analyses. However, in only reading the data on its surface, this statement obscures important cultural and political factors mediating how the data came to be reported in this way. Why are “illegal alien campfires” categorized separately from just “campfires”? Who has stakes in seeing quantitative metrics specific to campfires purportedly ignited by this subgroup of the population—a subgroup that can only be distinctly identified through systems of human classification that are also devised and debated according to diverse political commitments?

While detailing the history of the data’s collection and some potential inconsistencies in how fire perimeters are calculated, the data documentation provided by CalFIRE does not answer questions about the history and stakes of these categories. In other words, it details the provenance of the data but not the provenance of its semantics and classifications. In doing so, it naturalizes the values reported in the data in ways that inadvertently discourage recognition of the human discernment involved in their generation. Yet, even a cursory Web search of the key phrase “illegal alien campfires in California” reveals that attribution of wildfires to undocumented immigrants in California has been used to mobilize political agendas and vilify this population for more than two decades (see, for example, Hill 1996). Discerning the critical import of this data analysis thus demands more than statistical savvy; to assess the quality and significance of this data, an analyst must reflect on their own political and ethical commitments.

Data Sense and Exploration is a course designed to help students reckon with the values reported in a dataset so that they may better judge their integrity. The course is part of a series of undergraduate data studies courses offered in the Science and Technology Studies Program at the University of California Davis, aiming to cultivate student skill in applying critical thinking towards data-oriented environments. Data Sense and Exploration cultivates critical data literacy by walking students through a quarter-long research project contextualizing, exploring, and visualizing a publicly-accessible dataset. We refer to the project as an “ethnography of a dataset,” not only because students examine the diverse cultural forces operating within and through the data, but also because students draw out these forces through immersive, consistent, hands-on engagement with the data, along with reflections on their own positionality as they produce analyses and visualizations. Through a series of labs in which students learn how to quantitatively summarize the features in a dataset in the coding language R (often referred to as a descriptive data analysis), students also practice researching and reflecting on the history of the dataset’s semantics and classification. In doing so, the course encourages students to recognize how the quantitative metrics that they produce reflect not only the way things are in the world, but also how people have chosen to define them. Perhaps, most importantly, the course positions data as always already structured according to diverse biases and thus aims to foster student skill in discerning which biases they should trust and how to responsibly draw meaning from data in spite of them. In this paper, I present how this project is taught in Data Sense and Exploration and some critical findings students made in their projects.

Teaching Critical Data Analysis

With the growth of data science in industry, academic research, and government planning over the past decade, universities across the globe have been investing in the expansion of data-focused course offerings. Many computationally or quantitatively-focused data science courses seek to cultivate student skill in collecting, cleaning, wrangling, modeling, and visualizing data. Simultaneously, high-profile instances of data-driven discrimination, surveillance, and mis-information have pushed universities to also consider how to expand course offerings regarding responsible and ethical data use. Some emerging courses, often taught directly in computer and data science departments, introduce students to frameworks for discerning “right from wrong” in data practice, focusing on individual compliance with rules of conduct at the expense of attention to the broader institutional cultures and contexts that propagate data injustices (Metcalf, Crawford, and Keller 2015). Other emerging courses, informed by scholarship in science and technology studies (STS) and critical data studies (CDS), take a more critical approach, broadening students’ moral reasoning by encouraging them to reflect on the collective values and commitments that shape data and their relationship to law, democracy, and sociality (Metcalf, Crawford, and Keller 2015).

While such courses help students recognize how power operates in and through data infrastructure, a risk is that students will come to see the evaluation of data politics and the auditing of algorithms as a separate activity from data practice. While seeking to cultivate student capacity to foresee the consequences of data work, coursework that divorces reflection from practice end up positioning these assessments as something one does after data analysis in order to evaluate the likelihood of harm and discrimination. Research in critical data studies has indicated that this divide between data science and data ethics pedagogy has rendered it difficult for students to recognize how to incorporate the lessons of data and society into their work (Bates et al. 2020). Thus, Data Sense and Exploration takes a different approach—walking students through a data analysis project while encouraging them, each step of the way, to record field notes on the history and context of their data inputs, the erasures and reductions of narrative that emerge as they clean and summarize the data, and the rhetoric of the visualizations they produce. As a cultural anthropologist, I’ve structured the class to draw from my own training in and engagement with “experimental ethnography” (Clifford and Marcus 1986). Guided by literary, feminist, and postcolonial theory, cultural anthropologists engage experimental ethnographic methods to examine how systems of representation shape subject formation and power. In this sense, Data Sense and Exploration positions data inputs as cultural artifacts, data work as a cultural practice, and ethnography as a method that data scientists can and should apply in their work to mitigate the harm that may arise from them. Importantly, walking students into awareness of the diverse cultural forces operating in and through data helps them more readily recognize opportunities for intervention. Rather than criticizing the values and political commitments that they bring to their work as biasing the data, the course celebrates such judgments when bent toward advancing more equitable representation.

The course is predominantly inspired by literature in data and information infrastructure studies (Bowker et al. 2009). These fields study the cultural and political contexts of data and the infrastructures that support them by interviewing data producers, observing data practitioners, and closely reading data structures. For example, through historical and ethnographic studies of infrastructures for data access, organization, and circulation, the field of data infrastructure studies examines how data is made and how it transforms as it moves between stakeholders and institutions with diverse positionalities and vested interests (Bates, Lin, and Goodale 2016). Critiquing the notion that data can ever be pure or “raw,” this literature argues that all data emerge from sites of active mediation, where diverse epistemic beliefs and political commitments mold what ultimately gets represented and how (Gitelman 2013). Diverting from an outsized focus on data bias, Data Sense and Exploration prompts students to grapple with the “interpretive bases” that frame all data—regardless of whether it has been produced though personal data collection, institutions with strong political proclivities, or automated data collection technologies. In this sense, the course advances what Gray, Gerlitz, and Bounegru (2018) refer to as “data infrastructure literacy” and demonstrates how students can apply critical data studies techniques to critique and improve their own day-to-day data science practice (Neff et al. 2017).

Studying a Dataset Ethnographically

Data Sense and Exploration introduces students to examining a dataset and data practices ethnographically through an extended research project, carried out incrementally through a series of weekly labs.[1] While originally the labs were completed collaboratively in a classroom setting, in the move to remote instruction in Spring 2020, the labs were reformulated as a series of nine R Notebooks, hosted in a public GitHub repository that students clone into their local coding environments to complete. R Notebooks are digital documents, written in the scripting language Markdown, that enable authors to embed chunks of executable R code amidst text, images, and other media. The R Notebooks that I composed for Data Sense and Exploration include text instruction for how to find, analyze, and visualize a rectangular dataset, or a dataset in which values are structured into a series of observations (or rows) each described by a series of variables (or columns). The Notebooks also model how to apply various R functions to analyze a series of example datasets, offer warnings of the various faulty assumptions and statistical pitfalls students may encounter in their own data practice, and demonstrate the critical reflection that students will be expected to engage in as they apply the functions in their own data analysis.

Interspersed throughout the written instruction, example code, and reflections, the Notebooks provide skeleton code for students to fill in as they go about applying what they have learned to a dataset they will examine throughout the course. At the beginning of the course, when many students have no prior programming experience, the skeleton code is quite controlled, asking students to “fill-in-the-blank” with a variable from their own dataset or with a relevant R function.

# Uncomment below and count the distinct values in your unique key. Note that you may need to select multiple variables. If so, separate them by a comma in the select() function.
#n_unique_keys <- _____ %>% select(_____) %>% n_distinct()

# Uncomment below and count the rows in your dataset by filling in your data frame name.
#n_rows <- nrow(_____)

# Uncomment below and then run the code chunk to make sure these values are equal.
# n_unique_keys == n_rows
Figure 2. Example of skeleton code from R Notebooks.

However, as students gain familiarity with the language, each week, they are expected to compose code more independently. Finally, in each Notebook, there are open textboxes, where students record their critical reflections in response to specific prompts.

Teaching this course in the Spring 2020 quarter, I found that the structure provided by the R Notebooks overall was particularly supportive to students who were coding in R for the first time and that, given the examples provided throughout the Notebooks, students exhibited greater depth of reflection in response to prompts. However, without the support of a classroom once we moved online, I also found that novice students struggled more to interpret what the plots they produced in R were actually showing them. Moreover, advanced students were more conservative in their depth of data exploration, closely following the prompts and relying on code templates. In future iterations of the course, I thus intend to spend more synchronous time in class practicing how to quantitatively summarize the results of their analysis. I also plan to add new sections at the end of each Notebook, prompting students to leverage the skills they learned in that Notebook in more creative and free-form data explorations.

Each time I teach the course, individual student projects are structured around a common theme. In the iteration of the course that inspired the project that opens this article, the theme was “social and environmental challenges facing California.” In the most recent iteration of the course, the theme was “social vulnerability in the wake of a pandemic.” In an early lab, I task students with identifying issues warranting public concern related to the theme, devising research questions, and searching for public data that may help answer those questions. Few students entering the course have been taught how to search for public research, let alone how to search for public data. In order to structure their search activity, I task the students with imagining and listing “ideal datasets”—intentionally delineating their topical, geographic, and temporal scope—prior to searching for any data. Examining portals like data.gov, Google’s dataset search, and city and state open data portals, students very rarely find their ideal datasets and realize that they have to restrict their research questions in order to complete the assignment. Grappling with the dearth of public data for addressing complex contemporary questions around equity and social justice provides one of the first eye-opening experiences in the course. A Notebook directive prompts students to reflect on this.

Throughout the following week, I work with groups of students to select datasets from their research that will be the focus of their analysis. This is perhaps one of the most challenging tasks of the course for me as the instructor. While a goal is to introduce students to the knowledge gaps in public data, some public datasets have so little documentation that the kinds of insights students could extrapolate from examinations of their history and content would be considerably limited. Further, not all rectangular datasets are structured in ways that will integrate well with the code templates I provide in the R Notebooks. I grapple with the tension of wanting to expose students to the messiness of real-world data, while also selecting datasets that will work for the assignment.

Once datasets have been assigned, the remainder of the labs provide opportunities for immersive engagement with the dataset. In what follows, I describe a series of concepts (i.e. routines and rituals, semantics, classifications, calculations and narrative, chrono-politics, and geo-politics) around which I have structured each lab, and provide some examples of both the data work that introduced students to these concepts and the critical reflections they were able to make as a result.

Data Routines and Rituals

In one of the earlier labs, students conduct a close reading of their dataset’s documentation—an example of what Geiger and Ribes (2011) refer to as a “trace ethnography.” They note the stakeholders involved in the data’s collection and publication, the processes through which the data was collected, the circumstances under which the data was made public, and the changes in the data’s structure. They also search for news articles and scientific articles citing the dataset to get a sense of how governing bodies have leveraged the data to inform decisions, how social movements have advocated for or against the data’s collection, and how the data has advanced other forms of research. They outline the costs and labor involved in producing and maintaining the data, the formal standards that have informed the data’s structure, and any laws that mandate the data’s collection.
From this exercise, students learn about the diverse “rituals” of data collection and publication (Ribes and Jackson 2013). For instance, studying the North American Breeding Bird Survey (BBS)—a dataset that annually records bird populations along about 4,100 roadside survey routes in the United States and Canada—Tennyson Filcek learned that the data is produced by volunteers skilled in visual and auditory bird identification. After completing training, volunteers drive to an assigned route with a pen, paper, and clipboard and count all of the bird species seen or heard over the course of three minutes along each designated stop on the route. They report the data back to the BBS Office, which aggregates the data and makes them available for public consumption. While these rituals shape how the data get produced, the unruliness of aggregating data collected on different days, by different individuals, under different weather and traffic conditions, and in different parts of the continent has prompted the BBS to implement recommendations and routines to account for disparate conditions. The BBS requires volunteers to complete counts around June, start the route a half-hour before sunrise, and avoid completing counts on foggy, rainy, or windy days. Just as these routines domesticate the data, the heterogeneity of the data’s contexts demands that the data be cared for in particular ways, in turn patterning data collection as a cultural practice. This lab is thus an important precursor to the remaining labs in that it introduces students to the diverse actors and commitments mediating the dataset’s production and affirms that the data could not exist without them.

While I have been impressed with students’ ability to outline details involving the production and structure of the data, I have found that most students rarely look beyond the data documentation for relevant information—often missing critical perspectives from outside commentators (such as researchers, activists, lobbyists, and journalists) that have detailed the consequences of the data’s incompleteness, inconsistencies, inaccuracies, or timeliness for addressing certain kinds of questions. In future iterations of the course, I intend to encourage students to characterize the viewpoints of at least three differently positioned stakeholders in this lab in order to help illustrate how datasets can become contested artifacts.

Data Semantics

In another lab, students import their assigned dataset into the R Notebook and programmatically explore its structure, using the scripting language to determine what makes one observation distinct from the next and what variables are available to describe each observation. As they develop an understanding for what each row of the dataset represents and how columns characterize each row, they refer back to the data documentation to consider how observations and variables are defined in the data (and what these definitions exclude). This focused attention to data semantics invites students to go behind-the-scenes of the observations reported in a dataset and develop a deeper understanding of how its values emerge from judgments regarding “what counts.”

ca_crimes_clearances <- read.csv("https://data-openjustice.doj.ca.gov/sites/default/files/dataset/2019-06/Crimes_and_Clearances_with_Arson-1985-2018.csv")

str(ca_crimes_clearances)
## 'data.frame':    24950 obs. of  69 variables:
##  $ Year               : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ County             : chr  "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
##  $ NCICCode           : chr  "Alameda Co. Sheriff's Department" "Alameda" "Albany" "Berkeley" ...
##  $ Violent_sum        : int  427 405 101 1164 146 614 671 185 199 6703 ...
##  $ Homicide_sum       : int  3 7 1 11 0 3 6 0 3 95 ...
##  $ ForRape_sum        : int  27 15 4 43 5 34 36 12 16 531 ...
##  $ Robbery_sum        : int  166 220 58 660 82 86 250 29 41 3316 ...
##  $ AggAssault_sum     : int  231 163 38 450 59 491 379 144 139 2761 ...
##  $ Property_sum       : int  3964 4486 634 12035 971 6053 6774 2364 2071 36120 ...
##  $ Burglary_sum       : int  1483 989 161 2930 205 1786 1693 614 481 11846 ...
##  $ VehicleTheft_sum   : int  353 260 55 869 102 350 471 144 74 3408 ...
##  $ LTtotal_sum        : int  2128 3237 418 8236 664 3917 4610 1606 1516 20866 ...
##  $ ViolentClr_sum     : int  122 205 58 559 19 390 419 146 135 2909 ...
##  $ HomicideClr_sum    : int  4 7 1 4 0 2 4 0 1 62 ...
##  $ ForRapeClr_sum     : int  6 8 3 32 0 16 20 6 8 319 ...
##  $ RobberyClr_sum     : int  32 67 23 198 4 27 80 21 16 880 ...
##  $ AggAssaultClr_sum  : int  80 123 31 325 15 345 315 119 110 1648 ...
##  $ PropertyClr_sum    : int  409 889 166 1954 36 1403 1344 422 657 5472 ...
##  $ BurglaryClr_sum    : int  124 88 62 397 9 424 182 126 108 1051 ...
##  $ VehicleTheftClr_sum: int  7 62 16 177 8 91 63 35 38 911 ...
##  $ LTtotalClr_sum     : int  278 739 88 1380 19 888 1099 261 511 3510 ...
##  $ TotalStructural_sum: int  22 23 2 72 0 37 17 17 7 287 ...
##  $ TotalMobile_sum    : int  6 4 0 23 1 26 18 9 3 166 ...
##  $ TotalOther_sum     : int  3 5 0 5 0 61 21 64 2 22 ...
##  $ GrandTotal_sum     : int  31 32 2 100 1 124 56 90 12 475 ...
##  $ GrandTotClr_sum    : int  11 7 1 20 0 14 7 2 2 71 ...
##  $ RAPact_sum         : int  22 9 2 31 4 21 25 9 15 451 ...
##  $ ARAPact_sum        : int  5 6 2 12 1 13 11 3 1 80 ...
##  $ FROBact_sum        : int  77 56 23 242 35 38 136 13 22 1120 ...
##  $ KROBact_sum        : int  22 23 2 71 10 7 43 3 4 264 ...
##  $ OROBact_sum        : int  3 11 2 43 11 3 7 1 1 107 ...
##  $ SROBact_sum        : int  64 130 31 304 26 38 64 12 14 1825 ...
##  $ HROBnao_sum        : int  59 136 26 351 56 32 116 3 0 1676 ...
##  $ CHROBnao_sum       : int  38 48 15 150 9 21 43 4 13 253 ...
##  $ GROBnao_sum        : int  23 2 1 0 2 7 43 6 9 83 ...
##  $ CROBnao_sum        : int  32 2 2 0 0 8 21 2 2 46 ...
##  $ RROBnao_sum        : int  11 20 6 47 14 9 19 3 2 306 ...
##  $ BROBnao_sum        : int  3 2 3 21 0 2 6 0 3 37 ...
##  $ MROBnao_sum        : int  0 10 5 91 1 7 2 11 12 915 ...
##  $ FASSact_sum        : int  25 16 3 47 6 47 43 10 26 492 ...
##  $ KASSact_sum        : int  27 30 2 103 8 38 55 13 21 253 ...
##  $ OASSact_sum        : int  111 90 10 224 9 120 208 29 43 396 ...
##  $ HASSact_sum        : int  68 27 23 76 36 286 73 92 49 1620 ...
##  $ FEBURact_Sum       : int  1177 747 85 2040 161 1080 1128 341 352 9011 ...
##  $ UBURact_sum        : int  306 242 76 890 44 706 565 273 129 2835 ...
##  $ RESDBUR_sum        : int  1129 637 100 2015 89 1147 1154 411 274 8487 ...
##  $ RNBURnao_sum       : int  206 175 33 597 32 292 295 100 44 2114 ...
##  $ RDBURnao_sum       : int  599 195 44 1418 26 485 532 163 103 5922 ...
##  $ RUBURnao_sum       : int  324 267 23 0 31 370 327 148 127 451 ...
##  $ NRESBUR_sum        : int  354 352 61 915 116 639 539 203 207 3359 ...
##  $ NNBURnao_sum       : int  216 119 32 224 44 274 238 104 43 1397 ...
##  $ NDBURnao_sum       : int  47 46 21 691 14 110 45 34 26 1715 ...
##  $ NUBURnao_sum       : int  91 187 8 0 58 255 256 65 138 247 ...
##  $ MVTact_sum         : int  233 187 42 559 85 219 326 76 56 2711 ...
##  $ TMVTact_sum        : int  56 33 4 55 9 71 88 40 9 121 ...
##  $ OMVTact_sum        : int  64 40 9 255 8 60 57 28 9 576 ...
##  $ PPLARnao_sum       : int  5 31 26 133 5 10 1 4 3 399 ...
##  $ PSLARnao_sum       : int  60 20 4 163 4 14 20 6 3 251 ...
##  $ SLLARnao_sum       : int  289 664 40 1277 1 704 1058 106 435 1123 ...
##  $ MVLARnao_sum       : int  930 538 147 3153 207 1136 753 561 241 8757 ...
##  $ MVPLARnao_sum      : int  109 673 62 508 153 446 1272 155 252 901 ...
##  $ BILARnao_sum       : int  205 516 39 611 16 360 334 276 151 349 ...
##  $ FBLARnao_sum       : int  44 183 46 1877 85 493 417 187 281 4961 ...
##  $ COMLARnao_sum      : int  11 53 17 18 24 27 59 7 2 70 ...
##  $ AOLARnao_sum       : int  475 559 37 496 169 727 696 304 148 4055 ...
##  $ LT400nao_sum       : int  753 540 84 533 217 937 1089 370 235 976 ...
##  $ LT200400nao_sum    : int  437 622 68 636 122 607 802 299 262 2430 ...
##  $ LT50200nao_sum     : int  440 916 128 2793 161 1012 1102 453 464 4206 ...
##  $ LT50nao_sum        : int  498 1159 138 4274 164 1361 1617 484 555 13254 ...
Figure 3. Basic examination of the structure of the CA Crimes and Clearances dataset.

For instance, studying aggregated totals of crimes and clearances for each law enforcement agency in California in each year from 1985 to 2017, Simarpreet Singh noted how the definition of a crime gets mediated by rules in the US Federal Bureau of Investigation (FBI)’s Uniform Crime Reporting Program (UCR)—the primary source of statistics on crime rates in the US. Singh learned that one such rule, known as the hierarchy rule, states that if multiple offenses occur in the context of a single crime incident, for the purposes of crime reporting, the law enforcement agency classifies the crime only according to the most serious offense. In descending order, these classifications include 1. Criminal Homicide 2. Criminal Sexual Assault 3. Robbery 4. Aggravated Battery/Aggravated Assault 5. Burglary 6. Theft 7. Motor Vehicle Theft 8. Arson. This means that in the resulting data, for incidents where multiple offenses occurred, certain classes of crime are likely to be underrepresented in the counts.

Sidhu also acknowledged how counts for individual offense types get mediated by official definitions. A change in the FBI’s definition of “forcible rape” (including only female victims) to “rape” (focused on whether there had been consent instead of whether there had been physical force) in 2014 led to an increase in the number of rapes reported in the data from that year on. From 1927 (when the original definition was documented) up until this change, male victims of rape had been left out of official statistics, and often rapes that did not involve explicit physical force (such as drug-facilitated rapes) went uncounted. Such changes come about, not in a vacuum, but in the wake of shifting norms and political stakes to produce certain types of quantitative information (Martin and Lynch 2009). By encouraging students to explore these definitions, this lab has been particularly effective in getting students to reflect not only on what counts and measures of cultural phenomena indicate, but also on the cultural underpinnings of all counts and measures.

Data Classifications

In the following lab, students programmatically explore how values get categorized in the dataset, along with the frequency with which each observation falls into each category. To do so, they select categorical variables in the dataset and produce bar plots that display the distributions of values in that variable. Studying a US Environmental Protection Agency (EPA) dataset that reported the daily air quality index (AQI) of each county in the US in 2019, Farhat Bin Aznan created a bar plot that displayed the number of counties that fell into each of the following air quality categories on January 1, 2019: Good, Moderate, Unhealthy for Sensitive Populations, Unhealthy, Very Unhealthy, and Hazardous.

aqi$category <- factor(aqi$category, levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous"))

aqi %>%
  filter(date == "2019-01-01") %>%
  ggplot(aes(x = category, fill = category)) +
  geom_bar() +
  labs(title = "Count of Counties in the US by Reported AQI Category on January 1, 2019", subtitle = "Note that not all US counties reported their AQI on this date", x = "AQI Category", y = "Count of Counties") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(size = 12, face = "bold")) +
  scale_fill_brewer(palette="RdYlGn", direction=-1)

Figure 4: R output when student plots the number of counties in each AQI category on January 1, 2019. Bar plot displays that most counties reported Good air quality on that day.

Figure 4. Barplot of counties in each AQI category on January 1, 2019.

Studying the US Department of Education’s Scorecard dataset, which documents statistics on student completion, debt, and demographics for each college and university in the US, Maxim Chiao created a bar plot that showed the number of universities that fell into each of the following ownership categories: Private, Public, Non-profit.

scorecard %>%
  mutate(CONTROL_CAT = ifelse(CONTROL == 1, "Public",
                          ifelse(CONTROL== 2, "Private nonprofit",
                                 ifelse(CONTROL == 3, "Private for-profit", NA)))) %>%
           ggplot(aes(x = CONTROL_CAT, fill = CONTROL_CAT)) +
           geom_bar() +
           labs(title ="Count of Colleges and Universities in the US by Ownership Model, 2018-2019", x = "Ownership Model", y = "Count of Colleges and Universities") +
           theme_minimal() +
           theme(legend.position = "none",
                 plot.title = element_text(size = 12, face = "bold"))

Figure 5: R output when student plots the number of colleges and universities by their ownership model in the 2018-2019 academic year.

Figure 5. Barplot of colleges and universities in the US by ownership model.

I first ask students to interpret what they see in the plot. Which categories are more represented in the data, and why might that be the case? I then ask students to reflect on why the categories are divided the way that they are, how the categorical divisions reflect a particular cultural moment, and to consider values that may not fit neatly into the identified categories. As it turns out, the AQI categories in the EPA’s dataset are specific to the US and do not easily translate to the measured AQIs in other countries, where for a variety of reasons, different pollutants are taken into consideration when measuring air quality (Plaia and Ruggieri 2011). The ownership models categorized in the Scorecard dataset gloss over the nuance of quasi-private universities in the US such as the University of Pittsburgh and other universities in Pennsylvania’s Commonwealth System of Higher Education.

For some students, this Notebook was particularly effective in encouraging reflection on how all categories emerge in particular contexts to delimit insight in particular ways (Bowker and Star 1999). For example, air pollution does not know county borders, yet, as Victoria McJunkin pointed out in her labs, the EPA reports one AQI for each county based on a value reported from one air monitor that can only detect pollution within a delimited radius. AQI is also reported on a daily basis in the dataset, yet for certain pollutants in the US, pollution concentrations are monitored on an hourly basis, averaged over a series of hours, and then the highest average is taken as the daily AQI. The choice to classify AQI by county and day then is not neutral, but instead has considerable implications for how we come to understand who experiences air pollution and when.

Still, I found that, in this lab, other students struggled to confront their own assumptions about categories they consider to be neutral. For instance, many students categorizing their data by state in the US suggested that there were no cultural forces underlying these categories because states are “standard” ways of dividing the country. In doing so, they missed critical opportunities to reflect on the politics behind how state boundaries get drawn and which people and places get excluded from consideration when relying on this bureaucratic schema to classify data. Going forward, to help students place even “standard” categories in a cultural context, I intend to prompt students to produce a brief timeline outlining how the categories emerged (both institutionally and discursively) and then to identify at least one thing that remains “residual” (Star and Bowker 2007) to the categories.

Data Calculations and Narrative

The next lab prompts students to acknowledge the judgment calls they make in performing calculations with data, including how these choices shape the narrative the data ultimately conveys. Selecting a variable that represents a count or a measure of something in their data, students measure the central tendency of the variable—taking an average across the variable by calculating the mean and the median value. Noting that they are summarizing a value across a set of numbers, I remind students that such measures should only be taken across “similar” observations, which may require first filtering the data to a specific set of observations or performing the calculations across grouped observations. The Notebook instructions prompt students to apply such filters and then reflect on how they set their criteria for similarity. Where do they draw the line between relevant or irrelevant, similar or dissimilar? What narratives do these choices bring to the fore, and what do they exclude from consideration?

For instance, studying a dataset documenting changes in eligibility policies for the US Supplemental Nutrition Assistance Program (SNAP) by state since 1995, Janelle Marie Salanga sought to calculate the average spending on SNAP outreach across geographies in the US and over time. Noting that we could expect there to be differences in state spending on outreach due to differences in population, state fiscal politics, and food accessibility, Salanga decided to group the observations by state before calculating the average spending across time. Noting that the passing of the American Recovery and Reinvestment Act of 2009 considerably expanded SNAP benefits to eligible families, Salanga decided to filter the data to only consider outreach spending in the 2009 fiscal year through the 2015 fiscal year. Through this analysis, Salanga found California to have, on average, spent the most on SNAP outreach in the designated fiscal years, while several states spent nothing.

snap %>%
  filter(month(yearmonth) == 10 & year(yearmonth) %in% 2009:2015) %>% #Outreach spending is reported annually, but this dataset is reported monthly, so we filter to the observations on the first month of each fiscal year (October)
  group_by(statename) %>%
  summarize(median_outreach = median(outreach * 1000, na.rm = TRUE), 
            num_observations = n(), 
            missing_observations = paste(as.character(sum(is.na(outreach)/n()*100)), "%"), 
            .groups = 'drop') %>%
  arrange(desc(median_outreach))
statename median_outreach num_observations missing_observations
California 1129009.3990 7 0 %
New York 469595.8557 7 0 %
Texas 422051.5137 7 0 %
Washington 273772.9187 7 0 %
Minnesota 261750.3357 7 0 %
Arizona 222941.9250 7 0 %
Nevada 217808.7463 7 0 %
Illinois 195910.5835 7 0 %
Connecticut 184327.4231 7 0 %
Georgia 173554.0009 7 0 %
Pennsylvania 153474.7467 7 0 %
South Carolina 126414.4135 7 0 %
Ohio 125664.8331 7 0 %
Rhode Island 99755.1651 7 0 %
Tennessee 98411.3388 7 0 %
Massachusetts 97360.4965 7 0 %
Wisconsin 87527.9999 7 0 %
Maryland 81700.3326 7 0 %
Vermont 69279.2511 7 0 %
North Carolina 62904.8309 7 0 %
Indiana 58047.9164 7 0 %
Oregon 57951.0803 7 0 %
Michigan 53415.1688 7 0 %
Florida 37726.1696 7 0 %
Hawaii 29516.3345 7 0 %
New Jersey 23496.2501 7 0 %
Missouri 23289.1655 7 0 %
Louisiana 20072.0005 7 0 %
Colorado 19113.8344 7 0 %
Iowa 18428.9169 7 0 %
Virginia 15404.6669 7 0 %
Delaware 14571.0001 7 0 %
Alabama 11048.8329 7 0 %
District of Columbia 9289.5832 7 0 %
Kansas 8812.2501 7 0 %
North Dakota 8465.0002 7 0 %
Mississippi 4869.0000 7 0 %
Alaska 3199.3332 7 0 %
Arkansas 3075.0833 7 0 %
Nebraska 217.1667 7 0 %
Idaho 0.0000 7 0 %
Kentucky 0.0000 7 0 %
Maine 0.0000 7 0 %
Montana 0.0000 7 0 %
New Hampshire 0.0000 7 0 %
New Mexico 0.0000 7 0 %
Oklahoma 0.0000 7 0 %
South Dakota 0.0000 7 0 %
Utah 0.0000 7 0 %
West Virginia 0.0000 7 0 %
Wyoming 0.0000 7 0 %
Table 1. Median of annual SNAP outreach spending from 2009 to 2015 per US state.

The students then consider how their measures may be reductionist—that is, how the summarized values erase the complexity of certain narratives. For instance, Salanga went on to plot a series of boxplots that displayed the dispersion of outreach spending across fiscal years for each state from 2009 to 2015. She found that, while outreach spending had been fairly consistent in several states across these years, in other states there had been a difference in several hundred thousand dollars from the fiscal year with the maximum outreach spending to the year with the minimum.

snap %>%
  filter(month(yearmonth) == 10 & year(yearmonth) %in% 2009:2015) %>%
  ggplot(aes(x = statename, y = outreach * 1000)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Distribution of Annual SNAP Outreach Spending per State from 2009 to 2015", x = "State", y = "Outreach Spending") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, face = "bold")) 

Figure 6: R output when student plots the distribution of outreach spending per state from 2009 to 2015.

Figure 6. Boxplot showing distribution of annual SNAP outreach spending from 2009 to 2015.

This nuanced story of variations in spending over time gets obfuscated when relying on a measure of central tendency alone to summarize the values.

This lab has been effective in getting students to recognize data work as a cultural practice that involves active discernment. Still, I have noticed that some students complete this lab feeling uncomfortable with the idea that the choices they make in data work may be framed, at least in part, by their own political and ethical commitments. In other words, in their reflections, some students describe their efforts to divorce their own views from their decision-making: they express concern that their choices may be biasing the analysis in ways that invalidate the results. To help them further grapple with the judgment calls that frame all data analyses (and especially the calls that they individually make when choosing how to filter, sort, group, and visualize the data), the next time I run the course I plan to ask students to explicitly characterize their own standpoint in relation to the analysis and reflect on how their unique positionality both influences and delimits the questions they ask, the filters they apply, and the plots they produce.

Data Chrono-Politics and Geo-Politics

In a subsequent lab, I encourage students to situate their datasets in a particular temporal and geographic context in order to consider how time and place impact the values recorded. Students first segment their data by a geographic variable or a date variable to assess how the calculations and plots vary across geographies and time. They then characterize, not only how and why there may be differences in the phenomena represented in the data across these landscapes and timescapes, but also how and why there may be differences in the data’s generation.

For instance, in Spring 2020, a group of students studied a dataset documenting the number of calls related to domestic violence received each month to each law enforcement agency in California.

dom_violence_calls %>%
  ggplot(aes(x = YEAR_MONTH, y = TOTAL_CALLS, group = 1)) +
  stat_summary(geom = "line", fun = "sum") +
  facet_wrap(~COUNTY) +
  labs(title = "Domestic Violence Calls to California Law Enforcement Agencies by County", x = "Month and Year", y = "Total Calls") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 5, angle = 90, hjust = 1),
        strip.text.x = element_text(size = 6))

Figure 7: R output when student plots the total domestic violence calls to California law enforcement agencies over time divided by county.

Figure 7. Timeseries of domestic violence calls to California law enforcement agencies by county.

One student, Laura Cruz, noted how more calls may be reported in certain counties not only because domestic violence may be more prevalent or because those counties had a higher or denser population, but also due to different cultures of police intervention in different communities. Trust in law enforcement may vary across California communities, impacting which populations feel comfortable calling their law enforcement agencies to report any issues. This creates a paradox in which the counts of calls related to domestic violence can be higher in communities that have done a better job responding to them.

Describing how the values reported may change over time, Hipolito Angel Cerros further noted that cultural norms around domestic violence have changed over time for certain social groups. As a result of this cultural change, certain communities may be more likely to call law enforcement agencies regarding domestic violence in 2020 than they were a decade ago, while other communities may be less likely to call.

This was one of the course’s more successful labs, which helped students discern the ways in which data are products of the cultural contexts of their production. Dividing the data temporally and geographically helped affirm the dictum that “all data are local” (Loukissas 2019)—that data emerge from meaning-making practices that are never completely stable. Leveraging data visualization techniques to situate data in particular times and contexts demonstrated how, when aggregated across time and place, datasets can come to tell multiple stories from multiple perspectives at once. This called on students, in their role as data practitioners, to convey data results with more care and nuance.

Conclusion

Ethnographically analyzing a dataset can draw to the fore insights about how various people and communities perceive difference and belonging, how people represent complex ideas numerically, and how they prioritize certain forms of knowledge over others. Programmatically exploring a dataset’s structure, schemas, and contexts helped students see datasets not just as a series of observations, counts, and measurements about their communities, but also as cultural objects, conveying meaning in ways that foreground some issues while eclipsing others. The project also helped students see data science as a practice that is always already political, as opposed to something that can potentially become politicized when placed into the wrong hands or leveraged in the wrong ways. Notably, the project helped students cultivate these insights by integrating a computational practice with critical reflection, highlighting how they can incorporate social awareness and critique into their work. Still, the course content could be strengthened to encourage more critical examinations of categories students consider to be standard, and to better connect their choices in data analysis with their own political and ethical commitments.

Notably, there is great risk to calling attention to just how messy public data is, especially in a political moment in the US where a growing culture of denialism is undermining the credibility of evidence-based research. I encourage students to see themselves as data auditors and their work in the course as responsible data stewardship, and on several occasions, we have worked together to compose emails to data publishers describing discrepancies we have found in the datasets. In this sense, rather than disparaging data for its incompleteness, inconsistencies, or biases, the project encourages students to rethink their role as critical data practitioners, responsible for considering when and how to advocate for making datasets and data analysis more comprehensive, honest, and equitable.

Notes

[1] I typically assign Joe Flood’s The Fires (2011) as the course text. The book tells a gripping and sobering story of how a statistical model and a blind trust in numbers contributed to the burning of the NYC’s poorest neighborhoods in the 1970s.

Bibliography

Bates, Jo, David Cameron, Alessandro Checco, Paul Clough, Frank Hopfgartner, Suvodeep Mazumdar, Laura Sbaffi, Peter Stordy, and Antonio de la Vega de León. 2020. “Integrating FATE/Critical Data Studies into Data Science Curricula: Where Are We Going and How Do We Get There?” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 425–435. FAT* ’20. Barcelona, Spain: Association for Computing Machinery. https://dl.acm.org/doi/abs/10.1145/3351095.3372832.

Bates, Jo, Yu-Wei Lin, and Paula Goodale. 2016. “Data Journeys: Capturing the Socio-Material Constitution of Data Objects and Flows.” Big Data & Society 3, no. 2. https://doi.org/10.1177/2053951716654502.

Bowker, Geoffrey, Karen Baker, Florence Millerand, and David Ribes. 2009. “Toward Information Infrastructure Studies: Ways of Knowing in a Networked Environment.” In International Handbook of Internet Research, edited by Jeremy Hunsinger, Lisbeth Klastrup, and Matthew Allen, 97–117. Springer Netherlands. https://doi.org/10.1007/978-1-4020-9789-8_5.

Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and Its Consequences. Cambridge, Massachusetts: MIT Press.

Clifford, James, and George E. Marcus. 1986. Writing Culture: The Poetics and Politics of Ethnography: A School of American Research Advanced Seminar. Berkeley: University of California Press.

Flood, Joe. 2011. The Fires: How a Computer Formula, Big Ideas, and the Best of Intentions Burned Down New York City—and Determined the Future of Cities. New York: Riverhead Books.

Geiger, R. Stuart, and David Ribes. 2011. “Trace Ethnography: Following Coordination through Documentary Practices.” In 2011 44th Hawaii International Conference on System Sciences, 1–10. https://doi.org/10.1109/HICSS.2011.455.

Gitelman, Lisa, ed. 2013. “Raw Data” Is an Oxymoron. Cambridge, Massachusetts: MIT Press.

Gray, Jonathan, Carolin Gerlitz, and Liliana Bounegru. 2018. “Data Infrastructure Literacy:” Big Data & Society, July. https://doi.org/10.1177/2053951718786316.

Hill, Jim. 1996. “Illegal Immigrants Take Heat for California Wildfires.” CNN, July 28, 1996. https://web.archive.org/web/20051202202133/https://www.cnn.com/US/9607/28/border.fires/index.html.

Loukissas, Yanni Alexander. 2019. All Data Are Local: Thinking Critically in a Data-Driven Society. Cambridge, Massachusetts: The MIT Press.

Martin, Aryn, and Michael Lynch. 2009. “Counting Things and People: The Practices and Politics of Counting.” Social Problems 56, no. 2: 243–66. https://doi.org/10.1525/sp.2009.56.2.243.

Metcalf, Jacob, Kate Crawford, and Emily F. Keller. 2015. “Pedagogical Approaches to Data Ethics.” Council for Big Data, Ethics, and Society. Council for Big Data, Ethics, and Society. https://bdes.datasociety.net/council-output/pedagogical-approaches-to-data-ethics-2/.

Neff, Gina, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. “Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science.” Big Data 5, no. 2: 85–97. https://doi.org/10.1089/big.2016.0050.

Plaia, Antonella, and Mariantonietta Ruggieri. 2011. “Air Quality Indices: A Review.” Reviews in Environmental Science and Bio/Technology 10, no. 2: 165–79. https://doi.org/10.1007/s11157-010-9227-2.

Ribes, David, and Steven J Jackson. 2013. “Data Bite Man: The Work of Sustaining a Long-Term Study.” In Gitelman 2013, 147–166.

Star, Susan Leigh, and Geoffrey C. Bowker. 2007. “Enacting Silence: Residual Categories as a Challenge for Ethics, Information Systems, and Communication.” Ethics and Information Technology 9, no. 4: 273–80. https://doi.org/10.1007/s10676-007-9141-7.

Acknowledgments

Thanks are due to the students enrolled in STS 115: Data Sense and Exploration in Spring 2019 and Spring 2020, whose work helped refine the arguments in this paper. I also want to thank Matthew Lincoln and Alex Hanna for their thoughtful reviews, which not only strengthened the arguments in the paper but also my planning for future iterations of this course.

About the Author

Lindsay Poirier is Assistant Professor of Science and Technology Studies at the University of California, Davis. As a cultural anthropologist working within the field of data studies, Poirier examines data infrastructure design work and the politics of representation emerging from data practices. She is also the Lead Platform Architect for the Platform for Experimental Collaborative Ethnography (PECE).

Network of Erasmus’s network, visualized using Cytoscape. Both nodes and edges are colored, and the nodes are sized, so that more information about centrality, edge weight, and clustering coefficient can be seen.
2

Thinking Through Data in the Humanities and in Engineering

Abstract

This article considers how the same data can be differently meaningful to students in the humanities and in data science. The focus is on a set of network data about Renaissance humanists that was extracted from historical source materials, structured, and cleaned by undergraduate students in the humanities. These students learned about a historical context as they created first travel data, and then the network data, with each student working on a single historical figure. The network data was then shared with a graduate engineering class in which students were learning R. They too were assigned to acquaint themselves with the historical figures. Both groups then created visualizations of the data using a variety of tools: Palladio, Cytoscape, and R. They were encouraged to develop their own questions based on the networks. The humanists’ questions demanded that the data be reembeded in a context of historical interpretation—they wanted to reembrace contingency and uncertainty—while the engineers tried to create the clarity that would allow for a more forceful, visually comprehensible presentation of the data. This paper compares how humanities and engineering pedagogy treats data and what pedagogical outcomes can be sought and developed around data across these very different disciplines.

In the humanities, we train students to interpret their material within a larger context. Facts exist to be contextualized, biases uncovered, problems revealed. Students in many corners of the humanities are rarely confronted with something termed data, which they imagine as dry and quantitative and unyielding. Art history in particular is still a discipline of printed books and, especially, of material objects. Of course data do exist in our field, adhering to objects as physical information or tagged contents, or to the objects’ makers, as in the University of Amsterdam’s monumental ECARTICO project (Manovich 2015; Bok et al. n.d.). But introducing students to data is normally much less central to our work than persuading them to engage in close examination of the visual, and to use libraries to gather information.

Modern engineering is distinguished by production of massive data, most of which can be accessed from all over the world. Engineering students often take computer science and statistics classes, in addition to a curriculum in their chosen field, as a way of acquiring the expertise to deal with modern data. In the engineering realm, quantitative data are central and the context from which data arises is usually not discussed. As a result, engineering educators have devised pedagogy to motivate students to contextualize findings. One of the primary ways that engineering pedagogy has changed in the past twenty years to meet this challenge is the introduction of experiential and project-based learning (Crawley et al. 2007; Savage, Chen, and Vanasupa 2008). Both of these approaches are designed to couple the development of technical skills with increasing contextual awareness and cultural literacy. In this paper, we unpack key assumptions at the heart of the current state of pedagogy in both engineering and digital humanities by posing two questions:

  1. Does digital training in the humanities alone motivate students to consider an outward focus for their contextual learning, and
  2. Does project-based learning in engineering motivate students sufficiently to dig below the exploration of data and production of visualizations, and into context.

We implicitly challenge the notion that teaching digital humanities and the construction and meaning of “data” is enough to create a digital scholar. In engineering, we challenge the notion that a shift to project-based instruction is sufficient to motivate student learning beyond digital skills and computational methods.

To conduct this study, we consider how one data set functioned pedagogically in a humanities course taught within an art history department, and how the same data and core assignment was used in parallel in a data science course taught in engineering. In both cases, the process of working with data was meant to unsettle the ways in which students had normally been asked to work in their discipline. “Data” was framed as both a subject of analysis and a pedagogical tool to make students question their habits of thought, further empowering them to ask questions they had never thought to ask before. In both cases, students had to move back and forth between interpretability and quantification, recognizing the limitations and opportunities of approaching their data as (historical) material, and organizing their historical material as data.

The Humanities Class

The course “Humanists on the Move” introduced liberal arts undergraduates to data gathering and structuring as well as visualization and analysis. The goal of the class was to make students engage with the most fundamental humanities source material—primary written historical documents—as well as with data: the former should make the analysis of the latter meaningful. In fact, by the end of the semester, the class would not merely have learned about the early sixteenth century, about individual humanist figures, and about data and their analysis, but as a group the students would have produced new knowledge about this historical period, things that could not have been found in any published source.

Each student took on a single humanist figure for the semester. The characters ranged from Martin Luther to Isabella d’Este, Erasmus to Copernicus, Henry VIII to Cellini and Leonardo da Vinci. Students worked in groups according to the type of figure they were studying: Rulers, Artists, Scientists, and Thinkers. Every week the class read and discussed a primary source text, “met” its author, and investigated the historical context within which that figure had lived and ruled, painted, or written. Students learned enough about their own figure’s life to provide both a short written introduction and a longer oral presentation about them to the class. Having attained familiarity with their figures, other students’ figures, and a sense of the period based on contemporary writings, students then moved on to consider how the humanists’ historical roles were impacted by mobility and network-building—and, further, how other variables (gender, profession, national origin) factored into these complexities. This process required original research, and would necessitate collecting, structuring, cleaning, visualizing and analyzing data.

Using biographical sources, particularly actual printed books (which many in the class had never thought to consult before), students first gathered information on the travels of their figure: locations visited, and dates of travel. They geocoded each location so that it could be mapped, and they structured their material as data, each creating a three-sheet Excel spreadsheet. The members of each group then combined their data into a single spreadsheet, so that all Rulers, or all Artists, would eventually be visualized and analyzed together.

The class was initially held at UMD’s Collaboratory, where Collaboratory staff introduced students to OpenRefine, an open source platform created in Google Labs (originally as GoogleRefine) to clean and parse data using a simple set of tools (Muñoz 2013a; 2013b; 2014). This introduction covered installation and basic use. Each time it is opened, OpenRefine creates a server instance on the host computer, which is interfaced via a web browser. Users can open a local dataset (the default choice), as well as live data accessed via a URL (e.g., that of the City Permit Office of Toronto, Canada which is the basis for the tutorials on using Open Refine found in the Documentation section at openrefine.org).

Using a dataset contained within an Excel spreadsheet, “Sample Messy Humanist Data” provided by Professor Elizabeth Honig, Christian Cloke and Quint Gregory demonstrated the use of basic tools within OpenRefine, such as Common Transforms, Faceting, and Clustering, which allow the user quickly to reconcile data values that may be similar though not the same (such as capitalized/not-capitalized entries; misspellings; those with a space after or before a string). Through such operations, which require one to think carefully about how the data are structured, the user develops a deeper awareness of the dataset and confidence in its soundness and consistency. In addition, students were shown how different columns of data could be joined or split, depending on the desired outcome, to make new data expressions. The resulting “cleaned” dataset could be exported to a data table in any number of preferred formats (CSV/TSV, Excel, JSON, etc.).

To visualize their travel data, students were trained to use the Stanford-based platform Palladio (Humanities + Design n.d.).  Palladio is an open source tool that was originally conceived of to visualize data from the “Mapping the Republic of Letters” project, which had collected material on scholarly networks in early modern Western Europe. Its main capabilities are therefore the visualization of networks and the creation of maps. Designed to be usable by humanists, Palladio does still necessitate correctly structured data, and students explored how that structuring impacted the generation of maps in Palladio’s system. Within its map function, Palladio also allows the visualization of chronological data linked to travels as both a timeline and timespans, so that the user can see the locations mapped (with locations sized according to criteria such as number of times visited) and the years in which travels occurred (Figure 1). Palladio also allows for “faceting,” i.e. dividing and recategorizing elements of data so that it can be examined in another dimension. For example, faceting enabled students to study over what distances female humanists were able to travel, or what cities attracted the most scientists vs. the most theologians, or which figures might have been together in Rome during a given year.

The travels of artists, shown as a map overlayed with a timeline along which locations visited in each year are visualized.
Figure 1. Visualization of the travels of artists, with faceted timeline overlaying the underlying map of locations visited.

Based on the maps and faceting, and on their research on individual figures whose travels were now visualized together, the class was able to explore what life events, ambitions, and exigencies led to travel in the Renaissance, and how travel mattered differently to figures with different professions.

The Data Set

The data set shared between humanists and engineers was created in the next phase of “Humanists on the Move,” which concerned humanist networks. Historical networks have been thoroughly studied and, more recently, elegantly visualized. The vast and remarkable website The Six Degrees of Francis Bacon, hosted by the Carnegie Mellon University Libraries, is a model of what a collaborative project using humanities data can accomplish (Lincoln 2016; Moretti 2011). Nevertheless, network material as we imagined it would be considerably less clear-cut as data than travel had been. A person is or isn’t in a given location at a given time, but a connection—in network terms, an edge—is harder to define. There are obvious connections such as family, colleagues, allies, collaborators. But when a figure read a book by another humanist, did that make them connected? And if so, how deeply connected had they become? How would the importance of that connection compare to, say, attending a performance in which another figure had acted, being present at a diplomatic meeting but not as a main player, writing a letter but (as far as we know) never receiving a reply to it? Historical resources are often fragmentary, and the class tangled with how to account for that as they assembled data. These were issues that most undergraduates had never confronted as they studied history, but now, history’s lacunae were of immediate relevance to their work.

In structuring their data, students were asked first to come up with a limited set of labels that would describe relationships. These might include patronage, respect, influence, friendship, antagonism. Often they encountered an example that none of their labels seemed to fit, but which was not sufficiently different, or representative, to warrant a new label. They learned how to compromise. Next, the students had to agree on criteria by which those edges could be weighted on a scale of one to three.

Another way of thinking about this exercise entails recognizing that it involved phases of translation, from humanist ways of thinking about material into quantifiable terms and then back again (Handelman 2015; Bradley 2018). Describing relationships, even determining what makes a relationship and why it matters, is a perfect example of humanistic work. Art historians love to talk about influence, patronage, and collaboration; this is all fundamental to how we write our histories. We could all probably say who was an important patron or a minor influence. But the students were asked to take information they had gathered and make it numerically regular, working against the humanist instinct to value irregularity and to see each instance of a given relationship, whether patronage or correspondence, as essentially a unique event with its own characteristics that are not simple to equate with those of a comparable event (Rawson and Muñoz 2016). Now every relationship had to be described using a fixed term from a limited list; every edge had to have a weight, from one to three. Long discussions were involved, although the COVID pandemic was widespread and we were meeting via Zoom.

The class gathered nearly 700 connections representing the ways in which over 450 different persons were connected to our core of twenty humanist figures (Figure 2). All of the groups combined their data into one large class spreadsheet. Every person (node) was described by a profession, every relationship (edge) had a label, sometimes several, and a numerical weight. This was the data set that we passed along to the engineers.

Section of a spreadsheet showing how network connections were recorded. Each line represents an edge, or relationship between two individuals, and includes information on gender, profession, and nature and closeness of the connection.
Figure 2. Part of the network spreadsheet, in progress. Each line represents an edge, with our key figures in column C and their connected nodes in column G. Information about each figure includes profession terms and gender; relationships are characterized in terms of type of connection and edge weight.

Engineers, Data, and a Humanities Data Set

The course “Data in the Built Environment” is designed to teach data science skills to graduate engineering students. One of its main aims is to motivate students to dig deeper into context via project-based learning concepts (Hicks and Irizarry 2018). To do this, students are given a new dataset each week with which to practice a newly introduced data science technique. Students practice the technique in class in groups and then use new data (also in groups) for homework as a way of deepening and solidifying their understanding (Paul Alexander Horton, Weiner, and Lande 2018; Neff et al. 2017).  In short, each week students are challenged to synthesize the technical knowledge and then apply this learning through a practical data application with questions relevant to the data rather than to the technique. This approach is designed to create a tension between data as viewed by engineers and problems that require a deeper analysis to really understand the contextual story. Throughout the semester, the class pedagogy (and grading) emphasized the importance of characterizing data analysis results within the context in which data emerges. The network class was taught toward the end of the semester, so students had practice with linking data subtleties to context—but only in data reflective of the built environment (e.g., transportation, water, and housing data).

The underlying assumption of most engineering students is that data are data, mostly the same in all applications. Rarely do engineering students grapple with data that are unfamiliar to them. The Humanists on the Move data offered a completely novel opportunity to practice network visualization, motivating students to understand the underlying data in a way that they would not normally worry about.

The engineering class assignment mimicked the instructions for the humanist class, but compressed the time allocated for background research. Each student was assigned three humanists, who themselves were selected because they provided students the opportunity to uncover interesting contextual information. The engineering students prepared a one-page summary of basic background information for each figure, including important acquaintances, and any documented travel using three or more sources of information. Because the time allocated for background research was compressed, Wikipedia was an allowable source of information. It was notable that even this limited information gathering exercise threw engineering students into new terrain. Many had questions about how to decide what was important, how to find sources of information, even why they were working on these data in particular. The exercise of preparing them for the data both energized and confused them.

The engineering students were organized into groups of three. Because each student had background sheets on three humanists, groups were assigned so that each group had multiple sources of information on one or more humanists. This deliberate tactic was intended to motivate them to think more about the information that their networks were conveying. The exercise was structured so that groups started by developing standard networks and then moved to allow each group to design more elaborate or situational networks.

Visualizing Network Data

Each class now visualized the network data. For the engineering students, this was the entire point of the class: to visualize data with the implicit assumption that they would draw on the contextual information that they had gathered prior to the class. For students from art history and other humanities disciplines, this was new terrain. A map is a reasonably familiar object, even from the Renaissance, and students understood all of its basic parameters (Harley 2001). Superimposing information about travels onto it was not in itself a vast step. A network, however, was not something they were used to thinking about in visual form, nor were they adept at analyzing a network. A visible network gathers data and presents it in a way that will suggest new questions and will demand interpretation in and of itself—humanistic interpretation, that will return the uncertain and the variable while also incorporating the regular and quantified.

In engineering, visualization is essential for exploring, cleaning, understanding and explaining data. In the class, students master programming for data visualization that makes data exploration easier and more productive, and allows an engineer to both better understand the data and to present data in a way that has impact, particularly on audiences such as policy makers and the public.  Students are taught appropriate (and inappropriate) uses of different kinds of charts and graphs, graphical composition, and the design aspects of effectively conveying information such as selecting colors, minimizing chartjunk and emphasizing key features of the data. The focus in engineering is on the mechanics of visualization. As noted earlier though, the transition to project-based learning in our field has ideally involved preparing students to explore context more deeply, even contexts with which they were truly unfamiliar.

The engineering class used a variety of network packages within R, which is a language that provides an environment for statistics and visualization (R Core Team n.d.). The language is open-source, rooted in statistical computing and provides a reproducible platform for engineering calculations. One of R’s major strengths is that it can be easily extended through packages to include modern computing methods and approaches. The network packages within R that were used in the class included igraph, ggraph, tidygraph, and visNetwork.

The igraph package provides functions that implement a wide range of graphing algorithms and can handle very large graphs (Nepusz 2016). The ggraph package extends ggplot (a core package for visualization) to handle networks using the grammar of graphics approach (Wickham 2010). Next, tidygraph provides tools to manipulate and analyze networks and is a wrapper for most of the igraph capabilities (Pedersen 2020). Finally, visNetwork allows for interactive visualization.  Students were given the opportunity to work with any of these tools on this exercise.

The humanities students had started their visualization process using Palladio again. As in its mapping function, Palladio allows for faceting networks, so at this stage students could see all the connections based on friendship, for example, or isolate how and where clerics fit into the network (Figure 3).

Network of connections between rulers and other figures, visualized by humanities students using Palladio. The network is drab but readable. Nodes sized by number of connections.
Figure 3. Rulers’ network, as visualized using Palladio.

Palladio, however, is a tool for visualization and not for computational analysis. It can’t actually work with edge weights, which as humanists we had found to be such an important and complex issue. So at this point the Collaboratory stepped in again with an introduction to Cytoscape. Cytoscape would allow students to visualize the data, while at the same time furnishing a richer understanding of the underlying mathematical analysis of their networks. Cytoscape was developed for analyzing networks of data in systems biology research, as practitioners in this field were not proficient in the use of R (Shannon 2003). As a platform, however, it is discipline-agnostic: data sets of all types and from varied fields, including the humanities, can be analyzed and visualized, and as a result Cytoscape has become a platform researchers in the humanities are comfortable using.

Students were introduced to Cytoscape on the last day of class, and because it was introduced so late in the semester it was advertised as a way for interested students to build another skill and continue querying the dataset they had thus far created and visualized. Students were fascinated by the insights gained from network analyses possible in Cytoscape, but unavailable in Palladio. In addition, they responded favorably to the powerful suite of options within the visualization environment of Cytoscape. For instance, the appearance of nodes and edges can be customized prior to analysis to isolate certain types of values, or the researcher can use the results of statistical analysis to draw out nodes and connections of greater importance within the network. Also of considerable value is the ability of Cytoscape to parse larger datasets, or focus in on specific nodes to make sense of networks within networks, which can be selected and excised into separate visualizations (Figure 4).

Network of Erasmus’s network, visualized using Cytoscape. Both nodes and edges are colored, and the nodes are sized, so that more information about centrality, edge weight, and clustering coefficient can be seen.
Figure 4. Visualization in Cytoscape version 3.7.2, showing a sub-network centering on Erasmus. The nodes are scaled in correspondence with their betweenness centrality (i.e., how much a node bridges other nodes, indicating a key player in a network) and color-coded according to their clustering coefficient (the degree to which nodes cluster together, moving from light to dark as values increase), and the edges are scaled and color-coded (from light to dark) according to their weight.

Interpreting the Visualized Data

For the humanities students, it was the process and outcome of visualization that made the data intriguing to interpret. But crucially, the data had been created by them, over a period of months, before they could move ahead with visualizing and interpreting it. It was only then that they could see, for instance, that certain thinkers held key positions between powerful figures while others, extremely famous in our day, were on the margins of the main humanist network. Persons who wrote a great deal, be it sermons or conduct books or even letters, might have an enormous “degree centrality” (or number of connections), even while the edge weight of many of their connections was relatively low. Some secondary figures who we would have thought to be quite outside our network assumed rather central positions in it. What, we asked, should we make of these unexpected findings?

Because students had developed the data themselves, and had in the process become very familiar with individual figures within the network, they were better able to interpret the positions of each major person. And because of their previous experience with mapping, they had extra knowledge that informed their interpretation of the network. For instance, a figure who travelled very little—say, Raphael—was hampered in his network-building despite his enormous historical influence. This led the class to question both their art-historical preconceptions—for example, that as a superstar, Raphael would be at the center of a network—but also to pose further humanistic questions that the data could not answer. Network-building was crucial for some figures (Aretino springs to mind) but of limited importance for others. What were the alternatives? Creating, visualizing, and then interpreting data was a means of creating new knowledge and a stimulus to further thinking.  This further thinking was based on humanistic knowledge and posed  questions that would be answered through those means. The shuttle back and forth between quantifiable data and humanistic inquiry through data and its visualization was a hugely fruitful exercise (Drucker 2011).

While producing reasonably well-designed networks, the engineering students studiously avoided connecting networks to a more textual analysis. For example, Figure 5 on the left shows the most common output (from ~90% of the groups) when students were asked to portray the network (an open-ended question). When asked to focus on one or more attributes, every group produced a gender network (Figure 5 on the right). This happened despite the relative abundance of other types of attributes and of group and individual knowledge specific to each of the humanists.

Two visualizations of humanist networks made by engineering students using R. One shows all links between figures, and the other separates out networks of women from those of men.
Figure 5. Humanist networks as visualized in R by engineering students. The full network, and a network distinguished by gender.

Conclusion

Humanists were challenged by the idea of extracting data from context, taking facts (“Do we believe in facts in this class?” one student had asked) and turning them into quantifiable data.  The more they discretized and structured the data, the more resistant they became to compromise, to what they perceived as flattening out the nuance of individual relationships or even professional identities. However, once the data were visualized, class members were well prepared to read those results and return them to a humanist framework. Without caring particularly how the networks themselves looked, they approached the data with a more historically informed eye than did the engineers and moved quickly to interpretation. For instance, they already knew well the limitations on women’s travel and connections—we had read primary sources about women’s education—and so that and other historical aspects of the network were more revealing to them.

Much of engineering pedagogy focuses on design techniques to solve a problem. In the engineering R class, the design techniques were tuned toward learning about visualization (e.g., color ramps), how to code and design visualization features that draw attention to features of the visualization that are relevant to the analytical objective. This approach to the exercise resulted in networks that lacked texture, despite the interesting and often provocative information on the humanists that students gathered prior to the class. Engineers tend to gravitate toward well-produced visualizations (e.g. appropriately labeled axes, titles that are descriptive, etc.) or portray some important design feature. When the data cannot be understood without context, engineers are less able to navigate the tension between accuracy and context.

Engineers are, however, more alert to the subtleties of the visualization itself and how it communicates information about the data. The caveat here is that the engineering students seem unable to bring noted visualization subtleties back to the data context. In other words, they produce beautiful graphics but do not reflexively use these visualizations to think more about the problem from which their data emerges. Alternatively, humanists, even art historians, have not been trained to care about the aesthetic and persuasive presentation of data. Perhaps this is because humanists see themselves as talking mostly with one another, moving rather quickly from visualized data back to humanistic queries and a written argument. It may be that the humanist students need to be formally trained to make their visualizations an integral part of their textual analysis story. It might also be useful to the future of the humanities, particularly a public-facing humanities, if humanists were not only more comfortable with data, but also with using it to speak beyond the confines of the classroom or the pages of a scholarly journal.

Bibliography

Bok, Marten Jan, Harm Nijboer, and Judith Brouwer, eds. n.d. ECARTICO: Linking cultural industries in the early modern Low Countries, ca. 1475 – ca. 1725. Accessed October 17, 2020. http://www.vondel.humanities.uva.nl/ecartico/.

Bradley, Adam James. 2018. “Visualization and the Digital Humanities.” IEEE Computer Graphics and Applications 38, no. 6: 26–38.

Csárdi, Gábor, and Tamás Nepusz. 2006. “The igraph software package for complex network research.” InterJournal Complex Systems: 1695. https://igraph.org.

Crawley, Edward, Johan Malmqvist, Soren Ostlund, Doris Brodeur, and Kristina Edstrom. 2007. “Rethinking Engineering Education.” The CDIO Approach 302: 60–62.

Drucker, Johanna. 2011. “Humanities Approaches to Graphical Display.” Digital Humanities Quarterly 5, no. 1. http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html.

Handelman, Matthew. 2015. “Digital Humanities as Translation: Visualizing Franz Rosenzweig’s Archive.” TRANSIT 10, no. 1. https://escholarship.org/uc/item/69d0g81v.

Harley, J.B. 2001. “Maps, Knowledge, and Power” and “Silences and Secrecy: The Hidden Agenda of Cartography in Early Modern Europe.” In The New Nature of Maps, 51–107. Johns Hopkins.

Hicks, Stephanie C., and Rafael A. Irizarry. 2018. “A Guide to Teaching Data Science.” The American Statistician 72, no. 4: 382–391. https://doi.org/10.1080/00031305.2017.1356747.

Humanities + Design. n.d. Accessed October 17, 2020. https://hdlab.stanford.edu/palladio/.

Lincoln, Matthew. 2016. “Social Network Centralization Dynamics in Print Production in the Low Countries, 1550–1750.” International Journal of Digital Art History 2: 134–152.

Manovich, Lev. 2015. “Data Science and Digital Art History.” International Journal for Digital Art History, no. 1 (June). https://doi.org/10.11588/dah.2015.1.21631.

Moretti, Franco. 2011. “Network Theory, Plot Analysis.” New Left Review 68: 80–102.

Muñoz, Trevor. 2013a. “What IS on the Menu? More Work with NYPL’s Open Data, Part One.” http://trevormunoz.com/notebook/2013/08/08/what-is-on-the-menu-more-work-with-nypl-open-data-part-one.html.

———. 2013b. “Refining the Problem — More Work with NYPL’s Open Data, Part Two.”
http://trevormunoz.com/notebook/2013/08/19/refining-the-problem-more-work-with-nypl-open-data-part-two.html.

———. 2014. “Borrow a Cup of Sugar? Or Your Data Analysis Tools? — More Work with NYPL’s Open Data, Part Three.”
http://trevormunoz.com/notebook/2014/01/10/borrowing-data-science-tools-more-work-with-nypl-open-data-part-three.html.

Neff, Gina, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. “Critique and contribute: A practice-based framework for improving critical data studies and data science.” Big Data 5, no. 2: 85–97.

Paul Alexander Horton, S.S. Jordan, Steven Weiner, and Micah Lande. 2018. “Project-Based Learning among Engineering Students during Short-Form Hackathon Events.” In ASEE Annual Conference and Exposition, Conference Proceedings.

Pedersen, Thomas Lin. 2020. “A Tidy API for Graph Manipulation.” A Tidy API for Graph Manipulation. Accessed October 17, 2020. https://tidygraph.data-imaginist.com/.

R Core Team. n.d. Accessed October 17, 2020. https://www.r-project.org/about.html.

Rawson, Katie, and Trevor Muñoz. 2016. “Against Cleaning,” Curating Menus, July 7. http://www.curatingmenus.org/articles/against-cleaning/.

Savage, Richard, Katherine Chen, and Linda Vanasupa. 2008. “Integrating Project-Based Learning throughout the Undergraduate Engineering Curriculum.” Journal of STEM Education 8, no. 3.

Shannon, Paul. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13: 2498–2504.

Wickham, Hadley. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19, no. 1 (January 2010): 3–28. https://doi.org/10.1198/jcgs.2009.07098.

Acknowledgments

Thanks to Rebecca Levitan, who originally suggested to Elizabeth Honig the idea for this course, and who acted as her teaching assistant when she taught the class at UC Berkeley.

About the Author

Elizabeth Alice Honig is Professor of Northern European Art at the University of Maryland. She is the author of, most recently, Pieter Bruegel and the Idea of Human Nature (Reaktion, 2019), while her current research is about the experience of captivity in renaissance Europe. She curates the websites janbrueghel.net, pieterbruegel.net, and brueghelfamily.net, and her work in digital art history deriving from those projects has focused on mapping patterns of similarity between pictures produced in the Brueghel workshop network.

Deb Niemeier is the Clark Distinguished Chair in Energy and Sustainability at the University of Maryland, College Park and a professor in the Department of Civil and Environmental Engineering. She works with sociologists, planners, geographers, and education faculty to study the formal and informal governance processes in urban landscapes and the risks and disparities associated with outcomes in the intersection of finance, housing, infrastructure and environmental hazards. She is an AAAS Fellow, a Guggenheim Fellow, and a member of the National Academy of Engineering.

Christian Cloke specializes in the archaeology of the ancient Mediterranean world, employing a range of digital methods and technologies to do so. In service to his archaeological fieldwork (in Italy, Jordan, Armenia, Albania, and Greece), he builds and works with custom databases, Geographical Information Systems (GIS), and a wide array of imaging techniques. He holds a PhD in Classical Archaeology from the University of Cincinnati and is currently the associate director of the Michelle Smith Collaboratory for Visual Culture at the University of Maryland, College Park, where he works on varied digital research and pedagogical projects with students and faculty.

Quint Gregory specializes in seventeenth-century Dutch and Flemish art, as well as museum theory and practice. He is the creator and director of the Michelle Smith Collaboratory for Visual Culture, a center within the University of Maryland’s Department of Art History and Archaeology committed to supporting students, faculty, staff, and members of the broader community who are interested in adopting digital humanities methods and tools in their work and practice. He is especially interested in using offline and online platforms and skills in the causes of social and racial justice and to repair our relationship with the planet.

The opening header of the first episode of Ulysses in The Little Review (photograph by the author from the copy in the Special Collections and Archives of Grinnell College).
1

Numbering Ulysses: Digital Humanities, Reductivism, and Undergraduate Research

Abstract

Ashplant: Reading, Smashing, and Playing Ulysses, is a digital resource created by Grinnell College students working with Erik Simpson and staff and faculty collaborators.[1] In the process of creating Ashplant, the students encountered problems of data entry, some of which reflect the general difficulties of placing humanistic materials in tabular form, and some of which revealed problems arising specifically from James Joyce’s experimental techniques in Ulysses. By presenting concrete problems of classification, the process of creating Ashplant led the students to confront questions about the tendency of Digital Humanities methods to treat humanistic materials with regrettably “mechanistic, reductive and literal” techniques, as Johanna Drucker puts it. It also placed the students’ work in a century-long history of readers’ efforts to tame the difficulty of Ulysses by imposing numbering systems and quasi-tabular tools—analog anticipations of the tools of digital humanities. By grappling with the challenges of creating the digital project, the students grappled with the complexity of the literary material but found it difficult to convey that complexity to the readers of the site. The article closes with some concrete suggestions for a self-conscious and reflexive digital pedagogy that maintains humanistic complexity and subtlety for student creators and readers alike.

In calculating the addenda of bills she frequently had recourse to digital aid.
—James Joyce, Ulysses (17.681–82), describing Molly Bloom[2]

The growth of the Digital Humanities has been fertilized by widespread training institutes, workshops, and certifications of technical skills. As we have developed those skills and methods, we DHers have also cultivated a corresponding skepticism about their value. Katie Rawson and Trevor Muñoz, for example, write that humanists’ suspicions of data cleaning “are suspicions that researchers are not recognizing or reckoning with the framing orders to which they are subscribing as they make and manipulate their data” (Rawson and Muñoz 2019, 280–81). Similarly and more broadly, Catherine D’Ignazio and Lauren Klein describe in Data Feminism the need for high-level critique when our data does not fit predetermined categories—to move beyond questioning the categories to question “the system of classification itself” (D’Ignazio and Klein 2020, 105). When we have “recourse to digital aid,” to reappropriate Joyce’s phrase from the epigraph above, we break fluid, continuous (analog) information into discrete units. In the process, we gain computational tractability. What do we lose? What do we lose, especially, when we use digital tools in humanistic teaching, where we value the cultivation of fluidity and complexity? These recent critiques build on foundational work such as Johanna Drucker’s earlier indictment of the methods of quantitative DH:

Positivistic, strictly quantitative, mechanistic, reductive and literal, these visualization and processing techniques preclude humanistic methods from their operations because of the very assumptions on which they are designed: that objects of knowledge can be understood as self-identical, self-evident, ahistorical, and autonomous. (Drucker 2012, 86)

The problems that Drucker identifies echo the distinction that constitutes the category of the digital itself: whereas analog information exists on a continuous spectrum, digital data becomes computationally tractable by creating discrete units that are ultimately binary. Tractability can require a loss of complexity. One of Drucker’s examples involves James Joyce’s Ulysses: she evokes the history of mapping the novel’s Dublin as a signature instance of the “grotesque distortion” that can occur when we use non-humanistic methods to transform the materials of imaginative works (Drucker 2012, 94). At their worst, such methods can operate like the budget of Bloom’s day in Ulysses, which is an accounting of a complex set of interactions that obscures at least as much as it reveals, mainly by sweeping messy expenditures into a catch-all category called “balance” (17.1476). Making the numbers add up can render complexity invisible.[3]

I agree with Drucker’s point, albeit with some discomfort, as I am also the faculty lead of a digital project that involves mapping Ulysses, albeit in a different way. In this essay, I take up the pre-digital and digital history of transforming information about Joyce’s novel into structured data. Then I consider the concrete application of those methods in Ashplant: Reading, Smashing, and Playing Ulysses, a website that shares the scholarship of my Grinnell College students. Working on the site has brought us into the history of numbering Ulysses, for better and for worse, and shown us how Ulysses specifically—more than most texts—resists and undermines the very processes that give digital projects their analytical power. In creating Ashplant, we have found that undergraduate research provides an especially generative environment for breaking down the “unproductive binary relation,” in Tara McPherson’s words, between theory and practice in the digital humanities (Macpherson 2018, 22).

Numbering Ulysses from The Little Review to the Database

Although created with twenty-first–century digital technologies, Ashplant takes part in a tradition that has built over the full century since the publication of Ulysses. Readers of Joyce’s text have long sought to discipline its complexity by creating reading aids structured like tabular data. That process begins with numbering: facing a novel that sometimes runs for many pages without a paragraph break, we readers have given ourselves a unique, identifying value for each line of the text. In database architecture, such a value is called—with unintentionally Joycean overtones—a primary key.[4] The primary key for Ulysses assigns the lines a value based on episode and line numbers. In its most conventional form, the line numbering is based on the Gabler edition of the novel. The episode-line key lets us point, for instance, to “Ineluctable modality of the visible”—the first line of the third episode—as 3.1.

Ulysses is unusual in having such a reliably fixed convention for a work of prose. Prose normally resists stable line numbering because its line breaks change in response to variations of typesetting that we do not normally read as meaningful; thus arises the variation among editions of Shakespeare’s plays in the line numbering of prose passages. The difficulty of reading Ulysses, however, creates a desire for reliably numbered reference points, for stable ground upon which communities of readers can gather. Such numbering imposes orderly hierarchy upon a text that implicitly and explicitly resists the concept and practices of orderly hierarchy. The early history of numbering the chapters and pages of Ulysses reveals our modern standardization—and by extension the structured data in Ashplant—as the product of a century-old conflict between printed versions of Ulysses and efforts of readers to retrofit the text into tractable data.

The serialization of Ulysses in The Little Review gave readers their first opportunity to grasp Joyce’s text with names and numbers. In the first issue containing part of Ulysses, the numbers begin: issue V.11, for March 1918. The Table of Contents reads, “Ulysses, 1.” The heading of the piece itself, on three lines, is “ULYSSES / JAMES JOYCE / Episode 1.”

The opening header of the first episode of Ulysses in The Little Review, volume 5, number 11, March 1918.
Figure 1. The opening header of the first episode of Ulysses in The Little Review (photograph by the author from the copy in the Special Collections and Archives of Grinnell College).

The 1922 Shakespeare and Company edition removes the episode numbers of the Little Review installments and, indeed, removes most signposting numbers altogether. The volume has no table of contents. In the front matter, the sole indication of a section or chapter number is a page containing only the roman numeral I, placed a bit above and to the left of the center of the page.

Figure 2. The Shakespeare and Company page with only the roman numeral “I” (Joyce, 1922).

This page is preceded by one blank page and followed by another, after which the main text of the novel begins, with no chapter title or episode number.

The beginning of the first episode in the Shakespeare and Company edition, showing no numbering or other header above the text.
Figure 3. The beginning of the first episode in the Shakespeare and Company edition (Joyce, 1922).

The page has no number, either. The following page is numbered 4, so the reader can infer that this is page three, and that the page with the roman numeral I has also been page one of the book. Episodes two and three are also unnumbered, so the reader can infer retrospectively that the roman numeral one indicated a section rather than a chapter or episode number. At this point, that is to say, the only stable numbering from The Little Review—the episode numbers—has disappeared entirely and been replaced by section numbers (just three for the whole book) that reveal their signification only gradually.

To fill the void of stable numbering, early readers of Ulysses relied on supplemental texts that have structured the naming and numbering conventions of the text ever since: the two schemata that Joyce hand-wrote for Carlo Linati and Stuart Gilbert in 1920 and 1921, respectively. The schemata have retained their power in part because of the tantalizing (apparent) simplicity of their form: they organize information about the novel in a structure closely resembling that of a spreadsheet or relational database table. Both schemata use the novel’s eighteen episodes as their records, or tuples—the rows that act essentially as entries in the database—and both populate each record with data corresponding to a series of columns largely but not entirely shared between the two schemata. The relational structure of the Linati schema, for example, has a record for the first episode that identifies it with the number “1,” the title “Telemachus,” and the hour of “8–9” (Ellmann 1972).

Today, any number of websites re-create the schemata by making the quasi-tabular structure fully tabular, as does the Wikipedia page for the Linati schema (“Linati Schema for Ulysses”).

A screenshot of the Linati schema in Wikipedia, showing the tabular structure of the data in the web page.
Figure 4. A screenshot of the Linati schema in Wikipedia. (https://en.wikipedia.org/wiki/Linati_schema_for_Ulysses).

 

Joyce’s sketches mimic tabular data so neatly that many later versions of the schemata put them in tabular structure without comment. The episode names and numbers again prove their utility: though the two schemata contain different columns, a digital version can join them by creating a single, larger table organized by episode.[5]

Organizing information by episode number has become standard practice in analog and digital supplements to Ulysses. In the analog tradition, readers have long used supplemental materials, organized episode by episode, to assist in the reading of the novel. An offline reader might prepare to read episode three, for instance, by reading a brief summary of the episode on Wikipedia, then consulting the schemata entries for the episode in Richard Ellmann’s Ulysses on the Liffey, then reading the longer summary in Harry Blamires’s New Bloomsday Book. In each case, that reader would look for materials associated with episode number three or the associated name “Proteus.” As much as any other work of literature, Ulysses invites that kind of hand-crafted algorithm for the reading process, all organized by the supplements’ adoption of conventional episode numbers and, often, the Gabler edition’s line numbers as well.

Digital projects, including Ashplant, rely on these episode and line numbers even more fundamentally. One section of Ashplant—the most conventional section—is called “Ulysses by episode.” It organizes and links to resources created by other writers and scholars, from classic nodes of reading Ulysses such as the Linati and Gilbert schemata to contemporary digital projects such as Boston College’s “Walking Ulysses” maps and the textual reproductions of the Modernist Versions Project.[6] This part of Ashplant, like those digital supplements to Ulysses and a long list of others, relies on the standard numerical organization: it has eighteen sections, corresponding to the eighteen episodes of the novel.

The underlying structure of these pages illustrates the importance of the episode number as a primary key: at the level of HTML markup, all eighteen pages point to the same file (“episode.php”) and therefore contain the same code.[7] The only change that happens when the user moves from the page about episode one to that for episode two, for example, is that the value of a single variable, “episode_number,” changes from “01” to “02.” Based on that variable, the page changes the information it displays, mainly by altering database calls that say, essentially, “Look at this database table and give me the information in the row where episode_number equals” the value of that variable. The database call looks like this, with emphasis added:

/* Performing SQL query */
$query = “SELECT a.episode_number, a.episode_name, b.episode_number, b.ls_time, b.ls_color, b.ls_people, b.ls_scienceart, b.ls_meaning, b.ls_technic, b.ls_organ, b.ls_symbols, b.gs_scene, b.gs_hour, b.gs_organ, b.gs_color, b.gs_symbol, b.gs_art, b.gs_technic
FROM episode_names a, schemata b
WHERE (a.episode_number=’$episode_number’) and (a.episode_number=b.episode_number)”;
$result = mysql_query($query) or die(“Query failed : ” . mysql_error());
$num=mysql_numrows($result);

The user’s input provides the value of episode_number (from 01 to 18) when the page loads.[8] Then, the page with this code queries the database to gather the information—the episode’s name, the schemata entries, and much more—from two database tables joined by the column containing that two-digit number in each table. The process gains its effectiveness from the conventionality of the episode numbers. The price of that efficacy is the loss of a good deal of information—from quirks of typesetting and handwriting, to alternative approaches to numbering the episodes, to the Italianate episode names from the schemata—that have at least as much textual authority as do our later simplifications.

Hierarchy and Classification

If there could be that which is contained in that which is felt there would be a chair where there are chairs and there would be no more denial about a clatter. A clatter is not a smell. All this is good.
—Gertrude Stein, Tender Buttons (Stein 2018, 53)

Line numbers, mainly those of the Gabler edition, impose further numerical discipline on Ulysses. The line numbers produce a hierarchy that allows humans and machines alike to arrive at a shared understanding of textual location:

Line (beginning at 1 and incrementing by 1 within each episode)

Episodes (1–18, or “Telemachus” to “Penelope”)

Ulysses (the whole)

This rationalization functions so powerfully, not only in digital projects but also in conventional academic citation, because it assigns to each location in the text—with some exceptions, such as images—a line, and every line belongs to an episode, and every episode belongs to Ulysses. The hierarchy of information allows shared understanding of reference points.

Though created before the age of contemporary digital humanities, the line/episode/book hierarchy produces a kind of standardization—simple, technical, and reductive—that is enormously useful for digital methods. For example, I embedded into Ashplant a script that combs parts of the site for references to Ulysses, based on a standard citation format of “(U [episode].[line(s)]),” then generates automatically a list of references to an episode, with a link to each source page.

A screenshot of an index of line references from Ashplant, showing a machine-generated list of references from Episode Five of Ulysses.
Figure 5. A screenshot of an index of line references from Ashplant.

These references point to entries in our collective lexicon of key terms in Ulysses. The students’ lexicon entries together constitute a playful, inventive exploration of the book’s language. The automated construction of the list itself, however, relies on an episode-line hierarchy that has none of that playfulness or invention.

For playful inventiveness in a hierarchy of location, we can turn instead to Stephen Dedalus in Joyce’s Portrait of the Artist as a Young Man, who reads the hierarchical self-location he has written on the flyleaf of his geography book:

Stephen Dedalus
Class of Elements
Clongowes Wood College
Sallins
County Kildare
Ireland
Europe
The World
The Universe (Joyce 2003, 12)

Stephen’s hierarchy is personal and resistant, not only containing the humorously self-centered details of his hyper-local situation but also excluding, for instance, any layer between “Ireland” and “Europe” that would acknowledge Ireland’s containment within the United Kingdom. It also confuses categories: although the list seems geographical—appropriately, given its location on a geography book—it contains some elements that would require additional information to become geographical (“Stephen Dedalus,” “Class of Elements”), and others whose mapping would be contentious (“Ireland,” “Europe”). It even contains an element, “Class of Elements,” that constitutes a self-reflexive joke about the impulse to classification that the list satirizes.

Such knowing irony does not infuse the hierarchies that drive much of our work in the digital humanities. That work requires schemes of classification that rely on one element’s containment within another. Consider what “Words API” claims to be “knowing” about words:

A screenshot from the 'About' section of Words API describing the hierarchical relations of certain terms, such as 'a hatchback is a type of car'.
Figure 6. A screenshot from the “About” section of Words API describing the hierarchical relations of certain terms (https://www.wordsapi.com).

An API, or Application Programming Interface, provides methods for different pieces of software to communicate with one another in a predictable way. Words API, “An API for the English Language,” performs this function by adding hierarchical metadata to every word. That metadata, in turn, allows other software to draw on that information by searching for all the words that refer to parts of the human body, for instance, or for singular nouns. In other kinds of DH applications, such as textual editions that are part of the XML-based Text Encoding Initiative, the hierarchical relationships are often hand-encoded: feminine rhyme is a type of rhyme, and rhyming words are sections of lines, which are elements of stanzas, and so forth.

The utility of these techniques is not surprising. Stanzas do generally consist of lines, transitive is a type of verb. The reliance on these encoded hierarchies echoes the methods of New Criticism, such as Wellek and Warren’s hierarchical sequence of image, metaphor, symbol, and myth—for them, the “central poetic structure” of a work (Wellek and Warren 1946, 190). For contemporary scholars more invested in decentering and poststructuralism, however, the echo of New Critical hierarchies in DH is unwelcome. Centrality implies exclusion; structure implies oversimplification; formal hierarchy implies social hierarchy. Or, as John Bradley writes, “XML containment often represents a certain kind of relationship between elements that, for want of a better term, can be thought of as ‘ownership’” (Bradley 2005, 145). Even for scholars working specifically to counteract the hierarchical containments of XML, the attempt can lead—as in Bradley’s work—to the linking of multiple hierarchical structures rather than the disruption of hierarchical organization itself.

Problems of “Subtle Things”

Addressing the challenges of encoding historical materials in XML, Bradley describes the limitations of hierarchical classification. “Humanities material,” he writes, “sometimes does not suit the relational model,” and he cites the Orlando Project’s opposition to placing its data in a relational database because it wanted to say more “subtle things” than the relational model could express (Bradley 2005, 141).[9] Bradley responds to that challenge with an ingenious method of integrating the capabilities of SQL databases into XML, solving the problem of expressing how a name in a historical document might refer to one of three people with discrete identifiers in the database. Even this problem, however, involves a relatively simple kind of uncertainty, representable on a line between “certain” and “unlikely.” The data being encoded is used for humanistic purposes, but the problem itself is not especially humanistic: it assumes an objective historical reality that can be mapped, with varying levels of confidence, onto stable personal identifiers.

The data of Ulysses presents additional difficulties, many of which are specifically literary, as the students working on Ashplant have repeatedly found. In one case, a group of them sought to document every appearance of every character in Ulysses. That data has clear utility for readers: when made searchable, it could assist a reader by identifying, for a given episode and line number, the active characters, perhaps adding a brief annotation to each name. Many earlier aids to reading the novel offer descriptions of the main characters, but the students set out to develop a resource that was more comprehensive and more responsive to a reader’s needs at a given place in the text. The students quickly discovered that identifying and describing a handful of major characters is easy; identifying all of them and their textual locations is not just a bigger problem but a fundamentally different one.

Take, for example, the novel’s dogs. We had established early on that non-human entities could be characters in our classification, given Joyce’s attribution of speech and intention to, say, hats and bars of soap. At least one dog seemed clearly to reach the level of “character”: Garryowen, the dog accompanying the Citizen in the “Cyclops” episode. According to one of the satirical interpolations of the episode, Garryowen has attained, “among other achievements, the recitation of verse” (12.718–19), a sample of which is included in the text. For the purposes of our data and accompanying visualization, we therefore needed basic information about the character, such as its name and when it appears in Ulysses.

The name creates the first problem. The description of the dog in “Cyclops” identifies it as “the famous old Irish red setter wolfdog formerly known by the sobriquet of Garryowen and recently rechristened by his large circle of friends and acquaintances Owen Garry” (12.715–17). In itself, the attribution of two names to one being is not a problem. For instance, Leopold Bloom can be “Bloom” or “Poldy,” but switching between them does not rename him. In Ulysses, such an entity could take the names “Garryowen” and “Owen Garry.” As long as the underlying identity is stable, this kind of multiplicity (two names at two different times) can fit easily into a relational database.

But Ulysses does not work so simply. Within the fiction, the renaming of the dog has questionable reality-status.[10] The “rechristening” has no durability within the narrative; it exists only in the context of one satirical interpolation, and subsequent references to the dog revert to “Garryowen.” Arguably, if our task is to describe the characters that are real within the world of the novel, the name “Owen Garry” has no status at all, as it attaches more to the voice of the temporary narrator than to what we could imagine as the real (fictional) dog.

However, we could just as plausibly say that, in the fiction, “Owen Garry” must not only exist but also be recorded as a separate character representing the temporary re-creation of Garryowen by this narrative voice. All of this messiness anticipates the further complications of the hallucinatory “Circe” episode, in which Bloom is followed by a dog that metamorphoses among species—spaniel, retriever, terrier, bulldog—until Bloom addresses it as “Garryowen,” and it transforms into a wolfdog. The dog might say, as Stephen does, “I am other I now” (9.205). Joyce’s method relies on the unresolvability of these ambiguities.

Emily Mester, the student who took the lead on the Ashplant character project, brought the transforming dog to our working group as a problem of data entry. We discussed how the problem stemmed from a breakdown of classification: rather than allowing the reader to rely on conventional relationships between sets and their elements (living things include humans and other animals, which include dogs, which include species, which include individual dogs), Joyce’s transforming dog implies a relation in which the individual dog contains multiple species. Our conversation led us to consider the dog as a device through which Joyce upends hierarchies of containment by attaching the name “Garryowen” to a dog, or an assortment of dogs, whose characteristics arise from the surrounding narration.

We realized together that the exercise of entering data into our spreadsheet led us to new questions about earlier scholarship on Ulysses. We found, for example, that Vivien Igoe assumed that Joyce’s Garryowen represented a historical dog of the same name, as in her statement that “Garryowen, who appears in three of the episodes in Ulysses (‘Cyclops’, ‘Nausicaa’, and ‘Circe’), was born in 1876” (Igoe 2009, 89). Although Igoe subsequently notes that Joyce distorts Garryowen for the purposes of fiction, this sentence still relies on several related presumptions for the purposes of historicist explanation: that the historical and fictional Garryowens are the same, that the Garryowen of Ulysses has the species identification of the historical dog (“famous red setter” [Igoe 2009, 89] rather than the “Irish red setter wolfdog” of “Cyclops”), and that within Ulysses, the fictional dog maintains a constant identity across episodes.[11] Reading phrasing such as Igoe’s in light of our questions about Garryowen led the Ashplant group to consider the confrontation between certain kinds of historicist methods with poststructural skepticism.

As the students continued to develop Ashplant, they discovered more and more examples of data entry problems that gave rise to probing discussions of Ulysses and, often, of how fictions work and how readers receive them. We sought, for instance, to map the global imagination of Ulysses, resisting the tendency Drucker had criticized of producing simplistic, naïve Dublin-centric visualizations of the “action” of the book. Instead, our map included only places outside of Dublin. For that subproject, guided by the student Christopher Gallo, we asked, How do we map an imaginary place? One that a character remembers by the wrong name? One that seems to refer to a historical event but puts it in the wrong place? For another part of the project, led by Magdalena Parkhurst, we created a visualization of the Blooms’ bookshelf that has been disrupted in “Ithaca,” and we needed to represent books with missing, incorrect, and imaginary information according to our research into historical sources.

Again and again, we found that the parts of Ashplant that appeared to involve the simplest kinds of data entry prompted us to have some of our deepest conversations about Ulysses, often leading us to further reading in contemporary criticism and theory. We found that, as Rachel Buurma and Anna Tione Levine and have argued,

 

Building an archive for the use of other researchers with different goals, assumptions, and expectations requires sustained attention to constant tiny yet consequential choices: “Should I choose to ignore this unusual marking in my transcription, or should I include it?” “Does this item require a new tag, or should it be categorized using an existing one?” “Is the name of the creator of this document data or metadata?” (Buurma and Levine 2016, 275–76)

 

Though our project is not archival, our experience has aligned with Buurma and Levine’s argument. Undergraduate research, which “has long emphasized process over product, methodology over skills, and multiple interpretations over single readings,” is well situated to foster the “sympathetic research imagination” necessary for creating useful digital projects. As our process became product, we felt more powerfully the constraints of using the “reductive and literal” tools that concern Drucker. No matter how nuanced and far-reaching our conversation about Garryowen had been, for instance, the needs of our spreadsheet compelled us to choose: is/are the transforming dog(s) of “Circe” appearances Garryowen or not?[12]

We found that the machinery of data entry and visualization produced what Donna Haraway calls the “god trick” of producing the illusion of objectivity, even when our conversations and methods aspired to privilege, in Haraway’s words, “contestation, deconstruction, passionate construction, webbed connections, and hope for transformation of systems of knowledge and ways of seeing” (Haraway 1988, 585).[13] Our timeline-based visualization of character appearances, for example, could not resist the binary choice of yes or no; even a tool that could represent probability would not be capable of representing non-probabilistic indeterminacy in the way that our conversation had. We needed to find other ways to make Ashplant into a site that produces a humanistic experience for its readers as well as its creators.

Ways Forward for Humanism in Undergraduate Digital Studies

This essay will not fully solve the problem it addresses: that digital methods gain some of their power by selecting from and simplifying complex information, sometimes in ways that run contrary to humanistic practices. Like Buurma and Levine, however, I find that the scale and established practices of undergraduate research create opportunities to do digital work that minimizes the problem and may, in fact, point to approaches that can inform humanistic digital work in general. With that goal in mind, I offer a few propositions based on our Ashplant team’s experience to date.

  1. Narrate the problems. Undergraduate research often operates at a scale that allows for hand-crafted digital humanities, in which the consequences of data manipulation can become the explicit subject of a project. The structure of Ashplant allows us to explain the problems of documenting character and location in Ulysses, and it also provides space for a wider range of student research: an analysis of Bloom’s scientific thinking and mis-thinking in “Ithaca,” a piece about Ulysses and the film Inside Llewyn Davis that uses hyperlinks to take a circular rather than linear form, and students’ artistic responses to the novel. The scale of undergraduate research allows it to become an arena for confrontation with and immersion in the problems created by the intersection of data science and the humanities.
  2. Connect conventional research to digital outcomes. Ashplant has at its heart an annotated bibliography, for which students read, cite, and summarize existing scholarship. Creating such a bibliography in digital form—specifically, with the bibliographical information in a database accessed through our web interface—enables searchability and linking. The bibliography thus becomes the scholarly backbone of the site, linked from and linking to every other section. Contributing to this part of the project grounds the students in the kind of reading and writing they have done for their other humanistic work, while also illustrating the affordances of the digital environment.
  3. Use the genre of the hypertext essay. Writing essays that combine traditional scholarly citation with other means of linking—bringing a project’s data to bear on a problem, connecting the project to other digital collections and resources—allows students to experience and demonstrate the impact of their digital projects on scholarly argumentation. Ashplant therefore includes a section of topical essays and theoretical explorations, addressing subjects from dismemberment to music. These essayistic materials link to and, importantly, are linked from the parts of the site that are based more explicitly on structured data. Our visualization of the global locations of Ulysses can lay the foundations for discussions of the Belgian King Leopold and the postcolonial Ulysses, for example, and a tool we developed for finding phonemic patterns in the text became the prompt for Emily Sue Tomac, a student specializing in linguistics as well as English, to undertake a project on Joyce’s use of vowel alternation in word sets such as tap/tip/top/tup. Hypertext essays can reanimate the complexities and contestations hidden by the god trick.
  4. Make creative expression a pathway to DH. My initial design Ashplant involved an unconventional division of labor. For the most part, students wrote the content of the site, while I took the roles of faculty mentor, general editor, and web developer. As the site evolved, so did those roles, and I perceived an important limitation of our model: students were rarely responsible for the visual elements of our user interface, and their interest in that part of the project was growing. The students saw the creative arts as a means of resisting the constraints of digital methods, and some of them created art projects that now counterbalance the lexical content of the site. When I designed a new course on digital methods for literary studies, therefore, I put artistic creativity first.[14] In that class, the students learned frameworks for discussing the affordances and effects of electronic literature, and we applied those frameworks to texts such as “AH,” by Young-hae Chang Heavy Industries; Illya Szilak and Cyril Tsiboulski’s Queerskins: A Novel; and Ana María Uribe’s Tipoemas y Anipoemas.[15] These works model a range of approaches to interactivity and digital interfaces. Therefore, all of our subsequent work in the semester—from the creation of the students’ own works of electronic literature, to the collection and presentation of geographical data, to writing Python scripts for textual analysis—takes place after this initial framing of digital work as a set of creative practices.

The complexity of humanistic inquiry does not involve solving well-defined problems with clear endpoints and signs of success. Our wholes have holes. As I have worked with my students on Ulysses, we have come to embrace a practice of digital humanities that puts creativity, resistance, and questioning at its heart, even (or especially) when we use the tabular and relational structures that appear at first to build walls within the imaginative works we study. Asking questions as simple as “What do we call this chapter of Ulysses?” and “When does this character appear?” has led my students to think and play and draw, representing contours of absurdity and art that help draw new maps of undergraduate study in the humanities.

In some ways, that new mapping takes part in the tradition I have described here: the translation and even reduction of textual complexity into reference materials that help students grasp Ulysses and begin the process of making meaning of and around it. In other ways, however, Ashplant has led us to a practice of digital humanities more aligned with Tara McPherson’s emphasis on “the relations between the digital, the arts, and more theoretically inflected humanities traditions” (McPherson 2018, 13). The scale of undergraduate pedagogy allows spreadsheets, essays, maps, and paintings to grow from the same intellectual soil, maintaining the value that structured data has long provided while preserving the complex energies of humanistic inquiry.

Notes

[1] “Ashplant” is the word Joyce uses to describe the walking stick of Stephen Dedalus. As the site explains, Stephen’s ashplant “is not a simple support but his ‘casque and sword’ (9.296) that he uses for everything from dancing and drumming to smashing a chandelier.” We likewise sought to take the conventional idea of a digital site supporting the reading of Ulysses and create varied and surprising possibilities for its use.

[2] I cite Ulysses (Joyce 1986) by episode and line number throughout, following the numbering convention that is about to become the subject of this essay.

[3] On the other hand, when Bloom later relates the events of his day to his wife in words, his account includes similar evasions and omissions. Joyce’s larger point seems less about the deceptions of quantification than about the many modes of deception humans can use when sufficiently motivated to hide something.

[4] More technically, in a database, a primary key is a column or combination of columns that have a unique value for every row. For example, in Ulysses, line number alone cannot be a primary key because every episode has, for instance, a line numbered 33. Therefore, the primary key requires the combination of episode and number: 1.33, 4.33, and 17.33 are all unique values. The other main characteristic of a primary key is that it cannot contain a null value in any row.

[5] As we have seen, the numbering convention of labeling the episodes from one to eighteen follows the lead of the Little Review episodes but not of the Shakespeare & Company edition, which uses only section numbers. The schemata employ yet another system, primarily restarting the episode numbering at the beginning of each section (so the fourth episode, which begins the second section, becomes a second episode “1”).

[6] These projects’ addresses are http://ulysses.bc.edu/ and http://web.uvic.ca/~mvp1922/, respectively.

[7] HTML (Hypertext Markup Language) is the standard markup language for web pages. HTML describes the structure of the data in a page, along with some information about formatting, so that it can be rendered by a browser. HTML does not contain executable scripts (or “code”). To create executable scripts and access information in databases, Ashplant embeds the scripting language PHP within its HTML code and connects to databases created with MySQL. Using this combination of PHP and MySQL is a common approach to creating dynamic web pages.

[8] This numbering involves another small translation: using two digits—01 and 02 rather than 1 and 2—allows the numbers to sort properly when interpreted computationally.

[9] The ability of XML-based schemes to contain non-hierarchical information remains a point of lively contention. The subject prompted a lengthy conversation on the HUMANIST listserv in early 2019 under the heading “the McGann-Renear debate.” That conversation is archived at https://dhhumanist.org/volume/32/.

[10] “Cyclops” implies yet another variant of the name: the Citizen familiarly calls the dog “Garry,” and the mock-formal narrator calls the poet-dog “Owen,” reversing the usual functions of first and last names and implying another identity called “Garry Owen.”

[11] Igoe’s phrasing also conflates the birth years of the historical and fictional Garryowens, although the historical dog would have had an implausible age of around twenty-eight years at the time Ulysses takes place.

[12] For whatever it’s worth, the unsatisfying decision we made was to classify the “Garryowen” (and “Owen Garry”) of “Cyclops” and “Nausicaa” as the character “Garryowen,” then create a separate character called “Circe Dog” to capture the transforming species of the dog(s) of that episode.

[13] Haraway’s skepticism echoes the sentiments of the “foundational crisis” of mathematics about a century ago, when Joyce was conceiving Ulysses and when, in 1911, Oskar Perron wrote, “This complete reliability of mathematics is an illusion, it does not exist, at least not unconditionally” (Engelhardt 2018, 14).

[14] That course, “Lighting the Page: Digital Methods for Literary Study,” was designed in partnership with my student collaborator Christina Brewer, who made especially valuable contributions to the unit on electronic literature.

[15] “AH” is online at http://www.yhchang.com/AH.html, Queerskins at http://online.queerskins.com/, and Uribe’s poetry at http://collection.eliterature.org/3/works/tipoemas-y-anipoemas/typoems.html.

Bibliography

Blamires, Harry. 1996. The New Bloomsday Book. London: Routledge.

Bradley, John. 2005. “Documents and Data: Modelling Materials for Humanities Research in XML and Relational Databases. Literary and Linguistic Computing 20, no. 1: 133–51.

Buurma, Rachel Sagner and Anna Tione Levine. 2016. “The Sympathetic Research Imagination: Digital Humanities and the Liberal Arts.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein, 274–279 Minneapolis: University of Minnesota Press.

D’Ignazio, Catherine and Lauren F. Klein. 2020. Data Feminism. Cambridge: MIT Press.

Drucker, Johanna. 2012. “Humanistic Theory and Digital Scholarship.” In Debates in the Digital Humanities, edited by Matthew K. Gold, 85–95. Minneapolis: University of Minnesota Press.

Ellmann, Richard. 1972. Ulysses on the Liffey. Oxford: Oxford University Press.

Engelhardt, Nina. 2018. Modernism, Fiction, and Mathematics. Edinburgh: Edinburgh University Press.

Haraway, Donna. 1988. “Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective.” Feminist Studies 13, no. 3: 575–599.

Igoe, Vivien. 2009. “Garryowen and the Giltraps.” Dublin James Joyce Journal 2, no. 2: 89–94.

Joyce, James. 1922. Ulysses. Paris: Shakespeare and Company. http://web.uvic.ca/~mvp1922/ulysses1922/

Joyce, James. 1986. Ulysses, edited by Hans Walter Gabler with Wolfhard Steppe and Claus Melchior. New York: Vintage.

Joyce, James. 2003. A Portrait of the Artist as a Young Man, edited by Seamus Deane. New York: Penguin.

Macpherson, Tara. 2018. Feminist in a Software Lab: Difference + Design. Cambridge: Harvard University Press.

Rawson, Katie and Trevor Muñoz. 2019. “Against Cleaning.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein, 279–92. Minneapolis: University of Minnesota Press.

Stein, Gertrude. 2018. Tender Buttons: Objects, Food, Rooms, edited by Leonard Diepeveen. Peterborough: Broadview.

Wellek, René and Austin Warren. 1946. Theory of Literature. New York: Harcourt.

Acknowledgments

This essay names some of the contributors to Ashplant, but dozens of student, faculty, and staff collaborators have made important contributions to the project, and the full accounting of gratitude for their work is on the site’s “About” page at http://www.math.grinnell.edu/~simpsone/Ulysses/About/index.php. I also thank Amanda Golden, Elyse Graham, and Brandon Walsh for their insightful comments on earlier versions of this piece.

About the Author

Erik Simpson is Professor of English and Samuel R. and Marie-Louise Rosenthal Professor of Humanities at Grinnell College. He is the author of two books: Literary Minstrelsy, 1770–1830 and Mercenaries in British and American Literature, 1790–1830: Writing, Fighting, and Marrying for Money. His current research concerns digital pedagogy and, in collaboration with Carolyn Jacobson, the representation of spoken dialects in nineteenth-century literature.

A greyscale map with circles over countries, the size and darkness of which indicate density. The US is the darkest, followed by India, Indonesia, Viet Nam, West Africa, Europe, and the Caribbean.
3

Data Fail: Teaching Data Literacy with African Diaspora Digital Humanities

Abstract

This essay examines the authors’ experiences working collaboratively on Power Players of Pan-Africanism, a data curation and data visualization project undertaken as a directed study with undergraduate students at Salem State University. It argues that data-driven approaches to African diaspora digital humanities, while beset by challenges, promote both data literacy and an equity lens for evaluating data. Addressing the difficulties of undertaking African diaspora digital humanities scholarship, the authors discuss their research process, which focused on using archival and secondary sources to create a data set and designing data visualizations. They emphasize challenges of doing this work: from gaps and omissions in the archives of the Pan-Africanism social movement to the importance of situated data to the realization that the original premises of the project were flawed and required pivoting to ask new questions of the data. From the trials and tribulations—or data fails—they encountered, the authors assess the value of the project for promoting data literacy and equity in the cultural record in the context of high school curricula. As such, they propose that projects in African diaspora digital humanities that focus on data offer teachers the possibility of engaging reluctant students in data literacy while simultaneously encouraging students to develop an ethical lens for interpreting data beyond the classroom.

What can data visualization tell us about the scope and spread of Pan-Africanism during the first half of the 20th century, and what insights does undertaking this research offer for teaching data literacy? These questions were at the heart of a directed study during the 2019–2020 academic year, where we, a professor (Roopika Risam) and two students (initially Jennifer Mahoney in Fall 2019, with Hibba Nassereddine joining in Spring 2020), examined the utility of data visualization for African diaspora digital humanities and its possibilities for cultivating students’ interest in and knowledge of data-driven research. Part of Mahoney’s participation in Salem State University’s Digital Scholars Program, which introduces students to humanities research using computational research methods, the directed study offered her the experience of undertaking interdisciplinary independent research (a rare opportunity in the humanities at Salem State University), an introduction to working with data and data visualization, and the opportunity to broaden her knowledge of African diaspora literature and history. While the process of undertaking this research included many twists and turns and, ultimately, did not yield the insights we had anticipated, it opened up new areas of inquiry for computational approaches to the African diaspora, critical insights about the value of introducing students to African diaspora digital humanities, and the pedagogical imperatives of data literacy. As we propose, data projects on the African diaspora offer the possibility of both introducing students to important stories and voices that are often underrepresented in curricula and to the ethics of working with data in the context of communities that have been dehumanized and oppressed by unethical uses of data.

The State of Data in African Diaspora Digital Humanities

In recent years, Black Digital Humanities has grown tremendously in scope. The African American Digital Humanities (AADHum) Initiative at the University of Maryland, College Park, led initially by Catherine Knight Steele and now by Marisa Parham, and the Center for Black Digital Research at Penn State, led by P. Gabrielle Foreman, Shirley Moody-Turner, and Jim Casey, attest to increased institutional investment in digital approaches to Black culture. An extensive list of projects, created by the Colored Conventions Project, demonstrates the variety of methodologies, histories, and voices being explored through Black Digital Humanities scholarship. Since Kim Gallon outlined the case for Black Digital Humanities in her essay in the 2016 volume of the Debates in the Digital Humanities series, she has, indeed, “set in motion a discussion of the black digital humanities by drawing attention to the ‘technology of recovery’ that undergirds black digital scholarship, showing how it fills the apertures between Black studies and digital humanities” (Gallon 2016, 42–43). Black Digital Humanities is, as scholars like Gallon (2016), Parham (2019), Safiya Umoja Noble (2019), and others propose, fundamentally transnational. An emphasis on the African diaspora has, thus, become an essential dimension of Black Digital Humanities. The Digital Black Atlantic (University of Minnesota Press, 2021), which Risam co-edited with Kelly Baker Josephs for the Debates in the Digital Humanities series, will be the first volume to articulate the scope and span of African diaspora digital humanities as a multidisciplinary, transnational assemblage of diverse scholarly practices spanning a range of disciplines (e.g., literary studies, history, library and information science, musicology) and methodologies (e.g., community archives, library collection development, textual analysis, network analysis).

African diaspora digital humanities, we contend, offers students opportunities to engage in active learning through participation in civically engaged scholarship. Such forms of authentic learning are “participatory, experimental, and carefully contextualized via real-world applications, situations, or problems” (Hancock et al. 2010, 38). They draw on scholarship that supports deep learning through the experiences of actively constructing knowledge (Downing et al. 2009; Ramsden 2003; Vanhorn et al. 2019). In the context of digital humanities, as Tanya Clement (2012), suggests, “Project-based learning in digital humanities demonstrates that when students learn how to study digital media, they are learning how to study knowledge production as it is represented in symbolic constructs that circulate within information systems that are themselves a form of knowledge production” (366). As Risam (2018) has argued, undertaking this work in the context of postcolonial and diaspora studies “empowers students to not only understand but also intervene in the gaps and silences that persist in the digital cultural record” (89–90). As projects like Amy E. Earhart and Toniesha L. Taylor’s White Violence, Black Resistance demonstrate, authentic learning through research-based projects in African diaspora studies “teach recovery, research, and digitization skills while expanding the digital canon” (Earhart and Taylor 2016, 252). Such projects allow undergraduate students to develop both digital and data literacy skills, which are often only implicitly taught in undergraduate courses, particularly in the humanities (Carlson et al. 2015; Battershill and Ross 2017; Anthonysamy 2020).

Approaches to the African diaspora that foreground working with data have shown particular promise as the technologies of recovery for which Gallon advocates. The Transatlantic Slave Trade Database, which aggregates data from slave ship records, was first conceived in the early 1990s by David Eltis, David Richardson, and Stephen Behrendt, researchers who were compiling data on enslavement and decided to join forces. Over the decades, the team and database expanded to include 36,000 voyages. The Transatlantic Slave Trade Database is now partnering with other projects on enslavement through Michigan State University’s Enslaved project, which is working to develop interoperable linked open data between these various databases. Projects like In the Same Boats, directed by Kaiama L. Glover and Alex Gil, with contributions from a team of scholars of the African diaspora (including Risam), demonstrate the value of a transnational, data-driven approach to more recent facets of African diasporic culture. The directors compiled data sets from their partners identifying the locations where Black writers and artists found themselves throughout the twentieth century and created data visualizations that show their intersections. While co-location of these figures at a given time does not necessarily mean they met, the project opens up new research questions about relationships and collaborations between them. The possibility of creating new avenues of transnational research is, perhaps, the most critical contribution of African diaspora digital humanities projects that focus on data.

But working with data in the context of the African diaspora is not an unambiguous proposition. Writing about the Transatlantic Slave Trade Database in her essay “Markup Bodies,” Jessica Marie Johnson argues, “Metrics in minutiae neither lanced historical trauma nor bridged the gap between the past itself and the search for redress” (2018, 62). In Dark Matters, Simone Brown notes that data has played a role in racialized surveillance from transatlantic slavery to the present and has been complicit with social control (2015, 16). COVID Black, a task force on Black health and data, directed by Kim Gallon, Faithe Day, and Nishani Frazier, along with a team, addresses racial disparities from the COVID-19 pandemic through data. Recognizing and addressing these issues is critical for African diaspora digital humanities projects that focus on data, particularly when working with undergraduate humanities students because of the twin challenges of students’ general lack of exposure to African diaspora studies and to data literacy in curricula.

Understanding Data through the Lens of Pan-Africanism

All of these issues came together in our project, Power Players of Pan-Africanism, which collects data on and develops data visualizations of attendees of Pan-Africanist gatherings from 1900 to 1959. Pan-Africanism, a social movement of great significance during the 20th century, fostered a sense of solidarity and political organization between people in Africa and African-descended people around the world. The timeframe encompasses the First Pan-African Conference in 1900, Pan-African Congresses held between 1919 and 1945, the Bandung Conference held in 1955, the Congresses of Black Writers and Artists in 1956 and 1959, the Afro-Asian Writers’ Conference in 1958, and assorted events during this time period that created space for people of Africa and its diaspora to meet and discuss their common political, social, and economic concerns. We chose to include events including Afro-Asian connections as well because they offered opportunities for Pan-Africanist connections in the broader context of Afro-Asian solidarity. Additionally, we ended in 1959 because 1960—widely known as the “Year of Africa”—saw the successes of decolonization movements in Africa and significantly changed the stakes of the conversation among Pan-Africanists.

While the idea for Power Players of Pan-Africanism emerged as a side project from Risam’s work on The Global Du Bois, a data visualization project that explores how computational data-driven research challenges, complicates, and assists with how we understand W.E.B. Du Bois’s role as a global actor in anticolonial struggles, and from her contribution of the Du Bois data set to Glover and Gil’s In the Same Boats, this project was undertaken as a collaboration between Risam and Mahoney, who together designed a plan for research, data collection and curation, and data modeling. We were joined in the Spring 2020 semester by Hibba Nassereddine, another student in the Digital Scholars Program, who collaborated with us on research for the data set, the iterative process of designing research questions based on the data, and prototyping of data visualizations.

The first challenge we encountered is that Pan-Africanism is largely unexamined within both high school and college curricula in the US. Despite its significance for understanding anti-colonial and anti-racist movements in the US and abroad, Pan-Africanism is a topic that goes largely unexplored in the classroom. However, its emphasis on global cooperation between Africa and its diaspora is poised to open up significant insights on the African diaspora, global history, political science, and literary studies, among others. The thriving network of intellectuals, artists, writers, and politicians who participated in Pan-Africanist movements reveals rich global connections and world travel that brought Black people of the US, Caribbean, Europe, and Africa into communication and collaboration during the first half of the 20th century. Thus, Mahoney, and later Nassereddine, first had to learn about an entirely new area of study in preparation for their participation in this project.

Data literacy is also a sorely missing part of curricula in high schools and colleges in the US. Therefore, both Mahoney and Nassereddine had to learn about working with data as well. We focused on the concept that data is situated, an idea that Jill Walker Rettberg has articulated (Rettberg 2020; Risam 2020). Data is not, as many think, objective and neutral but is a factor of how it is collected—who is collecting it, what terms are they using, what are their biases—and how it is represented—what choices are being made in data visualization and how does that affect how data is interpreted and received by audiences. We examined principles of data visualization, influenced by the work of Edward Tufte, Alberto Cairo, and Isabel Mereilles, to consider how data visualization risks misrepresenting or skewing data. Thus, to be prepared to undertake the project, Mahoney, and later Nassereddine, needed a firm grounding in data literacy and data ethics, which they had not received elsewhere in their education.

Recognizing the challenges of working with data in the context of the African diaspora, Risam and Mahoney set out to identify connections between attendees at Pan-Africanist events. By identifying conferences and other events that created space for Pan-Africanists to meet, we believed we could bring to life a data set that would reveal connections between figures in Pan-Africanist networks. Would network analysis reveal new key figures beyond names like Du Bois, George Padmore, Kwame Nkrumah, Marcus Garvey, Jomo Kenyatta, and Léopold Sédar Senghor?

Right away, we encountered another issue: the lack of readily available data sets for this work. The absence was not particularly surprising, as it reflects historical and ongoing marginalization of scholarship on the African diaspora more generally and Pan-Africanism specifically within academic knowledge production and archives. As Risam (2018) argues, the lack of preservation and digitization of material related to communities within the African diaspora and in the Global South is a major deterrent to undertaking digital humanities projects. Therefore, research to create a data set was a necessary precursor to data visualization.

This process turned out to be a lot more difficult than expected. We spent months digging into the history of Pan-Africanism, using monographs, journal articles, digital archives, theses and dissertations, historic Black newspapers, organization newsletters, and primary source documents from the events, such as published pamphlets listing attendees and photographs with captions to identify events where Pan-Africanism was an important focus and uncover names of delegates and other participants. Explicitly named “Pan-African” events (First Pan-African Conference, First Pan-African Congress, Second Pan-African Congress, etc.) were the easiest to identify. However, Pan-Africanist conferences went by many other names: writers’ conferences, peace conferences, and anti-colonial conferences. Furthermore, a single event often appears under multiple names, a factor of the relative lack of attention Pan-Africanism has received in academic discourse. In these cases, we labeled events by the names with which they most commonly appear in academic and archival sources. For example, we identify one event as the “All-African People’s Conference,” held in Accra, Ghana in December 1958 based on corroboration of sources, but this event is also referred to as the “Congress of African Peoples” (Adi and Sherwood 2003). Even more confusingly, Immanuel Geiss’s The Pan-African Movement (1974), arguably the first scholarly treatment of Pan-Africanism, refers to the All-African People’s Conference as the “Sixth Pan-African Congress,” while the Sixth Pan-African Congress typically refers to an event held in Dar es Salaam, Tanzania in 1974 in the lineage of earlier Pan-African Congresses but in a different mode given the acceleration of decolonization from 1960 on. Some events were also unnamed. In one such case, we learned that West-African activist, editor, and teacher, Garan Kouyauté held an event in Paris in 1934, and we internally referred to this as “Kouyauté’s Event.” While we kept running into Kouyauté’s name in other sources, we were unable to find substantially more information about that particular event. This became a common theme in our research, where individuals clearly played important roles in the Pan-African movement but do not commonly appear among the most cited figures in scholarship on Pan-Africanist thought. These omissions suggest that there is still much more research on Pan-Africanism that needs to be done, but their inclusion in our data set offers researchers new names of figures whose influence on Pan-Africanism should be pursued.

Despite this challenge, the research process often delivered moments of validation, when the simple act of locating multiple obscure sources confirming an event made us grateful that we could prove that it happened. Therefore, the work of creating the data set was itself a scholarly activity, using both primary and secondary sources to validate the existence of lesser-known Pan-Africanist gatherings that deserve better recognition. For example, in The Pan-African Movement (1974), Geiss introduces an event called, “The Negro in the World Today.” Harold Moody, a Jamaican-born physician residing in London, hosted said event in July 1934 to coincide with a visit from a Gold Coast delegation, including prince and politician Nana Ofori Atta. Geiss explains, “One of the motives given for convening was the racial discrimination which faced coloured workers and students in Britain” (1974, 357). This event, among others, led to the Fifth Pan-African Congress in October 1945 in Manchester, England. However, finding any details of who attended “The Negro in the World Today” proved fruitless, and we almost started to question if this event was significant enough to be included in the data set. A bright moment in our research occurred when we found the event named in a newspaper article titled, “Africans Hold Important Three-Day Conference in London” in the July 21, 1934 issue of The Pittsburgh Courier (ANP 1934, 2). Confirming the existence of this event was celebratory, and these exuberant moments made many excruciating hours of research where we turned up nothing worth it. All told, we identified close to seventy events within our timeframe that fit our criteria of explicitly creating space for Pan-African connections among Black participants from around the world.

More obstacles appeared as we worked to identify the names of delegates and other participants in these events. In some cases, sources only identify the names of organizations being represented and did not include the names of people from the organization who were in attendance. Often, we had much more success identifying the numbers of delegates and attendees at events than locating their names. Knowing the numbers, however, gave us a sense of the percentage of attendee names that we had confirmed. For example, we know that there were over 200 delegates and 5,000 participants at the Fourth Pan-African Congress, held in New York in 1927, but we have only successfully identified twenty-six of those names. In our most successful case, the Conference on Africa, held in New York in 1944, we identified names of all 112 delegates, as well as additional participants and observers.

Among the many names that we added to our data set, we encountered further discrepancies we had to address. Some of the same participants were listed under different names in multiple sources, requiring additional research to verify. In some cases, this was a matter of typos within the sources. For example, a participant named “William Fonaine” attended the First International Conference of Negro Writers and Artists, and a participant named “W. F. Fontaine” attended the Second International Conference of Negro Writers and Artists. We were able to confirm that William F. Fontaine attended both events. In other cases, delegates had changed their names, which was not unusual at the time. In some instances, people changed their names to embrace their African roots and resist the imposition of colonial languages on their identities. T. Ras Makonnen was born George Thomas N. Griffiths in 1900 but changed his name in 1935. Kwame Nkrumah, born Francis Nwia Kofi Nkrumah in 1909, changed his name to Kwame Nkrumah in 1945 (and later became the first Prime Minister and then first President of Ghana). In other cases, differences in non-Anglophone names reflected divergent transliteration practices. We chose to include delegates’ country or colony of origin as well, which introduced a further level of inconsistency. Of course, we encountered changes in names reflecting transitions from colony to independent nation, such as Gold Coast to Ghana. But there were more puzzling inconsistencies as well. In many cases this reflected the mobility of participants in Pan-Africanism, their shifting national allegiances, and/or their affiliation with multiple locales. For others, however, it reflects inconsistencies in archival materials. In perhaps the oddest case, we found “Miguel Francis Delanang” from Ethiopia attending the Bandung Conference and a “Miguel Francis Delanang” from Ghana at the same conference. Based on our research, this is the same person. While we have done our best to identify as many discrepancies as we could, we fully expect that others exist that we have not caught because they are less obvious, such as aliases or pseudonyms that we have not yet connected to another name. Therefore, we view our data set not as a static and finished object but a living, collaborative document for other researchers who want to contribute to it.

Although we could easily spend years continuing our research, we decided that we had a substantial enough amount of data for a subset of twenty-one events that we could use to begin prototyping our data visualizations. When we began the project, we were curious about the networks among the participants. Would a network show significant connections among participants? How dense would these networks be? Which figures would be the hubs in the network? Would they be the usual suspects or might new voices emerge? To explore these questions, we created a force-directed graph—and the results were virtually meaningless. There was little density in the network and few connections among attendees. Light clustering in the network appeared around W.E.B. Du Bois, widely known as the father of Pan-Africanism, which was hardly surprising.

These disappointing results prompted several teachable moments about data and research design. We looked closely at our data set to understand why the network visualization seemed little more than noise. While we had expected to find participants attending more than one event, our twenty-one events gave us over one thousand names with the majority only attending one event. Logically, it was unsurprising that better-known figures like Du Bois attended more events because they had access to the means to do so. Also, since our events spanned six decades punctuated by major events like World Wars I and II, the rise of the Soviet Union, and the beginning of decolonization, the power players in the movement changed as their investment in Pan-Africanism waxed and waned over time. We also know, based on the information we had found about the total numbers of participants, that some of our data sets were incomplete—and may always be incomplete. Without accounting for the situatedness of the data we had curated, the results simply did not make sense.

We also recognized that our initial hypothesis about the existence of a network with well-defined connections was an erroneous assumption. Engagement of delegates with an event did not necessarily imply extended participation in the global dimensions of a movement. This realization led us to reconsider how we imagine what “participating” in a social movement means. In a conversation about these challenges, digital humanist Quinn Dombrowski suggested that perhaps what is most meaningful lies not in the network but in the brokenness of the network—in what a network visualization cannot represent. There may, for example, be forms of participation that cannot be captured within the bounds of face-to-face gatherings. These might be captured, instead, through correspondence between those engaged in Pan-Africanism. There might also be local effects of an individual’s attendance at an event that similarly would not manifest in a network visualization of participants. Rather than offer a clear picture of Pan-Africanism, our data set and meaningless network visualization opened up a new set of questions about the role of digital humanities in understanding Pan-Africanism.

This misstep was also an opportunity to explore the iterative nature of project design with students. Digital humanists, after all, are not unaccustomed to encountering failure and pivoting with research questions and methods to see what these methods make possible (Dombrowski 2019; Graham 2019). Engaging with iterative project design and negotiating the inevitable errors offers undergraduate students the opportunity to develop both creativity and problem solving skills (Pierrakos et al. 2010; Shernoff et al. 2011; Wood and Bilsborow 2013). We began to ask new questions about our data set and continued developing prototypes to see if they offered more meaningful insight on the data. One question that emerged was how to visualize the data in a way that would make the events and delegate information more easily navigable than reading a spreadsheet. We experimented with a sunburst data visualization, which shows hierarchical relationships between data. The top level of the hierarchy focused on decades, then years, then events, and finally participants. The sunburst visualization allowed us to organize the data and provide easy access to a complex data set, while also representing the data proportionally (which decades and years included the most events and which events included the largest numbers of delegates). Another question we considered was how our data might speak to the reach of Pan-Africanism both geographically and temporally. We created two maps to examine this question. The first, a static map, simply dropped pins at the locations of the nearly seventy events we had identified, revealing a broad geographical scope for Pan-Africanist gatherings—in the US, the Caribbean, Europe, Africa, and Asia. A second map, focusing on the twenty-one events for which we had identified a significant number of participants, mapped the attendees’ colonies and countries of origin. This dynamic heat map, animated to aggregate participant data over time, demonstrated the significant geographic scope of Pan-Africanism and its growth and spread over the first sixty years of the 20th century. Critically, we understood these visualizations as representations of particular elements of our data set, each shedding light on different details within the data but none showing the entire picture. While this is a feature of digital humanities scholarship that engages with data more generally—data visualizations are representations that slice and sample data sets, showing particular aspects of the data—it is a critical way of understanding data-driven approaches to African diaspora digital humanities.

Teaching (and Learning) Data Literacy with African Diaspora Digital Humanities

Despite the challenges of this work, we came away from the experience with key insights for both scholarship of the African diaspora and pedagogy. Risam was reminded that when working in the context of a subject that has been marginalized in the broader landscape of scholarly knowledge production, we are inherently limited by what archives have preserved and what scholarship has covered. Our research is encumbered by what Risam (2018) has described as the omissions of the cultural record, and as much as we can undertake the important work—like curating data sets—to avoid reproducing and amplifying these gaps, we inevitably must contend with fragments of information and the larger question of what data can and cannot reveal about the African diaspora. Although this knowledge ultimately proved frustrating, it was profound for Mahoney and Nassereddine in their first foray into working with data. Risam also found the experience an instructive lesson in how to teach humanities students to engage with data when we miss the mark—e.g. when our presumptions about the network failed to pan out. While scientific methods in STEM prompt students to contemplate and negotiate failure, this is not typically foregrounded in humanities methodologies (Henry et al. 2019; Melo et al. 2019; Croxall and Warnick 2020). However, this project offered Risam the opportunity to encourage students to move away from assumptions and be open to the new insights that emerge from a challenge. As Mahoney and Nassereddine are both students pursuing their teaching licenses in English, Risam used this experience as an opportunity to model reflective practice for the heartbreaks we encounter in both digital humanities research and in teaching—sometimes one’s brilliant idea does not prove to be so in execution, and the appropriate response is not to shut down and yield to failure but to pivot—ask questions, reassess, and re-plan.

From this experience, Mahoney had the opportunity to delve deeply into archival research and scholarship on the African diaspora for the first time. She was also surprised to learn that many high school teachers and professors with whom she discussed her work had not heard of Pan-Africanism, reflecting the lack of coverage of this powerful movement within high school and college curricula. Conversely, projects like ours are examples of how we can engage students in addressing these gaps in both curriculum and the cultural record (Risam 2018; Hill and Dorsey 2019; Thompson and McIlnay 2019; Dallacqua and Sheahan 2020; Davila and Epstein 2020). This project also led Mahoney to realize that often we are left with more questions than answers. For example, what breakthroughs or achievements for the African diaspora did Pan-Africanist gatherings create? How were these participants, who faced travel or visa restrictions, funding their travels for these events? Mahoney also discovered the moments of serendipity, joy, and surprise that are part of the research experience, in the way it opens up a virtually limitless garden of forking paths to explore. She was particularly excited to uncover the significance of women to Pan-Africanism. The Fourth-Pan African Congress in New York in 1927, for example, was organized primarily by women. Although women’s names are not counted among the key figures of Pan-Africanism, through the curation of our data set, Mahoney identified that Amy Ashwood Garvey, the first wife of well-known Pan-Africanist Marcus Garvey, arguably played a more significant role in Pan-Africanism than her husband. Aside from one out-of-print biography, Lionel M. Yard’s Biography of Amy Ashwood Garvey, 1897–1969, there is little research focused on Ashwood Garvey, but Mahoney was able to reconstruct her role. Ashwood Garvey used her father’s credit to help Garvey found the Universal Negro Improvement Association in Jamaica, and she worked with Garvey in the US, where they were married and divorced within two years. After their separation, Ashwood Garvey committed herself to Pan-Africanism, co-founding the Nigerian Progress Union and the International African Friends of Abyssinia (later the International African Service Bureau). Additionally, she was a respected speaker at Pan-Africanist and other political events throughout Europe, the Caribbean, the United States, and Africa. After organizing the Fifth Pan-African Congress in Manchester, England in 1945, Ashwood Garvey spent several years in Africa speaking to women and children and raising money for schools, lecturing in Nigeria, residing for two years as a guest of the Asantehene in Kumasi, Ghana, and adopting two daughters in Monrovia, Liberia. Later in her life, she opened the Afro-Woman Service Bureau in London. Mahoney began to recognize the questions that emerged as a factor of the relative lack of scholarly attention that Pan-Africanism has received in spite of its significance, which is a reflection of the biases within the cultural record—and in curriculum—that favor knowledge production on canonical histories, figures, and movements of the Global North over the stories and voices of the Global South (Akua 2019; Lehner and Ziegler 2019; Span and Sanya 2019; Caldwell and Chávez 2020). This experience also led Mahoney to recognize the importance of incorporating the voices of Black writers and artists engaged in Pan-Africanism into her classroom as a high school teacher.

From her crash course in data literacy while working on the project, Mahoney also realized that digital humanities must be included in the high school English Language Arts classroom. Contextualizing her experiences in her prior coursework on English teaching methods and technology teaching methods, Mahoney came to understand digital humanities as a way of teaching data literacy to her own students. In Massachusetts, where Mahoney will be teaching, high school teachers are beholden to the Massachusetts Curriculum Frameworks, which are based on Common Core Standards. In 2016, Massachusetts released Digital Literacy Standards, but there has been no incentive, accountability, or professional development provided to support their implementation. African diaspora digital humanities, in particular, Mahoney recognized, facilitates students’ digital literacy while furthering the essential goal of expanding the canon in the classroom to ensure inclusive representation for all students. Focusing on the two together allows teachers to move past perceived barriers—such as the cost of adding new books to curriculum or lack of interest from colleagues—to work towards justice and equity through students’ engagement with data. In the context of working with informational texts in the Common Core Standards, data literacy encourages students to understand the ethics of data and data visualizations—How was data collected? Who collected the data? What questions were asked? What terminology was used to ask the questions and how might that have informed the response? What is the difference between quantitative and qualitative data? What implicit messages appear in data visualizations? What stories can they tell and what are their limits?

We, therefore, propose that African diaspora digital humanities has an essential role to play in pedagogy, particularly at the high school level. Reading and analyzing data sets and data visualization is a cross-disciplinary skill that needs to be incorporated across the curriculum, and English Language Arts teachers have a responsibility to ensure that students are prepared to understand data, as a cornerstone of literacy. Teaching data literacy holds the possibility of appealing to students who might struggle with or be less interested in literature, allowing teachers to leverage their engagement with data sets and data visualization into deeper connections to the practices of reading and analyzing texts, while building their knowledge of the social value of data literacy (Kjelvik and Schultheis 2019; Špiranec et al. 2019; Bergdahl et al. 2020). Furthermore, it acquaints students with the iterative nature of research and interpretation, while building their capacity to recognize failure and to redirect their efforts towards new avenues of inquiry that may be more fruitful. This is not a matter of “grit”—the troubling emphasis on underserved students’ attitudes towards perseverance rather than on the structural oppressions that impede learning (Barile 2014; Duckworth 2016; Stitzlein 2018)—but strengthening critical thinking skills, particularly when working with English language learners (Parris and Estrada 2019; Smith 2019; Yang et al. 2020). Working with data of the African diaspora also contributes to greater diversity within curricula, while encouraging students to recognize the power dynamics at play in whose voices and experiences are preserved in the artifacts that form our cultural record. Ensuring that students have the opportunity to learn about the Black writers and artists who were the power players of Pan-Africanism in the context of data literacy offers teachers the possibility of promoting equity in the classroom and developing students’ ability to use their knowledge to interpret data through an ethical lens beyond the classroom.

Bibliography

Akua, Chike. 2019. “Standards of Afrocentric Education for School Leaders and Teachers.” Journal of Black Studies 51, no. 2 (December): 107–27. https://doi.org/10.1177/0021934719893572.

Associated Negro Press. 1934. “Africans Hold Important Three-Day Conference in London.” The Pittsburgh Courier, July 21, 1934.

Adi, Hakim, and Marika Sherwood. 2003. Pan-African History: Political Figures From Africa and the Diaspora Since 1787. London: Routledge.

Anthonysamy, Lilian. 2020. “Digital Literacy Deficiencies in Digital Learning Among Undergraduates” In Understanding Digital Industry, edited by Siska Noviaristanti, Hasni Mohd Hanafi, and Donny Trihanondo, 133–36. London: Routledge.

Barile, Nancy. 2014. “Is “Getting Gritty” the Answer?: Can Grit Solve All Your Students’ Problems? This Urban High School Teacher Shares Her Experiences.” Educational Horizons 93, no. 2 (December): 8–9. https://doi.org/10.1177/0013175X14561418.

Battershill, Claire and Shawna Ross. 2017. Using Digital Humanities in the Classroom: A Practical Introduction for Teachers, Lecturers, and Students. London: Bloomsbury Academic.

Bergdahl, Nina, Jalal Nouri, and Uno Fors. 2019. “Disengagement, Engagement and Digital Skills in Technology-enhanced Learning.” Education and Information Technologies 25: 957–983. https://doi.org/10.1007/s10639-019-09998-w.

Brown, Simone. 2015. Dark Matters: On the Surveillance of Blackness. Durham, NC: Duke University Press.

Cairo, Alberto. 2019. How Charts Lie: Getting Smarter about Visual Information. NY: Norton.

Caldwell, Kia Lilly, and Emily Susanna Chávez. 2020. Engaging the African Diaspora in K–12 Education. New York: Peter Lang Publishing Group.

Carlson, Jake, Megan Sapp Nelson, Lisa R. Johnston, and Amy Koshoffer. 2015. “Developing Data Literacy Programs: Working with Faculty, Graduate Students and Undergraduates.” Bulletin of the Association for Information Science and Technology 41, no. 6 (August/September): 14–17.

Clement, Tanya. 2012. “Multiliteracies in the Undergraduate Digital Humanities Curriculum: Skills, Principles, and Habits of Mind.” In Digital Humanities Pedagogy: Practices, Principles, and Politics, edited by Brett D. Hirsch, 365–88. Cambridge: Open Book Publishers.

Croxall, Brian, and Quinn Warnick. 2020. “Failure.” In Digital Pedagogy in the Humanities: Concepts, Models, and Experiments, edited by Rebecca Frost Davis, Matthew K. Gold, Katherine D. Harris, and Jentery Sayers. https://digitalpedagogy.hcommons.org/keyword/Failure.

Dallacqua, Ashley K., and Annmarie Sheahan. 2020. “Making Space: Complicating a Canonical Text Through Critical, Multimodal Work in a Secondary Language Arts Classroom.” Journal of Adolescent & Adult Literacy 64, no. 1 (July/August): 67–77. https://doi.org/10.1002/jaal.1063.

Davila, Denise, and Elouise Epstein. 2020. “Contemporary and Pre–World War II Queer Communities: An Interdisciplinary Inquiry Via Multimodal Texts.” English Journal 110, no. 1 (September): 72–79.

Dombrowski, Quinn. 2019. “Towards a Taxonomy of Failure.” http://quinndombrowski.com/?q=blog/2019/01/30/towards-taxonomy-failure.

Downing, Kevin, Theresa Kwong, Sui-Wah Chan, Tsz-Fung Lam, and Woo-Kyung Downing. 2009. “Problem-based Learning and the Development of Metacognition.” Higher Education 57: 609–621.

Duckworth, Angela. 2016. Grit: The Power of Passion and Perseverance. New York: Scribner.

Earhart, Amy E. and Toniesha L. Taylor. 2016. “Pedagogies of Race: Digital Humanities in the Age of Ferguson.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein, 251–264. Minneapolis: University of Minnesota Press.

Eltis, David, et al. 2020. The Transatlantic Slave Trade Database. https://www.slavevoyages.org.

Gallon, Kim. 2016. “Making the Case for Black Digital Humanities.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein, 43–49. Minneapolis: University of Minnesota Press.

Gallon, Kim et al. 2020. COVID Black. https://www.cla.purdue.edu/academic/sis/p/african-american/covid-black/team.html.

Geiss, Imanuel. 1974. The Pan-African Movement. New York: Africana Publishing Company.

Glover, Kaiama L. and Alex Gil. 2020. In the Same Boats. https://sameboats.org.

Graham, Shawn. 2019. Failing Gloriously and Other Essays. Grand Forks, ND: The Digital Press.

Hancock, Thomas, Stella Smith, Candace Timpte, and Jennifer Wunder. 2010. “PALs: Fostering Student Engagement and Interactive Learning.” Journal of Higher Education Outreach and Engagement 14, no. 4. https://openjournals.libs.uga.edu/jheoe/article/view/798/798.

Henry, Meredith A., Shayla Shorter, Louise Charkoudian, Jennifer M. Heemstra, and Lisa A. Corwin. 2019. “FAIL Is Not a Four-Letter Word: A Theoretical Framework for Exploring Undergraduate Students’ Approaches to Academic Challenge and Responses to Failure in STEM Learning Environments.” CBE—Life Sciences Education 18, no. 1 (Spring): 1–17. https://doi.org/10.1187/cbe.18-06-0108.

Hill, Craig, and Jennifer Dorsey. 2020. “Expanding the Map of the Literary Canon Through Multimodal Texts.” In Handbook of the Changing World Language Map, edited by Stanley D. Brunn and Roland Kehrein, 77–89. Cham, Switzerland: Springer.

Johnson, Jessica Marie. 2018. “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads.” Social Text 36, no. 4 (2018): 57–79. https://doi.org/10.1215/01642472-7145658.

Johnston, Brenda, Peter Ford, Rosamond Mitchell, and Florence Myles. 2011. Developing Student Criticality in Higher Education: Undergraduate Learning in the Arts and Social Sciences. London: Bloomsbury Publishing.

Kjelvik, Melissa K., and Elizabeth H. Schultheis. 2019. “Getting Messy with Authentic Data: Exploring the Potential of Using Data from Scientific Research to Support Student Data Literacy.” CBE—Life Sciences Education 18, no. 2 (Summer): 1–18. https://doi.org/10.1187/cbe.18-02-0023.

Lehner, Edward and John R. Ziegler. 2019. “Re-Conceptualizing Race in New York City’s High School Social Studies Classrooms.” In Handbook of Research on Social Inequality and Education, edited by Sherrie Wisdom, Lynda Leavitt, and Cynthia Bice, 24–45. Hershey, Pennsylvania: IGI Global.

Meirelles, Isabel. 2013. Design for Information. Beverly, Massachusetts: Rockport Press.

Melo, Marijel, Elizabeth Bentely, Ken S. McAllister, and José Cortez. 2019. “Pedagogy of Productive Failure: Navigating the Challenges of Integrating VR into the Classroom.” Journal of Virtual Worlds Research 12, no. 1 (January): 1–20. https://doi.org/10.4101/jvwr.v12i1.7318.

Noble, Safiya Umoja. 2019. “Toward a Critical Black Digital Humanities.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein, 25–35. Minneapolis: University of Minnesota Press.

Pangrazio, Luci, and Julian Sefton-Green. 2020. “The Social Utility of ‘Data Literacy.’” Learning, Media, and Technology 45, no. 2 (June): 208–20. https://doi.org/10.1080/17439884.2020.1707223.

Parham, Marissa. 2019. “Sample | Signal | Strobe: Haunting, Social Media, and Black Digitality.” In Debates in the Digital Humanities, edited by Matthew K. Gold and Lauren F. Klein, 101–122. Minneapolis: University of Minnesota Press.

Parris, Heather, and Lisa M. Estrada. 2019. “Digital Age Teaching for English Learners.” In The Handbook of TESOL in K‐12, edited by Luciana C. de Oliveria, 149–62. Hoboken, New Jersey: Wiley-Blackwell.

Pierrakos, Olga, Anna Zilberberg, and Robin Anderson. 2010. “Understanding Undergraduate Research Experiences through the Lens of Problem-based Learning: Implications for Curriculum Translation.” Interdisciplinary Journal of Problem-Based Learning 4, no. 2 (September): 35–62. https://doi.org/10.7771/1541–5015.1103.

Ramsden, Paul. 2003. Learning to Teach in Higher Education. New York: Routledge.

Rettberg, Jill Walker. 2020. “Situated Data Analysis: A New Method for Analysing Encoded Power Relationships in Social Media Platforms and Apps.” Humanities and Social Sciences Communications 7, no. 5 (2020). https://doi.org/10.1057/s41599-020-0495-3.

Risam, Roopika. 2020. “‘It’s Data, Not Reality’: On Situated Data with Jill Walker Rettberg.” Nightingale, June 29, 2020. https://medium.com/nightingale/its-data-not-reality-on-situated-data-with-jill-walker-rettberg-d27c71b0b451.

Risam, Roopika. 2018. New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy. Evanston, Illinois: Northwestern University Press.

Shernoff, Elisa S., Ane M. Maríñez-Lora, Stacy L. Frazier, Lara J. Jakobsons, Marc S. Atkins, and Deborah Bonner. 2011. “Teachers Supporting Teachers in Urban Schools: What Iterative Research Designs Can Teach Us.” School Psychology Review 40, no. 4 (December): 465–85. https://doi.org/10.1080/02796015.2011.12087525.

Smith, Blaine E. 2019. “Mediational Modalities: Adolescents Collaboratively Interpreting Literature through Digital Multimodal Composing.” Research in the Teaching of English 53, no. 3 (February): 197–222. https://search.proquest.com/docview/2196370157?pq-origsite=gscholar&fromopenview=true.

Span, Christopher M., and Brenda N. Sanya. 2019. “Education and the African Diaspora.” In The Oxford Handbook of History Education, edited by John L. Rury and Eileen H. Tamura, 399–412. New York: Oxford University Press.

Špiranec, Sonja, Denis Kos, and Michael George. 2019. “Searching for Critical Dimensions in Data Literacy.” In Proceedings of CoLIS, the Tenth International Conference on Conceptions of Library and Information Science, Ljubljana, Slovenia, June 16–19, 2019. Information Research 24, no. 4 (December). http://informationr.net/ir/24-4/colis/colis1922.html.

Stitzlein, Sarah M. 2018. “Teaching for Hope in the Era of Grit.” Teachers College Record 120, no. 3 (March): 1–28. http://www.tcrecord.org/Content.asp?ContentId=22085.

Thompson, Riki, and Matthew McIlnay. 2019. “Nobody Wants to Read Anymore! Using a Multimodal Approach to Make Literature Engaging.” Journal of English Language and Literature 7, no. 1 (January): 21–40.
https://www.researchgate.net/publication/341312737.

Tufte, Edwards. 2001. The Visual Display of Quantitative Information, 2nd edition. Cheshire, Connecticut: Graphics Press.

Vanhorn, Shannon, Susan M. Ward, Kimberly M. Weismann, Heather Crandall, Jonna Reule, et al. 2019. “Exploring Active Learning Theories, Practices, and Contexts.” Communication Research Trends 38, no. 3 (January): 5–25.
https://search.proquest.com/docview/2308823162?fromopenview=true&pq-origsite=gscholar.

Wood, Denise, and Carolyn Bilsborow. 2015. “‘I am not a Person with a Creative Mind’: Facilitating Creativity in the Undergraduate Curriculum Through a Design-Based Research Approach.” In Leading Issues in e-Learning Research MOOCs and Flip: What’s Really Changing?, edited by Mélanie Ciussi, 79–107. United Kingdom: Academic Conferences and Publishing Limited.

Yang, Ya-Ting Carolyn, Yi-Chien Chen, and Hsui-Ting Hun. 2020. “Digital Storytelling as an Interdisciplinary Project to Improve Students’ English Speaking and Creative Thinking.” Computer Assisted Language Learning. https://doi.org/10.1080/09588221.2020.1750431.

Acknowledgments

The authors gratefully acknowledge Krista White for thoughtful feedback on this essay; Gail Gasparich, Regina Flynn, Elizabeth McKeigue, and J.D. Scrimgeour at Salem State University for supporting the Digital Scholars Program; and Haley Mallett for her support preparing the manuscript.

About the Authors

Jennifer Mahoney is an MEd student at Salem State University. She received her Bachelor of Arts in English from Salem State and is currently completing her Master’s in Secondary Education. Mahoney is currently a teaching fellow at Revere High School, an urban public school just outside of Boston, MA. She was the inaugural recipient of the Richard Elia Scholarship and her research interests include contemporary pedagogical approaches, underrepresented historical events, and digital humanities.

Roopika Risam is Chair of Secondary and Higher Education and Associate Professor of Secondary and Higher Education and English at Salem State University. Her research interests lie at the intersections of postcolonial and African diaspora studies, humanities knowledge infrastructures, and digital humanities. Risam’s monograph, New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy was published by Northwestern University Press in 2018. She is co-editor of Intersectionality in Digital Humanities (Arc Humanities/Amsterdam University Press, 2019). Risam’s co-edited collection The Digital Black Atlantic for the Debates in the Digital Humanities series (University of Minnesota Press) is forthcoming in 2021.

Hibba Nassereddine is an MEd student at Salem State University. She received her Bachelor of Arts in English from Salem State and is currently completing her Master’s in Secondary Education. Nassereddine is currently a teaching fellow at Holten Richmond Middle School in Danvers, Massachusetts.

A woman works at a laptop looking at an image of a snowy landscape.
1

Data Literacy in Media Studies: Strategies for Collaborative Teaching of Critical Data Analysis and Visualization

Abstract

This essay addresses challenges of teaching critical data literacy and describes a shared instruction model that encourages undergraduates at a large research university to develop critical data literacy and visualization skills. The model we propose originated as a collaboration between the library and an undergraduate media and cultural program, and our specific intervention is the development of a templated data-visualization instruction session that can be taught by many people each semester. The model we describe has the dual purpose of supporting the major and serving as an organizational template, a structure for building resources and approaches to instruction that supports librarians as they develop replicable pedagogical strategies, including those informed by a cultural critical lens. We intend our discussion for librarians who are teaching in an academic setting, and particularly in contexts involving large-scale or programmatic approaches to teaching. The discussion is also useful to faculty in the disciplines who are considering partnering with the library to interject aspects of data or information literacy into their program.

Learning that emphasizes data literacy and encourages analysis within multimedia visualization platforms is a growing trend in higher education pedagogy. Because data as a form of evidence holds a privileged position in our cultural discourse, interdisciplinary undergraduate degree programs in the social sciences, humanities, and related disciplines increasingly incorporate data visualization, thus elevating data literacy alongside other established curricular outcomes. When well-conceived, critical data literacy instruction engenders a productive blend of theory and practice and positions students to examine how race-based bigotry, gender bias, colonial dominance, and related forms of oppression are implicated in the rhetoric of data analysis and visualization. Students can then create visualizations of their own that establish counternarratives or otherwise confront the locus of power in society to present alternative perspectives.

As scholarship in media, communications, and cultural studies pedagogy has established, data visualizations “reflect and articulate their own particular modes of rationality, epistemology, politics, culture, and experience,” so as to embody and perpetuate “ways of knowing and ways of organizing collective life in our digital age” (Gray et al. 2016, 229). Catherine D’Ignazio and Lauren F. Klein (2020, 10) explain this dialectic more pointedly in Data Feminism, arguing, “we must acknowledge that a key way power and privilege operate in the world today has to do with the word data itself,” especially the assumptions and uses of it in daily life. Critical instruction positions undergraduates to question how data, in its composition, analysis, and visualization, can often perpetuate an unjust socio-cultural status quo. Undergraduates who are introduced to frames for interpreting culture also need to be exposed to tools—literal and conceptual—that help them critique data visualizations. The goal is to enable a holistic critical literacy, through which students can find data, structure it with a research question in mind, and produce accurate, inclusive visualizations.

However, data instruction is challenging, and planning data learning within the context of an existing course requires an array of skills. Effective data visualization pedagogy demands that instructors locate example datasets, clean data to minimize roadblocks, and create sample visualizations to initiate student engagement with first-order cultural-critical concepts. These steps, a substantial time investment, are necessary for teaching that enables data novices to contend with the mechanics of data manipulation while remaining focused on social and political questions that surround data. When charged with developing data visualization assignments and instructional assistance, faculty often seek the support and expertise of librarians and educational technologists, who are located at the nexus of data learning within the university (Oliver et al. 2019, 243).

Even in cases where librarians and instructional support staff are well-positioned to assist, the demand for teaching data visualization can be overwhelming. It can become burdensome to deliver in-person instruction to cohort courses with a large student enrollment, across many sections and in successive semesters. In order to initiate and maintain an effective, multidisciplinary data literacy program, teaching faculty, librarians, and educational technologists must establish strong teaching partnerships that can be replicated and reimagined in multiple contexts.

This essay addresses some challenges of teaching critical data literacy and describes a shared instruction model that encourages undergraduates at a large research university to develop critical data literacy and visualization skills. Although anyone engaged in teaching critical data literacy can draw from this essay, we intend our discussion for librarians who are teaching in an academic setting, and particularly in contexts involving large-scale or programmatic approaches to teaching. In addition, we believe our essay is particularly pertinent to those designing program curricula within discipline-specific settings, as our ideas engage questions of determining scale, scope, and learning outcomes for effective undergraduate instruction.

The teaching model we propose originated as a collaboration between the New York University Libraries and NYU’s Media, Culture, and Communications (MCC) department, and our specific intervention is the development of an assignment involving data visualization for a Methods in Media Studies (MIMS) course. The distributed teaching model we describe has the dual purpose of supporting the major and serving as an organizational template, a structure for building resources and approaches to instruction that supports librarians as they develop replicable pedagogical strategies, including those informed by a cultural critical lens. In this regard, we believe that collaborative instruction empowers librarians and faculty from many disciplines to develop their own data literacy competency while growing as teachers. And, it enables the library to affect undergraduate learning throughout the university.

There is already an extensive body of research about the role of critical data literacy instruction, including critical approaches to the technical elements of data visualization (Drucker 2014; Sosulski 2019; Engebresten and Kennedy 2020). While we draw from that scholarly discussion, we focus instead on the upshot of programmatic, extensible teaching partnerships between libraries and discipline-specific undergraduate programs. Along the way, we engage two crucial questions: What is the value of creating replicable lesson plans and materials, to be taught by an array of library staff repeatedly? How can the librarians who design these materials strike a balance between creating a step-by-step lesson plan that library instructors follow and structuring a guided lesson that is flexible and capacious enough for instructors to experience meaningful teaching encounters of their own?

Data Literacy in Undergraduate Education

Several curricular initiatives and assessment rubrics in higher education pedagogy recognize the need for students to develop fluidity with digital media and quantitative reasoning, a precursor to effective data visualization. In 2005, Association of American Colleges and Universities (AAC&U) began a decade-long initiative called Liberal Education and America’s Promise (LEAP), which resulted in an inventory of 21st century learning outcomes for undergraduate education. Quantitative literacy is on the list of outcomes (Association of American Colleges and Universities 2020). A corresponding AAC&U rubric statement asserts that “[v]irtually all of today’s students … will need basic quantitative literacy skills such as the ability to draw information from charts, graphs, and geometric figures, and the ability to accurately complete straightforward estimations and calculations.” The rubric urges faculty to develop assignments that give students “contextualized experience” analyzing, evaluating, representing, and communicating quantitative information (Association of American Colleges and Universities 2020). The substance of the LEAP initiative informed the development of our collaborative teaching model, for it allowed us to ground our curricular interventions within larger university curricular trends that had already emerged.

Although quantitative literacy is important, there are other structures for teaching that see data fluidity and visualization as being tied to larger information seeking practices. For this reason, we also turned to the Framework for Information Literacy for Higher Education, developed by the Association of College and Research Libraries (ACRL). The Framework embraces the concept of metaliteracy, which promotes metacognition and a critical examination of information in all its forms and iterations, including data visualization. One of the six frames posed by the document, “Information Creation as a Process,” closely aligns with data competency, including data visualization. This frame emphasizes that the information creation process can “result in a range of information formats and modes of delivery” and that the “unique capabilities and constraints of each creation process as well as the specific information need determine how the product is used.” Within the Framework, learning is measured according to a series of “dispositions,” or knowledge practices that are descriptive behaviors of those who have learned a concept. Here, the Framework is apropos, as students who see information creation as a process “value the process of matching an information need with an appropriate product” and “accept ambiguity surrounding the potential value of information creation expressed in emerging formats or modes” (ACRL 2016). The Framework recognizes that evolved undergraduate curricula must incorporate active, multimodal forms of analysis and production that synthesize information seeking, evaluation, and knowledge creation.

Other organizations and disciplines also advocate for quantitative literacy in the undergraduate curriculum. For instance, Locke (2017) discusses the relevance of data in the humanities classroom and points to ways undergraduate digital humanities projects can incorporate data analysis and visualization to extend inquiry and interpretation. And Beret and Phillips (2016, 13) recommend that every journalism degree program provide a foundational data journalism course, because interdisciplinary data instruction cultivates professionals “who understand and use data as a matter of course—and as a result, produce journalism that may have more authority [or] yield stories that may not have been told before.” In sum, LEAP, the ACRL Framework, and movements for data literacy in the disciplines influenced the Libraries’ collaboration with the Media and Cultural Communications department, and this informed the effort to create and support a meaningful learning experience for students in this major.

Learning-by-Teaching: Structured, Programmatic Instruction and Libraries

Our collaborative model evolved with the conviction that structured, programmatic teaching can foster professional growth for librarians and library technologists. In addition to creating impactful learning for students, programmatic teaching provides a structure that allows for educators to expand the contexts in which they can teach. In many cases, librarians who specialize in information literacy are less adroit regarding the concepts and mechanics of working with data. Teaching data as a form of information, then, necessarily requires a baseline technical expertise.

Several studies published within the past decade indicate that learning with the intent to teach can lead to better understanding, regardless of the content in question. One such study finds that learners who were expecting to teach the material to which they were being introduced show better acquisition than learners who were expecting only to take a test, theorizing that learning-by-teaching pushes the learner beyond essential processing to generative processing, which involves organizing content into a personally meaningful representation and integrating it with prior knowledge (Fiorella and Mayer 2013, 287). Another study finds that learners who were expecting to teach show better organizational output and recall of main points than those who were not expecting to teach, which suggests that learners who anticipate teaching tend to put themselves “into the mindset of a teacher,” leading them to use preparation techniques—such as concept organizing, prioritizing, and structuring—that double as enhancements to a learner’s own encoding processes (Nestojko, et al. 2014, 1046). This evidence boosts our belief that learning-by-teaching is a good strategy for librarians to build foundational data literacy skills, and it informed the development of our program.

Development and Implementation of the Collaborative Teaching Model

Situated in NYU’s Steinhardt School of Culture, Education, and Human Development, the MCC program covers global and transcultural communication, media institutions and politics, and technology and society, among other related fields. MCC program administrators, who were looking to incorporate practical skills into what had previously been a theory-heavy degree, approached the library to co-develop instructional content that would expose students to applied data literacy and multimedia visualization platforms. The impetus for the program administrators to reach out to the library was their participation in a course enhancement grant program, which testifies to the lasting effects that school or university-based curriculum initiatives can have on undergraduate learning. In this case, what emerged was a sustained teaching partnership. Though the support was refined over time, its core remained constant: individual sections of a media studies methods class would attend a librarian-led class session that prepares students to evaluate data and construct a visualization exploring some element of media and political economy, grounded in an assigned reading that asserts ownership of or access to media and communications infrastructure is intrinsically related to the well-being and development of countries around the world.

The class is a first-year requirement in Media, Culture, and Communication, one of NYU’s largest majors. The course tends to be taught by beginning doctoral students, and is by design a highly fluid teaching environment. In early iterations of library support, we designed a module that attempted to have students perform a range of analysis and visualization tasks. Students were introduced to basic socio-demographic datasets and were invited to create a visualization that investigates a research question of their choosing, provided that the question adhered generally to the themes of media and political economy. The assignment as initially constituted expected the student to frame a question, find a dataset and clean it, choose a visualization platform, and generate one or more visualizations that imply a causal relationship between variables that they had identified.

The learning outcomes and assignment developed in this initial sequence turned out to be too ambitious. The assignment had fairly loose parameters, which proved problematic, and the 75-minute class session could not provide sufficient preparation. Students struggled with developing viable research questions, finding data sets, and cleaning the data (the multivalent process of normalizing, reshaping, redacting, or otherwise configuring data to be ingested and visualized in online platforms without errors). Also, we had pointed them to an overwhelming array of data analysis software tools, including ESRI’s ArcMap, Carto, Plot.ly, Raw, and Tableau. We found they had great difficulty with both selecting a tool and learning how to use it, in addition to the connected process of finding a dataset to visualize within it. The Libraries tried to accommodate, but ultimately realized that the module needed significant adjustment going forward, especially since the MCC department decided to expand the project to include up to 10 sections of the course each semester.

Besides struggling with research questions, datasets, and tools, it was also apparent that students had trouble connecting this work to the broader ideas of media and political economy intrinsic to the assignment. Informed by these first-round outcomes, we came together again to revise the instructional content and assignment. Taking our advice into account, the MCC teaching faculty and program administrators refined the learning outcomes as such:

  • Become familiar with the principles, concepts, and language related to data visualization
  • Investigate the context and creation of a given dataset, and think critically about the process of creating data
  • Emphasize how online visualization platforms allow users to make aesthetic choices, which are part and parcel of the rhetoric of visualization

The librarians also created a student-facing online guide as a home base for the module and decided to distribute the teaching load by inviting Data Services specialists from the Libraries’ Data Services department to help teach the library sessions (MCC-UE 2019). And to provide a better lead-in to the library session, a preparatory lesson plan was developed for the MCC instructors to present in the class prior to the library visit.

After further feedback from program administrators and consideration, we inserted a scaffolding component into the library session lesson plan to better prepare students for their assignment. The component involved comparing four sample visualizations created from the very same data, and it included questions for eliciting a discussion about the origins and constructions of data. Scenario-based exercises for creating visualizations in Google sheets and Carto were also incorporated into the lesson, giving students practice before tackling the actual assignment. The assignment was also redesigned with built-in support. Students would no longer be expected to find their own dataset and attempt to clean extracted data, tasks that had caused them frustration and anxiety. Instead, they would choose from a handful of prescribed and pre-cleaned datasets. Data Services staff worked to remediate a set of interesting datasets to anticipate the kind of visualization students would attempt. Also, rather than having to choose from a confusing array of data visualization tools, they would be directed to use Google sheets or Carto only. Assuming the task of identifying, cleaning, and preparing datasets meant extra front-loaded work on the Libraries’ part, but it also freed students to focus on the higher order activity of investigating the relationship between visualizing information and examining social or political culture.

Instructional Support from a Wide Community of Teachers: Growing a Base

Another issue at hand was the strain the project was having on the members of the Data Services team and Communications Librarian, who taught all ten library sessions that were offered each semester. To achieve sustainability going forward, a broader group of librarians would be needed to help teach the library sessions. Moving forward, the Data and Communications librarians decided to recruit other NYU librarians to participate as instructors. Most of the recruits were data novices, but they viewed the invitation as an opportunity to learn data basics, expand their instruction repertoire, and strengthen their teaching practice. Calling on colleagues to teach outside their comfort zone is a big ask, one that requires strong support and administrative buy-in. So recruits were provided with a thorough lesson plan, a comprehensive hands-on training session, and the opportunity to shadow more experienced instructors before teaching the module solo (MCC-UE 2019).

By including a more robust roster of instructors, the structure also gave us the ability to further tie our lesson to what was planned in the MIMS curriculum. A new reading was chosen by the media studies faculty, “Erasing Blackness: the media construction of ‘race’ in Mi Familia, the first Puerto Rican situation comedy with a black family,” by Yeidy Rivero. The article grounds the students’ exploration of the relationship between media and political economy within the MIMS class, and it also provides a good entry point to explore critical data literacy concepts. According to Rivero, the show Mi Familia, deliberately represents a “flattened,” racially homogeneous “imagined community” of lower-middle class black family life that erases Puerto Rico’s hybrid racial identity. This flattening, Rivero argues, is part and parcel of multidimensional efforts to “Americanize” Puerto Rico and align its culture with the interests of the U.S. Furthermore, since the Puerto Rican media is regulated by the U.S. Federal Communication Commission (FCC) and owned by U.S. corporations, Puerto Ricans themselves had little recourse to question the portrayal of constructed racial identities in the mainstream culture (Rivero 2002).

Students were instructed to complete the reading prior to the library session. During the session, the library instructor referred to the reading and introduced a dataset with particular relevance to it. The instructor engaged students in a discussion about the importance of reviewing the dataset description and variables in order to form a question that can be reasonably asked of the data. With students following along, the instructor then modeled how to use Google sheets to manipulate the data and create a visualization that speaks to the question.

The selected dataset resulted from a study of the experiences and expressions of racial identity by young adults who lived in first and second-generation immigrant households in the New York City area during the late 1990s (Mollenkopf, Kasinitz, and Waters 2011). The timeframe of this article and the dataset line up well. The sitcom mentioned in the article first aired in 1994, but had been picked up in Telemundo’s NYC area affiliates by the late 1990s, so it is highly possible that this sitcom would have been on the air in the homes of study participants. The dataset, which is aggregated at the person level, includes variables about participants’ family and home context, patterns of socialization, exposure to media, and sense of self. In order to foreground the analytic process of looking at data, ascertaining its possibilities, and gesturing at potential visualizations, we created a simplified version of the raw data, which omits some columns and imputes other variables for easier use. To accompany this dataset, we also created some simple data visualizations in Google Sheets, ArcGIS Online, and Tableau, which are intentionally “impoverished,” thus designed to elicit discussion from students about the claims made by the visualizations.

Undoubtedly, these adjustments to the module led to students performing better on the assignment. Improvements to the lead-in session provided by the MCC instructors ensured that the students were prepared with context for the library workshop and an understanding of why the library was supporting the assignment. Basing the assignment on a specific article made it possible for librarians to model a way of bridging the theoretical concepts of the class to a question that could be asked of data. There was also more time for two pair-and-share discussions and group work in Google Sheets and Carto, which addressed a fundamental and recurring frustration in the students’ understanding of the assignment: the ability to ask an original question of a dataset, and to ask a question that would address a larger theme of media and political economy.

From the standpoint of instructors in NYU Libraries, we also found that the model provided a strengthened group of teachers. Several people who worked with sections of MIMS contributed ideas to the instructor manual and created ancillary slides and examples that are tailored to their own interest in the claims about racial and national identity that the Rivero article makes. For us, this flexibility is an important element of the collaborative teaching model; it offers both the structure for those who are new to data analysis and visualization to teach effectively, yet it also contains enough pathways for discussion to be meaningful and personal, should individual instructors want to branch out in their own teaching.

Conclusion

Despite being familiar with technology, many students arrive at college without a holistic ability to interpret, analyze, and visualize data. Educators now recognize the need to provide foundational data literacy to undergraduates, and many teaching faculty look to the library for support in instructional design and implementation. In this article, we recognize that creating integrated, meaningful data learning lessons is a complex task, yet we believe that the collaborative teaching model can be applied in various disciplinary contexts. Sustainability of this model depends on equipping a wide range of librarians with necessary data literacy skills, which can be achieved with a learning-by-teaching approach. After developing a teaching model that calls upon the expertise of teachers across the library, we gained some important insights on maintaining the communication and support to make it sustainable, building the workshop itself, and balancing the labor that all of this requires.

Good communication and organization between the MCC department and librarians was also key in maintaining the scalability of this instruction program. Given the heavy rotation of new teachers on both the library and MCC side, we needed to provide module content that was streamlined and assignment requirements that were clear cut in order to quickly on-board teachers to the goals, process, and output of the module. When recruiting library instructors, we emphasized that volunteers will not only build their data literacy skill set, but will also expand their pedagogical knowledge and teaching range. Finally, to ensure that volunteer instructors have a successful experience, we also provide support mechanisms such as a step-by-step lesson plan, thorough train-the-trainer sessions, opportunities to observe and team-teach before going solo, and a point person to contact with questions and concerns.

There is much hidden labor in all of this work. Robust student support for the course was also crucial, and really took off when the MCC department created a dedicated student support team from graduate assistants in the program. On the library side, communicating regularly with the MCC department, assessing and revising the learning objects, organizing and hosting train the trainer sessions, and scheduling all of the library visits takes many hours of time and planning. This work should not be overlooked when considering a program of this scale.

A collaboration at this level can provide rich data literacy at scale to undergraduates, while also offering the chance for instructors in the library and in disciplinary programs to develop their own skills in numeracy and data visualization as they learn by teaching. Through time, effort, and dedicated maintenance, a program like this becomes a successful partnership that has a broad and demonstrated impact on student learning, strengthens ties between the library and the departments we serve, and allows librarians and data services specialists the opportunity to learn and grow from each other.

Related to the learning objects themselves, we had the most success when we matched the scope of the assignment closely with the time and support the students would have to complete it, and preparing a small selection of data sets for the students in advance was very helpful in this regard. We also built in a full class session of preparation before the library visit, in which MCC teachers introduced the assignment, some principles of data visualization (via a slide deck prepared by the library’s Data Services department), and how this method can connect to broader concepts of media analysis. This led to more effective learning for students. These changes to the student assignment, learning outcomes, and library lesson plan were developed through regular and structured assessments of the workshop: a survey to the instructors teaching the course, classroom visits to see the students’ final projects, and in-depth conversations with instructors on which aspects of the lesson plan were successful and which fell flat. Following each assessment the MCC administrators and the librarians would get together to discuss and iterate on the learning objects. This process of gathering feedback on the workshop, reflecting on that information and then revising the assignment enabled us to improve the teaching and learning experience over the years.

Bibliography

Association of American Colleges and Universities. n.d. “Essential Learning Outcomes.” Accessed June 2, 2020. https://www.aacu.org/essential-learning-outcomes.

Association of American Colleges and Universities (AAC&U). n.d. “VALUE Rubrics.” Accessed June 2, 2020. https://www.aacu.org/value/rubrics/quantitative-literacy.

Association of College & Research Libraries. 2016. “Framework for Information Literacy for Higher Education. “ Accessed June 2, 2020. http://www.ala.org/acrl/standards/ilframework.

Berret, Charles and Cheryl Phillips. 2016. Teaching Data and Computational Journalism. New York: Columbia Journalism School. https://journalism.columbia.edu/system/files/content/teaching_data_and_computational_journalism.pdf.

D’Ignazio, Catherine and Lauren F. Klein. 2020. Data Feminism. Boston: MIT Press. ProQuest Ebook Central.

Drucker, Johanna. 2014. Graphesis: Visual Forms of Knowledge Production. Cambridge, Massachusetts: Harvard University Press.

Engebretsen, Martin and Helen Kennedy, eds. 2020. Data Visualization in Society. Amsterdam: Amsterdam University Press. Project MUSE.

Fiorella, Logan, and Richard E. Mayer. 2013. “The Relative Benefits of Learning by Teaching and Teaching Expectancy.” Contemporary Educational Psychology 38, no. 4: 281–288. https://doi.org/10.1016/j.cedpsych.2013.06.001.

Gray, Jonathan, Lillian L. Bounegru, Stefania Milan, and Paolo Ciuccarelli. 2016. “Ways of Seeing Data: Toward a Critical Literacy for Data Visualizations as Research Objects and Research Devices.” In Innovative Methods in Media and Communication Research edited by Sebastian Kubitschko and Anne Kaun, 227–252. Cham, Switzerland: Palgrave Macmillan. ProQuest Ebook Central.

Locke, Brandon T. 2017. “Digital Humanities Pedagogy as Essential Liberal Education: A Framework for Curriculum Development.” Digital Humanities Quarterly 11, no. 3. http://www.digitalhumanities.org/dhq/vol/11/3/000303/000303.html.

Nestojko, John F., Dung C. Bui, Nate Kornell, and Elizabeth Ligon Bjork. 2014. “Expecting to Teach Enhances Learning and Organization of Knowledge in Free Recall of Text Passages.” Memory & Cognition 42, no. 7: 1038–1048. https://doi.org/10.3758/s13421-014-0416-z.

Mollenkopf, John, Phillip Kasinitz, and Mary Waters M. 2011. Immigrant Second Generation in Metropolitan New York. Ann Arbor: Inter-university Consortium for Political and Social Research [distributor]. https://doi.org/10.3886/ICPSR30302.v1/.

“MCC-UE 14 Media & Cultural Analysis.” 2019. New York University. https://guides.nyu.edu/mims/.

Oliver, Jeffry, Christine Kollen, Benjamin Hickson, and Fernando Rios. 2019. “Data Science Support at the Academic Library.” Journal of Library Administration 59, no. 3: 241–257. https://doi.org/10.1080/01930826.2019.1583015.

Rivero, Yeidy. M. 2002. “Erasing Blackness: The Media Construction of ‘Race’ in Mi Familia, the First Puerto Rican Situation Comedy with a Black Family.” Media, Culture & Society 24, no. 4: 481–497. https://doi.org/10.1177/016344370202400402.

Sosulski, Kristen. 2018. Data Visualization Made Simple: Insights into Becoming Visual. London: Routledge. ProQuest Ebook Central.

Acknowledgments

This teaching partnership, data, and associated resources would not have been possible without the work of many people in NYU Libraries and Data Services, as well as the NYU Steinhardt Methods in Media Studies program including: Bonnie Lawrence, Denis Rubin, Dane Gambrill, Yichun Liu, and Jamie Skye Bianco.

About the Authors

Andrew Battista is a Librarian for Geospatial Information Systems at New York University and teaches regularly on data visualization, geospatial software, and the politics of information.

Katherine Boss is the Librarian for Journalism and Media, Culture, and Communication at New York University, and specializes in information literacy instruction in media studies.

Marybeth McCartin is an Instructional Services Librarian at New York University, specializing in teaching information literacy fundamentals to early undergraduates.

Skip to toolbar