Issue Eighteen /

A computer-rendered pile of black question marks on a black background is punctuated by three scattered orange question marks that glow.

Using Wikipedia in the Composition Classroom and Beyond: Encyclopedic “Neutrality,” Social Inequality, and Failure as Subversion

Cherrie Kwok, University of Virginia

Abstract

Instructors who use Wikipedia in the classroom typically focus on teaching students how to adopt the encyclopedia’s content practices so that they can improve their writing style and research skills, and conclude with an Edit-a-Thon that invites them to address Wikipedia’s social inequalities by writing entries about minority groups. Yet these approaches do not sufficiently highlight how Wikipedia’s social inequalities function at the level of language itself. In this article, I outline a pedagogical approach that invites students to examine the ways that language and politics shape, and are shaped by, each other on Wikipedia. In the case of my Spring 2020 class, my approach encouraged students to examine the relationship between Wikipedia’s content policies and white supremacy, and Wikipedia’s claims to neutrality. I also draw on the Edit-A-Thon that I organized at the end of the unit to show how instructors can extend a critical engagement with Wikipedia by building in moments of failure, in addition to success. In the process, my pedagogical approach reminds instructors—especially in composition and writing studies—to recognize that it is impossible to teach writing decoupled from the politics of language.

Wikipedia has become a popular educational tool over the last two decades, especially in the fields of composition and writing studies. The online encyclopedia’s “anyone-can-edit” ethos emphasizes the collective production of informative writing for public audiences, and instructors have found that they can use it to teach students about writing processes such as citation, collaboration, drafting, editing, research, and revision, in addition to stressing topics such as audience, tone, and voice (Purdy 2009, 2010; Hood 2009; Vetter, McDowell, and Stewart 2019; Xing and Vetter 2020). Composition courses that use Wikipedia have thus begun to follow a similar pattern. Students examine Wikipedia’s history, examine the way its three content policies (Neutral point-of-view [NPOV], no original research, and verifiability) govern how entries are written and what research sources are cited, and discuss the advantages and limits of Wikipedia’s open and anonymous community of volunteer contributors. Then, as a final assignment, instructors often ask students to edit an existing Wikipedia entry or write their own. By contrast, instructors in fields like cultural studies, feminism, and postcolonialism foreground Wikipedia’s social inequalities by asking students to examine how its largely white and male volunteer editors have resulted in the regrettable lack of topics about women and people of color (Edwards 2015; Pratesi, Miller, and Sutton 2019; Rotramel, Parmer, and Oliveira 2019; Montez 2017; Koh and Risam n.d.). When they ask students to edit or write Wikipedia entries, these instructors also invite students to focus on minority groups or underrepresented topics, thus transforming the typical final assignment into one that mirrors the Edit-A-Thons hosted by activist groups like Art + Feminism.

The socially conscious concerns that instructors in cultural studies, feminism, and postcolonialism have raised are compelling because they foreground Wikipedia’s power dynamics. When constructing my own first-year undergraduate writing course at the University of Virginia, then, I sought to combine these concerns with the general approach instructors in composition and writing studies are using. In the Fall 2019 iteration of my course, my students learned about topics like collaborative writing and citation, in addition to examining academic and journalistic articles about the encyclopedia’s racial and gender inequalities. The unit concluded with a two-day Edit-A-Thon focused on African American culture and history. The results seemed fabulous: my brilliant students produced almost 20,000 words on Wikipedia, and created four new entries—one about Harlem’s Frederick Douglass Book Center and three about various anti-slavery periodicals.[1] In their reflection papers, many conveyed that Edit-A-Thons could help minority groups and topics acquire greater visibility, and argued that the encyclopedia’s online format accelerates and democratizes knowledge production.

Yet, as an instructor, I felt that I had failed to sufficiently emphasize how Wikipedia’s content policies also played a role in producing the encyclopedia’s social inequalities. Although I had devoted a few classes to those policies, the approaches I adapted for my unit from the composition and the cultural studies fields meant my students only learned how to adopt those policies—not how to critically interrogate them. The articles we read also obscured how these policies relate to the encyclopedia’s social inequalities because scholars and journalists often conceptualize such inequalities in terms of proportion, describing how there is more or less information about this particular race or that particular gender (Lapowsky 2015; Cassano 2015; Ford 2011; Graham 2011; John 2011). Naturally, then, that’s how our students learn to frame the issue, too—especially when the Edit-A-Thons we organize for them focus on adding (or subtracting) content, rather than investigating how Wikipedia’s inequalities also occur due to the way the encyclopedia governs language. Similar observations have been raised by feminist instructors like Leigh Gruwell, who has found that Wikipedia’s policies “exclude and silence feminist ways of knowing and writing” and argued that current pedagogical models have not yet found ways to invoke Wikipedia critically (Gruwell 2015).[2]

What, then, might a pedagogical model that does invoke Wikipedia critically look like? I sought to respond to this question by creating a new learning goal for the Wikipedia unit in the Spring 2020 iteration of my course. This time around, I would continue to encourage my students to use Wikipedia’s content policies to deepen their understanding of the typical topics in a composition course, but I would also invite them to examine how those policies create—and then conceal—inequalities that occur at the linguistic level. In this particular unit, we concentrated on how various writers had used Wikipedia’s content policies to reinscribe white supremacy in an entry about UVa’s history. The unit concluded with an Edit-A-Thon where students conducted research on historical materials from UVa’s Albert and Shirley Small Special Collections Library to produce a Wikipedia page about the history of student activism at UVa. This approach did not yield the flashy, tweet-worthy results I saw in the Fall. But it is—to my mind—much more important, not only because it is influenced by postcolonial theorists such as Gayatri Spivak, who has demonstrated that neutrality or “objectivity” is impossible to achieve in language, but also because it prompted my students to discuss how language and politics shape, and are shaped by, each other. In the process, this approach also reminds instructors—especially in composition and writing studies—to recognize that it is impossible to teach writing decoupled from the politics of language. Indeed, Jiawei Xing and Matthew A. Vetter’s recent survey of 113 instructors who use Wikipedia in their classrooms reveals that they did so to develop their students’ digital communication, research, critical thinking, and writing skills, but only 40% of those instructors prompted their students to engage with the encyclopedia’s social inequalities as well (Xing and Vetter 2020). While the study’s participant pool is small and not all the instructors in that pool teach composition and writing courses, the results remain valuable because they suggest that current pedagogical models generally do not ask students to examine the social inequalities that Wikipedia’s content policies produce. This article therefore outlines an approach that I used to invite my students to explore the relationship between language and social inequalities on Wikipedia, with the hope that other instructors may improve upon, and then interweave, this approach into existing Wikipedia-based courses today.

Given that this introduction (and the argument that follows) stress a set of understudied issues in Wikipedia, however, my overall insistence that we should continue using Wikipedia in our classrooms may admittedly seem odd. Wouldn’t it make more sense, some might ask, to support those who have argued that we should stop using Wikipedia altogether? Perhaps—but I would have to be a fool to encourage my students to disavow an enormously popular online platform that is amassing knowledge at a faster rate than any other encyclopedia in history, averages roughly twenty billion views a month, and shows no signs of slowing down (“Wikimedia Statistics – All Wikis” n.d.). Like all large-scale projects, the encyclopedia contains problems—but, as instructors, we would do better to equip our students with the skills to address such problems when they arise. The pedagogical approach that I describe in this paper empowers our students to identify some problems directly embedded in Wikipedia’s linguistic structures, rather than studying demographic data about the encyclopedia alone. Only when these internal dynamics are grasped can the next generation then begin to truly reinvent one of the world’s most important platforms in the ways that they desire.

1. Wikipedia’s Neutrality Problem

Wikipedia’s three interdependent content policies—no original research, verifiability, and neutral point of view—are a rich opportunity for students to critically engage with the encyclopedia. Neutral point of view is the most non-negotiable policy of the three, and the Wikipedia community defines it as follows:

Neutral point of view (NPOV) … means representing fairly, proportionately, and, as far as possible, without editorial bias, all the significant views that have been published by reliable sources about the topic … [it means] carefully and critically analyzing a variety of reliable sources and then attempting to convey to the reader the information contained in them fairly, proportionally … including all verifiable points of view. (“Wikipedia: Neutral Point of View” 2020)

Brief guidelines like “avoid stating opinions as facts” and “prefer nonjudgmental language” (“Wikipedia: Neutral Point of View” 2020) follow this definition. My students in both semesters fixated on these points and the overall importance of eschewing “editorial bias” when engaging with NPOV for the first time—and for good reason. A writing style that seems to promise fact alone is particularly alluring to a generation who has grown up on fake news and photoshopped Instagram bodies. It is no surprise, then, that my students responded enthusiastically to the first writing exercise I assigned, which asks them to pick a quotidian object and describe it from what they understood to be a neutral point of view as defined by Wikipedia. The resulting pieces were well-written. When I ran my eyes over careful descriptions about lamps, pillows, and stuffed animals, I glimpsed what Purdy and the composition studies cadre have asserted: that writing for Wikipedia does, indeed, provoke students to write clearly and concisely, and pay closer attention to grammar and syntax.

Afterwards, however, I asked my students to consider the other part of NPOV’s definition: that the writer should proportionally articulate multiple perspectives about a topic (“Wikipedia: Neutral Point of View” 2020). A Wikipedia entry about our planet, for example, would include fringe theories claiming the Earth is flat—but a writer practicing NPOV would presumably ensure that these claims do not carry what Wikipedians describe as “undue weight” over the scientific sources which demonstrate that the Earth is round. Interestingly, the Wikipedia community’s weighing rhetoric associates the NPOV policy with the archetypal symbol of justice: the scales. Wikipedians do not merely summarize information. By adopting NPOV, they appear to summarize information in the fairest way. They weigh out different perspectives and, like Lady Justice, their insistence on avoiding editorial bias seems to ensure that they, too, are metaphorically “blindfolded” to maintain impartiality.

Yet, my students and I saw how NPOV’s “weighing” process, and Wikipedia’s broader claims to neutrality, quickly unraveled when we compared a Wikipedia entry to another scholarly text about the same subject. Comparing and contrasting texts is a standard pedagogical strategy, but the exercise—when raised in relation to Wikipedia—is often used to emphasize how encyclopedic language differs from fiction, news, or other writing genres, rather than provoking a critical engagement with Wikipedia’s content policies. In my Spring 2020 course, then, I shifted the purpose of this exercise. This time around, we compared and contrasted two documents—UVa’s Wikipedia page and Lisa Woolfork’s “‘This Class of Persons’: When UVa’s White Supremacist Past Meets Its Future”—to study the limits of Wikipedia’s NPOV policy.

Both documents construct two very different narratives to describe UVa’s history. My students and I discovered that their differences are most obvious when they discuss why Thomas Jefferson established UVa in Charlottesville, and the role that enslaved labor played in constructing the university:

Wikipedia

Woolfork

In 1817, three Presidents (Jefferson, James Monroe, and James Madison) and Chief Justice of the United States Supreme Court John Marshall joined 24 other dignitaries at a meeting held in the Mountain Top Tavern at Rockfish Gap. After some deliberation, they selected nearby Charlottesville as the site of the new University of Virginia. [24]. (“University of Virginia” 2020)

On August 1, 1818, the commissioners for the University of Virginia met at a tavern in the Rockfish Gap on the Blue Ridge. The assembled men had been charged to write a proposal … also known as the Rockfish Gap Report. … The commissioners were also committed to finding the ideal geographical location for this undertaking [the university]. Three choices were identified as the most propitious venues: Lexington in Rockbridge County, Staunton in Augusta County, and Central College (Charlottesville) in Albemarle County. … The deciding factor that led the commissioners to choose Albemarle County as the site for the university was exclusively its proximity to white people. The commissioners observed, “It was the degree of the centrality to the white population of the state which alone then constituted the important point of comparison between these places: and the board … are of the opinion that the central point of the white population of the state is nearer to the central college….” (Woolfork 2018, 99–100)

Like many of its peers, the university owned slaves who helped build the campus. They also served students and professors. The university’s first classes met on March 7, 1825. (“University of Virginia” 2020)

For the first fifty years of its existence, the university relied on enslaved labor in a variety of positions. In addition, enslaved workers were tasked to serve students personally. … Jefferson believed that allowing students to bring their personal slaves to college would be a corrosive influence. … [F]aculty members, however, and the university itself owned or leased enslaved people. (Woolfork 2018, 101)

Table 1. Comparison of Wikipedia and Woolfork on why Thomas Jefferson established UVa in Charlottesville, and the role that enslaved labor played in constructing the university

Although the two Wikipedia extracts “avoid stating opinions as facts,” they expose how NPOV’s requirement that a writer weigh out different perspectives to represent all views “fairly, proportionately, and, as far as possible” is precisely where neutrality breaks down. In the first pair of extracts, the Wikipedia entry gives scant information about why Jefferson selected Charlottesville. Woolfork’s research, however, outlines that what contributors summarized as “some deliberation” was, in fact, a discussion about locating the university in a predominantly white area. The Wikipedia entry cites source number 24 to support the summary, but the link leads to a Shenandoah National Park guide that highlights Rockfish Gap’s location, instead of providing information about the meeting. Woolfork’s article, by contrast, carefully peruses the Rockfish Gap Report, which was produced in that meeting.

One could argue, as one of my students did, that perhaps Wikipedia’s contributors had not thought to investigate why Jefferson chose Charlottesville, and therefore did not know of the Rockfish Gap Report’s existence—and that is precisely the point. The Wikipedia entry’s inclusion of all three Presidents and the Chief Justice suggests that, when “weighing” different sources and pursuing a range of perspectives about the university’s history, previous contributors decided—whether knowingly or unconsciously—that describing who was at the meeting was a more important viewpoint. They fleshed out a particular strand of detail that would cement the university’s links to American nationalism, rather than inquire how and why Charlottesville was chosen. An entry that looks comprehensive, balanced, well-cited, and “neutral,” then, inevitably prioritizes certain types of information based on the information and the lines of inquiry its contributors choose to expand upon.

The second pair of extracts continue to reveal the fractures in the NPOV policy. Although Woolfork’s research reveals that the university used enslaved labor for the first 50 years, the only time the 10,000-word Wikipedia entry mentions slavery is buried within the three sentences I copied above, which undercuts NPOV’s claims to proportionality. Moreover, the first sentence carefully frames the university’s previous ownership of slaves as usual practice (“like many of its peers”). It is revealing that the sentence does not gaze, as it has done for the majority of the paragraph where this extract is located, on UVa alone—but expands outward to include all universities when conveying this specific fact about slavery. Interestingly, these facts about enslaved labor also come before the sentence about the university’s first day of classes. This means that the entry, which has so far proceeded in chronological fashion, suddenly experiences a temporal warp. It places the reader within the swing of the university’s academic life when it conveys that students and professors benefitted from enslaved labor, only to pull the reader backwards to the first day of classes in the next sentence, as though it were resetting the clock.

I want to stress that the purpose of this exercise was not to examine whether Woolfork’s article is “better” or “truer” than the Wikipedia entry, nor was it an opportunity to undercut the writers of either piece. Rather, the more complex concept my students grappled with was how the article and the entry demonstrate why the true/false—or neutral/biased—binaries that Wikipedia’s content policies rely on are themselves flawed. One could argue that both pieces about UVa are “true,” but the point is that they are slanted differently. The Wikipedia entry falls along an exclusively white axis, while Woolfork’s piece falls along multiple axes—Black and white—and demonstrates how both are actually intertwined due to the university’s reliance on enslaved labor. From a pedagogical standpoint, then, this exercise pushed my students in two areas often unexplored in Wikipedia assignments.

First, it demonstrated to my students that although phrases like “editorial bias” in Wikipedia’s NPOV guidelines presuppose an occasion where writing is impartial and unadulterated, such neutrality does not—and cannot—exist. Instructors in composition studies often ask students to practice NPOV writing for Wikipedia to improve their prose. This process, however, mistakenly conveys that neutrality is an adoptable position even though the comparative exercise I outlined above demonstrates neutrality’s impossibility.

Second, the comparative exercise also demonstrated to my students that Wikipedia’s inequalities occur at the linguistic level as much as the demographic level. Instructors in cultural studies frequently host Edit-A-Thons for their students to increase content about minority cultures and groups on Wikipedia, but this does not address the larger problem embedded in NPOV’s “weighing” of different perspectives. The guidelines state that Wikipedians must weigh perspectives proportionally—but determining what proportionality is to begin with is up to the contributors, as evinced by the two Wikipedia extracts I outlined. Every time a writer weighs different sources and perspectives to write an entry, what they are really doing is slanting their entry along certain axes of different angles, shapes, shades, and sizes. In the articles my students read, the most common axis Wikipedians use, whether knowingly or unconsciously, is one that centers white history, white involvement, and white readers. For example, as my students later discovered in Wikipedia’s “Talk” page for the entry about UVa, when two editors were discussing whether the university’s history of enslaved labor rather than its honor code should be mentioned in the entry’s lead, one editor claimed that the enslaved labor was not necessarily “the most critical information that readers need to know” (“University of Virginia” 2020).[3] Which readers? Who do Wikipedians have in mind when they use that phrase? In this instance, we see how “weighing” different perspectives not only leads one to elevate one piece of information over another, but also one type of reader over another.

As instructors, we need to raise these questions about audience, perspective, and voice in Wikipedia for our students. It is not so much that we have not covered these topics: we just haven’t sufficiently asked our students to engage with the social implications of these topics, like race (and, as Gruwell has said so cogently, gender). One way to begin doing so is by inflecting our pedagogical approaches with the discoveries in fields such as postcolonial studies and critical race studies. For example, my pedagogical emphasis on the impossibility of neutrality as I have outlined it above is partially indebted to critics like Gayatri Spivak. Her work has challenged the western critic’s tendency to appear as though they are speaking from a neutral and objective perspective, and demonstrated how these claims conceal the ways that such critics represent and re-present a subject in oppressive ways (Spivak 1995). Although her scholarship is rooted in deconstructionism and postcolonial theory, her concerns about objectivity’s relationship to white western oppression intersects with US-based critical race theory, where topics like objectivity are central. Indeed, as Michael Omi and Howard Winant have explained, racism in the United States proliferated when figures like Dr. Samuel Morton established so-called “objective” biological measures like cranial capacity to devalue the Black community while elevating the white one (Omi and Winant 1994).

I mention these critics not to argue that one must necessarily introduce a piece of advanced critical race theory or postcolonial theory to our students when using Wikipedia in the composition classroom (although this would of course be a welcome addition for whoever wishes to do so). After all, I never set Spivak’s “Can The Subaltern Speak?” as reading for my students. But what she revealed to me about the impossibility of neutrality in that famous paper prompted me to ask my students about Wikipedia’s NPOV policy in our class discussions and during our comparative exercise, rather than taking that policy for granted and inviting my students to adopt it. If instructors judiciously inflect their pedagogical practices with the viewpoints that critical race theory and postcolonial theory provide, then we can put ourselves and our students in a better position to see how digital writing on sites like Wikipedia are not exempt from the dynamics of power and oppression that exist offline. Other areas in critical race theory and postcolonial theory can also be brought to bear on Wikipedia, and I invite others to uncover those additional links. Disciplinary boundaries have inadvertently created the impression that discoveries in postcolonialism or critical race theory should concern only the scholars working within those fields, but the acute sensitivity towards power, marginalization, and oppression that these fields exhibit mean that the viewpoints their scholars develop are relevant to any instructor who desires to foster a more socially conscious classroom.

2. The Edit-a-Thon: Failure as Subversion

Composition classes that use Wikipedia usually conclude with an assignment where students are invited to write their own entry. For cultural studies courses in particular, students address the lack of content about minority cultures or groups by participating in a themed Edit-A-Thon organized by their instructor. These Edit-A-Thons mirror the Edit-A-Thons hosted by social justice organizations and activism groups outside of the university. These groups usually plan Edit-A-Thons in ways that guarantee maximum success for the participants because many are generously volunteering their time. Moreover, for many participants, these Edit-A-Thons are the first time where they will write for Wikipedia, and if the goal is to inspire them to continue writing after the event, then it is crucial that their initial encounter with this process is user-friendly, positive, and productive. This is why these events frequently offer detailed tutorials on adopting Wikipedia’s content policies, and provide pre-screened secondary source materials that adhere to Wikipedia’s guidelines about “no original research” (writing about topics for which no reliable, published sources exist) and “verifiability” (citing sources that are reliable). Indeed, these thoughtful components at the Art + Feminism Edit-A-Thon event I attended a few years ago at the Guggenheim Museum ensured that I had a smooth and intellectually stimulating experience when I approached Wikipedia as a volunteer writer for the first time. It was precisely because this early experience was so rewarding that Wikipedia leapt to the forefront of my mind when I became an instructor, and was searching for ways to expand student interest in writing.

It is because I am now approaching Wikipedia as an instructor rather than a first-time volunteer writer, however, that I believe we can amplify critical engagement with the encyclopedia if we set aside “success” as an end goal. Of course, there is no reason why one cannot have critical engagement and success as dual goals, but when I was organizing the Edit-a-Thon in my class, I noticed that building in small instances of “failure” enriched the encounters that my students had with Wikipedia’s content policies.

The encyclopedia stipulates that one should not write about organizations or institutions that they are enrolled in or employed by, so I could not invite my students to edit the entry about UVa’s history itself. Instead, I invited them to create a new entry about the history of student activism at UVa using materials at our library.[4] When I was compiling secondary sources for my students, however, I was more liberal with this list in the Spring than I was in the Fall. Wikipedians have long preferred secondary sources like articles in peer-reviewed journals, books published by university presses or other respected publishing houses, and mainstream newspapers (“Wikipedia: No Original Research” 2020) to ensure that writers typically center academic knowledge when building entries about their topic. Thus, like the many social justice and non-profit organizations who host Edit-A-Thons, for Fall 2019 I pre-screened and curated sources that adhered to Wikipedia’s policies so that my students could easily draw from them for the Edit-A-Thon.

In Spring 2020, however, I invited my students to work with a range of primary and secondary sources—meaning that some historical documents like posters, zines, and other paraphernalia, either required different reading methods than academically written secondary sources, or were impossible to cite because to write about them would constitute as “original research.” Experiencing the failure to assimilate other documents and forms of knowledge that are not articulated as published texts can help students interrogate Wikipedia’s lines of inclusion and exclusion, rather than simply taking them for granted. For example, during one particularly memorable conversation with a student who was studying hand-made posters belonging to student activist groups that protested during UVa’s May Days strikes in 1950, she said that she knew she couldn’t cite the posters or their contents, but asked: “Isn’t this history, too? Doesn’t this count?”

By the end of the Spring Edit-a-Thon, my students produced roughly the same amount of content as the Fall class, but their reflection papers suggested that they had engaged with Wikipedia from a more nuanced perspective. As one student explained, a Wikipedia entry may contain features that signal professional expertise, like clear and formal prose or a thick list of references drawn from books published by university presses and peer-reviewed journals, but still exclude or misconstrue a significant chunk of history without seeming to do so.

A small proportion of my students, however, could not entirely overcome one particular limitation. Some continued describing Wikipedia’s writing style as neutral even after asserting that neutrality in writing was impossible in previous pages of their essay. It is possible that this dissonance occurred accidentally, or because such students have not yet developed a vocabulary to describe what that style was even when they knew that it was not neutral. My sense, however, is that this dissonance may also reflect the broader (and, perhaps, predominantly white) desire for the fantasy of impartiality that Wikipedia’s policies promise. Even if it is more accurate to accept that neutrality does not exist on the encyclopedia, this knowledge may create discomfort because it highlights how one has always already taken up a position on a given topic even when one believes they have been writing “neutrally” about it, especially when that topic is related to race. Grasping this particular point is perhaps the largest challenge facing both the student—and the instructor—in the pedagogical approach I have outlined.

Notes

[1] The results for the Fall 2019 Edit-A-Thon are here. The results for the Spring 2020 Edit-A-Thon are here.

[2] Some of those in composition studies, like Paula Patch and Matthew A. Vetter, partially address Gruwell’s call. Patch, for example, has constructed a framework for critically evaluating Wikipedia that prompts students to focus on authorship credibility, reliability, interface design, and navigation (Patch 2010), often by comparing various Wikipedia entries to other scholarly texts online or in print. By contrast, Vetter’s unit on Appalachian topics on Wikipedia focused on the negative representations of the region within a larger course that sought to examine the way that Appalachia is continually marginalized in mainstream media culture (Vetter 2018).

[3] An extract from the Talk page conversation:

Natureium removed the mention of slavery from the lead as undue. I don’t see why that fact would be undue, but the dozens and dozens of other facts in the lead are not. I mean, the university is known for its student-run honor code? Seriously? (None of the sources in the section on that topic seem to prove that fact.) In addition, I see language in the lead such as “UVA’s academic strength is broad”—if there’s work to be done on the lead, it should not be in the removal of a foundation built with slave labor. If anything, it balances out what is otherwise little more than a jubilation of UvA that could have been written by the PR department. Drmies (talk) 17:40, 18 September 2018 (UTC)

I think this is an area where we run into a difficult problem that plagues projects like Wikipedia that strive to reflect and summarize extant sources without publishing original research. As a higher ed scholar I agree that an objective summary of this (and several other U.S.) university [sic] would prominently include this information. However, our core policies restrict us from inserting our own personal and professional judgments into articles when those judgments are not also matched by reliable sources. So we can’t do this until we have a significant number of other sources that also do this. (I previously worked at a research center that had a somewhat similar stance where the director explained this kind of work to some of us as “we follow the leading edge, we don’t make it.”) ElKevbo (talk) 19:36, 18 September 2018 (UTC)

But here, UvA has acknowledged it, no? Drmies (talk) 20:52, 18 September 2018 (UTC)

Yes and there are certainly enough reliable sources to include this information in the article. But to include the information in the lede is to assert that it’s the most critical information that readers need to know about this subject and that is a very high bar that a handful of self-published sources are highly unlikely to cross. Do scholars and other authors include “was built by slaves” when they first introduce this topic or summarize it? If not, we should not do so. ElKevbo (talk) 21:27, 18 September 2018 (UTC)

[4] The COVID-19 pandemic, which began toward the end of our Edit-A-Thon, meant that my students and I were unable to clean up the draft page and sufficiently converse with other editors who had raised various concerns about notability and conflict of interest, so it is not yet published on Wikipedia. We hope to complete this soon. In the meantime, I want to note that had the pandemic not occurred, I would have presented the concerns of these external editors to my students, and used their comments as another opportunity to learn more about the way that Wikipedia prioritizes certain types of knowledge. The first concern was the belief that the history of student activism at UVa was not a notable enough topic for Wikipedia because there was not enough general news coverage about it. Although another editor later refuted this claim, the impulse to rely on news coverage to determine whether a topic was notable enough is interesting within the context of student activism, and other social justice protests more broadly. Activist movements are galvanized by the very premise that a particular minority group or issue has not yet been taken seriously by those in power, or by the majority of a population. Some protests, like the Black Lives Matter movement and Hong Kong’s “Revolution of our Times,” have gained enough news coverage across the globe to count as notable topic. Does that mean, however, that protests on a smaller scale, and with less coverage, are somehow less important?

The second concern about conflict of interest also raises another question: Does the conflict of interest policy prevent us (and others) from fulfilling UVa’s institutional responsibility to personally confront our university’s close relationships to enslaved labor, white supremacy, and colonization, and foreground the activist groups and initiatives within UVa that have tried to dismantle these relationships? If so, will—or should—Wikipedia’s policy change to accommodate circumstances like this? These are questions that I wish I had the opportunity to pose to my students.

Bibliography

Cassano, Jay. 2015. “Black History Matters, So Why Is Wikipedia Missing So Much Of It?” Fast Company, January 29, 2015. https://www.fastcompany.com/3041572/black-history-matters-so-why-is-wikipedia-missing-so-much-of-it.

Edwards, Jennifer C. 2015. “Wiki Women: Bringing Women Into Wikipedia through Activism and Pedagogy.” The History Teacher 48, no. 3: 409–36.

Ford, Heather. 2011. “The Missing Wikipedians.” In Critical Point of View: A Wikipedia Reader, edited by Geert Lovink and Nathaniel Tkacz, 258–68. Amsterdam: Institute of Networked Cultures.

Graham, Mark. 2011. “Palimpsests and the Politics of Exclusion.” In Critical Point of View: A Wikipedia Reader, edited by Geert Lovink and Nathaniel Tkacz, 269–82. Amsterdam: Institute of Networked Cultures.

Gruwell, Leigh. 2015. “Wikipedia’s Politics of Exclusion: Gender, Epistemology, and Feminist Rhetorical (In)Action.” Computers and Composition 37: 117–31.

Hood, Carra Leah. 2009. “Editing Out Obscenity: Wikipedia and Writing Pedagogy.” Computers and Composition Online. http://cconlinejournal.org/wiki_hood/wikipedia_in_composition.html.

John, Gautam. 2011. “Wikipedia In India: Past, Present, Future.” In Critical Point of View: A Wikipedia Reader, edited by Geert Lovink and Nathaniel Tkacz, 283–87. Amsterdam: Institute of Networked Cultures.

Koh, Adeline, and Roopika Risam. n.d. “The Rewriting Wikipedia Project.” Postcolonial Digital Humanities (blog). Accessed July 3, 2020. https://dhpoco.org/rewriting-wikipedia/.

Lapowsky, Issie. 2015. “Meet the Editors Fighting Racism and Sexism on Wikipedia.” Wired, March 5, 2015. https://www.wired.com/2015/03/wikipedia-sexism/.

Montez, Noe. 2017. “Decolonizing Wikipedia through Advocacy and Activism: The Latina/o Theatre Wikiturgy Project.” Theatre Topics 27, no. 1: 1–9.

Omi, Michael, and Howard Winant. 1994. Racial Formation in the United States. 2nd ed. New York: Routledge.

Patch, Paula. 2010. “Meeting Student Writers Where They Are: Using Wikipedia to Teach Responsible Scholarship.” Teaching English in the Two-Year College 37, no. 3: 278–85.

Pratesi, Angela, Wendy Miller, and Elizabeth Sutton. 2019. “Democratizing Knowledge: Using Wikipedia for Inclusive Teaching and Research in Four Undergraduate Classes.” Radical Teacher: A Socialist, Feminist, and Anti-Racist Journal on the Theory and Practice of Teaching 114: 22–34.

Purdy, James. 2009. “When the Tenets of Composition Go Public: A Study of Writing in Wikipedia.” College Composition and Communication 61, no. 2: 351–73.

———. 2010. “The Changing Space of Research: Web 2.0 and the Integration of Research and Writing Environments.” Computers and Composition 27: 48–58.

Rotramel, Ariella, Rebecca Parmer, and Rose Oliveira. 2019. “Engaging Women’s History through Collaborative Archival Wikipedia Projects.” The Journal of Interactive Technology and Pedagogy 14 (January). https://jitp.commons.gc.cuny.edu/engaging-womens-history-through-collaborative-archival-wikipedia-projects/.

Spivak, Gayatri Chakravorty. 1995. “Can the Subaltern Speak?” In The Post-Colonial Studies Reader, edited by Bill Ashcroft, Gareth Griffiths, and Helen Tiffin. 28–37. 2nd ed. Oxford: Routledge.

Wikipedia. 2020.“University of Virginia.” Last modified 2020. In Wikipedia. https://en.wikipedia.org/w/index.php?title=University_of_Virginia&oldid=965972109.

Vetter, Matthew A. 2018. “Teaching Wikipedia: Appalachian Rhetoric and the Encyclopedic Politics of Representation.” College English 80, no. 5: 397–422.

Vetter, Matthew A., Zachary J. McDowell, and Mahala Stewart. 2019. “From Opportunities to Outcomes: The Wikipedia-Based Writing Assignment.” Computers and Composition 59: 53–64.

Wikipedia. “Wikimedia Statistics – All Wikis.” n.d. Accessed October 13, 2020. https://stats.wikimedia.org/#/all-projects.

Wikipedia. “Wikipedia:Neutral Point of View.” 2020. https://en.wikipedia.org/w/index.php?title=Wikipedia:Neutral_point_of_view&oldid=962777774.

Wikipedia. “Wikipedia:No Original Research.” 2020. https://en.wikipedia.org/w/index.php?title=Wikipedia:No_original_research&oldid=966320689.

Woolfork, Lisa. 2018. “‘This Class of Persons’: When UVA’s White Supremacist Past Meets Its Future.” In Charlottesville 2017: The Legacy of Race and Inequity, edited by Claudrena Harold and Louis Nelson. Charlottesville: UVA Press.

Xing, Jiawei, and Matthew A. Vetter. 2020. “Editing for Equity: Understanding Instructor Motivations for Integrating Cross-Disciplinary Wikipedia Assignments.” First Monday 25, no. 6. https://doi.org/10.5210/fm.v25i6.10575.

Acknowledgments

My thanks must go first to John Modica, my wonderful friend and peer. I am so grateful for his insightful suggestions and constant support when I was planning this Wikipedia unit, for agreeing to pair up his students with mine for the ensuing Spring 2020 Edit-A-Thon and for one of our discussion sessions, and for introducing me to Lisa Woolfork’s excellent article when I was searching for a text for the compare and contrast exercise. I am also indebted to UVa’s Wikimedian-in-Residence, Lane Rasberry, and UVa Library’s librarians and staff—Krystal Appiah, Maggie Nunley, and Molly Schwartzburg—for their help when I hosted my Edit-A-Thons; Michael Mandiberg and Linda Marci for their detailed and rigorous readers’ comments; John Maynard for his smart feedback; Brandon Walsh for his encouragement from start to finish; Kelly Hammond, Elizabeth Alsop, and the editorial staff at JITP; UVa’s Writing and Rhetoric Program for their support; and all of my ENWR students in Fall 2019 and Spring 2020, and John Modica’s Spring 2020 ENWR students as well.

About the Author

Cherrie Kwok is a PhD Candidate and an Elizabeth Arendall Tilney and Schuyler Merritt Tilney Jefferson Fellow at the University of Virginia. She is also the Graduate English Students Association (GESA) representative to UVa’s Writing and Rhetoric Program for the 2020–21 academic year. Her interests include global Anglophone literatures (especially from the Victorian period onwards), digital humanities, poetry, and postcolonialism, and her dissertation examines the relationship between anti-imperialism and late-Victorian Decadence in the poetry and prose of a set of writers from Black America, the Caribbean, China, and India. Find out more about her here.

A sepia-toned stereoscopic image from the turn of the twentieth century depicts a woman in a drawing room, herself looking into a stereoscope.

Issue Eighteen

Interdisciplinarity and Teamwork in Virtual Reality Design

Ole Molvig, Vanderbilt University

Bobby Bodenheimer, Vanderbilt University

Abstract

Virtual Reality Design has been co-taught annually at Vanderbilt University since 2017 by professors Bobby Bodenheimer (Computer Science) and Ole Molvig (History, Communications of Science and Technology). This paper discusses the pedagogical and logistical strategies employed during the creation, execution, and subsequent reorganization of this course through multiple offerings. This paper also demonstrates the methods and challenges of designing a team-based project course that is fundamentally structured around interdisciplinarity and group work.

Introduction

What is virtual reality? What can it do? What can’t it do? What is it good/bad for? These are some of the many questions we ask on the first day of our course, Virtual Reality Design (Virtual Reality for Interdisciplinary Applications from 2017–2018). Since 2017, professors Ole Molvig of the History Department and Bobby Bodenheimer of Computer Science and Electrical Engineering have been co-teaching this course annually to roughly 50 students at a time. With each offering of the course, we have significantly revamped our underlying pedagogical goals and strategies based upon student feedback, the learning literature, and our own experiences. What began as a course about virtual reality has become a course about interdisciplinary teamwork.

Both of those terms, interdisciplinarity and teamwork, have become deeply woven into our effort. While a computer scientist and a historian teach the course, up to ten faculty mentors from across the university participate as “clients.” The course counts toward the computer science major’s project-class requirement, but nearly half the enrolled students are not CS majors. Agile design and group mechanics require organizational and communication skills above all else. And the projects themselves, as shown below, vary widely in the topic and demands, requiring flexibility, creativity, programming, artistry, and most significantly, collaboration.

This focus on interdisciplinary teamwork, and not just in the classroom, has led to a significant, if unexpected, outcome: the crystallization of a substantial community of faculty and students engaging in virtual reality related research from a wealth of disciplinary viewpoints. Equipment purchased for the course remain active and available throughout campus. Teaching projects have grown into research questions and collaborations. A significant research cluster in digital cultural heritage was formed not as a result of, but in synergy with, the community of class mentors, instructors, and students.

Evolution of the Course

Prior to offering the joint course, both Bodenheimer (CS) and Molvig (History) had previously offered single-discipline VR based courses.

From the Computer Science side, Bodenheimer had taught a full three-credit course on virtual reality to computer science students. In lecture and pedagogy this course covered a fairly standard approach to the material for a one semester course, as laid out by the Burea and Coiffet textbook or the more recent (and applicable) Lavalle textbook (Lavalle 2017). Topically, the course covered such material as virtual reality hardware, displays, sensors, geometric modeling, three-dimensional transformations, stereoscopic viewing, visual perception, tracking, and the evaluation of virtual reality experiences. The goal of the course was to teach the computer science students to analyze, design, and develop a complex software system in response to a set of computing requirements and project specifications that included usability and networking. The course was also project-based with teams of students completing the projects. Thus it focused on collaborative learning, and teamwork skills were taught as part of the curriculum, since there is significant work that shows these skills are best taught and do not emerge spontaneously (Kozlowski and Ilgen 2006). This practice allowed a project of significant complexity to be designed and implemented over the course of the semester, giving a practical focus to most of the topics covered in the lectures.

From History, Molvig offered an additional one credit “lab course” option for students attached to a survey of The Scientific Revolution. This lab option offered students the opportunity to explore the creation of and meaning behind historically informed re-constructions or simulations. The lab gave students their first exposure to a nascent technology alongside a narrative context in which to guide their explorations. Simultaneous to this course offering, Vanderbilt was increasing its commitment to the digital humanities, and this course allowed both its instructor and students to study the contours of this discipline as well. While this first offering of a digital lab experience lacked the firm technical grounding and prior coding experience of the computer science offering, the shared topical focus (the scientific revolution) made for boldly creative and ambitious projects within a given conceptual space.

Centering Interdisciplinarity

Unlike Bodenheimer, Molvig did not have a career-long commitment to the study of virtual reality. Molvig’s interest in VR comes rather from a science studies approach to emergent technology. And in 2016, VR was one of the highest profile and most accessible emergent technologies (alongside others such as artificial intelligence, machine learning, CRISPR, blockchain, etc). For Molvig, emergent technologies can be pithily described as those technologies that are about to go mainstream, that many people think are likely to be of great significance, but no one is completely certain when, for whom, how, or really even if, this will happen.

For VR then, in an academic setting, these questions look like this: Which fields is VR best suited for? Up to that point, it was reasonably common in computer science and psychology, and relatively rare elsewhere. How might VR be integrated into the teaching and research of other fields? How similar or dissimilar are the needs and challenges of these different disciplines pedagogical and research contexts?

Perhaps most importantly, how do we answer these questions? Our primary pedagogical approach crystallized around two fundamental questions:

How can virtual reality inform the teaching and research of discipline X?
How can discipline X inform the development of virtual reality experiences?

Our efforts to answer these questions led to the core feature that has defined our Virtual Reality Design course since its inception: interdisciplinarity. Rather than decide for whom VR is most relevant, we attempted to test it out as broadly as possible, in collaboration with as many scholars as possible.

Our course is co-taught by a computer scientist and a humanist. Furthermore, we invite faculty from across campus to serve as “clients,” each with a real-world, disciplinary specific problem toward which virtual reality may be applicable. While Molvig and Bodenheimer focused on both questions, our faculty mentors focused on question 1: is VR surgery simulation an effective tool? Can interactive, immersive 3D museums provide users new forms of engagement with cultural artifacts? How can VR and photogrammetry impact the availability of remote archeological sites? We will discuss select projects below, but as of our third offering of this course, we have had twenty-one different faculty serve as clients representing twelve different departments or schools, ranging from art history to pediatrics and chemistry to education. A full list of the twenty-four unique projects may be found in Appendix 1.

At the time of course planning, Vanderbilt began a program of University Courses, encouraging co-taught, cross disciplinary teaching experiments, incentivizing each with a small budget, which allowed us to purchase the hardware necessary to offer the course. One of our stated outcomes was to increase access to VR hardware, and we have intentionally housed the equipment purchased throughout campus. Currently, most available VR hardware available for campus use is the product of this course. Over time, purchases from our course have established 10 VR workstations across three different campus locations (Digital Humanities Center, The Wond’ry Innovation Center, and the School of Engineering Computer Lab). Our standard set up has been the Oculus Rift S paired with desktop PCs with minimum specs of 16GB RAM and 1080GTX GPUs.

As the design of the joint, team-taught and highly interdisciplinary course was envisioned, several course design questions presented themselves. In our first iteration of the course, a condensed and more accessible version of the computer science virtual reality class was lectured on. Thus Bodenheimer, the computer science instructor, lectured on most of the same topics he had lectured on but at a more general level, and focused on how the concepts were implemented in Unity, rather than from a more theoretical perspective that was present in the prior offering. Likewise, Molvig brought with him several tools of his discipline, a set of shared readings (such as the novel Ready Player One (Cline 2012)) and a response essay to the moral and social implications of VR. The class was even separated for two lectures, allowing Bodenheimer to lecture in more detail on C#, and Molvig to offer strategies on how to avoid C# entirely within Unity.

Subsequent offerings of the course, however, allowed us to abandon most of this structure, and to significantly revise the format. Our experience with how the projects and student teams worked and struggled led us to re-evaluate the format of the course. Best practices in teaching and learning recommend active, collaborative learning where students learn from their peers (Kuh et al. 2006). Thus, we adopted a structured format more conducive to teamwork, based on Agile (Pope-Ruark 2017). Agile is a framework and set of practices originally created for software development but which has much wider applicability today. It can be implemented as a structure in the classroom with a set of openly available tools that allow students to articulate, manage, and visualize a set of goals for a particular purpose, in our case, the creation of a virtual experience tailored to their clients specific research. The challenge for us, as instructors, was to develop methods to instrument properly the Agile methods so that the groups in our class can be evaluated on their use of them, and get feedback on them so that they can improve their practices. This challenge is ongoing. Agile methods are thus used in our class to help teams accomplish their collaborative goals and teach them teamwork practices.

Course Structure

We presume no prior experience with VR, the Unity3D engine, or C# for either the CS or non-CS students. Therefore the first third of the course is mainly focused on introducing those topics, primarily through lecture, demonstration, and a series of cumulative “daily challenges.” By the end of this first section of the course, all students are familiar with the common tools and practices, and capable of creating VR environments upon which they can act directly through the physics engine as well as in a predetermined, or scripted, manner. During the second third of the course, students begin working together on their group projects in earnest, while continuing to develop their skills through continued individual challenges, which culminate in an individual project due at the section’s end. For the second and third sections of the course, all group work incorporates aspects of the Agile method described above, with weekly in-class group standups, and a graded, bi-weekly sprint review, conducted before the entire class. The final section of the course is devoted entirely to the completion of the final group project, which culminates in an open “demo day” held during final examinations, which has proven quite popular.

Three-fifths of our students are upper level computer science students fulfilling a “project course” major requirement, while two-fifths of our students can be from any major except computer science. Each project team is composed of roughly five students with a similar overall ratio, and we tend to have about 50 students per offering. This distribution and size are enforced at registration because of the popularity of the CS major and demand for project courses in it. The typical CS student’s experience will involve at least three semesters of programming in Java and C++, but usually no knowledge of computer graphics or C#, the programming language used by Unity, our virtual reality platform. The non-CS students’ experience is more varied, but currently does not typically involve any coding experience. To construct the teams, we solicit bids from the students for their “top three” projects and “who they would like to work with.” The instructors then attempt to match students and teams so that everyone gets something that they want.

It is a fundamental assertion of this course that all members of a team so constructed can contribute meaningfully and substantially to the project. As it is perhaps obvious what the CS students contribute, it is important to understand what the non-CS students contribute. First, Unity is a sophisticated development platform that is quite usable, and, as mentioned, we spend significant course time teaching the class to use it. There is nothing to prevent someone from learning to code in C# using Unity. However, not everyone taking our class wants to be a coder, but they are interested in technology and using technical tools. Everyone can build models and design scenes in Unity. Also, these projects must be robust. Testing that incremental progress works and is integrated well into the whole project is key not only to the project’s success as a product, but also to the team’s grade. We also require that the teams produce documentation about their progress, and interact with their faculty mentor about design goals. These outward-facing aspects of the project are key to the project’s success and often done by the non-CS students. Each project also typically requires unique coding, and in our experience the best projects are one in which the students specialize into roles, as each project typically requires a significant amount of work. The Agile framework is key here, as it provides a structure for the roles and a way of tracking progress in each of them.

Since each project is varied, setting appropriate targets and evaluating progress at each review is one of the most significant ongoing challenges faced by the instructors.

Projects

A full list of the twenty-four projects may be found in Appendix 1.

Below are short descriptions and video walkthroughs of four distinctive projects that capture the depth, breadth, and originality fostered by our emphasis on interdisciplinarity in all aspects of the course design and teaching.

Example Project: Protein Modeling

The motivation for this project, mentored by Chemistry Professor Jens Meiler, came from a problem common to structural chemistry: the inherent difficulty of visualizing 3D objects. For this prototype, we aimed to model how simple proteins and molecules composed of a few tens of atoms interact and “fit” together. In drug design and discovery, this issue is of critical importance and can require significant amounts of computation (Allison et al. 2014). These interactions are often dominated by short-range van der Waals forces, although determining the correct configuration for the proteins to bind is challenging. This project illustrated that difficulty by letting people explore binding proteins together. Two proteins were given in an immersive environment that were graspable, and users attempted to fit them together. As they fit together, a score showing how well they fit was displayed. This score was computed based on an energy function incorporating Van der Waals attractive and repulsive potentials. The goal was to get the minimum score possible. The proteins and the energy equation were provided by the project mentor, although the students implemented a Van der Waals simulator within Unity for this project. Figures 1 and 2 show examples from the immersive virtual environment. The critical features of this project worth noting are that the molecules are three-dimensional structures that are asymmetric. Viewing them with proper depth perception is necessary to get an idea of their true shape. It would be difficult to recreate this simulation with the same effectiveness using desktop displays and interactions.

While issues of efficiency and effectiveness in chemical pedagogy drove our mentor’s interest, the student creators and demo day users were drawn to this project for its elements of science communication and gamification. By providing a running “high score” and providing a timed element, users were motivated to interact with the objects and experience far longer than with a 2D or static 3D visualization. One student member of this group did possess subject matter familiarity which helped incorporate the energy function into the experience.

[archiveorg figure-1 width=640 height=480 frameborder=0 webkitallowfullscreen=true mozallowfullscreen=true]

Figure 1. Two proteins shown within the simulation. The larger protein on the left is the target protein to which the smaller protein (right) should be properly fit. A menu containing the score is shown past the proteins. Proteins may be grabbed, moved, and rotated using the virtual reality controllers.

Example Project: Vectors of Textual Movement in Medieval Cypress

Professor of French Lynn Ramey served as the mentor for this project. Unlike most other mentors, Prof. Ramey had a long history of using Unity3D and game technologies in both her research and teaching. Her goal in working with us was to recreate an existing prototype in virtual reality, and determine the added values of visual immersion and hand tracked interactivity. This project created a game that simulates how stories might change during transmission and retelling (Amer et al. 2018; Ramey et al. 2019). The crusader Kingdom of Cyprus served as a waypoint between East and West during the years 1192 to 1489. This game focuses on the early period and looks at how elements of stories from The Thousand and One Nights might have morphed and changed to please sensibilities and tastes of different audiences. In the game, the user tells stories to agents within the game, ideally gaining storytelling experience and learning the individual preferences of the agents. After gaining enough experience, the user can gain entry to the King’s palace and tell a story to the King, with the goal of impressing the King. During the game play, the user must journey through the Kingdom of Cyrus to find agents to tell stories to.

This project was very successful at showcasing the advantages of an interdisciplinary approach. Perhaps the project closest to a traditional video game, faculty and students both were constantly reminded of the interplay between technical and creative decisions. However, this was not simply an “adaption” of a finished cultural work into a new medium, but rather an active exploration of an open humanities research project asking how, why, when, and for whom are stories told. No student member of this group majored in the mentor’s discipline.

This project is ongoing, and more information can be found here: https://medievalstorytelling.org.

A video walkthrough of the game can be seen below.

[archiveorg figure-2_20201130 width=640 height=480 frameborder=0 webkitallowfullscreen=true mozallowfullscreen=true]

Figure 2. Video walk-through of gameplay.

Example Project: Interactive Geometry for K–8 Mathematical Visualization

In this project, Corey Brady, Professor of Education, challenged our students to take full advantage of the physical presence offered by virtual environments, and build an interactive space where children can directly experience “mathematical dimensionality.” Inspired by recent research (Kobiela et al. 2019; Brady et al. 2019) examining physical geometrical creation in two dimensions (think paint, brushes and squeegees), the students created a brightly lit and colored virtual room, where the user is initially presented with a single point in space. Via user input, the point can be stretched into a line, the line into a plane, and the plane into a solid (rectangles, cylinders, and prisms). While doing so, bar graph visualizations of length, width, height, surface area, and volume are updated in real-time while the user increases or decreases the object along its various axes.

Virtual Reality as an education tool has proven very popular, both amongst our students and in industry. No student member of this group specialized in education, but all members had of course first hand experience learning these concepts themselves as children. The opportunity to reimagine a nearly universal learning process was a significant draw for this project. After this course offering, Brady and Molvig have begun a collaboration to expand its utility.

A video demonstration of the project can be seen below.

[archiveorg figure-3_202011 width=640 height=480 frameborder=0 webkitallowfullscreen=true mozallowfullscreen=true]

Figure 3. User manipulates the x, y, and z axes of a rectangle. Real-time calculations of surface area and volume are shown in the background.

Example Project: Re-digitizing Stereograms

For this project, Molvig led a team to bring nineteenth-century stereographic images into 21st century technology. Invented by Charles Wheatstone in 1838 and later improved by David Brewster, stereograms are nearly identical paired photographs that when viewed through a binocular display, a single “3D image” [1] was perceived by the viewer, often with an effect of striking realism. For this reason, stereoscopy is often referred to as “Victorian VR.” Hundreds of thousands of scanned digitized stereo-pair photos exist in archives and online collections, however it is currently extremely difficult to view these as intended in stereoscopic 3D. Molvig’s goal was to create a generalizable stereogram viewer: capable of bringing stereopair images from remote archives for viewing within a modern VR headset.

Student interest quickly coalesced around two sets of remarkable stereoscopic anatomical atlases, the Edinburgh Stereoscopic Atlas of Anatomy (1905) and Bassett Collection of Stereoscopic Images of Human Anatomy from the Stanford Medical Library. Driven by student interest, the 2019 project branched into a VR alternative to wetlab or flat 2D medical anatomy imagery. This project remains ongoing, as is Molvig’s original generalized stereo viewer, which now includes a machine learning based algorithm to automated the import and segmentation of any stereopair photograph.

Two demonstrations of the stereoview player are below, the first for medical anatomy images, the second are stereophotos taken during the American Civil War. All images appear in stereoscopic depth when viewed in the headset.

[archiveorg figure-4_202011 width=640 height=480 frameborder=0 webkitallowfullscreen=true mozallowfullscreen=true]

Figure 4. Demonstration of anatomy stereoscopic viewer. Images from the Bassett Collection of Stereoscopic Images of Human Anatomy, Stanford Medical Library.

[archiveorg figure-5_202011 width=640 height=480 frameborder=0 webkitallowfullscreen=true mozallowfullscreen=true]

Figure 5. Demonstration of Civil War stereoviews. Images from the Robert N. Dennis collection of stereoscopic views, New York Public Library Digital Collection.

Challenges

This course has numerous challenges, both inside and outside of the classroom, and we have by no means solved them all.

Institutional

Securing support for co-teaching is not always easy. We began offering this course under a Provost level initiative to encourage ambitious teaching collaborations across disciplines. This initiative made it straightforward to count co-teaching efforts with our Deans, and provided some financial support for the needed hardware purchases. However, that initiative was for three course offerings, which we have now completed. Moving forward, we will need to negotiate our course with our Deans.

We rely heavily on invested Faculty Mentors to provide the best subject matter expertise. So far we have had no trouble finding volunteers, and the growing community of VR engaged faculty has been one of the greatest personal benefits, but as VR becomes less novel, we may experience a falloff in interest.

Interdisciplinarity

This is both the most rewarding and most challenging aspect of this course. Securing student buy-in on the value of interdisciplinary teamwork is our most consistent struggle. In particular, these issues arise around the uneven distribution of C# experience, and perceived notions of what type of work is “real” or “hard.” To mitigate these issues, we devote significant time during the first month of the course exposing everyone to all aspects of VR project development (technical and non-technical), and require the adoption of “roles” within each project to make responsibilities clear and workload distributed.

Cost

Virtual reality is a rapidly evolving field, with frequent hardware updates and changing requirements. We will need to secure new funding to significantly expand or update our current equipment.

Conclusions and Lessons Learned

Virtual reality technology is more accessible than ever, but it is not as accessible as one might wish in a pedagogical setting. It is difficult to create even moderately rich and sophisticated environments, without the development expertise gleaned through exposure to the computer science curriculum. A problem thus arises on two fronts. First, exposure to the computer science curriculum at the depth currently required to develop compelling virtual reality applications should ideally not be required of everyone. Unfortunately, the state of the art of our tools currently makes this necessary. Second, those who study computer science and virtual reality focus on building the tools and technology of virtual reality, the theories and algorithms integral to virtual reality, and the integration of these into effective virtual reality systems. Our class represents a compromise solution to the accessibility problem by changing the focus away from development of tools and technology toward collaboration and teamwork in service of building an application.

Our class is an introduction to virtual reality in the sense that students see the capability of modern commodity-level virtual reality equipment, software, and these limitations. They leave the class understanding what types of virtual worlds are easy to create, and what types of worlds are difficult to create. From the perspective of digital humanities, our course is a leveraged introduction to technology at the forefront of application to the humanities. Students are exposed to a humanities-centered approach to this technology through interaction with their project mentors.

In terms of the material that we, the instructors, focus most on in class, our class is about teamwork and problem-solving with people one has not chosen to work with. We present this latter skill as one essential to a college education, whether it comes from practical reasons, e.g., that is what students will be faced with in the workforce (Lingard & Barkataki 2013), or from theoretical perspectives on best ways to learn (Vygotsky 1978). The interdisciplinarity that is a core feature of the course is presented as a fact of the modern workforce. Successful interdisciplinary teams are able to communicate and coordinate effectively with one another, and we emphasize frameworks that allow these things to happen.

Within the broader Vanderbilt curriculum, the course satisfies different curricular requirements. For CS students, the course satisfies a requirement that they participate in a group design experience as part of their major requirements. The interdisciplinary nature of the group is not a major requirement, but is viewed as an advantage, since it is likely that most CS majors will be part of interdisciplinary teams during their future careers. For non-CS students, the course currently satisfies the requirements of the Communication of Science and Technology major and minor.[2]

Over the three iterations of this course, we have learned that team teaching an interdisciplinary project course is not trivial. In particular, it requires more effort than each professor lecturing on their own specialty, and expecting effective learning to emerge from the two different streams. That expectation was closer to what we did in the first offering of this course, where we quickly perceived that this practice was not the most engaging format for the students, nor was it the most effective pedagogy for what we wanted to accomplish. The essence of the course is on creating teams to use mostly accessible technology to create engaging virtual worlds. We have reorganized our lecture and pedagogical practices to support this core. In doing this, each of us brings to the class our own knowledge and expertise on how best to accomplish that goal, and thus the students experience something closer to two views on the same problem. While we are iteratively refining this approach, we believe it is more successful.

Agile methods (Pope-Ruark 2017) have become an essential part of our course. They allow us to better judge the progress of the projects and determine where bottlenecks are occurring more quickly. They incentivize students to work consistently on the project over the course of the semester rather than trying to build everything at the end in a mad rush of effort. By requiring students to mark their progress on burn down charts, the students have a better visualization of the task remaining to be accomplished. Project boards associated with Agile can provide insight into the relative distribution of work that is occurring in the group, ideally allowing us to influence group dynamics before serious tensions arise.

This latter effort is a work in progress, however. A limitation of the course as it currently exists is that we need to do a better job evaluating teams (Hughes & Jones 2011). Currently our student evaluations rely too heavily on the final outcome of the project and not enough on the effectiveness of the teamwork within the team. Evaluating teamwork, however, has seemed cumbersome, and the best way to give meaningful feedback to improve teamwork practices is something we are still exploring. If we improved this practice, we could give students more refined feedback throughout the semester on their individual and group performance, and use that as a springboard to teach better team practices. Better team practices would likely result in increased quality of the final projects.

Notes

[1] These images are not truly three dimensional, as they cannot be rotated or peered behind. Rather two images are created precisely to fool the brain into adding a perception of depth into a single combined image.
[2] https://as.vanderbilt.edu/cst/. There is currently no digital humanities major or minor at Vanderbilt.

References

Allison, Brittany, Steven Combs, Sam DeLuca, Gordon Lemmon, Laura Mizoue, and Jens Meiler. 2014. “Computational Design of Protein–Small Molecule Interfaces.” Journal of Structural Biology 185, no. 2: 193–202.

Amer, Sahar, and Lynn Ramey. 2018. “Teaching the Global Middle Ages with Technology.” Parergon: Journal of the Australian and New Zealand Association for Medieval and Early Modern Studies 35: 179–91.

Brady, Corey, and Richard Lehrer. 2020. “Sweeping Area Across Physical and Virtual Environments.“ Digital Experiences in Mathematics Education: 1–33. https://link.springer.com/article/10.1007/s40751-020-00076-2.

Cline, Ernest. 2012. Ready Player One. New York: Broadway Books.

Hughes, Richard L., and Steven K. Jones. 2011. “Developing and assessing college student teamwork skills.“ New Directions for Institutional Research 149: 53–64.

Kobiela, Marta, and Richard Lehrer. 2019. “Supporting Dynamic Conceptions of Area and its Measure.” Mathematical Thinking and Learning: 1–29.

Kozlowski, Steve W.J., and Daniel R. Ilgen. 2006. “Enhancing the Effectiveness of Work Groups and Teams.” Psychological Science in the Public Interest 7, no.3: 77–124.

Kuh, George D., Jillian Kinzie, Jennifer A. Buckley, Brian K. Bridges, and John C. Hayek. 2006. What Matters to Student Success: A Review of the Literature. Vol. 8. Washington, DC: National Postsecondary Education Cooperative.

LaValle, Steve 2017. Virtual Reality. Cambridge, UK: Cambridge University Press.

Lingard, Robert, and Shan Barkataki 2011. “Teaching Teamwork in Engineering and Computer Science.” 2011 Frontiers in Education Conference. Institute of Electrical and Electronics Engineers.

Pope-Ruark, Rebecca. 2017. Agile Faculty: Practical Strategies for Managing Research, Service, and Teaching. Chicago: University of Chicago Press.

Ramey, Lynn, David Neville, Sahar Amer, et al. 2019. “Revisioning the Global Middle Ages: Immersive Environments for Teaching Medieval Languages and Culture.” Digital Philology 8: 86–104.

Takala, Tuukka M., Lauri Malmi, Roberto Pugliese, and Tapio Takala. 2016. “Empowering students to create better virtual reality applications: A longitudinal study of a VR capstone course.” Informatics in Education 15, no. 2: 287–317.

Zimmerman, Guy W., and Dena E. Eber. 2001. “When worlds collide!: an interdisciplinary course in virtual-reality art.” ACM SIGCSE Bulletin 33, no. 1.

Appendix 1: Complete Project List

Project Title (Mentor, Field, Year(s))

Aristotelian Physics Simulation (Molvig, History of Science, 2017, 2018).
Virtual Excavation (Wernke, Archeology, 2017, 2018).
Aech’s Basement: scene from Ready Player One (Clayton, English, 2017).
Singing with Avatar (Reiser, Psychology, 2017).
Visualizing Breathing: interactive biometric data (Birdee, Medicine, 2017).
Memory Palace (Kunda, Computer Science, 2017).
Centennial Park (Lee, Art History, 2017).
Stereograms (Peters, Computer Science, 2017).
Medieval Storytelling (Ramey, French, 2017, 2018, 2019).
VR locomotion (Bodenheimer, Computer Science, 2017).
3D chemistry (Meiler, Chemistry, 2018).
Data Visualization (Berger, Computer Science, 2018).
Adversarial Maze (Narasimham and Bodenheimer, Computer Science, 2018).
Operating Room Tool Assembly (Schoenecker, Medicine, 2018).
Autism Spectrum Disorder: table building simulation (Sarkar, Mechanical Engineering, 2019).
Brain Flow Visualization (Oguz, Computer Science, 2019).
Interactive Geometry (Brady, Learning Sciences, 2019).
Jekyll and Hyde (Clayton, English, 2019).
fMRI Brain Activation (Chang, Computer Science, 2019).
Virtual Museum (Robinson, Art History, 2019).
Peripersonal Space (Bodenheimer, Computer Science, 2019).
Solar System Simulation (Weintraub, Astronomy, 2019).
Accessing Stereograms (Molvig, History, 2019).

About the Authors

Ole Molvig is an assistant professor in the Department of History and the Program in Communication of Science and Technology. He explores the interactions among science, technology, and culture from 16th-century cosmology to modern emergent technologies like virtual reality or artificial intelligence. He received his Ph.D. in the History of Science from Princeton University.

Bobby Bodenheimer is a professor in the Department of Electrical Engineering and Computer Science at Vanderbilt University. He also holds an appointment in the Department of Psychology and Human Development. His research examines virtual and augmented reality, specifically how people act, perceive, locomote, and navigate in virtual and augmented environments. He is the recipient of an NSF CAREER award and received his Ph.D. from the California Institute of Technology.

A spiral of books on library shelves appears almost as though a pie chart.

Issue Eighteen

Supporting Data Visualization Services in Academic Libraries

Negeen Aghassibake, University of Washington Libraries

Justin Joque, University of Michigan Library

Matthew L. Sisk, Navari Family Center for Digital Scholarship, University of Notre Dame

Abstract

Data visualization in libraries is not a part of traditional forms of research support, but is an emerging area that is increasingly important in the growing prominence of data in, and as a form of, scholarship. In an era of misinformation, visual and data literacy are necessary skills for the responsible consumption and production of data visualizations and the communication of research results. This article summarizes the findings of Visualizing the Future, which is an IMLS National Forum Grant (RE-73-18-0059-18) to develop a literacy-based instructional and research agenda for library and information professionals with the aim to create a community of praxis focused on data visualization. The grant aims to create a diverse community that will advance data visualization instruction and use beyond hands-on, technology-based tutorials toward a nuanced, critical understanding of visualization as a research product and form of expression. This article will review the need for data visualization support in libraries, review environmental scans on data visualization in libraries, emphasize the need for a focus on the people involved in data visualization in libraries, discuss the components necessary to set up these services, and conclude with the literacies associated with supporting data visualization.

Introduction

Now, more than ever, accurately assessing information is crucially important to discourse, both public and academic. Universities play an important role in teaching students how to understand and generate information. But at many institutions, learning how to effectively communicate findings from the research process is considered idiosyncratic for each field or the express domain of a particular department (e.g. applied mathematics or journalism). Data visualization is the use of spatial elements and graphical properties to display and analyze information, and this practice may follow disciplinary customs. However, there are many commonalities in how we visualize information and data, and the academic library, at the heart of the university, can play a significant role in teaching these skills. In the following article, we suggest a number of challenges in teaching complex technological and methodological skills like visualization and outline a rationale for, and a strategy to, implement these types of services in academic libraries. However, the same argument can be made for any academic support unit, whether college, library, or independently based.

Why Do We Need Data Visualization Support in Libraries?

In many ways the argument for developing data visualization services in libraries mirrors the discussion surrounding the inclusion and extension of digital scholarship support services throughout universities. In academic settings, libraries serve as a natural hub for services that can be used by many departments and fields. Often, data visualization (like GIS or text-mining) expertise is tucked away in a particular academic department making it difficult for students and researchers from different fields to access it.

As libraries already play a key role in advocacy for information literacy and ethics, they may also serve as unaffiliated, central places to gain basic competencies in associated information and data skills. Training patrons how to accurately analyze, assess, and create data visualizations is a natural enhancement to this role. Building competencies in these areas will aid patrons in their own understanding and use of complex visualizations. It may also help to create a robust learning community and knowledge base around this form of visual communication.

In an age of “fake news” and “post-truth politics,” visual literacy, data literacy, and data visualization have become exceedingly important. Without knowing the ways that data can be manipulated, patrons are not as capable of assessing the utility of the information being displayed or making informed decisions about the visual story being told. Presently, many academic libraries are investing resources in data services and subscriptions. Training students, faculty and researchers in ways of effectively visualizing these data sources increases their use and utility. Finally, having data visualization skills within the library also comes with an operational advantage, allowing more effective sharing of data about the library.

We are the Visualizing the Future Symposia, an Institute of Museum and Library Services National Forum Grant-funded group created to develop instructional and research materials on data visualization for library professionals and a community of practice around data visualization. The grant was designed to address the lack of community around data visualization in libraries. More information about the grant is available at the Visualizing the Future website. While we have only included the names of the three main authors; this work was a product of the work of the entire cohort, which includes: Delores Carlito, David Christensen, Ryan Clement, Sally Gore, Tess Grynoch, Jo Klein, Dorothy Ogdon, Megan Ozeran, Alisa Rod, Andrzej Rutkowski, Cass Wilkinson Saldaña, Amy Sonnichsen, and Angela Zoss.

We are currently halfway through our grant work and, in addition to providing publicly available resources for teaching visualization, are also in the process of synthesizing and collecting shared insights into developing and providing data visualization instruction. This present article represents some of the key findings of our grant work.

Current Environment

In order to identify some broad data visualization needs and values, we reviewed three environmental scans. The first was carried out by Angela Zoss, who is one of the co-investigators on the grant, at Duke University (2018) based on a survey that received 36 responses from 30 separate institutions. The second, by S.K. Van Poolen (2017), focuses on an overview of the discipline and includes results from a survey of Big Ten Academic Alliance institutions and others. And the final report by Ilka Datig for Primary Research Group Inc (2019) provides a number of in-depth case studies. While none of the studies claim to provide an exhaustive list of every person or institution providing data visualization support in libraries, in combination they provide an effective overview of the state of the field.

Institutions

The combined environmental scans represent around thirty-five institutions, primarily academic libraries in the United States. However, the Zoss survey also includes data from the Australian National University, a number of Canadian universities, and the World Bank Group. The universities represented vary greatly in size and include large research institutions, such as the University of California Los Angeles, and small liberal arts schools, such as Middlebury and Carleton College.

Some appointments were full-time, while others reported visualization as a part of other job responsibilities. In the Zoss survey, roughly 33% of respondents reported the word “visualization” in their job title.

Types of activities

The combined scans include a variety of services and activities. According to the Zoss survey, the two most common activities (i.e. activities that the most respondents said they engaged in) were providing consultations on visualization projects and giving short workshops or lectures on data visualization. After that other services offered include: providing internal data visualization support for analyzing and communicating library data; training on visualization hardware and spaces (e.g. large scale visualization walls, 3D CAVEs); and managing such spaces and hardware.

Resources needed

These three environmental scans also collectively identify a number of resources that are critical for supporting data visualization in librarians. One of the key elements is training for new librarians, or librarians new to this type of work, on visualization itself and teaching/consulting on data visualization. They also mention that resources are required to effectively teach and support visualization software, including access to the software, learning materials, but also ample time is required for librarians to learn, create and experiment themselves so that they can be effective teachers. Finally they outline the need for communities of practice across institutions and shared resources to support visualization.

It’s About the People

In all of our work and research so far, one important element seems worth stressing and calling out on its own: It is the people who make data visualization services work. Even visualization services focused on advanced instructional spaces or immersive and large scale displays, require expertise to help patrons learn how to use the space, maintain and manage technology, schedule events to create interest, and, especially in the case of advanced spaces, create and manage content to suggest the possibilities. An example of this is the North Carolina State University Libraries’ Andrew W. Mellon Foundation-funded project “Immersive Scholar” (Vandegrift et al. 2018), which brought visiting artists to produce immersive artistic visualization projects in collaboration with staff for the large scale displays at the library.

We encourage any institution that is considering developing or expanding data visualization services to start by defining skill sets and services they wish to offer rather than the technology or infrastructure they intend to build. Some of these skills may include programming, data preparation, and designing for accessibility, which can support a broad range of services to meet user needs. Unsupported infrastructure (stale projects, broken technology, etc.) is a continuing problem in providing data visualization services, and starting any conversation around data visualization support by thinking about the people needed is crucial to creating sustainable, ethical, and useful services.

As evidenced by both the information in the environmental scans and the experiences of Visualizing the Future fellows, one of the most consistently important ways that libraries are supporting visualization is through consultations and workshops that span technologies from Excel to the latest virtual reality systems. Moreover, using these techniques and technologies effectively requires more than just technical know-how; it requires in-depth considerations of design aesthetics, sustainability, and the ethical use and re-use of data. Responsible and effective visualization design requires a variety of literacies (discussed below), critical consideration of where data comes from, and how best to represent data—all elements that are difficult to support and instruct without staff who have appropriate time and training.

Services

Data visualization services in libraries exist both internally and externally. Internally, data visualization is used for assessment (Murphy 2015), marketing librarians’ skills and demonstrating the value of libraries (Bouquin and Epstein 2015), collection analysis (Finch 2016), internal capacity building (Bouquin and Epstein 2015), and in other areas of libraries that primarily benefit the institution.

External services, in contrast, support students, faculty, researchers, non-library staff, and community members. Some examples of services include individual consultations, workshops, creating spaces for data visualization (both physical and virtual), and providing support for tools. Some libraries extend visualization services into additional areas, like the New York University Health Sciences Library’s “Data Visualization Clinic,” which provides a space for attendees to share and receive feedback on their data visualizations from their peers (Zametkin and Rubin 2018), and the North Carolina State University Libraries’ Coffee and Viz Series, “a forum in which NC State researchers share their visualization work and discuss topics of interest” that is also open to the public (North Carolina State University Libraries 2015).

In order to offer these services, libraries need staff who have some interest and/or experience with data visualization. Some models include functional roles, such as data services librarians or data visualization librarians. These functional librarian roles ensure that the focus is on data and data visualization, and that there is dedicated, funded time available to work on data visualization learning and support. It is important to note that if there is a need for research data management support, it may require a position separate from data visualization. Data services are broad and needs can vary, so some assessment on the community’s greatest needs would help focus functional librarian positions.

Functional librarian roles may lend themselves to external facing support and community building around data visualization outside of internal staff. A needs assessment can help identify user-centered services, outreach, and support that could help create a community around data visualization for students, faculty, researchers, non-library staff, and members of the public. Having a community focused on data visualization will make sure that services, spaces, and tools are utilized and meeting user needs.

There is also room to develop non-librarian, technical data visualization positions, such as data visualization specialists or tool-specific specialist positions. These positions may not always have an outreach or community building focus and may be best suited for internal library data visualization support and production. Offering data visualization support as a service to users is separate from data visualization support as a part of library operations, and the decision on how to frame the positions can largely be determined by library needs.

External data visualization services can include workshops, training sessions, consultations, and classroom instruction. These services can be focused on specific tools, such as Tableau, R, Gephi, and so on. They can be focused on particular skills, such as data cleaning and normalizing, dashboard design, and coding. They can also address general concerns, such as data visualization transparency and ethics, which may be folded into all of the services.

There are some challenges in determining which services to offer:

Is there an interest in data visualization in the community? This question should be answered before any services are offered to ensure services are utilized. If there are any liaison or outreach librarians at your institution, they may have deeper insight into user needs and connections to the leaders of their user groups.
Are there staff members who have dedicated time to effectively offer these services and support your users?
Is there funding for tools you want to teach?
Do you have a space to offer these services? This does not have to be anything more complicated than a room with a projector, but if these services begin to grow, it is important to consider the effectiveness of these services with a larger population. For example, a cap on the number of attendees for a tool-specific workshop might be needed to ensure the attendees receive enough individual support throughout the session.

If all of these areas are not addressed, there will be challenges in providing data visualization services and support. Successful data visualization services have adequate staffing, access to the required tools and data, space to offer services (not necessarily a data wall or makerspace, but simply a space with sufficient room to teach and collaborate), and community that is already interested and in need of data visualization services.

Literacies

The skills that are necessary to provide good data visualization services are largely practical. We derive the following list from our collective experience, both as data visualization practitioners and as part of the Visualizing the Future community of practice. While the following list is not meant to be exhaustive, these are the core competencies that should be developed to offer data visualization services, either from an individual or as part of a team.

A strong design sense: Without an understanding of how information is effectively conveyed, it is difficult to create or assess visualizations. Thus, data visualization experts need to be versed in the main principles of design (e.g. Gestalt, accessibility) and how to use these techniques to effectively communicate visual information.

Awareness of the ethical implications of data visualizations: Although the finer details are usually assessed on a case by case basis, a data visualization expert should be able to interpret when a visualization is misleading and have the agency to decline to create biased products. This is a critical part of enabling the practitioner to be an active partner in the creation of visualizations.

An understanding, if not expertise, in a variety of visualization types: Network visualizations, maps, glyphs, Chernoff Faces, for example. There are many specialized forms of data visualization and no individual can be an expert in all of them, but a data visualization practitioner should at least be conversant in many of them. Although universal expertise is impractical, a working knowledge of when particular techniques should be used is a very important literacy.

A similar understanding of a variety of tools: Some examples include Tableau, PowerBI, Shiny, and Gephi. There are many different tools in current use for creating static graphics and interactive dashboards. Again, universal expertise is impractical, but a competent practitioner should be aware of the tools available and capable of making recommendations outside their expertise.

Familiarity with one or more coding languages: Many complex data visualizations happen at the command line (at least partially) so there is a need for an effective practitioner to be at least familiar with the languages most commonly used (likely either R or Python). Not every data visualization expert needs to be a programmer, but familiarity with the potential for these tools is necessary.

Conclusion

The challenges inherent in building and providing data visualization instruction in academic libraries provide an opportunity to address larger pedagogical issues, especially around emerging technologies, methods, and roles in libraries and beyond. In public library settings, the needs for services may be even greater, with patrons unable to find accessible training sources when they need to analyze, assess, and work with diverse types of data and tools. While the focus of our grant work has been on data visualization, the findings reflect the general difficulties of balancing the need and desire to teach tools and invest in infrastructure with the value of teaching concepts and investing in individuals. It is imperative that work teaching and supporting emerging technologies and methods focus on supporting the people and the development of literacies rather than just teaching the use of specific tools. To do so requires the creation of spaces and networks to share information and discoveries.

Bibliography

Bouquin, Daina and Helen-Ann Brown Epstein. 2015. “Teaching Data Visualization Basics to Market the Value of a Hospital Library: An Infographic as One Example.” Journal of Hospital Librarianship 15, no. 4: 349–364. https://doi.org/10.1080/15323269.2015.1079686.

Datig, Ilka. 2019. Profiles of Academic Library Use of Data Visualization Applications. New York: Primary Research Group Inc.

Finch, Jannette L. and Angela R. Flenner. 2016. “Using Data Visualization to Examine an Academic Library Collection.” College & Research Libraries 77, no. 6: 765–778. https://doi.org/10.5860/crl.77.6.765.

Micah Vandegrift, Shelby Hallman, Walt Gurley, Mildred Nicaragua, Abigail Mann, Mike Nutt, Markus Wust, Greg Raschke, Erica Hayes, Abigail Feldman Cynthia Rosenfeld, Jasmine Lang, David Reagan, Eric Johnson, Chris Hoffman, Alexandra Perkins, Patrick Rashleigh, Robert Wallace, William Mischo, and Elisandro Cabada. 2018. Immersive Scholar. Released on GitHub and Open Science Framework. https://osf.io/3z7k5/.

LaPolla, Fred Willie Zametkin and Denis Rubin. 2018. “The “Data Visualization Clinic”: a library-led critique workshop for data visualization.” Journal of the Medical Library Association 106, no. 4: 477–482. https://doi.org/10.5195/jmla.2018.333.

Murphy, Sarah Anne. 2015. “How data visualization supports academic library assessment.” College & Research Libraries News 76, no. 9: 482–486. https://doi.org/10.5860/crln.76.9.9379.

North Carolina State University Libraries. “Coffee & Viz.” Accessed December 4, 2019. https://www.lib.ncsu.edu/news/coffee–viz.

Van Poolen, S.K. 2017. “Data Visualization: Study & Survey.” Practicum study at the University of Illinois.

Zoss, Angela. 2018. “Visualization Librarian Census.” TRLN Data Blog. Last modified June 16, 2018. https://trln.github.io/data-blog/data%20visualization/survey/visualization-librarian-census/.

About the Authors

Negeen Aghassibake is the Data Visualization Librarian at the University of Washington Libraries. Her goal is to help library users think critically about data visualization and how it might play a role in their work. Negeen holds an MS in Information Studies from the University of Texas at Austin.

Matthew Sisk is a spatial data specialist and Geographic Information Systems Librarian based in Notre Dame’s Navari Family Center for Digital Scholarship. He received his PhD in Paleolithic Archaeology from Stony Brook University in 2011 and has worked extensively in GIS-based archaeology and ecological modeling. His research focuses on human-environment interactions, the spatial scale environmental toxins and community-based research.

Justin Joque is the Visualization Librarian at the University of Michigan. He completed his PhD in Communications and Media Studies at the European Graduate School and holds a Master of Science in Information (MIS) from the University of Michigan.

Issue Eighteen

Ethnographies of Datasets: Teaching Critical Data Analysis through R Notebooks

Lindsay Poirier, University of California, Davis

Abstract

With the growth of data science in industry, academic research, and government planning over the past decade, there is an increasing need to equip students with skills not only in responsibly analyzing data, but also in investigating the cultural contexts from which the values reported in data emerge. A risk of several existing models for teaching data ethics and critical data literacy is that students will come to see data critique as something that one does in a compliance capacity prior to performing data analysis or in an auditing capacity after data analysis rather than as an integral part of data practice. This article introduces how I integrate critical data reflection with data practice in my undergraduate course Data Sense and Exploration. I introduced a series of R Notebooks that walk students through a data analysis project while encouraging them, each step of the way, to record field notes on the history and context of their data inputs, the erasures and reductions of narrative that emerge as they clean and summarize the data, and the rhetoric of the visualizations they produce from the data. I refer to the project as an “ethnography of a dataset” not only because students examine the diverse cultural forces operating within and through the data, but also because students draw out these forces through immersive, consistent, hands-on engagement with the data.

Introduction

Last Spring one of my students made an important discovery regarding the politics encoded in data about California wildfires. Aishwarya Asthana was examining a dataset published by California’s Department of Forestry and Fire Protection (CalFIRE), documenting the acres burned for each government-recorded wildfire in California from 1878 to 2017. The dataset also included variables such as the fire’s name, when it started and when it was put out, which agency was responsible for it, and the reason it ignited. Asthana was practicing applying techniques for univariate data analysis in R—taking one variable in the dataset and tallying up the number of times each value in that variable appears. Such analyses help to summarize and reveal patterns in the data, prompting questions about why certain values appear more than others.

Tallying up the number of times each distinct wildfire cause appeared in the dataset, Asthana discovered that CalFIRE categorizes each wildfire into one of nineteen distinct cause codes, such as “1—Lightning,” “2—Equipment Use,” “3—Smoking,” and “4—Campfire.” According to the analysis, 184 wildfires were caused by campfires, 1,543 wildfires were caused by lightning, and, in the largest category, 6,367 wildfires were categorized with a “14—Unknown/Unidentified” cause code. The cause codes that appeared the fewest number of times (and thus were attributed to the fewest number of wildfires) were “12—Firefighter Training” and the final code in the list: “19—Illegal Alien Campfire.”

Figure 1: Plot counting wildfires in California by cause. In the plot, the fewest fires have been attributed to illegal alien campfires and firefighter training. — Figure 1. Plot of CalFIRE-documented wildfires by cause, produced in R.

Interpreting the data unreflectively, one might say, “From 1878 to 2017, four California wildfires have been caused by illegal alien campfires—making it the least frequent cause.” Toward the beginning of the quarter in Data Sense and Exploration, many students, particularly those majoring in math and statistics, compose statements like this when asked to draw insights from data analyses. However, in only reading the data on its surface, this statement obscures important cultural and political factors mediating how the data came to be reported in this way. Why are “illegal alien campfires” categorized separately from just “campfires”? Who has stakes in seeing quantitative metrics specific to campfires purportedly ignited by this subgroup of the population—a subgroup that can only be distinctly identified through systems of human classification that are also devised and debated according to diverse political commitments?

While detailing the history of the data’s collection and some potential inconsistencies in how fire perimeters are calculated, the data documentation provided by CalFIRE does not answer questions about the history and stakes of these categories. In other words, it details the provenance of the data but not the provenance of its semantics and classifications. In doing so, it naturalizes the values reported in the data in ways that inadvertently discourage recognition of the human discernment involved in their generation. Yet, even a cursory Web search of the key phrase “illegal alien campfires in California” reveals that attribution of wildfires to undocumented immigrants in California has been used to mobilize political agendas and vilify this population for more than two decades (see, for example, Hill 1996). Discerning the critical import of this data analysis thus demands more than statistical savvy; to assess the quality and significance of this data, an analyst must reflect on their own political and ethical commitments.

Data Sense and Exploration is a course designed to help students reckon with the values reported in a dataset so that they may better judge their integrity. The course is part of a series of undergraduate data studies courses offered in the Science and Technology Studies Program at the University of California Davis, aiming to cultivate student skill in applying critical thinking towards data-oriented environments. Data Sense and Exploration cultivates critical data literacy by walking students through a quarter-long research project contextualizing, exploring, and visualizing a publicly-accessible dataset. We refer to the project as an “ethnography of a dataset,” not only because students examine the diverse cultural forces operating within and through the data, but also because students draw out these forces through immersive, consistent, hands-on engagement with the data, along with reflections on their own positionality as they produce analyses and visualizations. Through a series of labs in which students learn how to quantitatively summarize the features in a dataset in the coding language R (often referred to as a descriptive data analysis), students also practice researching and reflecting on the history of the dataset’s semantics and classification. In doing so, the course encourages students to recognize how the quantitative metrics that they produce reflect not only the way things are in the world, but also how people have chosen to define them. Perhaps, most importantly, the course positions data as always already structured according to diverse biases and thus aims to foster student skill in discerning which biases they should trust and how to responsibly draw meaning from data in spite of them. In this paper, I present how this project is taught in Data Sense and Exploration and some critical findings students made in their projects.

Teaching Critical Data Analysis

With the growth of data science in industry, academic research, and government planning over the past decade, universities across the globe have been investing in the expansion of data-focused course offerings. Many computationally or quantitatively-focused data science courses seek to cultivate student skill in collecting, cleaning, wrangling, modeling, and visualizing data. Simultaneously, high-profile instances of data-driven discrimination, surveillance, and mis-information have pushed universities to also consider how to expand course offerings regarding responsible and ethical data use. Some emerging courses, often taught directly in computer and data science departments, introduce students to frameworks for discerning “right from wrong” in data practice, focusing on individual compliance with rules of conduct at the expense of attention to the broader institutional cultures and contexts that propagate data injustices (Metcalf, Crawford, and Keller 2015). Other emerging courses, informed by scholarship in science and technology studies (STS) and critical data studies (CDS), take a more critical approach, broadening students’ moral reasoning by encouraging them to reflect on the collective values and commitments that shape data and their relationship to law, democracy, and sociality (Metcalf, Crawford, and Keller 2015).

While such courses help students recognize how power operates in and through data infrastructure, a risk is that students will come to see the evaluation of data politics and the auditing of algorithms as a separate activity from data practice. While seeking to cultivate student capacity to foresee the consequences of data work, coursework that divorces reflection from practice end up positioning these assessments as something one does after data analysis in order to evaluate the likelihood of harm and discrimination. Research in critical data studies has indicated that this divide between data science and data ethics pedagogy has rendered it difficult for students to recognize how to incorporate the lessons of data and society into their work (Bates et al. 2020). Thus, Data Sense and Exploration takes a different approach—walking students through a data analysis project while encouraging them, each step of the way, to record field notes on the history and context of their data inputs, the erasures and reductions of narrative that emerge as they clean and summarize the data, and the rhetoric of the visualizations they produce. As a cultural anthropologist, I’ve structured the class to draw from my own training in and engagement with “experimental ethnography” (Clifford and Marcus 1986). Guided by literary, feminist, and postcolonial theory, cultural anthropologists engage experimental ethnographic methods to examine how systems of representation shape subject formation and power. In this sense, Data Sense and Exploration positions data inputs as cultural artifacts, data work as a cultural practice, and ethnography as a method that data scientists can and should apply in their work to mitigate the harm that may arise from them. Importantly, walking students into awareness of the diverse cultural forces operating in and through data helps them more readily recognize opportunities for intervention. Rather than criticizing the values and political commitments that they bring to their work as biasing the data, the course celebrates such judgments when bent toward advancing more equitable representation.

The course is predominantly inspired by literature in data and information infrastructure studies (Bowker et al. 2009). These fields study the cultural and political contexts of data and the infrastructures that support them by interviewing data producers, observing data practitioners, and closely reading data structures. For example, through historical and ethnographic studies of infrastructures for data access, organization, and circulation, the field of data infrastructure studies examines how data is made and how it transforms as it moves between stakeholders and institutions with diverse positionalities and vested interests (Bates, Lin, and Goodale 2016). Critiquing the notion that data can ever be pure or “raw,” this literature argues that all data emerge from sites of active mediation, where diverse epistemic beliefs and political commitments mold what ultimately gets represented and how (Gitelman 2013). Diverting from an outsized focus on data bias, Data Sense and Exploration prompts students to grapple with the “interpretive bases” that frame all data—regardless of whether it has been produced though personal data collection, institutions with strong political proclivities, or automated data collection technologies. In this sense, the course advances what Gray, Gerlitz, and Bounegru (2018) refer to as “data infrastructure literacy” and demonstrates how students can apply critical data studies techniques to critique and improve their own day-to-day data science practice (Neff et al. 2017).

Studying a Dataset Ethnographically

Data Sense and Exploration introduces students to examining a dataset and data practices ethnographically through an extended research project, carried out incrementally through a series of weekly labs.[1] While originally the labs were completed collaboratively in a classroom setting, in the move to remote instruction in Spring 2020, the labs were reformulated as a series of nine R Notebooks, hosted in a public GitHub repository that students clone into their local coding environments to complete. R Notebooks are digital documents, written in the scripting language Markdown, that enable authors to embed chunks of executable R code amidst text, images, and other media. The R Notebooks that I composed for Data Sense and Exploration include text instruction for how to find, analyze, and visualize a rectangular dataset, or a dataset in which values are structured into a series of observations (or rows) each described by a series of variables (or columns). The Notebooks also model how to apply various R functions to analyze a series of example datasets, offer warnings of the various faulty assumptions and statistical pitfalls students may encounter in their own data practice, and demonstrate the critical reflection that students will be expected to engage in as they apply the functions in their own data analysis.

Interspersed throughout the written instruction, example code, and reflections, the Notebooks provide skeleton code for students to fill in as they go about applying what they have learned to a dataset they will examine throughout the course. At the beginning of the course, when many students have no prior programming experience, the skeleton code is quite controlled, asking students to “fill-in-the-blank” with a variable from their own dataset or with a relevant R function.

# Uncomment below and count the distinct values in your unique key. Note that you may need to select multiple variables. If so, separate them by a comma in the select() function.
#n_unique_keys <- _____ %>% select(_____) %>% n_distinct()

# Uncomment below and count the rows in your dataset by filling in your data frame name.
#n_rows <- nrow(_____)

# Uncomment below and then run the code chunk to make sure these values are equal.
# n_unique_keys == n_rows

Figure 2. Example of skeleton code from R Notebooks.

However, as students gain familiarity with the language, each week, they are expected to compose code more independently. Finally, in each Notebook, there are open textboxes, where students record their critical reflections in response to specific prompts.

Teaching this course in the Spring 2020 quarter, I found that the structure provided by the R Notebooks overall was particularly supportive to students who were coding in R for the first time and that, given the examples provided throughout the Notebooks, students exhibited greater depth of reflection in response to prompts. However, without the support of a classroom once we moved online, I also found that novice students struggled more to interpret what the plots they produced in R were actually showing them. Moreover, advanced students were more conservative in their depth of data exploration, closely following the prompts and relying on code templates. In future iterations of the course, I thus intend to spend more synchronous time in class practicing how to quantitatively summarize the results of their analysis. I also plan to add new sections at the end of each Notebook, prompting students to leverage the skills they learned in that Notebook in more creative and free-form data explorations.

Each time I teach the course, individual student projects are structured around a common theme. In the iteration of the course that inspired the project that opens this article, the theme was “social and environmental challenges facing California.” In the most recent iteration of the course, the theme was “social vulnerability in the wake of a pandemic.” In an early lab, I task students with identifying issues warranting public concern related to the theme, devising research questions, and searching for public data that may help answer those questions. Few students entering the course have been taught how to search for public research, let alone how to search for public data. In order to structure their search activity, I task the students with imagining and listing “ideal datasets”—intentionally delineating their topical, geographic, and temporal scope—prior to searching for any data. Examining portals like data.gov, Google’s dataset search, and city and state open data portals, students very rarely find their ideal datasets and realize that they have to restrict their research questions in order to complete the assignment. Grappling with the dearth of public data for addressing complex contemporary questions around equity and social justice provides one of the first eye-opening experiences in the course. A Notebook directive prompts students to reflect on this.

Throughout the following week, I work with groups of students to select datasets from their research that will be the focus of their analysis. This is perhaps one of the most challenging tasks of the course for me as the instructor. While a goal is to introduce students to the knowledge gaps in public data, some public datasets have so little documentation that the kinds of insights students could extrapolate from examinations of their history and content would be considerably limited. Further, not all rectangular datasets are structured in ways that will integrate well with the code templates I provide in the R Notebooks. I grapple with the tension of wanting to expose students to the messiness of real-world data, while also selecting datasets that will work for the assignment.

Once datasets have been assigned, the remainder of the labs provide opportunities for immersive engagement with the dataset. In what follows, I describe a series of concepts (i.e. routines and rituals, semantics, classifications, calculations and narrative, chrono-politics, and geo-politics) around which I have structured each lab, and provide some examples of both the data work that introduced students to these concepts and the critical reflections they were able to make as a result.

Data Routines and Rituals

In one of the earlier labs, students conduct a close reading of their dataset’s documentation—an example of what Geiger and Ribes (2011) refer to as a “trace ethnography.” They note the stakeholders involved in the data’s collection and publication, the processes through which the data was collected, the circumstances under which the data was made public, and the changes in the data’s structure. They also search for news articles and scientific articles citing the dataset to get a sense of how governing bodies have leveraged the data to inform decisions, how social movements have advocated for or against the data’s collection, and how the data has advanced other forms of research. They outline the costs and labor involved in producing and maintaining the data, the formal standards that have informed the data’s structure, and any laws that mandate the data’s collection.
From this exercise, students learn about the diverse “rituals” of data collection and publication (Ribes and Jackson 2013). For instance, studying the North American Breeding Bird Survey (BBS)—a dataset that annually records bird populations along about 4,100 roadside survey routes in the United States and Canada—Tennyson Filcek learned that the data is produced by volunteers skilled in visual and auditory bird identification. After completing training, volunteers drive to an assigned route with a pen, paper, and clipboard and count all of the bird species seen or heard over the course of three minutes along each designated stop on the route. They report the data back to the BBS Office, which aggregates the data and makes them available for public consumption. While these rituals shape how the data get produced, the unruliness of aggregating data collected on different days, by different individuals, under different weather and traffic conditions, and in different parts of the continent has prompted the BBS to implement recommendations and routines to account for disparate conditions. The BBS requires volunteers to complete counts around June, start the route a half-hour before sunrise, and avoid completing counts on foggy, rainy, or windy days. Just as these routines domesticate the data, the heterogeneity of the data’s contexts demands that the data be cared for in particular ways, in turn patterning data collection as a cultural practice. This lab is thus an important precursor to the remaining labs in that it introduces students to the diverse actors and commitments mediating the dataset’s production and affirms that the data could not exist without them.

While I have been impressed with students’ ability to outline details involving the production and structure of the data, I have found that most students rarely look beyond the data documentation for relevant information—often missing critical perspectives from outside commentators (such as researchers, activists, lobbyists, and journalists) that have detailed the consequences of the data’s incompleteness, inconsistencies, inaccuracies, or timeliness for addressing certain kinds of questions. In future iterations of the course, I intend to encourage students to characterize the viewpoints of at least three differently positioned stakeholders in this lab in order to help illustrate how datasets can become contested artifacts.

Data Semantics

In another lab, students import their assigned dataset into the R Notebook and programmatically explore its structure, using the scripting language to determine what makes one observation distinct from the next and what variables are available to describe each observation. As they develop an understanding for what each row of the dataset represents and how columns characterize each row, they refer back to the data documentation to consider how observations and variables are defined in the data (and what these definitions exclude). This focused attention to data semantics invites students to go behind-the-scenes of the observations reported in a dataset and develop a deeper understanding of how its values emerge from judgments regarding “what counts.”

ca_crimes_clearances <- read.csv("https://data-openjustice.doj.ca.gov/sites/default/files/dataset/2019-06/Crimes_and_Clearances_with_Arson-1985-2018.csv")

str(ca_crimes_clearances)

## 'data.frame':    24950 obs. of  69 variables:
##  $ Year               : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ County             : chr  "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
##  $ NCICCode           : chr  "Alameda Co. Sheriff's Department" "Alameda" "Albany" "Berkeley" ...
##  $ Violent_sum        : int  427 405 101 1164 146 614 671 185 199 6703 ...
##  $ Homicide_sum       : int  3 7 1 11 0 3 6 0 3 95 ...
##  $ ForRape_sum        : int  27 15 4 43 5 34 36 12 16 531 ...
##  $ Robbery_sum        : int  166 220 58 660 82 86 250 29 41 3316 ...
##  $ AggAssault_sum     : int  231 163 38 450 59 491 379 144 139 2761 ...
##  $ Property_sum       : int  3964 4486 634 12035 971 6053 6774 2364 2071 36120 ...
##  $ Burglary_sum       : int  1483 989 161 2930 205 1786 1693 614 481 11846 ...
##  $ VehicleTheft_sum   : int  353 260 55 869 102 350 471 144 74 3408 ...
##  $ LTtotal_sum        : int  2128 3237 418 8236 664 3917 4610 1606 1516 20866 ...
##  $ ViolentClr_sum     : int  122 205 58 559 19 390 419 146 135 2909 ...
##  $ HomicideClr_sum    : int  4 7 1 4 0 2 4 0 1 62 ...
##  $ ForRapeClr_sum     : int  6 8 3 32 0 16 20 6 8 319 ...
##  $ RobberyClr_sum     : int  32 67 23 198 4 27 80 21 16 880 ...
##  $ AggAssaultClr_sum  : int  80 123 31 325 15 345 315 119 110 1648 ...
##  $ PropertyClr_sum    : int  409 889 166 1954 36 1403 1344 422 657 5472 ...
##  $ BurglaryClr_sum    : int  124 88 62 397 9 424 182 126 108 1051 ...
##  $ VehicleTheftClr_sum: int  7 62 16 177 8 91 63 35 38 911 ...
##  $ LTtotalClr_sum     : int  278 739 88 1380 19 888 1099 261 511 3510 ...
##  $ TotalStructural_sum: int  22 23 2 72 0 37 17 17 7 287 ...
##  $ TotalMobile_sum    : int  6 4 0 23 1 26 18 9 3 166 ...
##  $ TotalOther_sum     : int  3 5 0 5 0 61 21 64 2 22 ...
##  $ GrandTotal_sum     : int  31 32 2 100 1 124 56 90 12 475 ...
##  $ GrandTotClr_sum    : int  11 7 1 20 0 14 7 2 2 71 ...
##  $ RAPact_sum         : int  22 9 2 31 4 21 25 9 15 451 ...
##  $ ARAPact_sum        : int  5 6 2 12 1 13 11 3 1 80 ...
##  $ FROBact_sum        : int  77 56 23 242 35 38 136 13 22 1120 ...
##  $ KROBact_sum        : int  22 23 2 71 10 7 43 3 4 264 ...
##  $ OROBact_sum        : int  3 11 2 43 11 3 7 1 1 107 ...
##  $ SROBact_sum        : int  64 130 31 304 26 38 64 12 14 1825 ...
##  $ HROBnao_sum        : int  59 136 26 351 56 32 116 3 0 1676 ...
##  $ CHROBnao_sum       : int  38 48 15 150 9 21 43 4 13 253 ...
##  $ GROBnao_sum        : int  23 2 1 0 2 7 43 6 9 83 ...
##  $ CROBnao_sum        : int  32 2 2 0 0 8 21 2 2 46 ...
##  $ RROBnao_sum        : int  11 20 6 47 14 9 19 3 2 306 ...
##  $ BROBnao_sum        : int  3 2 3 21 0 2 6 0 3 37 ...
##  $ MROBnao_sum        : int  0 10 5 91 1 7 2 11 12 915 ...
##  $ FASSact_sum        : int  25 16 3 47 6 47 43 10 26 492 ...
##  $ KASSact_sum        : int  27 30 2 103 8 38 55 13 21 253 ...
##  $ OASSact_sum        : int  111 90 10 224 9 120 208 29 43 396 ...
##  $ HASSact_sum        : int  68 27 23 76 36 286 73 92 49 1620 ...
##  $ FEBURact_Sum       : int  1177 747 85 2040 161 1080 1128 341 352 9011 ...
##  $ UBURact_sum        : int  306 242 76 890 44 706 565 273 129 2835 ...
##  $ RESDBUR_sum        : int  1129 637 100 2015 89 1147 1154 411 274 8487 ...
##  $ RNBURnao_sum       : int  206 175 33 597 32 292 295 100 44 2114 ...
##  $ RDBURnao_sum       : int  599 195 44 1418 26 485 532 163 103 5922 ...
##  $ RUBURnao_sum       : int  324 267 23 0 31 370 327 148 127 451 ...
##  $ NRESBUR_sum        : int  354 352 61 915 116 639 539 203 207 3359 ...
##  $ NNBURnao_sum       : int  216 119 32 224 44 274 238 104 43 1397 ...
##  $ NDBURnao_sum       : int  47 46 21 691 14 110 45 34 26 1715 ...
##  $ NUBURnao_sum       : int  91 187 8 0 58 255 256 65 138 247 ...
##  $ MVTact_sum         : int  233 187 42 559 85 219 326 76 56 2711 ...
##  $ TMVTact_sum        : int  56 33 4 55 9 71 88 40 9 121 ...
##  $ OMVTact_sum        : int  64 40 9 255 8 60 57 28 9 576 ...
##  $ PPLARnao_sum       : int  5 31 26 133 5 10 1 4 3 399 ...
##  $ PSLARnao_sum       : int  60 20 4 163 4 14 20 6 3 251 ...
##  $ SLLARnao_sum       : int  289 664 40 1277 1 704 1058 106 435 1123 ...
##  $ MVLARnao_sum       : int  930 538 147 3153 207 1136 753 561 241 8757 ...
##  $ MVPLARnao_sum      : int  109 673 62 508 153 446 1272 155 252 901 ...
##  $ BILARnao_sum       : int  205 516 39 611 16 360 334 276 151 349 ...
##  $ FBLARnao_sum       : int  44 183 46 1877 85 493 417 187 281 4961 ...
##  $ COMLARnao_sum      : int  11 53 17 18 24 27 59 7 2 70 ...
##  $ AOLARnao_sum       : int  475 559 37 496 169 727 696 304 148 4055 ...
##  $ LT400nao_sum       : int  753 540 84 533 217 937 1089 370 235 976 ...
##  $ LT200400nao_sum    : int  437 622 68 636 122 607 802 299 262 2430 ...
##  $ LT50200nao_sum     : int  440 916 128 2793 161 1012 1102 453 464 4206 ...
##  $ LT50nao_sum        : int  498 1159 138 4274 164 1361 1617 484 555 13254 ...

Figure 3. Basic examination of the structure of the CA Crimes and Clearances dataset.

For instance, studying aggregated totals of crimes and clearances for each law enforcement agency in California in each year from 1985 to 2017, Simarpreet Singh noted how the definition of a crime gets mediated by rules in the US Federal Bureau of Investigation (FBI)’s Uniform Crime Reporting Program (UCR)—the primary source of statistics on crime rates in the US. Singh learned that one such rule, known as the hierarchy rule, states that if multiple offenses occur in the context of a single crime incident, for the purposes of crime reporting, the law enforcement agency classifies the crime only according to the most serious offense. In descending order, these classifications include 1. Criminal Homicide 2. Criminal Sexual Assault 3. Robbery 4. Aggravated Battery/Aggravated Assault 5. Burglary 6. Theft 7. Motor Vehicle Theft 8. Arson. This means that in the resulting data, for incidents where multiple offenses occurred, certain classes of crime are likely to be underrepresented in the counts.

Sidhu also acknowledged how counts for individual offense types get mediated by official definitions. A change in the FBI’s definition of “forcible rape” (including only female victims) to “rape” (focused on whether there had been consent instead of whether there had been physical force) in 2014 led to an increase in the number of rapes reported in the data from that year on. From 1927 (when the original definition was documented) up until this change, male victims of rape had been left out of official statistics, and often rapes that did not involve explicit physical force (such as drug-facilitated rapes) went uncounted. Such changes come about, not in a vacuum, but in the wake of shifting norms and political stakes to produce certain types of quantitative information (Martin and Lynch 2009). By encouraging students to explore these definitions, this lab has been particularly effective in getting students to reflect not only on what counts and measures of cultural phenomena indicate, but also on the cultural underpinnings of all counts and measures.

Data Classifications

In the following lab, students programmatically explore how values get categorized in the dataset, along with the frequency with which each observation falls into each category. To do so, they select categorical variables in the dataset and produce bar plots that display the distributions of values in that variable. Studying a US Environmental Protection Agency (EPA) dataset that reported the daily air quality index (AQI) of each county in the US in 2019, Farhat Bin Aznan created a bar plot that displayed the number of counties that fell into each of the following air quality categories on January 1, 2019: Good, Moderate, Unhealthy for Sensitive Populations, Unhealthy, Very Unhealthy, and Hazardous.

Figure 4: R output when student plots the number of counties in each AQI category on January 1, 2019. Bar plot displays that most counties reported Good air quality on that day. — Figure 4. Barplot of counties in each AQI category on January 1, 2019.

Studying the US Department of Education’s Scorecard dataset, which documents statistics on student completion, debt, and demographics for each college and university in the US, Maxim Chiao created a bar plot that showed the number of universities that fell into each of the following ownership categories: Private, Public, Non-profit.

Figure 5: R output when student plots the number of colleges and universities by their ownership model in the 2018-2019 academic year. — Figure 5. Barplot of colleges and universities in the US by ownership model.

I first ask students to interpret what they see in the plot. Which categories are more represented in the data, and why might that be the case? I then ask students to reflect on why the categories are divided the way that they are, how the categorical divisions reflect a particular cultural moment, and to consider values that may not fit neatly into the identified categories. As it turns out, the AQI categories in the EPA’s dataset are specific to the US and do not easily translate to the measured AQIs in other countries, where for a variety of reasons, different pollutants are taken into consideration when measuring air quality (Plaia and Ruggieri 2011). The ownership models categorized in the Scorecard dataset gloss over the nuance of quasi-private universities in the US such as the University of Pittsburgh and other universities in Pennsylvania’s Commonwealth System of Higher Education.

For some students, this Notebook was particularly effective in encouraging reflection on how all categories emerge in particular contexts to delimit insight in particular ways (Bowker and Star 1999). For example, air pollution does not know county borders, yet, as Victoria McJunkin pointed out in her labs, the EPA reports one AQI for each county based on a value reported from one air monitor that can only detect pollution within a delimited radius. AQI is also reported on a daily basis in the dataset, yet for certain pollutants in the US, pollution concentrations are monitored on an hourly basis, averaged over a series of hours, and then the highest average is taken as the daily AQI. The choice to classify AQI by county and day then is not neutral, but instead has considerable implications for how we come to understand who experiences air pollution and when.

Still, I found that, in this lab, other students struggled to confront their own assumptions about categories they consider to be neutral. For instance, many students categorizing their data by state in the US suggested that there were no cultural forces underlying these categories because states are “standard” ways of dividing the country. In doing so, they missed critical opportunities to reflect on the politics behind how state boundaries get drawn and which people and places get excluded from consideration when relying on this bureaucratic schema to classify data. Going forward, to help students place even “standard” categories in a cultural context, I intend to prompt students to produce a brief timeline outlining how the categories emerged (both institutionally and discursively) and then to identify at least one thing that remains “residual” (Star and Bowker 2007) to the categories.

Data Calculations and Narrative

The next lab prompts students to acknowledge the judgment calls they make in performing calculations with data, including how these choices shape the narrative the data ultimately conveys. Selecting a variable that represents a count or a measure of something in their data, students measure the central tendency of the variable—taking an average across the variable by calculating the mean and the median value. Noting that they are summarizing a value across a set of numbers, I remind students that such measures should only be taken across “similar” observations, which may require first filtering the data to a specific set of observations or performing the calculations across grouped observations. The Notebook instructions prompt students to apply such filters and then reflect on how they set their criteria for similarity. Where do they draw the line between relevant or irrelevant, similar or dissimilar? What narratives do these choices bring to the fore, and what do they exclude from consideration?

For instance, studying a dataset documenting changes in eligibility policies for the US Supplemental Nutrition Assistance Program (SNAP) by state since 1995, Janelle Marie Salanga sought to calculate the average spending on SNAP outreach across geographies in the US and over time. Noting that we could expect there to be differences in state spending on outreach due to differences in population, state fiscal politics, and food accessibility, Salanga decided to group the observations by state before calculating the average spending across time. Noting that the passing of the American Recovery and Reinvestment Act of 2009 considerably expanded SNAP benefits to eligible families, Salanga decided to filter the data to only consider outreach spending in the 2009 fiscal year through the 2015 fiscal year. Through this analysis, Salanga found California to have, on average, spent the most on SNAP outreach in the designated fiscal years, while several states spent nothing.

snap %>%
  filter(month(yearmonth) == 10 & year(yearmonth) %in% 2009:2015) %>% #Outreach spending is reported annually, but this dataset is reported monthly, so we filter to the observations on the first month of each fiscal year (October)
  group_by(statename) %>%
  summarize(median_outreach = median(outreach * 1000, na.rm = TRUE), 
            num_observations = n(), 
            missing_observations = paste(as.character(sum(is.na(outreach)/n()*100)), "%"), 
            .groups = 'drop') %>%
  arrange(desc(median_outreach))

statename	median_outreach	num_observations
California	1129009.3990	7
New York	469595.8557	7
Texas	422051.5137	7
Washington	273772.9187	7
Minnesota	261750.3357	7
Arizona	222941.9250	7
Nevada	217808.7463	7
Illinois	195910.5835	7
Connecticut	184327.4231	7
Georgia	173554.0009	7
Pennsylvania	153474.7467	7
South Carolina	126414.4135	7
Ohio	125664.8331	7
Rhode Island	99755.1651	7
Tennessee	98411.3388	7
Massachusetts	97360.4965	7
Wisconsin	87527.9999	7
Maryland	81700.3326	7
Vermont	69279.2511	7
North Carolina	62904.8309	7
Indiana	58047.9164	7
Oregon	57951.0803	7
Michigan	53415.1688	7
Florida	37726.1696	7
Hawaii	29516.3345	7
New Jersey	23496.2501	7
Missouri	23289.1655	7
Louisiana	20072.0005	7
Colorado	19113.8344	7
Iowa	18428.9169	7
Virginia	15404.6669	7
Delaware	14571.0001	7
Alabama	11048.8329	7
District of Columbia	9289.5832	7
Kansas	8812.2501	7
North Dakota	8465.0002	7
Mississippi	4869.0000	7
Alaska	3199.3332	7
Arkansas	3075.0833	7
Nebraska	217.1667	7
Idaho	0.0000	7
Kentucky	0.0000	7
Maine	0.0000	7
Montana	0.0000	7
New Hampshire	0.0000	7
New Mexico	0.0000	7
Oklahoma	0.0000	7
South Dakota	0.0000	7
Utah	0.0000	7
West Virginia	0.0000	7
Wyoming	0.0000	7

Table 1. Median of annual SNAP outreach spending from 2009 to 2015 per US state.

The students then consider how their measures may be reductionist—that is, how the summarized values erase the complexity of certain narratives. For instance, Salanga went on to plot a series of boxplots that displayed the dispersion of outreach spending across fiscal years for each state from 2009 to 2015. She found that, while outreach spending had been fairly consistent in several states across these years, in other states there had been a difference in several hundred thousand dollars from the fiscal year with the maximum outreach spending to the year with the minimum.

Figure 6: R output when student plots the distribution of outreach spending per state from 2009 to 2015. — Figure 6. Boxplot showing distribution of annual SNAP outreach spending from 2009 to 2015.

This nuanced story of variations in spending over time gets obfuscated when relying on a measure of central tendency alone to summarize the values.

This lab has been effective in getting students to recognize data work as a cultural practice that involves active discernment. Still, I have noticed that some students complete this lab feeling uncomfortable with the idea that the choices they make in data work may be framed, at least in part, by their own political and ethical commitments. In other words, in their reflections, some students describe their efforts to divorce their own views from their decision-making: they express concern that their choices may be biasing the analysis in ways that invalidate the results. To help them further grapple with the judgment calls that frame all data analyses (and especially the calls that they individually make when choosing how to filter, sort, group, and visualize the data), the next time I run the course I plan to ask students to explicitly characterize their own standpoint in relation to the analysis and reflect on how their unique positionality both influences and delimits the questions they ask, the filters they apply, and the plots they produce.

Data Chrono-Politics and Geo-Politics

In a subsequent lab, I encourage students to situate their datasets in a particular temporal and geographic context in order to consider how time and place impact the values recorded. Students first segment their data by a geographic variable or a date variable to assess how the calculations and plots vary across geographies and time. They then characterize, not only how and why there may be differences in the phenomena represented in the data across these landscapes and timescapes, but also how and why there may be differences in the data’s generation.

For instance, in Spring 2020, a group of students studied a dataset documenting the number of calls related to domestic violence received each month to each law enforcement agency in California.

Figure 7: R output when student plots the total domestic violence calls to California law enforcement agencies over time divided by county. — Figure 7. Timeseries of domestic violence calls to California law enforcement agencies by county.

One student, Laura Cruz, noted how more calls may be reported in certain counties not only because domestic violence may be more prevalent or because those counties had a higher or denser population, but also due to different cultures of police intervention in different communities. Trust in law enforcement may vary across California communities, impacting which populations feel comfortable calling their law enforcement agencies to report any issues. This creates a paradox in which the counts of calls related to domestic violence can be higher in communities that have done a better job responding to them.

Describing how the values reported may change over time, Hipolito Angel Cerros further noted that cultural norms around domestic violence have changed over time for certain social groups. As a result of this cultural change, certain communities may be more likely to call law enforcement agencies regarding domestic violence in 2020 than they were a decade ago, while other communities may be less likely to call.

This was one of the course’s more successful labs, which helped students discern the ways in which data are products of the cultural contexts of their production. Dividing the data temporally and geographically helped affirm the dictum that “all data are local” (Loukissas 2019)—that data emerge from meaning-making practices that are never completely stable. Leveraging data visualization techniques to situate data in particular times and contexts demonstrated how, when aggregated across time and place, datasets can come to tell multiple stories from multiple perspectives at once. This called on students, in their role as data practitioners, to convey data results with more care and nuance.

Conclusion

Ethnographically analyzing a dataset can draw to the fore insights about how various people and communities perceive difference and belonging, how people represent complex ideas numerically, and how they prioritize certain forms of knowledge over others. Programmatically exploring a dataset’s structure, schemas, and contexts helped students see datasets not just as a series of observations, counts, and measurements about their communities, but also as cultural objects, conveying meaning in ways that foreground some issues while eclipsing others. The project also helped students see data science as a practice that is always already political, as opposed to something that can potentially become politicized when placed into the wrong hands or leveraged in the wrong ways. Notably, the project helped students cultivate these insights by integrating a computational practice with critical reflection, highlighting how they can incorporate social awareness and critique into their work. Still, the course content could be strengthened to encourage more critical examinations of categories students consider to be standard, and to better connect their choices in data analysis with their own political and ethical commitments.

Notably, there is great risk to calling attention to just how messy public data is, especially in a political moment in the US where a growing culture of denialism is undermining the credibility of evidence-based research. I encourage students to see themselves as data auditors and their work in the course as responsible data stewardship, and on several occasions, we have worked together to compose emails to data publishers describing discrepancies we have found in the datasets. In this sense, rather than disparaging data for its incompleteness, inconsistencies, or biases, the project encourages students to rethink their role as critical data practitioners, responsible for considering when and how to advocate for making datasets and data analysis more comprehensive, honest, and equitable.

Notes

[1] I typically assign Joe Flood’s The Fires (2011) as the course text. The book tells a gripping and sobering story of how a statistical model and a blind trust in numbers contributed to the burning of the NYC’s poorest neighborhoods in the 1970s.

Bibliography

Bates, Jo, David Cameron, Alessandro Checco, Paul Clough, Frank Hopfgartner, Suvodeep Mazumdar, Laura Sbaffi, Peter Stordy, and Antonio de la Vega de León. 2020. “Integrating FATE/Critical Data Studies into Data Science Curricula: Where Are We Going and How Do We Get There?” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 425–435. FAT* ’20. Barcelona, Spain: Association for Computing Machinery. https://dl.acm.org/doi/abs/10.1145/3351095.3372832.

Bates, Jo, Yu-Wei Lin, and Paula Goodale. 2016. “Data Journeys: Capturing the Socio-Material Constitution of Data Objects and Flows.” Big Data & Society 3, no. 2. https://doi.org/10.1177/2053951716654502.

Bowker, Geoffrey, Karen Baker, Florence Millerand, and David Ribes. 2009. “Toward Information Infrastructure Studies: Ways of Knowing in a Networked Environment.” In International Handbook of Internet Research, edited by Jeremy Hunsinger, Lisbeth Klastrup, and Matthew Allen, 97–117. Springer Netherlands. https://doi.org/10.1007/978-1-4020-9789-8_5.

Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and Its Consequences. Cambridge, Massachusetts: MIT Press.

Clifford, James, and George E. Marcus. 1986. Writing Culture: The Poetics and Politics of Ethnography: A School of American Research Advanced Seminar. Berkeley: University of California Press.

Flood, Joe. 2011. The Fires: How a Computer Formula, Big Ideas, and the Best of Intentions Burned Down New York City—and Determined the Future of Cities. New York: Riverhead Books.

Geiger, R. Stuart, and David Ribes. 2011. “Trace Ethnography: Following Coordination through Documentary Practices.” In 2011 44th Hawaii International Conference on System Sciences, 1–10. https://doi.org/10.1109/HICSS.2011.455.

Gitelman, Lisa, ed. 2013. “Raw Data” Is an Oxymoron. Cambridge, Massachusetts: MIT Press.

Gray, Jonathan, Carolin Gerlitz, and Liliana Bounegru. 2018. “Data Infrastructure Literacy:” Big Data & Society, July. https://doi.org/10.1177/2053951718786316.

Hill, Jim. 1996. “Illegal Immigrants Take Heat for California Wildfires.” CNN, July 28, 1996. https://web.archive.org/web/20051202202133/https://www.cnn.com/US/9607/28/border.fires/index.html.

Loukissas, Yanni Alexander. 2019. All Data Are Local: Thinking Critically in a Data-Driven Society. Cambridge, Massachusetts: The MIT Press.

Martin, Aryn, and Michael Lynch. 2009. “Counting Things and People: The Practices and Politics of Counting.” Social Problems 56, no. 2: 243–66. https://doi.org/10.1525/sp.2009.56.2.243.

Metcalf, Jacob, Kate Crawford, and Emily F. Keller. 2015. “Pedagogical Approaches to Data Ethics.” Council for Big Data, Ethics, and Society. Council for Big Data, Ethics, and Society. https://bdes.datasociety.net/council-output/pedagogical-approaches-to-data-ethics-2/.

Neff, Gina, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. “Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science.” Big Data 5, no. 2: 85–97. https://doi.org/10.1089/big.2016.0050.

Plaia, Antonella, and Mariantonietta Ruggieri. 2011. “Air Quality Indices: A Review.” Reviews in Environmental Science and Bio/Technology 10, no. 2: 165–79. https://doi.org/10.1007/s11157-010-9227-2.

Ribes, David, and Steven J Jackson. 2013. “Data Bite Man: The Work of Sustaining a Long-Term Study.” In Gitelman 2013, 147–166.

Star, Susan Leigh, and Geoffrey C. Bowker. 2007. “Enacting Silence: Residual Categories as a Challenge for Ethics, Information Systems, and Communication.” Ethics and Information Technology 9, no. 4: 273–80. https://doi.org/10.1007/s10676-007-9141-7.

Acknowledgments

Thanks are due to the students enrolled in STS 115: Data Sense and Exploration in Spring 2019 and Spring 2020, whose work helped refine the arguments in this paper. I also want to thank Matthew Lincoln and Alex Hanna for their thoughtful reviews, which not only strengthened the arguments in the paper but also my planning for future iterations of this course.

About the Author

Lindsay Poirier is Assistant Professor of Science and Technology Studies at the University of California, Davis. As a cultural anthropologist working within the field of data studies, Poirier examines data infrastructure design work and the politics of representation emerging from data practices. She is also the Lead Platform Architect for the Platform for Experimental Collaborative Ethnography (PECE).

Network of Erasmus’s network, visualized using Cytoscape. Both nodes and edges are colored, and the nodes are sized, so that more information about centrality, edge weight, and clustering coefficient can be seen.

Issue Eighteen

Thinking Through Data in the Humanities and in Engineering

Elizabeth Alice Honig, University of Maryland, College Park

Deb Niemeier, University of Maryland, College Park

Christian F. Cloke, University of Maryland, College Park

Quint Gregory, University of Maryland, College Park

Abstract

This article considers how the same data can be differently meaningful to students in the humanities and in data science. The focus is on a set of network data about Renaissance humanists that was extracted from historical source materials, structured, and cleaned by undergraduate students in the humanities. These students learned about a historical context as they created first travel data, and then the network data, with each student working on a single historical figure. The network data was then shared with a graduate engineering class in which students were learning R. They too were assigned to acquaint themselves with the historical figures. Both groups then created visualizations of the data using a variety of tools: Palladio, Cytoscape, and R. They were encouraged to develop their own questions based on the networks. The humanists’ questions demanded that the data be reembeded in a context of historical interpretation—they wanted to reembrace contingency and uncertainty—while the engineers tried to create the clarity that would allow for a more forceful, visually comprehensible presentation of the data. This paper compares how humanities and engineering pedagogy treats data and what pedagogical outcomes can be sought and developed around data across these very different disciplines.

In the humanities, we train students to interpret their material within a larger context. Facts exist to be contextualized, biases uncovered, problems revealed. Students in many corners of the humanities are rarely confronted with something termed data, which they imagine as dry and quantitative and unyielding. Art history in particular is still a discipline of printed books and, especially, of material objects. Of course data do exist in our field, adhering to objects as physical information or tagged contents, or to the objects’ makers, as in the University of Amsterdam’s monumental ECARTICO project (Manovich 2015; Bok et al. n.d.). But introducing students to data is normally much less central to our work than persuading them to engage in close examination of the visual, and to use libraries to gather information.

Modern engineering is distinguished by production of massive data, most of which can be accessed from all over the world. Engineering students often take computer science and statistics classes, in addition to a curriculum in their chosen field, as a way of acquiring the expertise to deal with modern data. In the engineering realm, quantitative data are central and the context from which data arises is usually not discussed. As a result, engineering educators have devised pedagogy to motivate students to contextualize findings. One of the primary ways that engineering pedagogy has changed in the past twenty years to meet this challenge is the introduction of experiential and project-based learning (Crawley et al. 2007; Savage, Chen, and Vanasupa 2008). Both of these approaches are designed to couple the development of technical skills with increasing contextual awareness and cultural literacy. In this paper, we unpack key assumptions at the heart of the current state of pedagogy in both engineering and digital humanities by posing two questions:

Does digital training in the humanities alone motivate students to consider an outward focus for their contextual learning, and
Does project-based learning in engineering motivate students sufficiently to dig below the exploration of data and production of visualizations, and into context.

We implicitly challenge the notion that teaching digital humanities and the construction and meaning of “data” is enough to create a digital scholar. In engineering, we challenge the notion that a shift to project-based instruction is sufficient to motivate student learning beyond digital skills and computational methods.

To conduct this study, we consider how one data set functioned pedagogically in a humanities course taught within an art history department, and how the same data and core assignment was used in parallel in a data science course taught in engineering. In both cases, the process of working with data was meant to unsettle the ways in which students had normally been asked to work in their discipline. “Data” was framed as both a subject of analysis and a pedagogical tool to make students question their habits of thought, further empowering them to ask questions they had never thought to ask before. In both cases, students had to move back and forth between interpretability and quantification, recognizing the limitations and opportunities of approaching their data as (historical) material, and organizing their historical material as data.

The Humanities Class

The course “Humanists on the Move” introduced liberal arts undergraduates to data gathering and structuring as well as visualization and analysis. The goal of the class was to make students engage with the most fundamental humanities source material—primary written historical documents—as well as with data: the former should make the analysis of the latter meaningful. In fact, by the end of the semester, the class would not merely have learned about the early sixteenth century, about individual humanist figures, and about data and their analysis, but as a group the students would have produced new knowledge about this historical period, things that could not have been found in any published source.

Each student took on a single humanist figure for the semester. The characters ranged from Martin Luther to Isabella d’Este, Erasmus to Copernicus, Henry VIII to Cellini and Leonardo da Vinci. Students worked in groups according to the type of figure they were studying: Rulers, Artists, Scientists, and Thinkers. Every week the class read and discussed a primary source text, “met” its author, and investigated the historical context within which that figure had lived and ruled, painted, or written. Students learned enough about their own figure’s life to provide both a short written introduction and a longer oral presentation about them to the class. Having attained familiarity with their figures, other students’ figures, and a sense of the period based on contemporary writings, students then moved on to consider how the humanists’ historical roles were impacted by mobility and network-building—and, further, how other variables (gender, profession, national origin) factored into these complexities. This process required original research, and would necessitate collecting, structuring, cleaning, visualizing and analyzing data.

Using biographical sources, particularly actual printed books (which many in the class had never thought to consult before), students first gathered information on the travels of their figure: locations visited, and dates of travel. They geocoded each location so that it could be mapped, and they structured their material as data, each creating a three-sheet Excel spreadsheet. The members of each group then combined their data into a single spreadsheet, so that all Rulers, or all Artists, would eventually be visualized and analyzed together.

The class was initially held at UMD’s Collaboratory, where Collaboratory staff introduced students to OpenRefine, an open source platform created in Google Labs (originally as GoogleRefine) to clean and parse data using a simple set of tools (Muñoz 2013a; 2013b; 2014). This introduction covered installation and basic use. Each time it is opened, OpenRefine creates a server instance on the host computer, which is interfaced via a web browser. Users can open a local dataset (the default choice), as well as live data accessed via a URL (e.g., that of the City Permit Office of Toronto, Canada which is the basis for the tutorials on using Open Refine found in the Documentation section at openrefine.org).

Using a dataset contained within an Excel spreadsheet, “Sample Messy Humanist Data” provided by Professor Elizabeth Honig, Christian Cloke and Quint Gregory demonstrated the use of basic tools within OpenRefine, such as Common Transforms, Faceting, and Clustering, which allow the user quickly to reconcile data values that may be similar though not the same (such as capitalized/not-capitalized entries; misspellings; those with a space after or before a string). Through such operations, which require one to think carefully about how the data are structured, the user develops a deeper awareness of the dataset and confidence in its soundness and consistency. In addition, students were shown how different columns of data could be joined or split, depending on the desired outcome, to make new data expressions. The resulting “cleaned” dataset could be exported to a data table in any number of preferred formats (CSV/TSV, Excel, JSON, etc.).

To visualize their travel data, students were trained to use the Stanford-based platform Palladio (Humanities + Design n.d.). Palladio is an open source tool that was originally conceived of to visualize data from the “Mapping the Republic of Letters” project, which had collected material on scholarly networks in early modern Western Europe. Its main capabilities are therefore the visualization of networks and the creation of maps. Designed to be usable by humanists, Palladio does still necessitate correctly structured data, and students explored how that structuring impacted the generation of maps in Palladio’s system. Within its map function, Palladio also allows the visualization of chronological data linked to travels as both a timeline and timespans, so that the user can see the locations mapped (with locations sized according to criteria such as number of times visited) and the years in which travels occurred (Figure 1). Palladio also allows for “faceting,” i.e. dividing and recategorizing elements of data so that it can be examined in another dimension. For example, faceting enabled students to study over what distances female humanists were able to travel, or what cities attracted the most scientists vs. the most theologians, or which figures might have been together in Rome during a given year.

The travels of artists, shown as a map overlayed with a timeline along which locations visited in each year are visualized. — Figure 1. Visualization of the travels of artists, with faceted timeline overlaying the underlying map of locations visited.

Based on the maps and faceting, and on their research on individual figures whose travels were now visualized together, the class was able to explore what life events, ambitions, and exigencies led to travel in the Renaissance, and how travel mattered differently to figures with different professions.

The Data Set

The data set shared between humanists and engineers was created in the next phase of “Humanists on the Move,” which concerned humanist networks. Historical networks have been thoroughly studied and, more recently, elegantly visualized. The vast and remarkable website The Six Degrees of Francis Bacon, hosted by the Carnegie Mellon University Libraries, is a model of what a collaborative project using humanities data can accomplish (Lincoln 2016; Moretti 2011). Nevertheless, network material as we imagined it would be considerably less clear-cut as data than travel had been. A person is or isn’t in a given location at a given time, but a connection—in network terms, an edge—is harder to define. There are obvious connections such as family, colleagues, allies, collaborators. But when a figure read a book by another humanist, did that make them connected? And if so, how deeply connected had they become? How would the importance of that connection compare to, say, attending a performance in which another figure had acted, being present at a diplomatic meeting but not as a main player, writing a letter but (as far as we know) never receiving a reply to it? Historical resources are often fragmentary, and the class tangled with how to account for that as they assembled data. These were issues that most undergraduates had never confronted as they studied history, but now, history’s lacunae were of immediate relevance to their work.

In structuring their data, students were asked first to come up with a limited set of labels that would describe relationships. These might include patronage, respect, influence, friendship, antagonism. Often they encountered an example that none of their labels seemed to fit, but which was not sufficiently different, or representative, to warrant a new label. They learned how to compromise. Next, the students had to agree on criteria by which those edges could be weighted on a scale of one to three.

Another way of thinking about this exercise entails recognizing that it involved phases of translation, from humanist ways of thinking about material into quantifiable terms and then back again (Handelman 2015; Bradley 2018). Describing relationships, even determining what makes a relationship and why it matters, is a perfect example of humanistic work. Art historians love to talk about influence, patronage, and collaboration; this is all fundamental to how we write our histories. We could all probably say who was an important patron or a minor influence. But the students were asked to take information they had gathered and make it numerically regular, working against the humanist instinct to value irregularity and to see each instance of a given relationship, whether patronage or correspondence, as essentially a unique event with its own characteristics that are not simple to equate with those of a comparable event (Rawson and Muñoz 2016). Now every relationship had to be described using a fixed term from a limited list; every edge had to have a weight, from one to three. Long discussions were involved, although the COVID pandemic was widespread and we were meeting via Zoom.

The class gathered nearly 700 connections representing the ways in which over 450 different persons were connected to our core of twenty humanist figures (Figure 2). All of the groups combined their data into one large class spreadsheet. Every person (node) was described by a profession, every relationship (edge) had a label, sometimes several, and a numerical weight. This was the data set that we passed along to the engineers.

Section of a spreadsheet showing how network connections were recorded. Each line represents an edge, or relationship between two individuals, and includes information on gender, profession, and nature and closeness of the connection. — Figure 2. Part of the network spreadsheet, in progress. Each line represents an edge, with our key figures in column C and their connected nodes in column G. Information about each figure includes profession terms and gender; relationships are characterized in terms of type of connection and edge weight.

Engineers, Data, and a Humanities Data Set

The course “Data in the Built Environment” is designed to teach data science skills to graduate engineering students. One of its main aims is to motivate students to dig deeper into context via project-based learning concepts (Hicks and Irizarry 2018). To do this, students are given a new dataset each week with which to practice a newly introduced data science technique. Students practice the technique in class in groups and then use new data (also in groups) for homework as a way of deepening and solidifying their understanding (Paul Alexander Horton, Weiner, and Lande 2018; Neff et al. 2017). In short, each week students are challenged to synthesize the technical knowledge and then apply this learning through a practical data application with questions relevant to the data rather than to the technique. This approach is designed to create a tension between data as viewed by engineers and problems that require a deeper analysis to really understand the contextual story. Throughout the semester, the class pedagogy (and grading) emphasized the importance of characterizing data analysis results within the context in which data emerges. The network class was taught toward the end of the semester, so students had practice with linking data subtleties to context—but only in data reflective of the built environment (e.g., transportation, water, and housing data).

The underlying assumption of most engineering students is that data are data, mostly the same in all applications. Rarely do engineering students grapple with data that are unfamiliar to them. The Humanists on the Move data offered a completely novel opportunity to practice network visualization, motivating students to understand the underlying data in a way that they would not normally worry about.

The engineering class assignment mimicked the instructions for the humanist class, but compressed the time allocated for background research. Each student was assigned three humanists, who themselves were selected because they provided students the opportunity to uncover interesting contextual information. The engineering students prepared a one-page summary of basic background information for each figure, including important acquaintances, and any documented travel using three or more sources of information. Because the time allocated for background research was compressed, Wikipedia was an allowable source of information. It was notable that even this limited information gathering exercise threw engineering students into new terrain. Many had questions about how to decide what was important, how to find sources of information, even why they were working on these data in particular. The exercise of preparing them for the data both energized and confused them.

The engineering students were organized into groups of three. Because each student had background sheets on three humanists, groups were assigned so that each group had multiple sources of information on one or more humanists. This deliberate tactic was intended to motivate them to think more about the information that their networks were conveying. The exercise was structured so that groups started by developing standard networks and then moved to allow each group to design more elaborate or situational networks.

Visualizing Network Data

Each class now visualized the network data. For the engineering students, this was the entire point of the class: to visualize data with the implicit assumption that they would draw on the contextual information that they had gathered prior to the class. For students from art history and other humanities disciplines, this was new terrain. A map is a reasonably familiar object, even from the Renaissance, and students understood all of its basic parameters (Harley 2001). Superimposing information about travels onto it was not in itself a vast step. A network, however, was not something they were used to thinking about in visual form, nor were they adept at analyzing a network. A visible network gathers data and presents it in a way that will suggest new questions and will demand interpretation in and of itself—humanistic interpretation, that will return the uncertain and the variable while also incorporating the regular and quantified.

In engineering, visualization is essential for exploring, cleaning, understanding and explaining data. In the class, students master programming for data visualization that makes data exploration easier and more productive, and allows an engineer to both better understand the data and to present data in a way that has impact, particularly on audiences such as policy makers and the public. Students are taught appropriate (and inappropriate) uses of different kinds of charts and graphs, graphical composition, and the design aspects of effectively conveying information such as selecting colors, minimizing chartjunk and emphasizing key features of the data. The focus in engineering is on the mechanics of visualization. As noted earlier though, the transition to project-based learning in our field has ideally involved preparing students to explore context more deeply, even contexts with which they were truly unfamiliar.

The engineering class used a variety of network packages within R, which is a language that provides an environment for statistics and visualization (R Core Team n.d.). The language is open-source, rooted in statistical computing and provides a reproducible platform for engineering calculations. One of R’s major strengths is that it can be easily extended through packages to include modern computing methods and approaches. The network packages within R that were used in the class included igraph, ggraph, tidygraph, and visNetwork.

The igraph package provides functions that implement a wide range of graphing algorithms and can handle very large graphs (Nepusz 2016). The ggraph package extends ggplot (a core package for visualization) to handle networks using the grammar of graphics approach (Wickham 2010). Next, tidygraph provides tools to manipulate and analyze networks and is a wrapper for most of the igraph capabilities (Pedersen 2020). Finally, visNetwork allows for interactive visualization. Students were given the opportunity to work with any of these tools on this exercise.

The humanities students had started their visualization process using Palladio again. As in its mapping function, Palladio allows for faceting networks, so at this stage students could see all the connections based on friendship, for example, or isolate how and where clerics fit into the network (Figure 3).

Network of connections between rulers and other figures, visualized by humanities students using Palladio. The network is drab but readable. Nodes sized by number of connections. — Figure 3. Rulers’ network, as visualized using Palladio.

Palladio, however, is a tool for visualization and not for computational analysis. It can’t actually work with edge weights, which as humanists we had found to be such an important and complex issue. So at this point the Collaboratory stepped in again with an introduction to Cytoscape. Cytoscape would allow students to visualize the data, while at the same time furnishing a richer understanding of the underlying mathematical analysis of their networks. Cytoscape was developed for analyzing networks of data in systems biology research, as practitioners in this field were not proficient in the use of R (Shannon 2003). As a platform, however, it is discipline-agnostic: data sets of all types and from varied fields, including the humanities, can be analyzed and visualized, and as a result Cytoscape has become a platform researchers in the humanities are comfortable using.

Students were introduced to Cytoscape on the last day of class, and because it was introduced so late in the semester it was advertised as a way for interested students to build another skill and continue querying the dataset they had thus far created and visualized. Students were fascinated by the insights gained from network analyses possible in Cytoscape, but unavailable in Palladio. In addition, they responded favorably to the powerful suite of options within the visualization environment of Cytoscape. For instance, the appearance of nodes and edges can be customized prior to analysis to isolate certain types of values, or the researcher can use the results of statistical analysis to draw out nodes and connections of greater importance within the network. Also of considerable value is the ability of Cytoscape to parse larger datasets, or focus in on specific nodes to make sense of networks within networks, which can be selected and excised into separate visualizations (Figure 4).

Interpreting the Visualized Data

For the humanities students, it was the process and outcome of visualization that made the data intriguing to interpret. But crucially, the data had been created by them, over a period of months, before they could move ahead with visualizing and interpreting it. It was only then that they could see, for instance, that certain thinkers held key positions between powerful figures while others, extremely famous in our day, were on the margins of the main humanist network. Persons who wrote a great deal, be it sermons or conduct books or even letters, might have an enormous “degree centrality” (or number of connections), even while the edge weight of many of their connections was relatively low. Some secondary figures who we would have thought to be quite outside our network assumed rather central positions in it. What, we asked, should we make of these unexpected findings?

Because students had developed the data themselves, and had in the process become very familiar with individual figures within the network, they were better able to interpret the positions of each major person. And because of their previous experience with mapping, they had extra knowledge that informed their interpretation of the network. For instance, a figure who travelled very little—say, Raphael—was hampered in his network-building despite his enormous historical influence. This led the class to question both their art-historical preconceptions—for example, that as a superstar, Raphael would be at the center of a network—but also to pose further humanistic questions that the data could not answer. Network-building was crucial for some figures (Aretino springs to mind) but of limited importance for others. What were the alternatives? Creating, visualizing, and then interpreting data was a means of creating new knowledge and a stimulus to further thinking. This further thinking was based on humanistic knowledge and posed questions that would be answered through those means. The shuttle back and forth between quantifiable data and humanistic inquiry through data and its visualization was a hugely fruitful exercise (Drucker 2011).

While producing reasonably well-designed networks, the engineering students studiously avoided connecting networks to a more textual analysis. For example, Figure 5 on the left shows the most common output (from ~90% of the groups) when students were asked to portray the network (an open-ended question). When asked to focus on one or more attributes, every group produced a gender network (Figure 5 on the right). This happened despite the relative abundance of other types of attributes and of group and individual knowledge specific to each of the humanists.

Two visualizations of humanist networks made by engineering students using R. One shows all links between figures, and the other separates out networks of women from those of men. — Figure 5. Humanist networks as visualized in R by engineering students. The full network, and a network distinguished by gender.

Conclusion

Humanists were challenged by the idea of extracting data from context, taking facts (“Do we believe in facts in this class?” one student had asked) and turning them into quantifiable data. The more they discretized and structured the data, the more resistant they became to compromise, to what they perceived as flattening out the nuance of individual relationships or even professional identities. However, once the data were visualized, class members were well prepared to read those results and return them to a humanist framework. Without caring particularly how the networks themselves looked, they approached the data with a more historically informed eye than did the engineers and moved quickly to interpretation. For instance, they already knew well the limitations on women’s travel and connections—we had read primary sources about women’s education—and so that and other historical aspects of the network were more revealing to them.

Much of engineering pedagogy focuses on design techniques to solve a problem. In the engineering R class, the design techniques were tuned toward learning about visualization (e.g., color ramps), how to code and design visualization features that draw attention to features of the visualization that are relevant to the analytical objective. This approach to the exercise resulted in networks that lacked texture, despite the interesting and often provocative information on the humanists that students gathered prior to the class. Engineers tend to gravitate toward well-produced visualizations (e.g. appropriately labeled axes, titles that are descriptive, etc.) or portray some important design feature. When the data cannot be understood without context, engineers are less able to navigate the tension between accuracy and context.

Engineers are, however, more alert to the subtleties of the visualization itself and how it communicates information about the data. The caveat here is that the engineering students seem unable to bring noted visualization subtleties back to the data context. In other words, they produce beautiful graphics but do not reflexively use these visualizations to think more about the problem from which their data emerges. Alternatively, humanists, even art historians, have not been trained to care about the aesthetic and persuasive presentation of data. Perhaps this is because humanists see themselves as talking mostly with one another, moving rather quickly from visualized data back to humanistic queries and a written argument. It may be that the humanist students need to be formally trained to make their visualizations an integral part of their textual analysis story. It might also be useful to the future of the humanities, particularly a public-facing humanities, if humanists were not only more comfortable with data, but also with using it to speak beyond the confines of the classroom or the pages of a scholarly journal.

Bibliography

Bok, Marten Jan, Harm Nijboer, and Judith Brouwer, eds. n.d. ECARTICO: Linking cultural industries in the early modern Low Countries, ca. 1475 – ca. 1725. Accessed October 17, 2020. http://www.vondel.humanities.uva.nl/ecartico/.

Bradley, Adam James. 2018. “Visualization and the Digital Humanities.” IEEE Computer Graphics and Applications 38, no. 6: 26–38.

Csárdi, Gábor, and Tamás Nepusz. 2006. “The igraph software package for complex network research.” InterJournal Complex Systems: 1695. https://igraph.org.

Crawley, Edward, Johan Malmqvist, Soren Ostlund, Doris Brodeur, and Kristina Edstrom. 2007. “Rethinking Engineering Education.” The CDIO Approach 302: 60–62.

Drucker, Johanna. 2011. “Humanities Approaches to Graphical Display.” Digital Humanities Quarterly 5, no. 1. http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html.

Handelman, Matthew. 2015. “Digital Humanities as Translation: Visualizing Franz Rosenzweig’s Archive.” TRANSIT 10, no. 1. https://escholarship.org/uc/item/69d0g81v.

Harley, J.B. 2001. “Maps, Knowledge, and Power” and “Silences and Secrecy: The Hidden Agenda of Cartography in Early Modern Europe.” In The New Nature of Maps, 51–107. Johns Hopkins.

Hicks, Stephanie C., and Rafael A. Irizarry. 2018. “A Guide to Teaching Data Science.” The American Statistician 72, no. 4: 382–391. https://doi.org/10.1080/00031305.2017.1356747.

Humanities + Design. n.d. Accessed October 17, 2020. https://hdlab.stanford.edu/palladio/.

Lincoln, Matthew. 2016. “Social Network Centralization Dynamics in Print Production in the Low Countries, 1550–1750.” International Journal of Digital Art History 2: 134–152.

Manovich, Lev. 2015. “Data Science and Digital Art History.” International Journal for Digital Art History, no. 1 (June). https://doi.org/10.11588/dah.2015.1.21631.

Moretti, Franco. 2011. “Network Theory, Plot Analysis.” New Left Review 68: 80–102.

Muñoz, Trevor. 2013a. “What IS on the Menu? More Work with NYPL’s Open Data, Part One.” http://trevormunoz.com/notebook/2013/08/08/what-is-on-the-menu-more-work-with-nypl-open-data-part-one.html.

———. 2013b. “Refining the Problem — More Work with NYPL’s Open Data, Part Two.”
http://trevormunoz.com/notebook/2013/08/19/refining-the-problem-more-work-with-nypl-open-data-part-two.html.

———. 2014. “Borrow a Cup of Sugar? Or Your Data Analysis Tools? — More Work with NYPL’s Open Data, Part Three.”
http://trevormunoz.com/notebook/2014/01/10/borrowing-data-science-tools-more-work-with-nypl-open-data-part-three.html.

Neff, Gina, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. “Critique and contribute: A practice-based framework for improving critical data studies and data science.” Big Data 5, no. 2: 85–97.

Paul Alexander Horton, S.S. Jordan, Steven Weiner, and Micah Lande. 2018. “Project-Based Learning among Engineering Students during Short-Form Hackathon Events.” In ASEE Annual Conference and Exposition, Conference Proceedings.

Pedersen, Thomas Lin. 2020. “A Tidy API for Graph Manipulation.” A Tidy API for Graph Manipulation. Accessed October 17, 2020. https://tidygraph.data-imaginist.com/.

R Core Team. n.d. Accessed October 17, 2020. https://www.r-project.org/about.html.

Rawson, Katie, and Trevor Muñoz. 2016. “Against Cleaning,” Curating Menus, July 7. http://www.curatingmenus.org/articles/against-cleaning/.

Savage, Richard, Katherine Chen, and Linda Vanasupa. 2008. “Integrating Project-Based Learning throughout the Undergraduate Engineering Curriculum.” Journal of STEM Education 8, no. 3.

Shannon, Paul. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13: 2498–2504.

Wickham, Hadley. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19, no. 1 (January 2010): 3–28. https://doi.org/10.1198/jcgs.2009.07098.

Acknowledgments

Thanks to Rebecca Levitan, who originally suggested to Elizabeth Honig the idea for this course, and who acted as her teaching assistant when she taught the class at UC Berkeley.

About the Author

Elizabeth Alice Honig is Professor of Northern European Art at the University of Maryland. She is the author of, most recently, Pieter Bruegel and the Idea of Human Nature (Reaktion, 2019), while her current research is about the experience of captivity in renaissance Europe. She curates the websites janbrueghel.net, pieterbruegel.net, and brueghelfamily.net, and her work in digital art history deriving from those projects has focused on mapping patterns of similarity between pictures produced in the Brueghel workshop network.

Deb Niemeier is the Clark Distinguished Chair in Energy and Sustainability at the University of Maryland, College Park and a professor in the Department of Civil and Environmental Engineering. She works with sociologists, planners, geographers, and education faculty to study the formal and informal governance processes in urban landscapes and the risks and disparities associated with outcomes in the intersection of finance, housing, infrastructure and environmental hazards. She is an AAAS Fellow, a Guggenheim Fellow, and a member of the National Academy of Engineering.

Christian Cloke specializes in the archaeology of the ancient Mediterranean world, employing a range of digital methods and technologies to do so. In service to his archaeological fieldwork (in Italy, Jordan, Armenia, Albania, and Greece), he builds and works with custom databases, Geographical Information Systems (GIS), and a wide array of imaging techniques. He holds a PhD in Classical Archaeology from the University of Cincinnati and is currently the associate director of the Michelle Smith Collaboratory for Visual Culture at the University of Maryland, College Park, where he works on varied digital research and pedagogical projects with students and faculty.

Quint Gregory specializes in seventeenth-century Dutch and Flemish art, as well as museum theory and practice. He is the creator and director of the Michelle Smith Collaboratory for Visual Culture, a center within the University of Maryland’s Department of Art History and Archaeology committed to supporting students, faculty, staff, and members of the broader community who are interested in adopting digital humanities methods and tools in their work and practice. He is especially interested in using offline and online platforms and skills in the causes of social and racial justice and to repair our relationship with the planet.