Troublescraping: Migrating JITP’s Archive to Manifold

Brandon Walsh, University of Virginia

Editor’s Note:
In this renewal of our Behind the Seams feature, members of the JITP Editorial Collective reflect on their contributions to the journal-wide migration of our publishing platform from WordPress to Manifold. This piece finds Brandon Walsh in commentary around the challenges of standardizing over 300 WordPress articles to be restaged in layout as Manifold texts. Brandon walks readers through the technical and procedural steps underlying this stage of the migration process, in turn highlighting the oft-hidden process by which text data is normalized and translated across open publishing platforms.
—Zach Muhlbauer

Imagine, for a moment, that you are visiting a friend’s house for dinner. While helping to prepare the meal, they ask you to set the table. Trying to be helpful and perhaps guided by how you have organized your own kitchen, you open a drawer expecting to find silverware. Instead, you find potholders. You open one drawer after another until finally you must ask for help, at which point the host shows where to find the things you need.

While not a perfect analogy, this scenario does begin to illustrate one of the key challenges we’ve faced working on the shift of our publishing platform to Manifold: the procedure by which we move our old, archived content to the new system. It’s not a big deal if you’re only moving one or two pieces of content—you can just copy and paste things. JITP has over three hundred articles, though, so we need a plan for dealing with all these articles, staged in layout by different teams at different times, programmatically. Technical platforms are generally quite opinionated about how they organize their data, and each one makes particular decisions that might not translate well to another setup. If you’re lucky, your old platform will have a mechanism by which you export your data into a format readable by the new one. But even in the best of circumstances the process rarely works seamlessly and can require dozens or hundreds of hours of work to finish the process and make everything uniform across the archive. It’s something akin to saying, “Before I even begin to set this table, I need to reorganize this kitchen to be how I’m used to. Oh, and also I need to move my own silverware in to do this properly. Surprise—we’re now roommates.”

With every migration, it’s important to know where you’re starting from and where you are hoping to end up. JITP has historically been hosted on WordPress, a popular blogging platform. Each individual post is more than just the text of an article; it also contains an author, a date, a featured image, comments, and more. The complicated bit, then, is how all this metadata gets packaged together with the post to be transferred. WordPress has an export tool that provides a dump of your data in a format called XML, a standard markup language similar to HTML, the backbone of the web. We started with this export. Here’s an example article from Issue 19, and its exported code (Figure 1), to show what the XML export looks like. Note the combination of metadata about the text with the content itself as well as the embedded HTML, such as hyperlinks and structural information.

The XML markup of the above linked article, with the abstract preceded by a difficult-to-read set of metadata. — Figure 1. Screenshot of XML export of an article from Issue 19.

So, having successfully gotten everything out of WordPress, we faced our next hurdle: Manifold cannot take in XML directly. We needed to look at options to convert what we had to what we needed. First, Migration Lead Kelly Hammond used a freely available script to convert our XML to Markdown, a more lightweight language that Manifold could actually ingest. Here is the same article in Markdown for comparison:

The Markdown rendering of the above linked article, with the abstract preceded by a set of metadata that is much easier to read. — Figure 2. Screenshot of the same article, as converted to Markdown.

The Markdown ingestion worked, and we were able to get things directly onto Manifold. As mentioned, Markdown is more lightweight as a format, trading off the much more nuanced and complicated options that you get with HTML for a format that highlights with more clarity the features you are most likely to need when writing articles for the web. The result is quicker to write and more legible to the naked eye, both of which are probably legible from the screenshots above (Figures 1 and 2). As with any trade-off, some of the features we gave up nevertheless proved themselves important enough to pose a problem, though in ways that might not be immediately legible. Take these two versions of the same byline (Figure 3), the first in HTML and the second in Markdown:

First line reads open angle bracket h2 class=quote byline unquote close angle bracket Nancy Demerdash-Fatemi, Albion College open angle bracket forward slash h2 close angle brackt. The second line reads only numbersign numbersign Nancy Demerdash-Fatemi, Albion College — Figure 3. Article’s byline code as rendered in HTML, followed by the byline code as rendered in Markdown.

They both represent the byline as a second-level header, but only the HTML version gives it a CSS class of “byline,” which WordPress would use to flag authors’ names for consistent styling across all issues. This type of information gets lost in the transition to Markdown, and so we would find that we needed to find some way to maintain a relatively simple, ingestible format that would nonetheless retain some key CSS information, so that similar components of articles would be styled the same way across the newly migrated archive. The other limitation of the Markdown conversion script was that it was created by someone else. That’s great in one way—no need to reinvent the wheel. But it also meant it was very hard to change for our own purposes.

Ultimately, I decided to develop our own solution by writing a script in Python to scrape the JITP site directly. If you’re not familiar, web scraping is essentially the equivalent of asking a computer to copy and paste web information at scale. Because the script would be targeting the live site, it meant that we got all the information visible to the user. I won’t go into all details of the script itself, but you can view it at this gist link or below if you are interested.

https://gist.github.com/walshbr/ef1ebfb4550570047cec5d5ecf45ebd2.js

Because we wrote the script ourselves, it meant that we had a high degree of control over what the output looked like over and above just getting access to the data we needed. This meant that, for example, as we advanced in the conversations about the migration, we had an easy way to make adjustments. One example of this: a particular CSS class category had had a few names for the same function over the various years of the journal’s existence, and so even with some nudging our styling would look inconsistent. Rather than having someone manually go in and edit hundreds of files, we could take care of that with a single line of code, replacing all of the variations with one CSS class name. We could also make surgical interventions to remove particular elements that we didn’t want to keep in the new platform.

This process took a lot of time and energy, but it’s work that has hopefully made the migration more manageable in the long term. It’s given us an opportunity to look back at content that hasn’t been touched in years and think about how best to sustain it over time. Hopefully, the result will renew interest in all of the fantastic work authors and editorial collective members put into pieces in the back catalog of the journal, reanimating debates from new perspectives. Likewise, we hope this short glimpse into our process of reorganizing our own drawers might spark some ideas about your own digital projects, making it a smidge easier to cook with guests—as we so often do.

About the Author

Brandon Walsh is Head of Student Programs in the Scholars’ Lab in the University of Virginia Library. Prior to that, he was Visiting Assistant Professor of English and Mellon Digital Humanities Fellow in the Washington and Lee University Library. He received his PhD and MA from the Department of English at the University of Virginia, where he also held fellowships in the Scholars’ Lab and acted as Project Manager of NINES. His dissertation examined modern and contemporary literature and culture through the lenses of sound studies and digital humanities, and these days he works primarily at the intersections of digital pedagogy and digital humanities. He serves on the editorial board of The Journal of Interactive Technology and Pedagogy. He is a regular instructor at HILT, and he has work published or forthcoming with Programming Historian, Insights, the Digital Library Pedagogy Cookbook, Pedagogy, Digital Pedagogy in the Humanities, and Digital Scholarship in the Humanities, among others.

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Tags: digital publishing JITP Migration Manifold Markdown Python wordpress

'Troublescraping: Migrating JITP’s Archive to Manifold' has 2 comments

March 22, 2024 @ 12:37 am Lowes Survey

Lowes allows all the residents of the United States to take part in the customer satisfaction survey. It welcomes all the survey participants to give honest feedback about their last visit. Thus, by providing the feedback, all the customers enter into the sweepstakes directly for a $500 check. As it is a monthly contest, five lucky winners will be selected as the winners. So, take the Lowes survey here. Thank you for providing the feedback; your responses are valuable – regards.

September 11, 2022 @ 5:35 am تصليح افران غاز

If you have a gas stove or oven, it is important to know how to keep it in good condition. This means figuring out how he or she successfully fixed the problem. Gas appliances can be dangerous properly, so safety and knowledge are key. In this blog post

Would you like to share your thoughts?

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Brandon Walsh, University of Virginia

About the Author

'Troublescraping: Migrating JITP’s Archive to Manifold' has 2 comments

Would you like to share your thoughts? Cancel reply

Would you like to share your thoughts?