Leeds Report #6, or, I saw multi-variate statistics!

One of the most interesting sessions by far, from my humble perspective, was 1209: Whodunnit? Literary Forensics and Authorship Attribution for the Middle Ages. Three Middle Dutch scholars, all of whom work on questions of authorship and transmission, all spoke on different uses of statistical analysis in looking at textual variants. Before recapping each paper, allow me to talk for a bit about the interesting ideas and issues the session raised.

Sheer Geekiness - I just think this stuff is really cool (XKCD)Firstly: computational linguistics. This method of linguistic analysis rests on the fact that individual speakers of a common language have distinct linguistic markers. These markers are not topic-specific, but show up in really common words (articles, conjunctions, subordinators) and grammatical patterns. Put simply, you can tell the difference between a post by Magistra and a post by me by the fact that Magistra talks about early medieval history and I talk a high medieval sex; but a computational linguist would run our two anonmymised posts through a computer program and discover that I use certain conjunctions far more than she, and she uses some particular grammatical structure a lot more than I do.

Now, I seem to have a lot more faith in computational linguistics than many literary scholars – I think this is because I got taught the basic principles (although by no means how to do it) in first year, thank you Craig Ronalds. I know, for example, that this business about individual language markers has been rigorously tested on modern speakers from different language backgrounds. I know the method has been used to expose cases of police interfering with witness testimony (police members as a group show certain linguistic traits that are not shared by the general population, as a result of their training). I know it’s uses for humanities scholars haven’t been fully explored or tested yet, but I also suspect that a lot of the distrust people have for evidence drawn from computational linguistics is to do with the unfamiliar kind of evidence. Computational linguistics relies on data and statistical analysis and sciencey-kinds of things: I get the feeling that a lot of humanities scholars don’t trust that (it’s repeatable, sure, but you can’t go through your edition and mark it up and SEE the evidence right there). Our discipline trains us to check everything against the text, rather than checking it for thorough and repeatable experimental process: maybe we’re not so willing to trust people who branch out into other kinds of evidence.

With that said, it must also be stated that I don’t know enough about computational linguistics for my bullshit detector to work properly when hear about it. So I have no way of knowing if an individual scholar is doing their computational linguistics Rong. Given that the application of computational linguistics to literary scholarship is a relatively new field, one risk would be that there aren’t enough trained bullshit-detectors around, but that can only change with time and the increasing usefulness of computational techniques.

So what are some of the uses of computational linguistics to medievalists?

Rombert Stapel has been using computational linguistics to determine how much of Hendrick Gerardsz van Vianen (sp?)’s Croniken van der Duytcher Order, a late 15th c. chronicle of the Teutonic Order with specific focus on the area around Utrecht, was written by the said Hendrick. Several segments are easily identified as being from other sources – the prologue claims to be by a 12th century bishop who certainly wasn’t in Acre when he said he was; and the Balliwick chronicle for Utrecht seems separate from the main body of the text.

Traditional philological analysis would look at unusual words, and has been of some use to Rombert Stapel, but in the absence of original source texts it’s hard to tell where emendation has been happening. Instead, he took samples from the privileges written by the said Hendrick in his capacity as secretary to the Lands Commander Johan von Drongen. The samples are not just written at a different time to the Croniken, they’re also in a completely different style – something which would usually override philologically distinct vocabulary features, but doesn’t usually override the grammatical data used in computational linguistics.

The full set samples which he fed into the program (Delta, by someone named Burrows – it’s free, and apparently easy to use) were:

  • 2 sets of samples from the Croniken where traditional philological evidence (comparisons to original sources, I believe) shows Hendrick left traces as author.
  • The privileges mentioned above
  • The Sachenspiegel, known to have been copied by Hendrick
  • 2 unrelated texts of the same period and genre – one hagiography and one chronicle.

After testing that Delta could distinguish between the unrelated texts and the Hendrick texts, he then compared the samples to the entire rest of the Croniken, and pulled up several sections clearly not by Hendrick, including the first half of the prologue (but not the second); the Balliwick chronicle; and some formulaic documents- privileges and court pleadings. The rest appears to be either by Hendrick or substantially modified by him.

Rombert then argued that Hendrick’s strong presence across the Croniken suggests that he was both author and compiler at once; noting the existence of other Teutonic Order chronicle texts in this period in theLowlands, he says this points to a strong, self-aware hagiographical tradition in the balliwicks, away from the administrative centre of the Order.

Note: I’ve probably got the author/scribe’s name spelled wrong, but I’m pretty sure Croniken was on the slides, with a C not a K.


15 Responses to “Leeds Report #6, or, I saw multi-variate statistics!”

  1. Annelise Says:

    That is so interesting! I think you’re exactly right that this is difficult because we don’t know how to discriminate within this field, rather than that it can’t be trusted at all. I’m fascinated by ideas of redaction, forgery and misattribution, because to make a decision on either side has huge implications for how a work is read as literature and what its meanings (relationships either within a text or with its context) are imagined to be.

    I guess the science has to come down to probability though, not clear fingerprint evidence. I also wonder how much the markers change when a writer has different themes, audiences, attitudes/passions, and (through the stages of their lives) linguistic influences or development.

    • highlyeccentric Says:

      My understanding is that the markers usually used in computational linguistics *don’t* change significantly from theme to theme (in this case, Hendrick’s legal documents showed up the same as his chronicle, for example), or over the course of your life (although surely there’d have to be a period of noticeable change during school/uni?).

      • Annelise Says:

        That in itself is very interesting. In my experience I agree, my writing has changed a lot over time, especially when I come to _think_ in different ways as well as having read lots of different kinds of texts and had so much input on my writing through education. But if there’s something consistent, formed pretty early in life by various influences, that says something about the consistency of our selves and our thinking/perception/expression, regardless of change and influence.

  2. Annelise Says:

    Oh- and you get that problem of having to say “experts agree that this is the only solution”, rather than being easily able to present the working for other people to look through from their own position of expertise. Still, it’s pretty cool that you can measure this as well as intuiting it.

  3. Katrin Says:

    That is so cool. Thanks for posting about it!

  4. [c] Says:

    “I also suspect that a lot of the distrust people have for evidence drawn from computational linguistics is to do with the unfamiliar kind of evidence”

    I believe you may have hit a nerve here: I’m vaguely reminded of a discussion regarding some 14th century wall paintings a few years ago. The discussion was whether or not some inscriptions in the paintings were part of the original scheme or additions made at a later date. The problem was solved when a chemical analysis proved that the inscriptions were actually inserted while the plaster on the wall was still wet and fresh, so they must have been part of the original layer of paint. While everyone accepted this evidence, there was one particular scholar who still kept contesting it, claiming that the results of chemical analysis weren’t comprehensible for scholars in the humanities and therefore these results MUST NOT be taken into account…

    • highlyeccentric Says:

      It would be interesting to see if Americans who’ve had to take science courses at undergrad level adapt better to this sort of evidence in humanities contexts… if it were so, score one for the American generalist system. If it were not, then we’re still at a point where society at large needs a better understanding of how to discriminate regarding experimental evidence.

  5. Leeds Report #6b, or, more fun with Computational Linguistics! « The Naked Philologist Says:

    […] Comments highlyeccentric on A wordle for Augusthighlyeccentric on Leeds Report #6, or, I saw multi-variate statistics![c] on Leeds Report #6, or, I saw multi-variate statistics![c] on A wordle […]

  6. Jonathan Jarrett Says:

    Rombert then argued that Hendrick’s strong presence across the Croniken suggests that he was both author and compiler at once

    This exposes a smaller, but still significant problem with this kind of analysis, which is that we may not always understand the concept of authorship the same way as our authors did. We envisage someone scratching away with a quill by themselves, but some of our better-off authors were probably dictating, and some of them may have been handing out chunks of notes or tablets to scribes or amanuenses and saying, “write that up, will you? I have to go to court now” and so on. `Didn’t compose’ may not equal ‘could not claim’. The techniques are still valid of course, if Dun Rite, but we may sometimes need a bit more care with the interpretation (see also DNA, statistics…)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: