Leeds Report #6b, or, more fun with Computational Linguistics!

After Rombert Stapel’s paper, we moved onto two further papers which were perrhaps more ambitious in scope, and concerning which I have less certainty about the method and its application. All presenters talked about using control samples, and talked us through the process by which they deterrmined that Delta could tell the difference between their target author and unrelated samples, but each paper raised some questions for me which might just be revealing my ignorance.

Karina van Danlen-Oskam was attempting to use quantitative computational analysis to distinguish between different scribes of the same text. She talked about some of the difficulties of using computational linguistics for medieval studdies: you need an electronic text – but when you fling your electronic text into Delta, are you identifying the medieval author, the medieval scribe or the modern editor as your unique language user? During the course of her own analysis she also had to control for variant spellings – some manuscripts which looked really whacky turned out to be quite conventional once you controlled for variant spellings in feminine pronouns.

I liked Karina’s idea here – that you could bypass that question if you used computational linguistics to distinguish between different scribes of the same text. In this case, she took 15 MSS of Dutch text, a chronicle of biblical history. She made the transcriptions herself – necessarily short sections, the same sections from each text. She picked sections with interesting women in them partly because that seemed like fun to her, and partly because sections with interesting womein in them occur regularly but infrequently across the whole of biblical history.

What she found was that different samples showed different levels of variance across the whole set of manuscripts- one episode from the New Testament involing the thee Mariaswas wildly different across the boad. What she also found was that while the Judith episodes overall were pretty consistent, one scribe had got seriously inventive and not only changed things but added whole sections, effectively becoming an author for that stretch of the text.

XCKD crop - Citation NeededThe problem which arises out of this is that… we don’t know what it means. Using her existing data, Karina plans to look at the Esther episodes; she said she thought the scribe might have been inspired by traditions on the Nine Worthies, so if she was going back for more data she’d start with sections which dealt with the nine male Worthies. But without full transcripts of entire manuscripts, it’s not really possible to say how inventive that scribe was or how unique the manuscript.

My other problem here is that while computational linguistics clearly can demonstrate that the scribe of MS I (in Karina’s numbering) was creative in his account of Judith; and the dot plots were nicely illustrative; and it’s exciting to know this fact – you didn’t actually need computational linguistics to do it. All you needed was someone to look at the Judith sections of all fifteen MSS, and it just so happens that a computational linguist got to it first. Given that the scribe had *added entire lines*, I’m sure Karina noticed this when she was transcribing.

Literature - words that think they are too clever by half. Mostly written by men.Mike Kestemont was using computational linguistics to argue that one Johan the Clerk was the author of a group of twelve poems from Antwerp, usually attributed to the ‘Antwerpschool’ of poems. Now, this is a long stoush – we have one known poem by Johan and one almost-certainly-by Johan poem1 – and we have about 12 anonymous poems from the same period, and 20th century scholarship was greatly devoted to arguing about whether Johan wrote all of said poems or none of said poems.

Mike’s computational states focused on rhyme words, on the reasonable logic that a poet might change his topic, change his format, but he’s unlikely to change his list of ‘words which rhyme with purple’. And he discovered that all the anonymous poems used substantially the same rhyme words as Johan’s identified works!

Mike was good humoured about this: he knew well that exciting as his multivariate statistics were, he’s unlikely to put an end to the argument anytime soon; but he wanted to put his evidence down on the record for the ‘maximalist’ position.

My quibble with this – he was rigorous about his control sample and so on – is that I’d like to see some other statistical studies done on rhyme-words. While, sure, any two or three or twelve friends are probably not going to use the same stock rhyme words, what about teachers and students? In an oral poetic culture, wouldn’t one of the key things you teach your students be a stock of rhyme words for every occasion? But we don’t have much vernacular poetic evidence where we can identify teacher-student pairs or groups, at least not for the European middle ages. Medieval and early modern Arabic poetry might be able to help here – but I’m not even sure if Arabic poetry is rhymed; and the distinctly different oral cultures might cancel out the usefulness of such data for European medievalists.

Sheer Geekiness - I just think this stuff is really cool (XKCD)I would also be interested – just because I’m interested – to see a stack of computational analysis done on known Latin authors, particularly authors trained in the same place or by the same people. I’d like to know if the statistical difference between the language use of two second-language users trained in the same place is different to the statistical difference between two native speakers, especially since Latin composition has always been such a stylistically specific art. I’d like to know if you could use computational linguistics programs to run grammatical analyses on a Latin text and identify the author’s native language. These are all things that would be interesting to me! But I don’t have either the Latin or the statistical proficiency to do either of these things myself. Latinists and statisticians of the world, hear ye.


1. This was a fun story. In his identified poem, Johan announces that his patron had rejected his previous work because it was too misogynist. Conveniently, we have a remarkably misogynist poem, the Lehenspeigel, dedicated to the same patron, by ‘John, your poor Clerk’. So ten points to Rogier van Leef for turning down misogynist poetry?

Also, fun fact – John the Clerk fromAntwerpturns up in the wardrobe accounts of Edward III– he received payment from the English for spying on the French.


3 Responses to “Leeds Report #6b, or, more fun with Computational Linguistics!”

  1. Jonathan Jarrett Says:

    one scribe had got seriously inventive and not only changed things but added whole sections, effectively becoming an author for that stretch of the text

    Oh, and there you are. Don’t mind me then!

    But without full transcripts of entire manuscripts, it’s not really possible to say how inventive that scribe was or how unique the manuscript.

    It also strikes me that doing that kind of test over whole manuscripts would tend to smooth out local variation. So, if said MS was all one scribe and the Judith bit was the only one where he (or she) got excited, you might not notice it in the overall signal of conformity. Of course a good plot would let one look at the whole curve and you would notice that everything went wild around Judith, but I don’t know if this would be easy, not knowing the software myself.

    • highlyeccentric Says:

      Well, I think for a certain kind of investigation, if Judith were the ONLY bit where he, or possibly she, got excited, that would win you bonus points. Or Judith and Esther, even. Because that would tell you OTHERWISE CONFORMIST SCRIBE GETS EXCITED ABOUT EXCITING LADIES, instead of ‘scribe goes nuts with text’. But to really do that properly you’d need electronic transcripts of the _entire_ MS; and also at least several other _entire_ MSS. You could do this much quicker by simply looking through the MSS for obvious additions!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: