I drafted this for Aus Glam Blog Club’s November topic, “Change”. And then I chickened out of posting it, because my thoughts were this looping ramble and I wasn’t sure if they were worth sharing with the wider world.
Today at work there was a discussion about expertise versus efficiency of scale; of hand-crafted bibliographic metadata and the hidden unpaid labour of crowdsourced text corrections for digital content… about the opportunities and risks that come from open source and commercial machine learning databases that automate a deeper level of content than has ever been available for full text search before.
I feel like these are bubbles of feelings inside me, and they are still growing and building up pressure. One day they will pop and I’ll be able to look back and make sense of them all. But for now I’ll gather my courage and post what I wrote, and maybe people will disagree with everything I say and get mad and write their own thoughts down, and we’ll get more discussions going in the GLAM sector.
November’s Aus Glam Blog Club topic was Change. As a cataloguer there’s one topic that I’ve changed my mind on a lot. If you ask me in the morning I’ll have one opinion, in the afternoon I will have another. I just can’t pin down what I think or feel about automated subject analysis.
I’m not sure if this is an interesting format for a blog, but here’s the arguments I have with myself as I change my mind back and forth.
It’s a great idea, right? A machine learning algorithm could be fed a small batch of good records for academic journals and fiction and recipe books, and the ebooks or scanned pages of the content. The machine could learn that particular categories of subject headings in a thesaurus relate to specific types of text or content layout and begin to identify patterns of grammar, vocabulary, and unique keywords or frequencies of keywords that identify the differences between the publications. It would be error-prone at first, but with corrected data and more data being added over time, the algorithm could learn how to apply subject headings consistently (and update them).
This algorithm could keep up with changes in terminology or new ideas in specialist fields, in ways that cataloguers cannot in their day to day work. It could help us avoid problems like LCSH and “illegal immigrants”, where different political groups use different vocabularies and the Library of Congress becomes more vulnerable to political bias.
But it’s a terrible idea! I’ve changed my mind. An algorithm would learn all of the awful patterns of racism, sexism, and oppression that we work so hard to keep out of our collections. We’d be more likely to start the machine learning off with our public domain full-text digitised content which is more likely to contain outdated and offensive language, and have catalogue records with outdated terms. We couldn’t begin from a neutral place since the thesauri we currently use to describe our books all have inherent cultural and political and racial biases in them.
In other words, garbage in… garbage out.
That said, perhaps machine learning could free us from the need to curate and maintain thesauri. Even if it would still be based on content that could have inherent bias, an algorithm could use grammar, syntax, and other patterns to differentiate between pejorative uses of terms and neutral ones. It could be deliberately biased to trust (weight) terminology used by reliable publishers or types of publications… which couldn’t be worse than thousands of distributed human beings with their own political ideologies trying to work with a piecemeal patchwork of subject headings that reflect more about the time they were established, than the topics they relate to.
Although… we’re now in the territory of proxies. We’d be saying that this type of publisher is a proxy for an ideal librarian who can see through her own biases to the keywords that will help the reader find the information that they need. This could work really well for tried and true disciplines like law, cooking, and even tabletop role-playing games. But this won’t work so well for interdisciplinary studies that may display a hybrid of patterns from two disciplines. And if it can’t work well for them, would we see a negative effect of cross-pollution where environmental science and newspapers get grouped together a lot because they share a lot of text about the weather and our climate. When we allow an automatic association between two patterns to stand in for the real and unattainable “aboutness” of a text, we can become very vulnerable to those associations frustrating a lot of users and hiding information in weird places in our databases.
Plus, not all quotations are indented or in quotation marks, so that could get very weird.
Another thought I often have is that for open-access shelf libraries like public, school, and university libraries, it can be really difficult to locate useful related texts together. It often comes down to the call of the cataloguer who processes the book, and in Dewey Decimal there’s a few different ways you can order and build the numbers. Maybe we aren’t ready for machine learning to assign subject headings, but we could use machine learning based on topics and subject headings to shelve books so that more people can find them and libraries can optimise shelf space.
The worst idea of course, the one that I can’t get past, is that the real power of machine learning with subject description or shelf classification is that of changeability. It has the most value if it can adapt and respond as the content of a library changes. But that could mean regular “global” changes for subject headings for nearly all the records in the database, and an equivalent amount of exporting and importing changed data from our shared national databases. I’m not sure that our hugely distributed and global shared catalogue database could cope with that. Remember you can run, but you can’t hide, from WorldCat… I mean, the Blob!
It would take a lot of hardware just to process all of that data. It would take a lot of electricity.
While we could use it as a more static one-off kind of deal, trying to replace the time that a cataloguer spends determining the right subject heading and entering it in the catalogue record, that almost feels like we’re giving up on the quality of our catalogue records and how transformational it could be to improve access for everything. The new records we create are a barely visible speck of iceberg bobbing above the water, while there is a huge width and breadth of records that have come before us.
Even if we only brought the records to life in one library’s local system in this way, I think it would be too expensive in terms of resources.
Then again, this is all just thoughts and feelings. I don’t know anything for sure. It would be worth trying, thinking, planning, to see where machine learning could really take library catalogue databases on an individual library level, and globally. Having bad ideas, getting angry and arguing about them, and trying some of the things that will give us the information that we need to learn how to argue about targeting automated library work to make things better and not simply more or faster.