If you’re on Twitter, you should follow the account OnThisDayShe, which highlights an (often overlooked) woman who helped shape our world. Many are the woman scientists who did all the work, only to have their male colleagues end up with a Nobel prize after appropriating their research…
Today’s post was about Karen Sparck Jones, one of the greats of the field of Information Retrieval.
Information Retrieval is separate from database retrieval: it’s about finding texts that match certain criteria. At the start of the field (somewhere late 1960’s, early 1970’s), this was mostly about bibliographical information, but as more and more text became digitally accessible (with the advent of the World Wide Web really opening the floodgates), it became more and more of a general problem. (One of the search engines I worked with was literally designed to search in museum catalogues!)
At the start of the career, in 1997, I worked on search engines. Nobody had heard of Google, and AltaVista ruled the roost for web search. Search engines were massive lumbering beasts that were little more than a glorified grep on web pages — in fact, the premier Dutch search engine started out as a script that did a ‘grep’! And yet there was already quite a body of academic literature on Information Retrieval. I read quite a few of those articles, to assess algorithms and design solutions, and many times these articles referenced the work of Karen Sparck Jones.
She invented the ‘inverse document frequency’: the number of documents in which a search term occurs. Still, ‘df/idf’ is a trusted technique to validate the importance of a term in a document: if it occurs many times in a document (the ‘document frequency’, df) but not in a great number of documents (the ‘inverse document frequency’, idf), then it is safe to say that the term is important for that particular document. You could use that for term weighting, which is what you use to determine which documents to return at the top of your result set. It is safe to say that every search engine uses her research!
I had the opportunity to attend the ACM SIGIR conferences in 1998 (in Melbourne) and 1999 (in Berkley). It was a bit intimidating to attend a scientific event as a newbie — by then I had already given up on becoming a scientist because I had recognized that I was better suited as an engineer, and these were scientists. These people had been in the field for decades, and what did I know? Especially the ‘social’ events were hard on me, because I didn’t know anybody and I’m not the type of person to just walk up to somebody random and start a conversation. One of the advantages I had in 1998 was the presence of Prof. Kees Koster, my erstwhile boss with whom I had made an article that got accepted as part of the proceedings, so I could amble up and introduce myself to the people he was speaking to.
But that, too, was intimidating because these people knew each other from many previous conferences, and I had little to contribute to the discussion. But I do remember them speaking of Karen Sparck Jones in hushed tones, half-joking and half reverential. She was described as the ‘Grand Dame of Information Retrieval’ — and with reason.
Looking back at how search engines progressed since then, the gains are enormous. With all kinds of information readily available, we do not need to spend hours searching through libraries and encyclopedias — if those were available at all! Instead of spending all that time to get the information, we have more time to act on it, giving us tremendous gains in productivity.
So today, I think about Karen Sparck Jones, the Grand Dame of Information Retrieval. Without her research, our world would look very different indeed.