At the confluence of machine learning and library science

I am taking a class titled “Information Retrieval” this semester. It covers general topics about how to organize information so that it can be easily searched, retrieved, and used later. Much of the content overlaps with previous coursework I’ve had on databases and machine learning, but with a different emphasis.

I’m really enjoying the assignments in the class so far. In the first one, we were given a collection of scanned postcards to analyze. We looked at them in batches of three (randomly selected), and then were asked to come up with an “attribute” that was true for two of the postcards but false for the third. After doing this 20 times, we each had a list of 20 descriptive postcard attributes; this process was referred to as “attribute elicitation.”

I was delighted! The representation question is at the heart of machine learning, too, but we rarely (if ever) are given the chance to MAKE UP THE ATTRIBUTES ourselves. (Unless, of course, it’s a data set that we’re creating, which is also rare.) I felt such freedom. At the same time, I realized that the objective wasn’t quite the same. In machine learning, you want a representation that maximizes your later ability to classify, cluster, or otherwise analyze the data. In library science, you want a representation that maximizes your later ability to find particular items that satisfy a query. This perhaps boils down to discriminability vs. findability.

This goes deeper than it may seem at first. Often the machine learning (ML) task at hand is one of classification, in which case the universe of classes of interest is known in advance. Each item can be assigned to one of those classes. The representation can be (and sometimes is) optimized to maximize performance in classifying the known classes of interest. One of the latest trends in ML is to use “deep learning” to manufacture the representation automatically.

For information retrieval, in the sense used by library science folks, the classes of interest are not known, nor is the goal to craft an automated classifier for future data. Instead, the system (and representation) should support a potentially unlimited variety of future user (human) queries about any of the items in the collection. Success is measured not by classification or clustering accuracy, but by how many queries successfully locate the desired item or items, and how easily this is achieved (from the user perspective).

Has anyone tried to apply deep learning to library collections? Would it be useful here?

There is a terminology shift between the fields, too. The process of deciding on a representation is called “attribute elicitation” in library science, not to be confused with “feature extraction” in ML, which means the (automated) calculating of feature *values*. That process (assigning attribute values to items), in turn, is called “indexing.” (After creating 20 attributes, we then indexed 10 postcards by filling in their values for each of the attributes.) In ML we generally don’t get to do that, either, and especially not in a manual fashion. It was fun!

Going through the attribute elicitation and item indexing process raised other questions for me. It quickly becomes obvious that some attributes are easier or faster to “compute” than others, even for humans doing the task. “Color image” vs. “not a color image” is an easy decision, but “picture of a French location” can be far more difficult, relying as it does on deeper domain knowledge and deeper analysis of the image.

Should we prefer those that are easier to compute, all other things being equal? If you assume human indexers, then it seems you’d also prefer attributes that are most likely to be consistently computed by different people. We talked briefly about the “indexing rules” (crafted by humans) that go along with any such representation, to help with consistency. However, there was no discussion about informativeness, discriminability, or other properties that would guide you in selecting the best attributes to use. Perhaps we’ll get to that later.

Our next task is a group exercise in creating a database catalog of any objects we like, other than books. My group has chosen candles, and we’re now discussing what the most useful attributes might be; what might one like to search on, when in need of a candle? Or candles?

We’re only required to input five (five!) items into the final database. If we manage to get a few more in there, I’m tempted to do a clustering or PCA analysis and examine the distribution of candles that we end up with. :)