Thursday, April 2, 2015

Worth a thousand words: captioning and editorial subjectivity

(CW: literary prescriptivism, for some definition of ‘literary’)

Adding dry captions(*) to incidental images on Facebook/Tumblr/etc. posts (or: alt text to images on web pages) is as close to a definite ceterus paribus improvement as I can think of. Readers using screenreading software or its accessibility-oriented ilk to browse the web are able to read [sic] those captions, aiding their consumption of the text. For “typical” readers (i.e. those without visual impairments, browsing the web without additional software assistance) the additional text hardly poses a nuisance — in the case of alt text, they don’t even engage with it unless they specifically go looking for it, whilst suitably demarcated image captions are easy to skim past.

A good caption is also generally a good inferential bridge, conveying most of the context the image provides. Sure, a text description of an image might not produce the exact same mental-emotional experience (“affect”, I believe the kids are calling it) as the image itself, but if “same mental-emotional experience” were an end goal we’d be done for anyway given it’s a subjective experience.

A screenshot of a facebook link to an article titled ‘10 Trans Women in Love’, with the caption: Image of a Korean man and woman (Andy Marra and Drew Shives), who are a couple. They are both facing the camera. The woman is resting her chin on the man's shoulder. They're both smiling, happily.

Take, for instance, the screenshot above. The caption swiftly communicates the most important emotional aspects of the image, both the implicit (it’s a couple; they’re happy) and the explicit (they are facing the camera; her chin rests on his shoulder). This provides great insight into the ‘value add’ that the image provides — it’s plenty to go on whether you’re making sense of someone else’s comment on the image (“they are the cutest thing” / “I love their expressions”), or whether you’re just interested in how it complements the piece.

A screenshot of an image in a blog post, captioned: “Photo shows a chain-link fence against a blue sky. One of the sections of the fence has been removed, a hammock strung across the posts, and a person lays relaxing in the hammock.”

I’m particularly fond of the above example from a Model View Culture article. It sits at a little over thirty words, and directs our attention to the key ‘action’ in the image (the placement of the hammock; the person lying in it). You could put that description into an art brief and the resulting work would convey a similar mood.

I interpret captioning as far more an art than a mechanistic process. It exists to aid/augment/alter the subjective experience of a reader, and so it is a task informed by subjective understandings. And like any art intended to convey meaning to an audience, it had best be done with that communication in mind.

One important problem of describing the salient features of an image is that it necessarily invites editorial interpretation about what those salient features are. Do you talk about the lighting? Do you describe clothing on humans pictured? Do you describe relative position of objects in the scene? Does it matter what kind of hair that person has?

A screenshot of a facebook link to an article titled ‘Level up your Allyship’, with the caption: “Image of a black background with "BE AN ALLY" written in white text in capital letter. "How to" is written above it in red, with an arrow pointing, placing it in front of "be an ally". There is a red X, crossing out the "n" in "an". with "(better)" being inserted between "a" and "ally".”

I might have described the above image as “the phrase ‘be an ally’ is modified to instead read ‘how to be a better ally’”; the description that was actually used is more verbose and contains more information. Neither of these are a priori better! They reflect different approaches to the question of “what is this image about?”, and insofar as there is no one true way of understanding the image, neither is there a one true way of explaining it.

(I must stress, however, that adding more information to something does not always ceterus paribus improve understanding! Three words: bad Powerpoint presentations. Indiscriminately providing more data overloads the limited working memory of an audience. And in particular, the above example buries the lede, rendering it impossible to ‘understand’ the caption without trying to interpret the whole thing at once. As an aid in understanding it causes unnecessary work for the person relying on it.)

Economical use of attention, legibility, clear hierarchical communication of ideas — these are not always the most important goals of a piece of writing, but for captioning they’re more important than usual.

A screenshot of a facebook link to an article titled ‘Lighten Up’, with the caption: A close up painting of a forehead, facing slightly to our right. The person has black hair, black eyebrows, and is wearing green rimmed glasses. Their forehead under the light is a medium shade of brown with the hex color code "78616a". The shaded part of their right (our left) forehead is a darker brown with the hex color code "625563". The darkest brown is just below the top rim of their glasses, between their right eye and brow, and has the hex color code "4e444f".

Closely related to the problem of captioners interpreting what features are important is captioners interpreting what those features are. Think emotion and affect. Is the mood dark or hopeful? Do those colours contrast or complement? Do you describe the woman’s face as pensive, sad, bored? Often by this point we enter firmly into the territory of subjective interpretation — one viewer’s mysterious might be another’s bored. This subjectivity is unavoidable if you want to communicate efficiently.

(What do I mean here by efficiently? Well, one way to avoid subjective interpretation of facial expressions would be to mechanically list the exact placement of each of the creases on a subject’s face; which of zir tendons are expanded or contracted; the exact on-screen angle at which zir hair shadows zir eyes. Take this as an extensional definition of inefficient captioning. The viewer is made to do far too much work to understand. You may as well give them the RGB values of every pixel of the image.)

Of course, ‘subjective’ means something very different when you’re the writer or subeditor who curated the image in the first place. In that case, if you picked that particular photo of orphans because it had “warm morning light”, there’s less sense of ‘mismatch’ from adding your own description of the photo, even if you’re still biasing the audience’s interpretations.

However, this is rarely the case on social media (e.g. Facebook or Tumblr reshares), where image thumbnails are often assigned to the linked essays/posts by their respective writers, but it is the resharer who finds themselves attempting to caption the image, without access to the intent of the original curator. They’re forced to make the judgement call: how is the image supposed to complement the piece? What does it represent?

I’ll close with an open question: do styles (“schools”?) of captioning vary in predictable ways between different electronic media? Are there organic “dialects” which on average differentiate between captions from different networks, the same way that top-down style guides create a difference between captions on The Guardian versus Wikipedia?

(*) There are certainly other ways to use captions, alt text, etc., as their own separate part of a text rather than as explanation — think webcomic alt text, captions, etc. These tend to deliberately reframe the interpretation of an image for those who can already form an initial impression of it, providing humour or irony or insight. There’s an interesting discussion as to whether it’s possible in principle to recapture the peculiar affect of an image-text juxtaposition with text alone, but that’s a whole other discussion. In this post I’m solely concerned with functional, descriptive captions.