People as blindspots
--
Making sense of images posted on social media is something we, as human media consumers, may sometimes take for granted. But if there’s one thing the recent hype around deep learning has shown, it is just how challenging of a skill this is for machines to pick up. Nonetheless considerable effort is being put into the development of such technology. Some resulting in whimsical psychedelic visuals, others in impressive list-making software that can recognize, with little to no context, the presence of a baby or animal in the photo.
Practically speaking however, a lot of information still falls through the cracks, nuances and subtleties that for now are only graspable by humans, rendering a lot of these AIs comically stoic in the face of social media content. Several questions come to mind from pointing this out, for one, what is actually being missed? Does it matter if these systems are unable to grasp such nuances? In which situation would this omitted information actually be useful? There are also questions regarding interests and ethicality coming into play: who will benefit from this research? To what end is this research being funded? What will be the commercial application once this technology is implemented as features into platforms like Facebook? It’s easy to come up with answers for these but difficult to know if any will really stick! Will they become instantaneous purchase detectives so the likes of Facebook have new datasets to sell to advertisers or will they become automated storytellers for the visually impaired? According to a 2015 Wired article, it would be the latter, for now at least.
The piece is about Facebook’s Accessibility Team’s newest efforts to automate the recognition of an image’s contents in order to communicate them to Matt King, a blind man. The current AI prototype attempts to put into words what it is seeing:
“‘The scene is outdoors. It includes grass and trees and clouds. It’s near some water.’ King can’t completely imagine the photo — a shot of a friend with a bicycle during a ride through European countryside — but he has a decent idea of what it looks like.
‘My dream is that it would also tell me that it includes Christoph with his bike.’ King says.”
The description provided by the AI is a bit crude but promising. It’s one of a variety of other techniques being experimented with by Facebook’s R&D facility to automate image recognition.
Visual Q&A, another system being tested, generates a series of questions the AI can pose to the user in order to have them confirm what it has identified in a given photo. Other programs cross-reference the text in photo captions with what the software believes it is seeing in the image to create more in-depth descriptions. These experiments have yielded surprisingly accurate results that are able to recognize such things as who the subjects are, their age brackets and key setting elements.
Going back to Matt King, anything is an improvement on his current situation, but as impressive as these advances may be, none of these approaches fully equate being able to see the image.
The frugal output of Facebook’s current AI is more reminiscent of a picture-word book than a story. A story is comprised of a list of events at best but what makes a strong narrative lays in its tacit cues, something incredibly difficult to program a machine to decipher let alone recognize. According to Matt, what is missing is his friend’s name and what he is up to, in other words, the identification of the subject (which Facebook already does rather well) and an inference as to what the subject is doing. These elements, a subject and an action, are the basic tenants of narrative building, which is what the the AI is struggling to produce. Funnily enough this is something a person may glean by simply viewing the photo a mental dexterity machines have yet to be programmed for, a recurring theme in automation the dichotomy between the scalability of automation and the lack of adaptability of a given program. To complicate things, photos often have multiple, intersecting narratives, from a simple: what are the subjects doing in the photo, to how do the subjects of the photo feel about the contents of the image, to what is the interest of the photo-taker in the image they are taking? The Wired article goes on to quote Jeff Wieland, a member of the Facebook Accessibility team relating this very concern:
“‘We’re returning a list. We’re not returning a story,’ Wieland says. ‘But that’s really where we want to go.’”
So how do we get a story?
It might make sense to create an AI focused specifically for the recognition of human stories, unspoken feelings and… drama! A software capable of scrutinizing images of humans for micro-expressions that otherwise would have been registered as identical or just ignored. Take for instance the standardization of passport photo-taking as explained by Frances Stonor Saunders:
“Smiling was banned in 2004 along with frowning or raising eyebrows because this software treats the face as a blank somatic surface, scraped clean — exfoliated — of all affective expression, in order to be differentiated from other faces. It’s a search for fixed markers, not a full cartographic survey.”
To use Saunders’ words, Facebook’s approach to image recognition AI is to conduct a “full cartographic survey” of each photo rather than comparing “fixed markers” from one photo to the other to see how they differ and why. A great way to adapt this approach for automated narrative building may just be to flip the passport photo paradigm on its head to concentrate on identifying expressions not people.
Selfies have a very calculable nature, why not use those? They have predictable compositions and contents, so the AI could be designed to study the stance and facial expressions as fixed markers. Better yet, by only studying images with the hashtag #coupleselfies as a screening mechanism, the AI could be challenged to start by developing a narrative about how the two subjects feel about each other based on their postures, their stiffness or comfort.
Any recurring element or form in a #coupleselfie could become a reference point to help the machine dig deeper into the potential narratives existing between the two subjects. Although having two humans in the photo will add a level of complexity to the task, it also warrants the presence of a dialogue between the two subjects of the image.
Another set of fixed markers that could be implemented into this speculative program are emoji sequences. Emoji are increasingly used as telegraphic rebuses to accompany photos on social media. According to Tyler Schnoebelen’s research on the matter, emoji sequences often start with a “stance,” an emoji that indicates the tone or mood of the upcoming message. It is described here in Time Magazine as follows:
“In sets of two or three emoji, the stance comes before actions or other signals. The face comes first. Consider stance the attitude or emotion you have about something, represented by a happy, sad or flirty yellow face.”
Couldn’t those “stance” emoji be used as the “fixed markers” of social media content by prompting the software to only analyze photos captioned with an identical series of signs? If so, an AI studying a photo prefaced by a smiling face and a flexing bicep would be able to first, look for evidence of a positive “stance” and gym “signals” in the photo and then infer a “narrative” by asserting “Your friend is happy about working out.”
Following this train of thought, after numerous tries and fine-tuning, it’s easy to imagine the AI even recognizing micro-expressions in the photos to give much more telling stories: “Nadine is stoked about her gains but Mark is frustrated… He didn’t work out hard enough.”
That being said, if we go back to Wiedland’s point about creating stories not lists, what type of narratives do we as human viewers currently draw from friends’ social media feeds? Are our interpretations of the photos and videos found there even that deep? Off the top of my head most of my interpretations remain very shallow and event-based: sunset photos, morning coffees, bike treks which is also what Matt King seemed to want. So if we could choose how clairvoyant an AI should be, should it be better than humans at unpacking what it sees? Would that be possible or even necessary? Here’s another theme of automation, where to set the standard of performance? Should a machine always overachieve? In an industrial context this might make sense but what about a social one? What if your darkest secrets could be automated out of your photos? This reminds me of another excerpt from Saunders’ essay judging by which, she might disagree:
“I suppose the emoticon culture gets what it deserves: an emotional economy, generated by computer modelling and regulated by algorithms, that translates the subjective self into objective data.”
Here Saunders describes what she sees as the “nonsense” of trying to translate complex emotions into emoji. The point being that according to her, no prescribed symbol may be enough to accurately embody the deepest fears, joys, desires hidden in our subconscious, making the study of people through these symbols futile. I find her point debatable however. For one, it’s unclear if the “emotional economy” is forced upon us by the advent of the “emoticon culture” or a much wider cultural shift of self-editing brought forth by the condensed nature of social media, 20 second videos, 140 characters or keeping one’s thumb of the screen to view a video. A lot of platforms directly or indirectly encourage some form of self-volunteered economy of emotion or expression from app users. Moreover sequences of emoji often result in uniquely personal narratives, which would compel me to claim the opposite of Saunders’ point, that emoji culture itself is enabling social media users to introduce more personal messages and “stances” to contextualize the photos posted.
In fact it seems as if social media content is evolving to become a new take on Egyptian papyrus scrolls, where figures are portrayed according to standard and symbolic pictorial conventions complemented by hoards of hieroglyphics that help develop the given narrative.
Tangent aside, the two Saunders excerpts quoted in this piece were part of an essay called Where on Earth are you? written for the London Review of Books. In it, she tackles a theme that automated image recognition is a sub-compartment of: that of tracking and identifying humans as we migrate across frontiers, documented or not. Her piece exacerbates the limitations of our society’s current customs system, its reliance on technology and our subsequent reluctance to welcome whom cannot be logged.
“The faceless unnamed. Not the anonymous clump of one million migrants, but us, verified down to our eyeballs, yet unseeing and unseeable behind the high wall we have built to protect ourselves from the disordered, unauthorised, unregistered others beyond. […] Migrants often make the journey without identity documents […] the attempt to obtain them in their country of origin can be very dangerous. Others lose them at the outset when they’re robbed […]. Many destroy them deliberately because they fear, not without reason, that our system of verification will be a mechanism for sending them back.”
It may seem like a stretch to present side by side an AI’s inability to draw narratives from a photo and the limitations of similar software to interface with undocumented migrants. Nonetheless, both instances are examples of a bias we as a society are starting to develop for what is documentable, ratable, trackable without fully considering the implied costs. In both situations a specialized technology is presented as convenient, necessary solutions for one set of users, the visually impaired and documented travelers at the expense of others. In Matt King’s case, running his image translating app might report its findings to Facebook creating a portfolio of his friends’ activities. In Saunders’ example, security checkpoints were made convenient for some while rendering the task far more arduous for others who do not have papers, to the point of refusing to offer balanced alternatives regardless of their recent history. In both cases tension arises from a desire to automate vision-based tasks (because they seem deceivingly simple?) and struggling to fully consider the human implications in their application.
This is a problem the likes of IDEO, with Human Centered Design, and Facebook are aware of, in another Wired article Wiedland mentions “We wanted to build empathy into our engineering.” As commendable of a statement as it is, we must remember engineering is only one corner of a greater system, a social fabric, and empathy is not something the gears of engineering can be oiled with once in a while. Empathy and an awareness of emerging micro-cultures need to be made into integral parts of the machinery and its development. Empathy is all too often shorthand for convenient, turning a blind eye to potential abuse, wherein the availability of a tool comes at the detriment of others.
This is murky territory and challenging startups, designers and engineers to address this upfront (not leaving it to governments and human rights activists once the system has been implemented), would better ensure that companies, whether providing image recognition or border security services, would do so not at the detriment of those standing it these technologies’ blindspots.