Article Lead Image

Screengrab via Microsoft CaptionBot

Microsoft’s CaptionBot will attempt to identify what’s happening in your photos, with varying success

A picture is worth a couple words and emoji.


AJ Dellinger


Not sure what’s in an image? Microsoft’s latest bot will do it’s best to help you decipher your photos, though you should take its description with a grain of salt.

Microsoft has been all-in on artificial intelligence, and the computing giant’s latest offering to show off its always-learning algorithms is CaptionBot. The premise, as always, is simple: feed CaptionBot a picture and it will analyze what it sees and do its best to describe it.

CaptionBot is primarily a showcase of Microsoft Cognitive Services, the company’s toolkit to build apps powered by machine-learning AI. The image-describing CaptionBot utilizes two particular sets of skills from the Cognitive Services suite, Computer Vision and Natural Language.

Computer vision is fairly self-explanatory, and has been a staple of Microsoft’s many image-based projects that it has rolled out, including tools that guess a person’s ageidentifies twinsdetects emotions in facial expressions, and identifies dog breeds.

The Natural Language portion of CaptionBot allows the bot to speak like humans do, a feature that Microsoft showed off—with questionable success—with it’s Twitter AI chatbot Tay. Microsoft wants to get everyone used to talking to bots, so while CaptionBot hopefully won’t become a racist as its fed images, it will try to describe what it sees with emojis.

Microsoft probably doesn’t need to worry about CaptionBot going off the rails when it comes to its personality, but the company might worry that the bot just isn’t very good at its primary purpose.

While it’s pretty solid and identifying very clear, crisp images—stock photography, for example, is ideal—and can get the general idea of what it sees, it’s often way off on the details.

For example, when it comes to anything athletic, CaptionBot is able to identify that there is a sport being played, but it has no idea which one.

While there’s surely some football fans out there who think quarterbacks get preferential treatment, the NFL has not allowed them to defend themselves with bats yet.

At least there’s something to actually be mistaken for a baseball bat in this image—unless you count Aaron Rodgers’ arms in the previous one—but it’s hard enough to hit a palm-sized ball hurdling at 98 miles per hour with a solid wood bat, let alone a thin piece of titanium. Rory McIlroy probably wouldn’t mind of fairways were as wide as outfields though.

Ok, in the technical sense, this is correct. Michael Jordan can be considered a baseball player. But honestly, is every sport baseball to CaptionBot?

Every sport is baseball except for baseball, of course, which is tennis.

If Muhammad Ali knocked out Sonny Liston with a tennis racket, we’d probably know about it. That seems like it’d be a major part of the story.

Because an image like Ali standing over a knocked out Liston is such an iconic image to most, having CaptionBot try to figure out what’s going on makes its description seem so dismissive of what’s happening in the picture. 

As you’ll see, history according to CaptionBot is considerably less impressive.

CaptionBot’s description of the Tiananmen Square protests is probably closer to how the government of China, which has censored images of the event inside its boarders, would probably describe the event. 

The history of this picture is disputed, but there’s pretty clearly more happening than just some folks meandering down the street.

Oh no, CaptionBot is a moon landing truther.

As with all of Microsoft’s projects that rely on artificial intelligence, the system will likely improve as it’s fed more images and can learn how to better identify what it sees. Who knows if it will ever be convinced that the moon landing actually happened, though. Maybe there’s just a little bit too much Tay in CaptionBot’s DNA.

The Daily Dot