Audio, Image and video keywording. By people and machines.

Leave a comment

Tagging.tech interview with Martin Wilson

Tagging.tech presents an audio interview with Martin Wilson about image recognition.


Listen and subscribe to Tagging.tech on Apple PodcastsAudioBoom, CastBox, Google Play, RadioPublic or TuneIn.


Keywording Now: Practical Advice on using Image Recognition and Keywording Services

Now available





Henrik de Gyor:  This is Tagging.tech. I’m Henrik de Gyor. Today I’m speaking with Martin Wilson. Martin, how are you?

Martin Wilson:  I’m very well, thank you. How are you?

Henrik:  Good. Martin, who are you and what do you do?

Martin:  I am a director at Asset Bank. Being a director, I’ve done an awful lot of different things over the years. I have done some development on our product, Asset Bank. I’ve done sales and I’ve done consultancy while rolling out the product.

Just to explain a little bit about what Asset Bank is as a product, it is a digital asset management solution. Digital asset management is often shortened to DAM. A DAM solution helps clients and the users to organize the digital assets that almost every organization owns and makes use of nowadays.

By digital asset, we mean primary files. Things like images, videos, documents and all of those. A digital asset has an awful lot of value to an organization and it’s very important that they can find them easily, that they don’t waste money recreating digital assets that they already have, and that the assets themselves are used properly in a way that’s consistent with the brand of the organization.

Henrik:  Martin, what are the biggest challenges and successes you’ve seen with image and video recognition?

Martin:  Let me first start by saying how I think that image recognition has a potential to have a really big impact on my industry, digital asset management. Digital asset management is all about being able to find images and then use them properly. That’s the purpose of the DAM system. There’s an old adage which people use and it says that a DAM system is only as good as the metadata that is associated with the assets. The reason for that is, a million images, if you have a million images in any system it’s almost impossible to find the image you want without some sort of a search and or a browse function. Those searches and browse functions at the moment rely on what we call metadata that it is associated with the assets. That metadata is things like title or caption of an image, description, perhaps some keywords that been put in, maybe some information about how that can be used, the image can be used.

The result of this is that people, humans, spend an awful lot of time entering the metadata that is associated with digital assets. Usually, within an organization, the processes, the workflows that are associated with using a DAM application involve uploading one or more or many digital assets, typically images or videos, and then manually entering the data by, for example, looking at the image, seeing what it’s about, what the subject is, who’s in it maybe if it’s of people and then just actually typing in that data.

As you can imagine, that takes a lot of time. It’s also considered quite boring by most people. For that reason, it’s often skipped or not done really well. If it’s not done really well, the data associated with the assets is incomplete and therefore it’s very hard for it to turn up in the right searches.

The idea that it could be automated, this process, and have a computer work out what’s in the image and tag the digital assets appropriately is enormous. It’s almost like the Holy Grail of the upload process for DAM systems.

There was an awful lot of excitement when, for example Google Cloud Vision came out with their service. It’s what called an API which enables other applications to make use of the image recognition functionality. There’s a lot of other services as well that have come out in the last couple of years like Clarify, is another one.

When they came out, lots of DAM vendors got very excited and rushed to add the functionality into their own applications. We did the same. About a year ago we started a project with the objective of developing a component that could be used with Asset Bank in order to add auto-tagging capabilities to asset bank.

Let me just describe some of the challenges then that we found in doing that and when we rolled out some of our clients, the challenges they found. One of the challenges, I suppose which is always like a umbrella challenge over all of it, is people’s expectations.

Humans are very good at looking at images and working out what’s in it. They’ve also got a lot of domain knowledge. Usually, they understand, for example, their products. They can look at a product shot and say, “Yeah, that’s product F-567”, or whatever the code is. It’s actually very hard for computers to do that well. That problem hasn’t been solved that well yet.

What we found is, when compared with how humans tag images, the results coming from the auto-tagging software or APIs was not, to be frank, not of good enough quality for most cases. That’s the second specific challenge then, really. The quality of the raw results coming back from the software. The image, the visual recognition software was not quite good enough for use in most organizations, especially in a commercial sense.

That’s not say that it’s not useful. I’ll come on to that in a bit. What we found, on to the successes, what we found was that certain clients who had more generic or general images, the results were much better. We’ve got some clients who are tourists boards. They’ve got images of landscapes and scenery. Most of the image recognition software is quite good at finding the subjects and suggesting keywords for those types of images.

One of the reasons for that is that most of them have been trained on image data sets, that are images that are found on the internet for example. Of course they’re going to be generic. The other end of the spectrum, where we found it didn’t work that well was for clients that have got quite bespoke business domains or subject domains, images of their own product range. Very hard for these fairly generic image recognition software APIs to be able to come up with the right keywords for those sorts of image.

That’s possibly where there are still gaps. That might be something we’ll talk about in a minute about the future, which is the inability for a lot of this tagging software to learn from bespoke data sets.

Henrik:  Martin, as of December 2016, how do you see image and video recognition changing?

Martin:  I think it’s fair to say that it’s in it’s infancy at the moment. It’s only since it’s become available through the online cloud services or web services that people have found it very easy to start using this technology in their own applications. It’s only been the last couple of years, that really has kind of taken off as something that can be openly or easily used.

Now I think the vendors of this sort of software are learning very quickly from real use cases. I think it’s quite an exciting for where the commercial or non-commercial application of this software can go. I think if we first focus a little bit more on the current problems, that gives some insight into where the software might go, what direction it might go in.

I was just talking then about one of the problems being that is very generic at the moment, the tags that you get back from the online services are going to be fairly generic. That’s obviously the case if you understand how they work and how they learn. I think very quickly we’re going to see these services, and I know some are already, offering the ability for you to train them with your own data sets. That then opens up the application a lot more widely.

One of the things that image recognition and artificial intelligence, in general, is the context in which they’re operating. It’s much easier for image recognition software to work well if it is working within quite a narrow context. As an example, if you’re talking about, or if you want to try and get the software to recognize your product range, then if it’s trained on images that are of product range, and therefore the context is only products within your product range, then it’s a lot easier for it to recognize the right products, rather than having to think of every product that it’s ever seen an image of in the entire world.

Just to reiterate that I think the ability to train the software in bespoke data sets and for it to concentrate on in effect, domain-specific subjects, I think that’s a must and that will start to happen.

I think we will see quite a few hybrid solutions. What we found when we were doing our investigation into the software and what we ended up doing within Asset Bank or within the components which we call QuickTagger that works with Asset Banks is, coming up with hybrid human and computer interaction model where the tags that were being suggested by the visual recognition software were not just accepted as that’s job done. They were used to then group the images so a user could very quickly change the tags that weren’t right.

They could, for example, accept some of the tags, because they’re the right tags and the human agreed with the computer in effect, but then they could quite easily change the tags that were wrong. The key thing here is that the grouping was still being done pretty successfully. Although the tags that the image recognition software was suggesting might not be right, it was recognizing that certain images were of the same subject. That therefore meant that a human could go in and say, “Okay, I’ve got 50 images here that are all of a particular, I don’t know, model of car. They’ve all been grouped together, so that makes it really easy for me as a human to now type in the right name of the car or the model of the car.”

I think that idea, that where we are right now with this technology is that can help facilitate, speed up the human interfaces. That’s a quite a powerful idea I think, but where … I think that will continue, so we’ll see an evolution of that. I think we’re quite a long way off just being able to say, “Okay, you get on with it, computer. Tag these up.” I think we’re going to see improvements and sort of evolution of the idea of humans and computers working together in this auto-tagging sphere.

Henrik:  Martin, what advice would you like to share with people looking at image and video recognition?

Martin:  The first thing I would say is about expectations management. If you are used to having tags generated by humans who know what they’re doing, they understand the domain, the subject domain of the images that they’re tagging, you are likely to be fairly disappointed I would say in the results for most cases.

That’s one thing. See beyond the raw results you’re getting back from the tagging software. Look to how you might use the tags though to your advantage. For example, in hybrid solutions.

Consider what subject matter you’ve got, what your images are actually of and tailor your expectations accordingly. If you’ve got a lot of images that are of fairly generic subjects, you might find a lot of value from the auto tags. If you’ve got quite specific subjects, be prepared to potentially be a bit disappointed and or to have to put in quite a lot of work to either start training some of the software that you’re using or looking at how you can sort of augment the results with human interactions.

Sorry, another bit of advice is shop around. Have a look at the different services that are available. They’re fairly different. We built our QuickTagger in such a way that we can plug in the different services that are available, so we could just simply change it to work with Google Cloud Vision or with Clarify and there’s ten other potential candidates that I could list off the top of my head and probably more out there. They give different results. Some of them are better for different applications as well and different subjects. Usually, very simple to get a free trial and try out the software that’s there. That would be my last bit of advice. Shop around with the auto-tagging technologies that are available.

Henrik:  Martin, where can we find out more information?

Martin:  More information on our product Asset Bank is available on our website, which is www.assetbank.co.uk. If you’re interested in particular in how we, in the experiments that we’ve done and the components that we’ve got for Asset Bank, QuickTagger, then just fill in our contact form and express that interest. I would personally be very happy to talk to people about what we found.

There’s some information about QuickTagger that we’ve developed on our website as well. If you’re interested in finding about the different technologies that are available out there for you to use within your own application, there’s a lot. Personally, I would recommend now the cloud-based ones, because it’s much easier to get up and running with those. There’s quite a lot of information, meaning if you just typed in ‘image recognition software’ or ‘image recognition APIs’, you’ll see there’s quite a few good articles that people have put together on Quora and so on that have done the research for you. Use that as a starting point because as I say, things change all the time. New APIs come out. Do your research, but there is a lot of information available on the internet about this.

Henrik:  Thanks, Martin.

Martin:  You’re welcome.

Henrik:  For more on this, visit Tagging.tech.

Thanks again.


For a book about this, visit keywordingnow.com


Leave a comment

Tagging.tech interview with Matthew Zeiler

Tagging.tech presents an audio interview with Matthew Zeiler about image recognition


Listen and subscribe to Tagging.tech on Apple PodcastsAudioBoom, CastBox, Google Play, RadioPublic or TuneIn.


Keywording Now: Practical Advice on using Image Recognition and Keywording Services

Now available




Henrik de Gyor:  [00:02] This is TaggingTech. I’m Henrik de Gyor. Today I’m speaking with Matthew Zeiler.

Matthew, how are you?

Matthew Zeiler:  [00:06] Good. How are you?

Henrik:  [00:07] Good. Matthew, who are you, and what do you do?

Matthew:  [00:12] I am a founder and CEO of Clarifai. We are a technology company in New York, that has technology that lets the computer see automatically. You can send us an image or a video, and we’ll tell you exactly what’s in it. That means, all the objects like car, dog, tree, mountain.

[00:32] Even descriptive words like love, romance, togetherness, are understood automatically by our technology. We make this technology available to enterprises and developers, through very simple APIs. You can literally send an image with about three lines of code, and we’ll tell you a whole list of objects.

[00:53] As well as how confident we are that those objects appear within the image or video.

Henrik:  [00:58] Matthew, what are the biggest challenges and successes you’ve seen with image and video recognition?

Matthew:  [01:03] It’s really exciting. We started this company about two years ago, in November, 2013. We scaled it up to now over 30 people. Since the beginning, we kicked it off by winning this competition, called ImageNet. This competition is held every year. An international competition where researchers submit, and the largest companies submit, and we won the top five places.

[01:27] That was key in order to get recognition. Both in the research community, but even more importantly in enterprise community. Since then we’ve had tremendous amount of inbound across a wide variety of verticals. We’ve seen the problems in wedding domain, travel, real estate, asset management. In consumer photos, social media.

[01:50] Every possible vertical and domain you can think of that has image or video content. We have paying customers. We’re solving problems that range from organizing the photos inside your pocket…we actually launched our own consumer app for this in December [2015], called Forevery, which is really exciting. Anyone with an iPhone [could] check it out.

[02:10] All the way to media companies, being able to tag their content for internal use. The tagging is very broad, to understand every possible aspect in the world. We can also get really fine‑grained. Even down to the terms and conditions that you put up for your users to upload content to your products.

[02:33] We can tailor our recognition system to help you moderate that content, and filter out the unwanted content before it reaches your live site. Lots of really exciting applications, and huge successes for both image and video.

[02:48] I think one of the early challenges, when we started two years ago, was really demonstrating that the value of this technology can provide to an enterprise, and explaining what the technology is. A lot of people heard about image recognition, or heard the phrase at least, for decades.

[03:06] It’s because it’s been in research for decades. People have been trying to solve this problem, in making computers see. Not until very recently has this happened. Now they’re seeing this technology actually work in real applications. Not just on the demo that you can see at clarifai.com, where you can throw in your own image.

[03:26] You see it happen in real‑time, but in actual products that people use every day. From customers like Vimeo to improve their video search, or Style Me Pretty to improve their management of all of their wedding albums. Or Trivago, to improve search over hotel listings.

[03:43] When you start seeing these experiences be improved, Clarifai is at the forefront there, of integrating with these leading companies across these different verticals. It went from this challenge of educating the community and enterprises about what this technology does to, now finding the best ways to integrate it.

Henrik:  [04:03] As of early March 2016, how do you see image and video recognition changing?

Matthew:  [04:09] When I started the company about two years ago, a general model that could recognize a 1,000 concepts, was pretty much state of the art. That’s what won ImageNet, when we kicked off the company. Now, we’ve extended that to over 11,000 different concepts that we can recognize and evolved it to recognize things beyond just objects, like I mentioned.

[04:33] Now, you can see these descriptive words, like idyllic, which will bring up beach photos. Or scenic, which will bring up nice mountain shots. Or nice weather shots, where it’s snowing, and snow on the trees. Just beautiful stuff like that. That people would describe images in this way, but we’ve taught machines to do the same thing.

[04:56] I think, going forward, you’ll see a lot more of this expansion in the capability of the machine learning technology that we use. Also a whole personalization of it. What we’ve seen with the expansion of concepts is, it’s never going to be enough. You want to give the functionality to your users, to let them customize it in the way they talk about the world.

[05:21] There’s a few concrete examples here. In stock media, we sit at the upload process of a lot of stock media sites. A pro photographer might upload an image, and they used to have to manually tag it, but this is a very slow process. We do it in real‑time. We give them the ability to remove some tags, and add some tags, and then it’s uploaded to the site.

[05:45] What this does with the stock media company, is give a much more consistent experience for buyers. If you let different people who don’t know each other, and grew up in different backgrounds, in different parts of the world, all tag their own content, they all talk with different vocabularies.

[06:01] When a buyer comes and talks with their vocabulary, and searches on the site, they get pretty much random results. It’s not the ideal and optimal results. Whereas using Clarifai, you’ll get a consistent view of all of your data, and it’s tagged in the same way. It’s much better for the buyer experience as well.

[06:19] Another example is, in our app Forevery, we’ve baked in some new technology, that’s coming later this year to our enterprise customers, which is the ability to really personalize it to you. This is showing in two different parts of the application. One is around people, where you can actually teach the app your friends and family.

[06:42] The other is around things. You can teach it anything in the world. Whether it’s the name of your specific dog, or it’s the Eiffel Tower, or any of your favorite sports car. Something like that. You can customize it. It actually is training a model on the phone to be able to predict these things.

[07:01] I think, the future of machine learning and image and video recognition is this personalization. Because it becomes more emotionally connected to you, and more powerful. It’s the way you speak about the world and see the world. We’re really excited about that evolving.

Henrik:  [07:17] As of March, 2016, how much of the image and video recognition is done by people versus machines?

Matthew:  [07:24] That’s a great question. I don’t know the concrete numbers. There’s a huge portion of our customers who were doing it manually before. We have a few case studies out there, for example, Style Me Pretty. They were doing exactly that. They had users upload a wedding album, which, as you know, might be a 1,000, 2,000 photos from a weekend wedding.

[07:47] They had a moderation team to look through all that content, and tag it. Because ultimately they want other people to come to their site, to search and find inspirations. Now we’re allowing Style Me Pretty to upload over 10 times more content onto their site, which ultimately drives more revenue for them.

[08:06] Because now they advertise next to this content. They need well‑tagged content, so both their users find it interesting, and they can match the best ads to it. Now we’re helping them automate that system. We see that over and over again across these verticals. People were doing it manually before.

[08:23] It was very costly and time‑consuming. We’re either making that faster, or scaling it up by orders and magnitude.

Henrik:  [08:30] Matthew, what advice would you like to share with people looking into image and video recognition?

Matthew:  [08:35] That’s a great question. There’s a few alternatives, and we literally just released a blog post yesterday about this. You want to consider a lot of different things, when deciding about visual recognition providers, or building the technology in‑house. What Clarifai does is take a lot of the pains out of the process.

[08:55] We have experts in‑house that have PhDs in this field of object recognition. Not just myself as a CEO, but also a whole research team, dedicated to pushing this technology forward, and applying it to new application areas. That’s kind of the expertise piece. We also have the data piece covered.

[09:15] If you come to us, and you want to recognize cars and trees and dogs, you don’t need any label data that has those tags already associated with it. We’ve done that process of collecting data, either from the web or from our partners, and we’ve trained a model to recognize these things automatically.

[09:34] This is as broad as possible. We do the job of curating it, so that it’s very high quality, and it doesn’t have any obscene types of concepts, that you wouldn’t want your users to be exposed to. So it’s very nicely packaged for you. Then finally, we take away the need for extensive resources as well.

[09:53] We make it so you don’t need extra machines or specialized machines. We actually use some very specialized hardware to do this efficiently. You don’t need the time it takes to train these models, which takes many weeks, or sometimes months, to get optimal performance. All that is taken care of. You literally just need three lines of code, in order to use Clarifai.

[10:15] Finally, there’s this component of independence that Clarifai has, that some other providers don’t. As a small company, we’re corely focused on understanding every image and video, to improve life. We want to apply this technology to every possible vertical, and solve every possible problem that we can, without competing with our customers.

[10:38] There are some big entries in this space, where they’re building divisions within their companies that end up competing with you. If you’re a big enterprise, looking for image and video recognition, you have to consider that as well. Basically, do you trust the provider of this technology with your data?

[10:56] Because long‑term, you want to make a partnership that you both benefit from, and don’t have to be afraid of. That’s what Clarifai provides, and we make this very affordable for you, and very simple for you to use.

Henrik:  [11:09] Matthew, where can we find out more information about image and video recognition?

Matthew:  [11:13] I would check out Clarifai’s blog. One of the goals of our marketing department, is to educate the world about what visual recognition is. Not only how we do it, but how the technology works, and where you can get more resources for it. That’ll be the one‑stop shot. The first check‑out is that blog.clarifai.com. We regularly update it with information.

[11:37] There’s also a lot of great resources online. The research community…if you really want to dive into the details. What this community has evolved to do, is actually not wait for conferences or journal publications, but actually publish regularly to an open community of publications, so that the latest research is always available.

[12:00] That’s something really unique in this image and video recognition space, that we don’t see in other fields of research. Depending on what stage you’re at in understanding this technology, you’ll get high-level details from Clarifai’s blog. Then low level, all the way from the research community.

Henrik:  [12:16] Well, thanks Matthew.

Matthew:  [12:17] Thank you.

Henrik:  [12:18] For more of this, visit tagging.tech.

Thanks again.

For a book about this, visit keywordingnow.com