tagging.tech

Audio, Image and video keywording. By people and machines.

Tagging.tech interview with Joe Dew

Leave a comment

Tagging.tech presents an audio interview with Joe Dew about image recognition

Listen and subscribe to Tagging.tech on Apple PodcastsAudioBoom, CastBox, Google Play, RadioPublic or TuneIn.

Keywording_Now.jpg

Keywording Now: Practical Advice on using Image Recognition and Keywording Services

Now available

keywordingnow.com

 

Transcript:

 

Henrik de Gyor:  This is Tagging.tech. I’m Henrik de Gyor. Today, I’m speaking with Joe Dew. Joe, how are you?

Joe Dew:  I’m well. How are you?

Henrik:  Good. Joe, who are you and what do you do?

Joe:  I am the Head of Product for a company called JustVisual. JustVisual is a deep learning company focused on computer vision and image recognition. We’ve been doing this for almost eight years. What my role is in the company is…think of me as the interface between engineering and computer vision scientist and end customers.

We have a very deep technology bench and technology stack that does very sophisticated things, but translating a lot of that technology and capabilities to end‑consumers can be a challenge. Likewise, we have customers who are interested in the space, but aren’t really clear how to use it. My role is to translate their needs into requirements for engineering.

Henrik:  Joe, what are the biggest challenges and successes you’ve seen with image and video recognition?

Joe:  I think the biggest challenge is, for a little perspective, is that the human brain has evolved for millions of millions of years to be able to handle and process visual information very easily. A lot of the things that we as humans can recognize and do ‑‑ even a two‑ or three‑year‑old child can do ‑‑ is actually quite difficult to do for computers and takes a lot of work.

The implication of this is that the expectations from users on precision and accuracy when it comes to visual recognition is very, very high. I like to say there’s no such thing as a visual homonym.

Meaning that, if you did a text search, for example, and you typed in the word jaguar and it comes back with a car, and it comes back with a cat, you can understand why the search result came back that way. If I had asked the question with a visual ‑‑ if I queried a search engine with an image ‑‑ and it came back with a car when I meant for a cat it would be a complete fail.

When we’ve done testing with users, on visual similarity for example, the expectations of the similarity is very, very high. They expect something like almost an exact match when they’re asking. It’s largely because we, as humans, expect that. Again, if you think about how we interact with the world digitally, it’s actually a very unnatural thing.

When you search for things, you have to translate that, oftentimes, into a word or a phrase. You type it into a box and it returns words and phrases at which point you then need to translate again into the real world.

In the real world, you just look at something, you say, “Hey, I want something like that.” It is a picture in your mind, and you expect to receive something like that. What we’re trying to do is solve that problem, which is very tricky thing for computers to do at this point. But, having said that, in the field there’s been tremendous improvements in this capability.

Companies from Google to Facebook to Microsoft, for example, are doing some very interesting work in that field.

Henrik:  Joe, as of March 2016, how do you see image in video recognition changing?

Joe:  I think the three big factors that are impacting this field is increasing rise in processing power of a hardware, just the chip technology, Moore’s law, that type of thing.

Secondly is a vast improvement in the sophistication of algorithms or, specifically, deep learning algorithms that are getting smarter and smarter in training.

The third is, the increase in data. There is just so much visual data now ‑‑ which has not been true in years past ‑‑ that can be used for training and for increase in precision and recall. Those are the things that are happening on the technology field.

The translation of all of these is the accuracy of image recognition and, for that matter, video recognition will see exponential improvements in the next few months even, let alone years. You started to see that already. You start seeing that in the client‑side applications and robotics, websites, and the ability to extract pieces out of an image and see visually similar results.

Henrik:  Joe, what advice would you like to share with people looking at image and video recognition?

Joe:  I think the understanding the use case is probably the most important thing to think about. Oftentimes, you hear about the technology and what it can do, but you need to really think thoroughly about what, exactly, do you want the technology to do.

As an example, a lot of the existing technology today does what we called image recognition, or the idea of taking an image or a video clip and essentially tagging it with the English language words. Think of it as translating an image into text. That’s very useful for a lot of cases, but oftentimes, from a use case ‑‑ from a user ‑‑ it’s not that useful.

If you take a picture of a chair, for example, and it returns back chair, the users says, “I know it’s a chair. Why do I need this technology to tell me it’s a chair?” But, “What I’m really looking for is a chair that looks like this. Where can I find it?” That is a harder question to answer, and that is not an exercise where you’re simply translating it to words.

We found that there are companies that use Mechanical Turk techniques, etc. to essentially tag images, but users have not really adopted to that because, again, it’s not that useful. That’s one thing, is think about the use case of what exactly do you want the technology to do.

A lot of the machine learning and deep learning systems involve a lot of training. The other part you need to think about is, what do you want the algorithm to train for? Is it simply tagging or is it to extract certain visual attributes? Is it pattern? Is it color? What is it that you actually want the algorithm to see, essentially?

Then the third area is, right now, user adoption of the technology is still pretty low. I think that as it becomes broader and broader and more commonplace, you start seeing it in more and more applications, it will increase in adoption, but the concept of using an image as a query is still very foreign to most people.

When you say visual search, it doesn’t really mean anything to them. There’s a whole user adoption curve that has to happen before they can catch up to the technology.

Henrik:  Where can we find out more information about image and video recognition?

Joe:  You can go to our site, justvisual.com, to give you some background of what we do. There’s just a lot of interesting companies and researches happening right now in the field. It’s little bit all over the place, so there isn’t necessarily one place that has all the information, because the field is changing so quickly. It’s exciting times for this field.


 

For a book about this, visit keywordingnow.com

Author: Henrik de Gyor

Consultant. Podcaster. Writer.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.