Tagging.tech presents an audio interview with Brad Folkens about image recognition
Henrik de Gyor: This is Tagging.tech. I’m Henrik de Gyor. Today I’m speaking with Brad Folkens. Brad, how are you?
Brad Folkens: Good. How are you doing today?
Henrik: Great. Brad, who are you and what do you do?
Brad: My name’s Brad Folkens. I’m the CTO and co‑founder of CamFind Inc. We make an app that allows you to take a picture of anything and find out what it is, and an image recognition platform that powers everything and you can use as an API.
Henrik: Brad, what are the biggest challenges and successes you’ve seen with image recognition?
Brad: I think the biggest challenge with image recognition that we have today is truly understanding images. It’s something that computers have really been struggling with for decades in fact.
We saw that with voice before this. Voice was always kind of the promised frontier of the next computer‑human interface. It took many decades until we could actually reach a level of voice understanding. We saw that for the first time with Siri, with Cortana.
Now we’re kind of seeing the same sort of transition with image recognition as well. Image recognition is this technology that we’ve had promised to us for a long time. But it hasn’t quite crossed that threshold into true usefulness. Now we’re starting to see the emergence of true image understanding. I think that’s really where it changes from image recognition being a big challenge to starting to become a success when computers can finally understand the images that we’re sending them.
Henrik: Brad, as of March 2016, how much of image recognition is done by humans versus machines?
Brad: That’s a good question. Even in-house, quite a bit of it actually is done by machine now. When we first started out, we had a lot of human-assisted I would say image recognition. More and more of it now is done by computers. Essentially 100 percent of our image recognition is done by computers now, but we do have some human assistance as well. It really kind of depends on the case.
Internally, what we’re going for is what we call six-star answer. If you imagine a five-star answer is something where you take a picture of say a cat or a dog. We know generally what kind of breed it is. A six-star answer is the kind of answer where you take a picture of the same cat, and we know exactly what kind of breed it is. If you take a picture of a spider, we know exactly what kind of species that spider is every time. That’s what we’re going for.
Unsupervised computer learning is something that is definitely exciting, but I think we’re about 20 to 30 years beyond when we’re going to actually see unsupervised computer vision, unsupervised deep learning neural networks as something that actually finally achieves the promise that we expect from it. Until then, supervised deep learning neural networks is something that are going to be around for a long time.
What we’re really excited about is that we’ve really found a way to make that work in a way that’s a cloud site that customers are actually happy. The users of CamFind are happy with the kind of results that they’re getting out of it.
Henrik: As of March 2016, how do you see image recognition changing?
Brad: We talk a little bit about image understanding. I think where this is really going is to video next. Now that we’ve got some technology out there that understands images, really the next phase of this is moving into video. How can we truly automate and machine the understanding of video? I think that’s really the next big wave of what we’re going to see evolve in terms of image recognition.
Henrik: What advice would you like to share with people looking into image recognition?
Brad: I think what we need to focus on specifically is this new state of the art technology. It’s not quite new but of deep learning neural networks. Really we’ve played around…As computer scientists, we’ve screwed around a lot, for decades, with a lot of different machine learning types.
What really is fascinating about deep learning is it mimics the human brain. It really mimics how we as humans learn about the world around us. I think that we need to really inspire different ways of playing around with and modeling these neural networks, training them, on larger and larger amounts of real-world data. This is what we’ve really experimented is in training these neural networks on real-world data.
What we’ve found is that this is what truly brought about the paradigm shift that we were looking to achieve with deep learning neural networks. It’s really all about how we train them. For a long time, when we’ve been experimenting with image recognition, computer vision, these sorts of things. If you look at an applesto apples analogy, we’re trying to train computers very similarly to if we were to shut off all of our senses.
We have all these different senses. We have sight. We have sound. We have smell. We have our emotions. We learn about the world around us through all of these senses combined. That’s what form these very strong relationships in our memory that really teach us about things.
When you hold a ball in your hand, you see it in three dimensions because you’ve got stereoscopic vision, but you also feel the texture of it. You feel the weight of it. You feel the size. Maybe you smell the rubber or you have an emotional connection to playing with a ball as a child. All of these senses combined create your experience of what you know as a ball plus language and everything else.
Computers on the other hand, we feed them lots of two-dimensional images. It’s like if you were to close one of your eyes and look at the ball, but without any other senses at all, not a sense of touch, no sense of smell, no sense of sound, no emotional connection, none of those extra senses. It’s almost like if you’re flashing your eye for 30 milliseconds to that ball, tons of different pictures of the ball, and expecting to learn about it.
Of course, this isn’t how we learn about the world around. We learn about the world around through all these different senses and experiences and everything else. This is what we would like to inspire other computer scientists and those that are working with image recognition to really take this into account. Because this is really where we’ve seen as a company the biggest paradigm shift in image understanding and image cognition. We really want to try to push the envelope as far the state of the art as a whole. This is kind of where we see it going.
Henrik: Where can we find more information about image recognition?
Brad: It’s actually a great question. This is such a buzzword these days, especially in the past couple of years. Really, it sounds almost cheesy but just typing in a search into Google about image recognition brings up so much now.
If you’re a programmer, there’s a lot of different frameworks that you can get started with image recognition. You can get started with one of them’s called OpenCV. This is a little bit more of a toolbox for image recognition. It requires a little bit of an understanding of programming and a little bit of understanding of the math and the sciences behind it. This gives you a lot of tools for basic image recognition.
There’s two frameworks that we particularly like to play around with. One of them is called Cafe, and the other one is called Torch. Both of these are publicly available, open source projects and frameworks for deep learning neural networks. They’re a great place to play around with and learn, see how these things work.
Those are really what people tend to ask about image recognition and deep learning neural networks, that’s the sort of thing. I like to point them to because it’s great introduction and playground to get your feet wet and dirty with this type of technology.
Henrik: Thanks, Brad.
Brad: Absolutely. Thanks again.
Henrik: For more on this, visit Tagging.tech.
For a book about this, visit keywordingnow.com