Audio, Image and video keywording. By people and machines.

Leave a comment

Tagging.tech interview with Nikolai Buwalda

Tagging.tech presents an audio interview with Nikolai Buwalda about image recognition


Listen and subscribe to Tagging.tech on Apple PodcastsAudioBoom, CastBox, Google Play, RadioPublic or TuneIn.


Keywording Now: Practical Advice on using Image Recognition and Keywording Services

Now available





Henrik de Gyor:  This is Tagging.tech. I’m Henrik de Gyor. Today, I’m speaking with Nikolai Buwalda. Nikolai, who are you, and what do you do?

Nikolai Buwalda:  I support organizations with product strategy, and I’ve being doing that for the last 15 years. My primary focus is products that have social networking components, and whenever you have social networking and user‑generated content, there is a lot of content moderation that’s a part of that workflow.

Recently, I’ve been working with a French company, who’s launched their large social network in Europe, and as a part of that, we’ve spun up a startup that I’m the Founder of called moderatecontent.com, uses artificial intelligence to handle some of the edge cases when moderating content.

Henrik:  Nikolai, what are the biggest challenges and successes you’ve seen with image recognition?

Nikolai:  2015 was really an amazing year with image recognition. A lot of forces really came to maturity and so you’ve seen a lot of organizations deploy products and feature sets in the cloud that used or depend heavily on image recognition. It probably started about 20 years ago with experiments using neural networks.

In 2012, a team from the University of Toronto came forward with a real radical development in how neural networks are used for image recognition. Based on that, there was quite a few open source projects, a lot of video card makers also developed hardware that supported it, and in 2014 you saw another big leap by Google in image recognition.

Those products really matured in 2015, and that’s really allowed for a lot of enterprises to have a very cost effective ability now to integrate image recognition into the work that they do. So 2015 really has seen, in the $1000 range, the ability to buy a video card, use an open source platform, and very quickly have image recognition technology available to your workflow.

In terms of challenges, I continue to see two of the very same challenges existing in the industry. One is the risk to a company’s brand, and that still continues.

Even though image recognition is widely accepted as a technology that can surpass humans in a lot of cases for detecting patterns and understanding content, when you go back to your legal and to your privacy departments, they still want to have an element of humans reviewing content in the process.

It really helps them with their audit, and their ability to represent the organization when an incident does occur. Despite companies like Google going with an image recognition first passing the Turing test, you still end up with these parts of the organization who want human review.

I think it’s still another five years before these groups are going to be swayed to have an artificial intelligence machine‑learning first approach.

The second major issue is context. Machine learning or image recognition is really great at matching patterns in content and understanding these are all the different elements that make up some content, but they are not great at understanding the context ‑‑ the metadata that goes along with a piece of content ‑‑ and making assumptions about how all the elements work together.

To illustrate this, it’s probably a very good use case that’s commonly talked about, which is having a person pouring a glass of wine. Now, in all kinds of different contexts, this content could be recognized as something that you don’t want associated with your brand versus not being an issue at all.

If you think about somebody pouring a glass of wine, say at a cafe in France versus somebody pouring a glass of wine in Saudi Arabia. Between the two, there’s very different context there, but very difficult for machine to draw conclusion about the appropriateness of that.

Another very common edge case that people like to use as example is the bicycle example where machines are great at detecting bicycles. They can do amazing things, far surpass the ability of people to detect this type of object, but if that bicycle was a few seconds away from being into some sort of accident, machines are very difficult at detecting this.

That’s where human review ‑‑ human escalations comes into play for these types of issues and still represent a large portion of the workflow and the cost in moderating content. So, mitigating risk within your organization to have some sort of person review of content.

Then to also really understand the context are two things that I think, in the next five years, will be solved by artificial intelligence and will really put these challenges for image recognition behind them.

Henrik:  As of March 2016, how much of image recognition is completed by people versus machines?

Nikolai:  This is a natural stat to ask about, but I think, with all the advancements in 2015, I really like to talk about a different stat. Right now, anybody developing a platform that has user‑generated content has gone with Computer Vision Machine learning approach first.

They’ll have a 100 percent of their content initially reviewed with this technology and then, depending on the use case and the risk profile, a certain percentage gets flagged and moved on to a human workflow. I really like to think about it in terms of, “What is the number of people globally working in the industry?”

We know today that about 100,000 to 200,000 people worldwide are working at terminals moderating content. That’s a pretty large cost and a pretty staggering human cost. We know these jobs are quite stressful. We know they have high turnover and have long‑term effects on the people doing these jobs.

The stat I like to think about is, “How do we reduce the number of people who have to do this and move that task over to computers?” We also know that it’s about a thousand times less expensive to use a computer to moderate this. It’s about a tenth of a cent per piece of content versus about 10 cents per content to have a piece of content reviewed with human escalation.

In terms of really understanding how far we’ve advanced, I think the best metric to keep is how we can reduce the number of people who are involved in manual reconciliation.

Henrik:  Nikolai, what advice would you like to share with people looking into image recognition?

Nikolai:  My advice is, and it’s something that people have probably heard quite a bit, which is it’s really important to understand your requirements and to gain consensus within your organization about the business function you want image recognition to do.

It’s great to get excited about the technology and to see where the business function can help, but it’s the edge cases that can really hurt your organization. You have to gather all the requirements around.

That means meeting with legal, privacy, security and understanding the use case that you want to use image recognition for and then the edge cases that may pose some risks to your organization. You really have to think about all the different feature sets that go into making a project really successful with image recognition.

Things that are important is how it integrates with your existing content management system. A lot of image recognition platforms use third parties, and they can be offshore in countries like the Philippines and India. Understanding your requirements for sending content over there, your infosec department is really important to know how that integrates.

Having escalation and approval workflows, this is really going to protect you in these edge cases where there is the need for human review. That needs to be quite seamless as there’s still a significant amount of content that gets moderated and approved this way.

Having language and cultural support, global companies really have to consider the impact culturally of content from one region versus another. Having features and an understanding built into your image recognition that it can adapt to that is very important.

Crisis management, this is something that all the big social platforms have playbooks ready to go for. It’s very important because, even if it’s, like I said, one image in a million that gets classified poorly, it can have a dramatic impact in media or even legally for you. You want to be able to get ahead of it very quickly.

A lot of third parties provide these types of playbooks, and it’s a feature set that they offer along with their resources. The usual feature set you have to think about ‑‑ language filters, image, video, chat protection. Edge case that has a lot of business rules associated with is the protection of children, social‑media filtering.

You might want to have a wider band of guardrails to protect you on response rate and throughput. A lot of services have different types of offerings. Some will moderate content over 72 hours, and others you need response rates within the minute.

Understanding your throughput and response rate that’s required is very important and really impacts the cost of the offering that you are looking to provide. Third‑party list support ‑‑ a lot of companies will provide business rule guidance and support on the different rule sets that apply to different regions around the world.

That’s important to understand which ones you need and how to support it within your business process. Important to demonstrate control of your content is having user flags. Being able to have the people who are consuming your content, the ability to flag content into workflow to work through that demonstrates one of the controls that you need to often have in place and the edge cases.

The edge cases are where media and legal really has a lot of traction and are looking for companies to provide really good controls for protecting themselves. Things like suicide prevention, bullying, and hate speech can really dramatically…just one case can have a significant impact on your brand.

The last item is a lot of organizations for a lot of different reasons have their content moderation done within their own organization. They have the human review within their own organization and so having training of that staff for some of the stressful portions of that job and training for HR is very important. It is something to consider when building out of these workflows.

Henrik:  Nikolai, where can we find more information about image recognition?

Nikolai:  The leading research for image recognition really starts at the ImageNet competition that’s hosted at Stanford. If you Google ImageNet in Stanford, you’ll find that the URL isn’t that great and officially it’s called the ImageNet Large Scale Visual Recognition Challenge. This is where all the top organizations, all the top research teams in image recognition compete to have the best algorithms, the best tools, and the best techniques.

This is where all the breakthroughs in 2012, 2014 happened. Right now, Google is the leader, but it’s very close and image recognition at that competition is certainly at a level where these teams are far exceeding the capability of humans. So from there, you get to see all the tools and techniques that the latest organizations are using, and what’s amazing is the same tools and techniques they use on their platforms that exist for integrating within your own organization.

On top of that, the competition between video card providers, between AMD and NVIDIA, has really made the hardware to support this to allow for real‑time image recognition at a very cost-effective manner. The tools that they talk about at this competition leverage that hardware and so it’s a great starting place to understand what the latest techniques are and how you might implement them within your own organization.

Another great site is opencv.org or open computer vision, and they have taken a built‑up framework around taking all the latest tools and techniques and algorithms and packaging them up in a really easy‑to‑deploy toolset. It’s has been around for a long time and so they really have a lot of examples, a lot of the background about how to implement these types of techniques.

If you are hoping to get an experiment going very quickly, using some of the open source platforms from ImageNet competitions and using OpenCV together you can really get something up very quickly.

On top of that, when you’re building out these types of workflows, you need to work closely with a lot of the nonprofits that have great guidance on what are the rule sets, what are the guardrails you need to have in place to protect your users and to protect your organization.

The Facebook has really been a leader in this area and they have spun up a bunch of different organizations they work with ‑‑ the National Cyber Security Alliance, Childnet International, connectsafely.org ‑‑ and there are a lot of region‑specific organizations that you can work with. I definitely recommend that using their guardrails will really be a great starting point for a framework when understanding how image recognition can moderate your content, how image recognition can be used in ethical and legal manner.

In terms of content moderation, it’s a very crowded space right now. Some of the big partners, they don’t talk a lot about their statistics, but they are doing a very large volume of moderation. Companies like WebPurify, Crisp Thinking, and crowdsource.com, they all have an element of machine learning and computer and human interaction.

The cloud platforms like AWS and Azure have offerings for the machine learning side. Adobe definitely is a content management platform. They have great integrated software package if you use that platform.

Another aspect, which is quite important, is a lot of companies do their content moderation internally, and so having training for that staff and training for your HR department is very important. But all in all, there are a lot of resources, a lot of open source platforms that make it really easy to get started.

TensorFlow, which is an open source project from Google, they use it across their platform. I think they have…The last I checked, it was about 40 different product offerings that use the TensorFlow platform, and it is a neural network based image recognition type technology. It’s very visual and it’s very easy to understand and can really help reduce the amount of time to go to production with some of this technology.

Other open source projects, if you don’t want to be attached to Google, include CaffeTorchTheano and NVIDIA. They have a great offering tied to their technology.

Henrik:  Well, thanks Nikolai.

Nikolai:  Thank you, Henrik. I’m excited about content moderation. It’s a topic that’s not really talked a lot about, but it’s really important and I think in the next five years we are really going to see the computer side of content moderation and image recognition take over, understand the context of these items, and really reduce the dependency on people to do this type of work.

Henrik: For more on this, visit Tagging.tech. Thanks again.


For a book about this, visit keywordingnow.com

Leave a comment

Tagging.tech interview with Nicolas Loeillot

Tagging.tech presents an audio interview with Nicolas Loeillot about image recognition


Listen and subscribe to Tagging.tech on Apple PodcastsAudioBoom, CastBox, Google Play, RadioPublic or TuneIn.


Keywording Now: Practical Advice on using Image Recognition and Keywording Services

Now available





Henrik de Gyor:  This is Tagging.Tech. I’m Henrik de Gyor. Today, I’m speaking with Nicolas Loeillot. Nicolas, how are you?

Nicolas Loeillot:  Hi, Henrik. Very well, and you?

Henrik:  Great. Nicolas, who are you, and what do you do?

Nicolas:  I’m the founder of a company which is called LM3Labs. This is a company that is entering into its 14th year of existence. It was created in 2003, and we are based in Tokyo, in Singapore, and in Sophia Antipolis in South France.

We develop computer vision algorithm software, and sometimes hardware. Instead of focusing on some traditional markets for this kind of technology, like military or security and these kind of things, we decided to focus on some more fun markets, like education, museums, entertainment, marketing.

What we do is to develop unique technologies based on computer vision systems. Initially, we are born from the CNRS, which is the largest laboratory in France. We had some first patents for triangulations of finger in the 3D space, so we could very accurately find fingers a few meters away from the camera, and to use these fingers for interacting with large screens.

We thought that it would be a good match with large projections or large screens, so we decided to go to Japan and to meet video projector makers like Epson, Mitsubishi, and others. We presented the patent, just the paper, [laughs] explaining the opportunity for them, but nobody understood what would be the future of gesture interaction.

Everybody was saying, “OK, what is it for? There is no market for this kind of technology, and the customers are not asking for this.” That’s a very Japanese way to approach the market.

The very last week of our stay in Japan, we met with NTT DoCoMo, and they said, “Oh, yeah. That’s very interesting. It looks like Minority Report, and we could use this technology in our new showroom. If you can make a product from your beautiful patent, then we can be your first customer, and you can stay in Japan and everything.”

We went back to France. We met the electronics for supporting their technology. Of course, some pilots were already written, so we went back to NTT DoCoMo, and we installed them in February 2004.

From that, NTT DoCoMo introduced us to many big companies, like NEC, DMP, and some others in Japan, and they all came with different type of request. “OK. You track the fingers, but can you track the body motion? Can you track the gestures? Can you track the eyes, the face, the motions and everything?”

We made a strong evolution of the portfolio with something like 12 products today, which are all computer vision‑related, which are usually pretty unique in their domain, even if we have seen some big competitors like Microsoft [laughs] on our market.

In 2011, we were the first to see the first deployment of 4G networks in Japan, and we said, “OK. What do we do with the 4G? That’s very interesting, very large broadband, excellent response times and everything. What can we do?”

It was very interesting. We could do what we couldn’t do before, which is to put the algorithm on the cloud and to use it on the smartphone, because the smartphone were becoming very smart. It was just beginning of the smartphones at the time, with the iPhone 4S, which was the first one which was really capable of something.

We started to develop Xloudia, which is today one of our lead products. Xloudia is mass recognition of images, products, colors, faces and everything from the cloud, and in 200 milliseconds. It goes super fast, and we search in very large databases. We can have millions of items in the base, and we can find the object or the specific item in 200 milliseconds.

Typically, applying the technology to augmented reality, which was done far before us, we said, “OK. The image recognition can be applied to something which is maybe less fun than the augmented reality, but much more useful, which is the recognition of everything.”

You just point your smartphone to any type of object, or people, or colors, or clothes, or anything, and we recognize it. This can be done with the algorithm, with the image recognition and the video recognition. That’s a key point, but not only with these kind of algorithms.

We need to develop some deep learning recognition algorithm for finding some proximities, some similarities, and to offer the users more capabilities than saying, “Yes, this is it,” or, “No, this is not it.” [laughs]

We focus on this angle, which is, OK. Computer vision is under control. We know our job, but we need to push the R&D into something which is more on the distribution of the search on the network ought to go very fast. That’s the key point. The key point was going super fast, because for the user experience, it’s absolutely momentary.

On the other hand is, “If we don’t find exactly what is searched by the user, how can we find something which is similar or close to what they are looking for?” There is an understanding of the search, which is just far beyond the database that we have in catalog, and just to make some links between the search and the environment of the users.

The other thing that we focus on was actually the user experience. For us, it was absolutely critical that the people don’t press any button for finding something. They just have to use their smartphone, to point it to the object or to the page, or to the clothes, or anything that they want to search, and the search is instantaneous, so there is no other action.

There is no picture to take. There is no capture. There is no sending anything. It’s just capturing in real time from the video flow of the smartphone, directly understanding what is passing in front of the smartphone. That was our focus.

On this end, it implies a lot of processes, I would say, for the synchronization between the smartphone and the cloud. Because you can’t send all the information permanently to the cloud, so there is some protocol to follow in terms of communication. That was our job.

Of course, we don’t send pictures to the cloud because it’s too heavy, too data‑consuming. What we do is making a big chunk of the extractions or of the work on the smartphone, and sending only the necessary data for the search to the cloud.

The data, they can be feature points for the image. They can be a color reference extracted from the image. They could be vectors, or they could be a series of images from a video, for instance, just to make something which is coherent from frame to frame.

That’s Xloudia, super fast image recognition with the smartphone, but cloud‑based, I would say, and the purpose is really to focus on the user experience, to go super fast, and to always find something back [laughs] as a reference.

The target market may be narrower than what we had before with augmented reality, and what we target is to help the e‑commerce, or more specifically, the mobile commerce players to be able to implement the visual search directly into their application.

The problem today that we have even in 2016, the problem is that when you want to buy something on your smartphone, it’s very unpleasant. Even if you go to bigger e‑commerce companies like Amazon and the others, what you have on your smartphone is just a replication of what you can see on the Web, but it’s not optimized to your device. Nobody’s using the camera, or very few are using the camera for search.

The smartphone is not a limited version of the Web, typically. It’s coming with much more power. There is cameras. There are sensors, and many things that you’d find on a smartphone which are not on a traditional PC.

The way we do mobile commerce must be completely different from the traditional e‑commerce. It’s not a downgraded version of the e‑commerce. It must be something different.

Today, we see that 50 percent of the Internet traffic to big brand website is coming from the smartphone. 50 percent, and 30 percent of the e‑commerce is done from mobile.

It means that there is a huge gap between these 50 percent and these 30 percent. There is 20 percent of the visitors who don’t buy on the smartphone because of this lack of confidence or economics or something.

There is something wrong on the road to [laughs] the final basket. They don’t buy with the smartphone, and this smartphone traffic is definitely increasing with time, as well. It’s 50 percent today for some big brands, but it’s increasing globally for everybody.

There are some countries, very critical countries like Indonesia or India, who have a huge population, more than 300 million in Indonesia, one billion people in India. These guys, they go straight from nothing to the latest Samsung S6 or 7.

They don’t go through the PC stage, so they directly buy things from the smartphone, and there’s a huge generation of people who will just buy everything on their smartphone without knowing the PC experience, because there is no ADSN lines because there are so many problems with the PC. It’s too expensive, no space, or whatever.

We target definitely these kind of markets, and we want to serve the e‑commerce or the mobile commerce pioneers, people who really consider that there is something to be done in the mobile industry for improving the user experience.

Henrik:  What are the biggest challenges and successes you’ve seen with image and video recognition?

Nicolas:  If you want to find something which is precise, where everything is fine today, 2016 saw many technologies, algorithms, where you can compare, “OK. Yes, this is a Pepsi bottle, and this is not a Coca‑Cola bottle,” so that’s pretty under control today. There is no big issue with this.

The challenge ‑‑ I would prefer to say war ‑‑ is really understanding the context, so bringing more context than just recognizing a product is, “What is the history? What is the story of the user, the location of the user? If we can’t find, or if we don’t want to find a Pepsi bottle, can we suggest something else, and if yes, what do we suggest?”

It’s more than just tagging things which are similar. It’s just bringing together a lot of sources of information and providing the best answer. It’s far beyond pure computer vision, I would say.

The challenge for the computer vision industry today, I would say, is to merge with other technologies, and the other technologies are machine learning, deep learning, sensor aggregations, and just to be able to merge all these technologies together to offer something which is smarter than previous technologies.

On the pure computer vision technologies, of course, the challenge is to create database or knowledge where we can actually identify that some object are close to what we know, but they are not completely what we know, and little by little, to learn or to build some knowledge based on what is seen or recognized by the computer vision.

One of the still‑existing challenge…It’s a few decades that I am in this industry, but [laughs] there is still a challenge which is remaining, which is actually the, I would call it the background abstraction or the noise abstractions, is, “How can you extract what is very important in the image from what is less important?”

That’s still something which is a challenge for everyone, I guess, is just, “What is the focus? What do you really want? Within a picture, what is important, and what is not important?” That is a key thing, and algorithms are evolving in this domain, but it’s still challenging for many actors, many players in this domain.

Henrik:  As of March of 2016, how do you see image and video recognition changing?

Nicolas:  The directions are speed. Speed is very important for the user experience. It must be fast. It must be seamless for the users.

This is the only way for service adoption. If the service is not smooth, is not swift ‑‑ there is many adjectives for this in English [laughs] ‑‑ but if the experience is not pleasant, it will not be adopted, and then it can die by itself.

The smoothness of the service is absolutely necessary, and the smoothness for the computer vision is coming from the speed of the answer, or the speed of the recognition. It’s even more important to be fast and swift than to be accurate, I think. That’s the key thing.

The other challenge, the other direction for our company is definitely deep learning. Deep learning is something which is taking time, because we must run algorithms on samples on big databases for building an experience, and building something which is growing by itself.

We can’t say that the deep learning for LM3Labs, or for another company, is ready and finished. It’s absolutely not. It’s something which is permanently ongoing.

Every minute, every hour, every day, it’s getting there, because the training is running on, and we learn more to recognize. We improve the recognitions, and we use the deep learning for two purpose at LM3Labs.

One of them is for the speed of recognitions, so it’s the distribution of the search on the cloud. We use deep learning technologies for smartly distributing the search and going fast.

The other one is more computer vision‑focused, which is to, if we don’t find exactly something that the user is trying to recognize, we find something which is close and we can make recommendations.

These recommendations are used for the final users so they can have something at the end, and it’s not just a blank answer. There is something to propose, or it can be used between the customers.

We can assess some trends in the search, and we can provide our customers, or B2B customers, we can provide them with recommendations saying, “OK. This month, we understand that, coming from all our customers, the brand Pepsi‑Cola is going up, for instance, instead of Coca‑Cola.” This is just an example. [laughs] That’s typically the type of application that we use with the deep learning.

Henrik:  What advice would you like to share with people looking at image and video recognition?

Nicolas:  Trust the vision. The vision is very important. There are a lot of players in the computer vision community today.

Some have been acquired recently, like Metaio by Apple, or Vuforia by PTC are two recent examples, and some people are focused on the augmented reality, so really making the visual aspect of things. Some others are more into cloud for the visual search, and just improving the search for law enforcements and these kind of things.

The scope, the spectrum of the market is pretty wide, and there are probably someone who has exactly the same vision than you [laughs] on the market.

On our side, LM3Labs, we are less interested in augmented reality clients, I would say. We are less interested in machine‑to‑machine search because this is not exactly our focus, either.

We are very excited by the future of mobile commerce, and this is where we focus, and our vision is really on this specific market segment. I would say the recommendation is find a partner who is going with you in terms of vision. If your vision is that augmented reality will invade the world, go for a pure player in this domain.

If you have a smart vision for the future of mobile commerce, join us. [laughs] We are here.

Henrik:  Thanks, Nicolas. For more on this, visit Tagging.tech.

Thanks again.


For a book about this, visit keywordingnow.com