Return to site

From The State of Art of Computer Vision Technology to its Future - with Co-Founder and VP of SenseTime Fan Yang

· Founders and VCs

Facial recognition systems have long been part of dystopian science fiction stories, but may now become an integral facet of life in China. In Shenzhen, China, facial recognition systems have already been implemented in monitoring pedestrians while they cross the street. Pedestrians can be caught once crossing illegally, but the second time a pedestrian is seen trying to cross the street illegally, his or her face will appear on a screen with text indicating that pedestrian is breaking a law.

The rapid increase in interest in facial recognition technology piqued the interest of consumers to learn more about the technology and the companies working in this field. InfoQ, a news site reporting on software developments, visited the offices of SenseTime in Shenzhen and interviewed Fan Yang, Co-Founder and Vice President of SenseTime, a Hong Kong based Artificial startup that has received funding from both Qualcomm and Alibaba Group. (link to original interview)

Article translated exclusively by The Harbinger.



Facial recognition technology has become one of the most powerful tools for surveillance. It is widely used at train station, airports, and Customs. Other applications include monitoring deposits, withdrawals, and payments at banks. In late September, a viral video, called “China Skynet,” depicted how China monitors its citizens. This video demonstrated China’s latest real-time pedestrian detection and identification system. The system is capable of distinguishing motor vehicles, bicycles, and pedestrians, while accurately labeling the vehicle type, and the age, gender, and clothing of a pedestrian. The backbone of this technology is SenseVideo, developed by the Chinese technology company SenseTime.

A screenshot of SenseVideo, detecting cars and plate numbers in a security camera

SenseTime was founded in October 2014 with a focus on developing facial recognition technology. XiaoOu Tang is one of the founders of SenseTime and a professor at The Chinese University of Hong Kong (CUHK) with research interests in computer vision and pattern recognition. Tang has demonstrated accomplishments in both academic and business settings. SenseTime currently employs 140 PhD researchers. At the 2016 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), SenseTime and Tang’s research group at CUHK won three awards. SenseTime and Tang’s research group submitted a combination of 23 papers to The Conference on Computer Vision and Pattern Recognition (CVPT) and an additional 20 papers to the International Conference on Computer Vision (ICCV). These organizations demonstrated that China can perform world class research, bring scholarly attention to China and Chinese researchers at the leading global conferences on computer vision.


People often think of SenseTime as a company that develops facial recognition technology. However, facial recognition is only one area where SenseTime innovates. Fan Yang sees SenseTime as a service provider, developing AI solutions for assorted uses to aid companies across different industries. These industries have the potential to grow and develop by using AI. Yang states, “Of course, our current focus is on computer vision, more specifically, image and video analysis. Analysis of the human face is a very special and highly valuable service, thus a majority of the applications in image and video analysis focus on facial recognition technology. SenseTime offers a variety of business solutions across industries, but our coverage goes well beyond facial recognition.”


SenseTime has partnered with companies including Huawei, Qualcomm, China Mobile, and Xiaomi to develop applications in sectors including finance, security, Internet entertainment, artificial reality (AR), and smartphones. In July 2017, SenseTime raised $410 million in Series B funding, which was the largest single investment in any artificial intelligence (AI) company up to that time.


The developments and breakthroughs of computer vision technology

Deep Learning has allowed computer vision to be implemented outside of research settings and it can now be found in myriad industry applications. Fan Yang has been working on computer vision for many years. While an employee at Microsoft, Yang focused on developing new technologies in computer vision and computer graphics, including facial recognition systems, image recognition systems, and 3D reconstruction software. Similarly, the core technologies of SenseTime are focused on facial recognition, smart surveillance, and image recognition systems.


In the 1990s, there was an uptick in facial recognition technologies. In research environments, facial recognition technology achieved valuable results, but it was not yet capable of overcoming real world challenges. Since 2004, when Yang joined Microsoft, the field of computer vision had been progressing at a relatively slow pace. But, around 2011, the hardware reached an important level of maturity and large companies had accumulated enormous data sets. Deep Learning algorithms became increasingly practical at solving real world problems. This dramatically changed the field of computer vision and accelerated the growth and development of computer vision technology. Computer vision systems expanded beyond the walls of research labs and can now be seen in various industries with a variety of applications.


The advancement of computer vision systems depended heavily on Deep Learning algorithms and the availability of large data sets. These large data sets allowed researchers to develop algorithms that could be generalized, and increased their utility. Prior to Deep Learning, a breakthrough in one area was more likely to be limited to that specific area. However, now an improvement in processing and analyzing lighting, for example, can impact all computer vision applications.


It is easier for computer vision software to be generalized and impact different fields in a shorter timeline and at a lower cost than was possible before. The potential value of this is enormous. While it is challenging to develop a new AI application, many more companies are willing to invest in the development costs of these technologies.


Yang states that, “Previously, only top-tier companies were willing to spend money and establish research institutes to perform core technology R&D. Microsoft and Bell Labs are two examples. However, today the landscape is completely different. I believe in the future, when AI technology has been applied throughout different industries, the complete ecosystem of those industries will be transformed.”


The balance between fundamental research and research applications

Those critical of the current industry approach believe that many companies and developers do not fully understand the operating principles of Deep Learning. They claim that these companies are focusing solely on applications of the technology and not developing or understanding the fundamentals.

Fan Yang indicated there are two schools of thoughts in academia on this question. The first believe that it is critical to focus on the underlying principles in these fields before developing industry applications. Yang agrees with this viewpoint and feels that an increasing number of companies, including SenseTime, are devoting more resources into pioneering the fundamental scientific research. Yang states, “Fundamental research can lead us to the right path towards a sustainable future.” At the same time, Yang understands the need to strike a balance between fundamental research and commercial applications. It is critical to have both a complete scientific research system and directional guidance on industrial applications. While scientific R&D is important, companies need to work on applications to commercialize their technology.

In recent years, a lot of companies have invested significantly in facial recognition technology and have achieved impressive results. Companies promote their verification rate as one of the leading benchmarks for comparison. And today, we often see numbers like 99%, 99.4% and 99.8% from these companies. How should we interpret the implications and gaps behind these numbers?

“Technical indicators should not be generalized. There is a wide range of assumptions behind every technical indicator.”

Yang cited a few examples. One use case in the financial industry is the registration of Internet finance services where 1-to-1 face recognition is applied. The process is similar to identifying photos of the same person from your home photo album. Another use case in the security industry is the search of a specific target from a huge fugitive database based on blurred photos. All these are use cases of face recognition technology and their verification rate could all be about 99% or a few decimal point more.

The actual difference behind the numbers is much larger than it seems to be. It depends on many factors and therefore, the verification rate is strongly correlated to the industry and assumptions. Verification rates from different use cases are hardly comparable.

Compared to understanding the premises of verification rate, what is even more important is whether companies are able to develop original and breakthrough technology. In the use case of Internet photo album, SenseTime has developed the world’s first face recognition technology that has even beaten human’s ability in face recognition. Many smart album services that came after originated from this technology. In Yang’s views, it is important for companies to create scalable breakthroughs in the face of unfamiliar use cases and new challenges. It takes the integration of technology, data and understanding of the market to create a truly valuable and meaningful technological breakthrough.

As verification rate reaches 99%, the technical challenge of face recognition technology lies in deepening the technology development for the specific use case. Although 99% seems high enough already, different industries may have different levels of requirement for the verification rate. In some cases, 99% could be only an entry requirement. For example, in the case of personal identity verification services for banks, SenseTime’s technology has achieved 10-7 error rate which is equivalent to a 7-figure bank password. And it is only beginning to be adopted. In the use case of security, the issues of blurred photos, shades, and poor angle are also creating new challenges.

broken image

On the surface, the industry of facial recognition might seem homogenous and straightforward. But in fact, the actual use cases can be highly complex. It is therefore meaningless to discuss the technology without understanding the business context. Typical use cases such as security and smartphone present real challenges that are worth for us to tackle through deepening of face recognition technology development.


Image and video analysis is more complicated than what you think

“Image and video analysis is in fact quite a sophisticated technical system from both a functional and competency standpoint. It takes several teams to work together to implement or deepen a technology.”


SenseTime’s work in the field of computer vision technology can be classified into the following areas: image enhancement, object detection, object classification, algorithm modeling, training engine and so on.


Intelligent image enhancement is the first step for image and video analysis. Although cameras and video recorders we have today are already very capable, the capture of images and videos still faces various challenges. For example, “Depth Maps” extracted by infrared cameras and structured light cameras would still show a lot of noises. Also, fast-moving objects get blurred under security surveillance cameras. As a result, image restoration and image enhancement of degraded image content are key computer vision tasks to tackle before conducting any analysis. We also call it “Low Level Vision”, which is an independent work we do at SenseTime with the purpose of enhancing the quality of captured images and videos.


The recognition and analysis of images and videos can also be classified into several sub-segments, including object detection – knowing where an object is located; object positioning – knowing the key outline and shape of an object; object classification – identifying an object and knowing what it is; and segmentation of space – having clear descriptions of the edges and outline of an object. In fact, the entire recognition system can be further divided into various academic sub-streams but, in reality, it is all about the combined applications of these overlapping sub-streams.


SenseTime also has dedicated teams to conduct fundamental research. For example, they are working on areas such as 1) how to simplify the algorithm model so that its AI products can run on mobile devices; 2) how to optimize the algorithm to enhance processing speed; 3) incremental and evolutionary updates of the AI training engine and operating system, 4) research on weak supervision and unsupervised learning, including pioneering technologies such as reinforced learning and transfer learning. Yang believes that the most important thing, from building the computing engine to developing the data flow architecture, is not the amount of data; but rather, it is making the algorithm stable and close-looped.


The contextual use cases of computer vision technology

SenseTime has long been thinking about the commercialization of computer vision technology. Yang shared in his previous speaker events that the advancement of technology would require an integration with the industry. According to him, the key applications of computer vision technology in SenseTime’s products and business lines include the following:

  1. Security. The past definition of security largely focuses on public security. In fact, it covers a broader context including transportations, business scenes, residential areas, schools and others. The potential coverage of use cases is huge
  2. Intelligent terminals. Typically, intelligent terminals refer to smartphones. Yet, the form and shape of intelligent terminals will continue to evolve and AI technologies will certainly create significant values for these future smart devices.
  3. Online video applications. As the online user behavior continues to evolve, an increasing amount of online content will shift from traditional formats like text to new multi-media formats such as photos, videos and others. The recent explosive growth of live-streaming/broadcasting is a leading case example. SenseTime provides such video applications and platforms a comprehensive and value-added solution
  4. Individual identity verification. Image-based identity verification is a highly value-added application as it is a unique cross-industry solution. It has now spread widely from online to offline. In China, the real-name policy of individual citizen identity information presents a strong demand for such a solution. It also helps us resolve, to a certain extent, issues in online security and offline public safety. All Internet applications, including both online businesses as well as offline businesses (airports, supermarkets and hotels), will have increasing demand for individual identity verification. SenseTime offers a very comprehensive solution in this area
  5. Autonomous driving. Autonomous driving is one of the major disruptive forces in the future and AI technology will play an essential role in accelerating the development of autonomous driving technologies. SenseTime has already made sizable investment with strategic planning in this field. 


The technical support behind the application in Security industry

A qualified product in Security needs more than facial recognition. Take an example of the Security in a public open space. Several technologies are involved in this application.

  1. Hardware, i.e. security cameras. For a large public open space, one single camera is not enough to have a complete coverage, and therefore panoramic cameras and stretchable cameras will be needed to collect facial or any other images.
  2. Data collection algorithm. Security cameras will use an algorithm to analyze the crowd, based on data available and rules specified by human, to determine where in the public space is more crowded, where people are staying for a long time, to make decision on places to take a special notice.

  3. Facial recognition. After the previous two steps, we can make use of facial recognition in these special areas to see if there is someone on the blacklist (for example, pickpockets). And that is the reason why in the second step we are looking for crowded areas: those are the areas with higher probability of pickpocketing.

  4. Motion capture and detection. Take pickpocketing as an example. When looking into pickpocketing behavior, we have to keep track of the motion of human body to detect special patterns associated with pickpocketing to identify theft.

  5. Image Enhancement. If the collected images are blurry, we will make use of image enhancement technology to make images easier to analyze.

As Yang has mentioned, when it comes to the application in the real world, typically it will be an application of multiple image processing technologies and algorithm. Facial recognition and motion detection are the key technologies, but not the only ones. A combination of different techniques is required to achieve a better performance in every computer vision application.


Interdisciplinary talents are the key to the application of AI

According to Fan Yang, applying AI technologies to actual projects has two major obstacles: 1. Choosing the right path and timing, 2. Finding the most suitable talents.


AI applications ought to incorporate with the industry; therefore, the first challenge is to choose the right industry. “If the company does not have the leading edge in AI technology, such as video searching in search engines, it might be fine for large companies to accumulate and learn the technology through time; however, for smaller companies who rely on their immature technology, they will die out in maybe two years if they fail to make return”, quote Fan Yang. He also mentioned that the most important thing to do for these companies is to make sure the chosen industry is a valid, scalable market with relatively high and inelastic demand; second, they need to obtain organic data from the market so that they can have continuous improvements in their algorithms; finally, when landing their products, they need to leverage their technological advantage (usually lasts for 1 -1.5 years) to build barriers for entry and solidify their leading position, so that they will not be surpassed by followers in the long term.


On the other hand, a comprehensive integration of key technologies and industry knowledge is needed for the application of AI. Industry demand is usually obscure and uncertain, especially from the perspective of technology, and it needs to be understood and resolved by people with both technology background and industry knowledge. It is a crucial step for companies to realize their AI applications. “The balancing between standard and non-standard is a common problem for AI products, especially for 2B industries. There is a huge demand for AI talents with knowledge on both sides; however, due to the technical nature of AI, it is traditionally less incorporated with the industry. Therefore, when you try to build a product that fits into the industry demand, you need to combine the understanding between technology and industry, which is the most challenging part from my point of view, because this type of talents is almost non-existent in the past. There were hardly any tech people with deep industry knowledge.”


Security issue of facial recognition

Facial recognition technology is mostly used in security and finance areas, especially for bank and payment applications which especially demand high security levels. Not long ago, the launch of Apple’s FaceID has also heated the discussion of security for facial recognition.


Fan Yang concludes two types of security issues for facial recognition. First, improve precision and minimize misrecognition. Second, avoid illegal hacks such as using photo and video to bypass the recognition step. The precision of facial recognition has constantly been improving due to the increase in training data and better algorithms. Relatively speaking, the second issue is more challenging to deal with.


At this moment, to avoid photo and video hacks, SenseTime accumulates a large amount hacking data to train the recognition system through pattern and spectrum analysis. “No matter using video or photos to hack, there are traces and evidences, such as reflections of phone screens, that could be easily found by machines trained with a large amount of data but probably hard for human to do so.”


The 3D facial recognition technology adopted by Apple differentiates from other technologies in their devices - 3D camera captures more image data in three dimensional space than simply color information in 2D, which enables deeper analysis by the algorithm so that it achieves better precision and security. Fan thinks that research and development in 3D capture devices is a clear trend in the industry, and SenseTime will also take effort in that direction in the future.


The future of Computer Vision Technology

Regarding the future of Computer Vision, Fan thinks that there are three major challenges. The first and the most consensual of the industry is to decrease the reliance on data - current facial recognition pattern is strongly dependent on large data, while it does not take that amount for human. The second is optimization of overall performance, i.e. reducing the computing cost for analysis, which is crucial for AI applications. The third is theoretical research and development. Understanding the know-how will always benefit the industry in the long term.

Fan also thinks that the analysis and understanding of video content is one of the most promising fields of computer vision in the future. “People have been discussing about it for years, and we all have different judgements as to when it will be mature enough for applications. My personal view is that, since the Internet has become a mature ecosystem with immense amount of business values, the amount of video applications is far below where it should be. The potential value of video, or visual signals, is tremendous because visual information occupies an important proportion of human interactions. Under such circumstances, it is inevitable to start explore deeper in the video field. Many offline industries have strong demand for this technology, whereas the images and videos online, especially in the context of understanding video content, has a huge room for growth in the future, and what we can achieve today is still not enough.”


What is the position of computer vision in the blueprint of AI industry

“Computer Vision is the core of AI with the largest potential business value.”


Information is in the center of everything. Looking beyond the scope of AI, the whole IT industry serves to collect, transmit, store, analyze and compute information and provide feedback. And AI plays a role which let the machines take over more responsibilities in this information cycle and do a better job than human beings. In the daily interactions between humans, visual content is a more essential and informative type of information; therefore, it is of higher order in the hierarchy of different forms of information, and also demands more on the technology side. Once we obtain the basic processing power of analyzing visual information, it will create exponential increase in value that may exceeds the whole IT industry today, and even disrupt the methods of interaction between human and the world.


An important aspect of computer vision is that the human eyes can only receive and analyze a very narrow spectrum of electromagnetic waves, while machines can recognize a wider range of wavelengths with applications like infrared cameras and structure light sensors. Fan poses an interesting question: “these sensors expand the spectrum of waves that human can perceive, so is it possible for us to expand it constantly? From this perspective, it means that computer vision can replace human beings and assist us with looking into some more fundamental aspects of this world.”


Fan thinks that the current design and usage of infrared cameras are still in the perspective of human beings and rely on the support and guidance of human knowledge. The information captured by infrared cameras is transformed into visuals that humans can perceive, and then humans make use of machines to understand it. The next step would be to capture the information directly into machine-understandable form, allowing the machine to expand and explore it on its own.

This article was originally published by InfoQ on November 30th, 2017.