Introduction to video surveillance with Gen AI

Today, it is essential to optimize the management of corporate assets, security and control, especially in industrial facilities. In the era of generative artificial intelligence, video analytics has become a powerful tool for identifying and mitigating potential risks. This post explores the benefits and opportunities of video analytics and industrial surveillance with Gen AI.

All companies that have assets, machinery and risk management in the industrial sector need to have more and more control over their premises. Something that is summarized in the discipline of Enterprise Asset Management.

So much so that threat detection has evolved significantly thanks to advances in artificial intelligence. Traditional approaches were based on predefined rules and motion detection, with an inability to identify subtle threats. At the same time, much of the monitoring was done with manual reports from operators or thermal cameras.

Due to the manual and slow nature of the inspection, it was common for several months to elapse between two consecutive thermal inspections at the same site. The long time between manual thermal inspections made it less practical and slower for preventive and predictive maintenance.

However, AI is revolutionizing video surveillance systems and industrial asset management, complementing innovations in sensorization and Industry 4.0.

By continuously monitoring, processing, and analyzing high-quality images, models are being developed that can learn to recognize patterns and anomalous behavior with greater accuracy, enabling more effective detection of potential threats.

In addition, there is the possibility of implementing generative artificial intelligence (Gen AI) to improve efficiency and reduce costs.

In this sense, a few years ago, the traditional strategy of companies was to train computer vision models, completely customized, which was very expensive and complex. In some cases, it was not feasible. Simply because of the following factors:

  • All the data had to be structured and labeled.
  • It was necessary to use some developed neural network that already existed to do the fine tuning and to be able to detect latent risks. And if you wanted to specialize it for a specific function, it was very expensive.
  • Multiple projects and proofs of concept could not be developed in parallel, which was a limitation given the scale of video surveillance projects.
  • In the past, machine vision applications required extensive initial configuration to achieve accuracy.

What innovations and capabilities does Gen AI bring to these problems?

Instead of having to develop dozens of fine-tuning projects, Gen AI allows us to look for a “foundation model” that is powerful and versatile (these are deep learning neural networks that are completely adaptable to our needs) and, based on that choice, develop a new use case. Or we can do very fast and inexpensive proofs of concept to know in advance if our project is feasible.

Increasingly intelligent and adaptable technology

Currently, more and more well-known vendors are starting to develop solutions related to industrial asset management, within reach. In fact, Google already has specific APIs for video processing, such as Video Intelligence, which allows each customer to develop its own solution. At the same time, the multimodal models of Open AI or LlaMA are the ideal complement to the use of Gen AI in industrial video surveillance.

A very important factor in the intensive use of video surveillance is that cameras are becoming increasingly intelligent. It is possible to embed software and machine learning models in cameras. At the same time, the cost of aerial photography is decreasing and access to it is becoming more democratic. Today, we have drones that can monitor complex pipelines and installations (providing real-time images) and take samples to have many images for analysis.

This whole process involves so-called «edge computing», a distributed computing framework that brings business applications closer to data sources, such as IoT devices or local peripheral servers. This proximity to the data at its source can bring great business benefits, such as faster knowledge acquisition, better response times, and greater bandwidth availability. Considering that data volumes will continue to grow as 5G networks increase the number of connected mobile devices.

All of this means that we no longer need to reproduce a complete video, nor stop for days to analyze it, and that data processing is becoming increasingly efficient. In this respect, multimodal models are becoming more compact and have the necessary performance for a customized solution.

On the other hand, high-speed fiber connectivity is already available for modern plants. And if it is an industrial plant with cameras that is completely wired with fiber, the connectivity is really a commodity with zero cost. Then it is possible to add cameras in different key areas of the industrial plant and develop custom applications for that type of plant.

In short, all of this smart and customizable technology converges in that multimodal models work well, we have the connectivity and the increasingly affordable cameras, and the problem of scale. We will not need to develop a custom computer vision model for each specific application, but we will test each model and each specific solution ad hoc.

The Problem of Prompt Engineering and LLM for Video Surveillance

There is no doubt that in these models, GenAI acts as a «human» interface between the user and the data in the systems. The operator has to ask a technical question that the model has to answer correctly and quickly. This is a new skill that human users must learn. And accepting that mistakes will be made in the process is the other part of the solution.

It is about working a lot with the prompts so that the hallucinations of the models (wrong answers) are not a problem, and working with the prompt chains to iteratively improve those models.

Let’s say we’re developing a model for detecting failures, such as displacement in the drill pipe or gas leaks. Getting this model to identify these risks and learning to eliminate the false positives is not a trivial task.

This is the clear problem of engineering the model’s prompts. If we have about 50 very specific video surveillance applications, and we are not going to develop a traditional machine learning model, but we are going to use a multimodal Gen AI application to which we are going to give instructions, it is essential to implement good prompt engineering.

This is where the scalability issue comes in, which is teaching technicians to manage their own prompts, in order to define what is a problem and what is not a problem, and to learn to disambiguate, and as these use cases come up, errors come up.

For example, if we see a video of an employee without a helmet or without any personal protective equipment in the industrial plant, it is clear that he is without a helmet, but he is not walking between the yellow lines and he is 200 meters away from the machines, so he is not at risk.

How do we get the model to understand what a leak or a shift in the drill pipes is?

We can see the image of the infrastructure and the machinery with different colors and configure a prompt that says if it is red above the value of 0.5 where the vertex is, that is an alarm.

In this way, we create this prompt with the specific photo and ask the large language model (LLM): Is this condition met? That is, we ask if the vertex is red above the value 0.5 and the LLM tells us «yes».

Is that right or wrong? Clearly, there will be a path where the prompt engineers will start to refine the model, but that learning will be done on the asset in the video and false positives or false alarms will start to be eliminated.

As we mentioned earlier, building a specific machine learning model tailored to detect an anomaly that occurs once in a while is very expensive and difficult to implement. Because we will not have enough examples of leaks to make the model work well.

Instead, we currently have Gen AI multimodal models that allow us to do this video analysis efficiently and at a very low cost, although there is no benchmarking yet on which model works best.

Benefits of Industrial Video Surveillance with Gen AI

In summary, the main benefits of combining video surveillance in industrial environments with the use of Generative Artificial Intelligence models are

  1. Operational efficiency to record appropriate procedures.
  2. Monitoring and visual analysis tailored to industrial needs.
  3. Cost reduction due to unbeatable hardware opportunities (cheaper cameras, drones, greater connectivity, etc.).
  4. Guaranteed security conditions and operation in a 100% controlled environment, even in remote locations.
  5. Efficient detection of intruders, industrial or machinery risks, tampering or vandalism.

If your company has video surveillance requirements or you need to understand how to develop a video analytics model with Gen AI, you can contact our 7Puentes experts to help you develop the optimal solution with a minimum viable product.

Our business specialists can provide you with professional advice to develop, test, scale and implement many projects simultaneously related to video analytics, surveillance and industrial monitoring.