Have you ever wondered how to ensure that the privacy or sensitivity of your confidential data is not an issue when developing an AI and ML project? In this article, we will tell you the solution and best practices to deal with the issue of ai privacy challenges, which will positively impact the development of your business and data projects, especially in the oil and gas industry.
It is a fact that, according to industry data, more than 70% of IT projects in companies fail due to various factors related to strategic management, planning or even data quality. If we add the Artificial Intelligence (AI) component, the situation becomes even more complex.
Currently, machine learning (ML) projects require specific data to train, learn from, and model reality. By using statistics to predict and recognize patterns, ML works better with large data sets than with small ones.
However, models generally do not have enough data due to several factors:
- The examples provided by the business are not sufficient to create the rules and infer the patterns that the model needs to discover and infer to generate new knowledge.
- On the other hand, there are models that are multivariate, this large quantity and diversity of variables makes the work to be done much more complex: the more variables, the more examples are needed to discriminate rules.
But the most important factor that determines the success or failure of the ML project is the privacy barrier that the vendors who want to work with these companies or the data science team encounter. In general, if the company cannot access the data due to confidentiality or sensitivity issues of certain data, these issues arise:
- An external consultant cannot work or intervene in any AI and ML project.
- External GenAI services such as ChatGPT cannot be used.
- If in doubt, do not use GenAI. The paradox is: if the company has confidence in cloud computing, why not in GenAI?
This is the classic case of the oil and gas industry, which needs to handle more and more information in real time, maintenance reports and failures or anomalies, accident detection and data related to plant personnel. We have seen this problem clearly in the energy sector conference that 7Puentes attended in Houston (Texas).
Another classic example is the medical or financial information of users. Of course, in the case of Oil & Gas, we find many issues related to accident reports or data about an oil well or drilling operation carried out in a specific location. In this case, both the location of the well and its characteristics, productivity data, etc. are confidential, as it is essential that this information does not leak to the general public or to the competition. And sometimes it is a legal issue, not just a commercial one, because personal or sensitive data, such as medical or banking information, cannot be disclosed. Most importantly, data associated with an individual or company employee can never be used; it must be anonymized in some way.
The biggest conflict is that AI initiatives and ML projects are not moving forward because of this issue. It is necessary to address the issue of data privacy so that projects can be developed successfully and at the same time the company can continue to protect its sensitive data.
Privacy solutions
In this case, there are at least two basic techniques to advance AI initiatives and overcome the problem of data privacy or sensitivity:
- Data masking: One option is to mask and obfuscate the data, it would be something like what we see in classified FBI documents where some parts are crossed out when certain content is censored. In this case, the part of the text that we do not want to be used in ChatGPT is marked. Then dates, proper names, ID numbers, credit card numbers, etc. can be replaced. Of course, it can be used as if it were a regular expression formula to extract anything that resembles those content parameters, because if it is confidential information, each document will be searched and, for example, John Doyle’s name will be replaced by «operator_name», the real date by «11/99/99», the ID by 00000001. These are things that make no sense in real life, but it is known that the data is masked with a function. So that would be masking or obfuscating the data. And this is used a lot in banking, in particular. Suppose we have to put together a data set that tells us which bank account number to give the information to. Then the data needs to be obfuscated to protect the bank user’s data. All of this also helps prevent fraud. Currently, large language models (LLM) can also be used, they can even be used locally, such as the «Llama 2» version (a family of pre-trained and fine-tuned large language models, published by Meta AI in 2023). It is mainly used to pre-process documents for obfuscation and then pass the data to ChatGPT. Here, sensitive data is removed via semantic prompts.
- GANs to generate synthetic data that is indistinguishable from real data: Today, generative adversarial networks (GANs) are a powerful machine learning technique for generating synthetic data that is indistinguishable from real data. GANs have been used to generate synthetic images, text, audio, and video, and have applications in a wide range of fields, including healthcare, finance, and security. GANs work by pitting two neural networks against each other: a generator and a discriminator. The goal of the generator is to create synthetic data that is as realistic as possible, while the goal of the discriminator is to distinguish between real and synthetic data. The generator and discriminator are trained simultaneously, and over time the generator learns to produce increasingly realistic synthetic data. To synthesize, GANs have two components, a generator and a discriminator, which compete with each other (hence the term «adversarial»):
- Generator: Generates real examples from a random perturbation.
- Discriminator: Discards what does not appear to be real.
A great advantage of these techniques is that you can add specific technical or business concepts that are useful and interesting for the project, or pure rules to filter. And there is a battery of GAN network architectures and topologies, as well as many research articles that serve as input for our projects.
Let’s say 7Puentes hires you to work in a wind farm with generators and we want to predict the energy production, but they are not going to give us the real information of those generators. So what we do is we use the GAN to generate synthetic information that statistically behaves the way the wind farm client wants and that is as close as possible to certain rules. In short, we work on our models with this synthetic data, but we never really know the real data. At the end of the project, we deliver it to the customer, and the customer does the final fine-tuning or refinement with the real data that they have internally. And the same could be true for an oil well or any industrial plant in the market that has sensitive data and wants to protect it during their AI project.
The reality is that as consultants we have never seen the real information. Nevertheless, the project is moving forward, it is successful, whereas before it was stuck for the reasons we have already discussed.
The value of 7Puentes in unblocking AI projects
If these problems sound familiar to you as a customer, contact us and we will explain how to generate a valuable asset for your project using data masking and GANs, so that the data can be used for decision making. We will not work with confidential data, but will use synthetic data for the entire project, protecting the data that the company is interested in protecting.Without a doubt, at 7Puentes we know how to deal with the classic problems of data protection and we have many years of experience in different industries, facing these problems every day, especially in leading companies. Contact us!