Breaking Data Silos: an exploration of Federated Learning technology and practice

After a long period of development, AI technology has made great breakthroughs in algorithm design, computing power and data usage in recent years and emerged in industrial applications as a pivotal role. But now many industries that employ AI technology face new problems – such as “data silos,” where data are dispersed across isolated sources, and a growing concern about “data privacy”. To break data silos without compromising data privacy, WeBank’s AI team proposes a widely adaptable solution—a system based on “Federated Learning”. The team has open-sourced FATE, an industrial level Federated Learning framework. It allows organizations to efficiently exchange data information and jointly build model while respecting data security, user privacy protection and government regulations. Currently, WeBank’s AI department has applied FATE to a series of tasks, including retail, insurance, regulatory technology and credit risk control.

1.Historical background of Federated Learning

Tianjian Chen, Deputy General Manager of WeBank’s AI department pointed out that deep learning is a choice of technology and Federated Learning is a choice of the era. With economic globalization and global spread of the Internet, massive data are being produced, which have profoundly reshaped all industries. However, with the introduction of such regulations on data protection as GDPR (General Data Protection Regulation), issues of data privacy and security have attracted increasing attention.

Research on domestic data regulatory legal system

Meanwhile, Chinese government has been pushing laws on data supervision. The regulation terms reflect two characteristics:

Strict: data supervision becomes stricter, with severer punishment;

Comprehensive: protection covers personal information data, scientific data, medical data, online transactions and other types of data.

In response to this trend, business can employ Federated Learning technology to make use of big data rational and legal.

WeBank- Deputy General Manager of AI department- Tianjian Chen

According to Tianjian Chen, Federated Learning is a cooperative big data machine learning technology with security compliance. Its fundamental difference from other technologies: Federated Learning is a tool that adjusts the rights, responsibilities and interests in the process of big data cooperation. It is a timely tool born in response to the public’s concern over data security. There are a wide range of application scenarios of Federated Learning, without limit on domain or algorithm. WeBank has been carrying out technical cooperation with partners across various fields; cooperation tasks include credit risk control, smart city management, machine vision, equipment fault detection.

After the China Artificial Intelligence Open Source Software Development Alliance (AIOSS) publishes the country’s first Federated Learning standard, a growing number of cooperative consultations flowed in, reflecting an increasing attention on Federated Learning from a large number of enterprises and institutions. Many industry organizations have contacted us about the application of Federated Learning. Federated Learning is highly regarded as a solution to the global concern over data privacy.

2.FATE: the new generation of Federated Learning technology and practices

Often the reality differs drastically from the ideal in the application and implementation of AI technology:

Ideal: high quality data, abundant labeled data, concentrated data storage;

Reality: poor quality data, lack of labeled data, data dispersion and isolation, more than 80% of enterprises have data silos issues.

The classification system of Federated Learning

Tao Fan pointed out that Federated Learning is the key technology to solve the problems above. It has the following characteristics: data isolation, data silos, non-destructive, peer-cooperation. According to different usage scenarios, Federated Learning can be divided into Vertical Federated Learning, Horizontal Federated Learning and Federated Transfer Learning.

Tao Fan, Senior Researcher at WeBank

Currently, Federated Learning has demonstrated its values for many fields:

Bank + Regulation: joint modeling for anti-money-laundering
Internet + Bank: joint modeling for credit risk control
Internet + Insurance: joint modeling for equity pricing
Internet + Retail: joint modeling for customer value evaluation

Meet FATE

Finally, Tao Fan introduced FATE (Federated AI Technology Enabler), the Federated Learning open-source project led by WeBank. Its core functions include:

FATE-Serving: Federated online model service
FATE-Flow & FATE-Board: Federated modeling Pipeline and visualization tool
FATE FederatedML: algorithm implementation based on Federated Learning framework
EggRoll: Distributed computing and storage abstractions
Federated Network: Communication abstractions across site networks

3.FATE-Flow: a pipeline for end-to-end federated learning production service

The advantage of Federated Learning is that it can ensure data never leave local storage when parties cooperatively build machine learning models. While federated learning mechanism enhances data security and privacy, it also brings about technical challenges. As an industrial-scale framework, FATE-Flow is an end-to-end Federated Learning Pipeline. It is dedicated to delivering highly resilient, high-performance Federated Learning tasks. The Pipeline includes modeling, training, model management, production release and online reasoning.

End-to-end federated learning Pipeline

Jice Zeng shared his thoughts and practices on how to flexibly schedule and manage complex Federated Learning tasks, visualize federated modeling and implement online federated reasoning services. His experience focuses on applying experimental machine learning algorithm to actual production.

Jice Zeng, AI System Architect from Webank

Features of FATE-Flow include:

DAG-defined Federated Learning Pipeline: multi-asymmetric Pipeline DAG, general JSON format DAG DSL, DSL-Parser
Federated task collaborative scheduling: multi-task queue management, collaborative task distribution, task consistency management, multi-state synchronization, etc.
Federated model management: model storage, model consistency mangement, version management, release management, etc.
Federated task life cycle management: multi-stop, status detection, etc.
Real-time tracking of input and output of federated tasks: real-time record storage of data, models, custom metrics, logs, etc.

At the end of his talk, Jice Zeng called: “Join FATE, Let’s Federated Everything!”

4.Shield Sandbox: data cooperation and secure multi-party computing

There are many application scenarios of machine learning techniques in the wave of digital empowerment. Machine learning algorithms require large amount of high data quality to perform well. Secure multi-party computing technology and Federated Learning are of great use. Based on the existing digital ecosystem of Tencent, Tencent cloud shield data sandbox provides a safe and reliable machine learning platform for data cooperation, covering scenarios from business promotion, joint modeling to online service.

Sandbox distributed collaborative modeling

Safe multi-party computing MPC is the calculation problem of coordinating safe multi-party calculation when there is no trusted third party.

Common secure multi-party computing techniques include:

Secret Sharing
Garbled Circuit
Oblivious Transfer
homomorphic encryption

Xiong Zhang, Shield Sandbox product technical director from Tencent

In his talk session, Xiong Zhang first introduced four basic technologies in MPC. Then he explained how MPC and Federated Machine Learning techniques help protect cooperating parties’ data when using Shield Sandbox . Xiong Zhang said the Federated Learning framework, FATE, allows the sandbox not to interact directly with the original data when performing business data cooperation, protecting data.

At last, Xiong Zhang said that the goal of Shield Sandbox is to provide a data cooperation environment for those big data clusters that have high computing and storage capacity. Shield Sandbox is based on the existing big data ecosystem on Tencent Public Cloud. It aims to and help all customers on Tencent Cloud to better realize the saying “Technology to Good, Digital Power”. In the future, FATE and Shield Sandbox will cooperate in the deep sandbox in two main aspects. For one part, Shield Sandbox will promote existing data assets on the public cloud to use the sandbox to deploy FATE, helping those enterprises with data advantages in their industry dimension to dig deeper into the value of data and integrate into the Internet digital ecosystem. On the other hand, with the help of FATE, Shield Data Sandbox hopes to build the digital ecosystem on Tencent Cloud, attracting those enterprises that need more data to improve their business conversion rate to migrate to Tencent Cloud and experience the charm of digital power.

This Salon shows us how Federated Learning is a viable way to break data silos. The study and exploration of Federated Learning will not stop, and FATE will also continue to improve itself. Facing the future of Federated Learning, Tianjian Chen said: “Currently the application of federated learning is mainly limited by network bandwidth and chip computing power and we are still doing Federated Learning in the data center, both of which can be satisfied relatively well. If Federated Learning is to be done on edge devices such as mobile phones in the future, greater bandwidth and edge computing power will be necessary. I’m very optimistic that 5G will bring enough bandwidth to Federated Learning and as the phone chips get stronger, Federated Learning won’t be too far away from widespread mobile devices.”