SCB Data Lake Evolution

Siam Commercial Bank, the first bank in Thailand, has been a pillar of the Thai economy for over 115 years. The bank's acknowledgment of "data" as a corporate treasure is a sign that SCB is ahead of the competition. Since "data" is still a relatively novel idea in Thailand, the Bank is among the first organizations to establish the groundwork for data and technology. This article will explain how Thailand's oldest bank became one of the first institutions in the country to employ cutting-edge Data and AI technology.

data1

A Pioneering Era: From Enterprise Data Warehouse to an On-Premises Data Lake

SCB implemented Enterprise Data Warehouse (EDW) technologies in 2015, which consolidate all data into a single repository. Before "data lake" repositories became a thing, companies used terms such as "Big Data" and "Unstructured Data." A data lake stores data to facilitate analytics and other operations. This big database can store both unstructured and semi-structured data. Prior to the advent of big data, data was collected and structured in tabular format and the Data Warehouse Team was responsible for creating data relational models. In 2016, when this small team was first formed, SCB began collaborating with leading consulting firm Accenture to build a data platform, with Accenture being responsible for the initial deployment. This turnkey project took a year to complete because it was an on-premises data lake leveraging the then-popular Cloudera Data Platform and data center-installed servers. With Cloudera, SCB has adopted a unique strategy. Typically, EDW receives data in the form of files for export to Teradata for model creation. However, SCB transmitted files directly to big data for transformation into models, bypassing EDW.


EDW provides operational, downstream, and application support for the data lake, with analytics the data lake's primary function. EDW and Data Lake functions were divided, with each having their own responsibilities. The data lake is mostly utilized by analytics teams throughout the organization. In the past, teams depended on the centralized EDW pool as their major data source. All EDW resources were centralized and could not be scaled up, making it a sluggish process to analyze massive amounts of data. The data lake is a possible alternative to EDW for analytics requiring evaluation of long-term historical data. Our data lake was created to enable data analytics teams to evaluate vast amounts of historical data more quickly. In the years that followed, more users began to utilize the data lake. The FIA Team was among the first users to submit requests for financial analysis models to the Data Lake team. Since then, the program's user base has continuously increased.

Transitioning from an On-Premises Data Lake to a Cloud Data Lake

In the year following the completion of the data lake platform in 2017, the Bank's management mandated the use of cloud computing. Upon completion of this project, the Data Team was divided into the Data Science Team, Data Solution Delivery Team, Data Engineer Team, Data Operations Teams, and Data Visualize Teams to form the Data Solution Support Team (DSS). Their functions were split based on the architecture of the Data Lake platform. These teams assist the analytics teams, with the FIA Team among the initial adopters. During this time, the Data Science Team began to recruit additional members who were proficient in analytics and artificial intelligence while solely utilizing the data lake. As more data was uploaded to the Cloud, the Bank saw a steady movement of analytics teams from EDW to the Cloud. Certain teams, such as those responsible for risk management, continued to use EDW as their primary source. Throughout 2018 and 2019, services and versions were gradually improved.

Cloud-based operations are advantageous due to their convenience, speed, ease of platform maintenance, and scalability. In addition, a data center is no longer necessary. In the past, on-premises scalability required the Cloudera Data Platform to be built in the data center, demanding numerous hardware procurement and networking processes. If the system failed to execute or perform, the method and script had to be re-tuned for troubleshooting. In the past, Teradata was utilized as a superior Massive Parallel Processing (MPP) Databased engine for operating data warehouses other than Oracle, which was a simple database system that could only be expanded to a particular extent because activities slowed as more data was added. Big data platforms, such as Teradata, began to provide a parallel processing solution incorporating multiple nodes. However, because this method requires specialized infrastructure, Teradata is exceedingly costly. Due to its greater speed compared to typical relational databases, Teradata dominated the market for data warehouses for decades.

When open-source big data became available, however, customers did not expand their use of Teradata due to its expensive price. With the emergence of data science, Big Data gained increased popularity among users. Using big data, customers can run any historical data they desire or conduct data queries as usual at a reasonable cost. Price, speed, and convenience make cloud computing indisputably more cost-effective and efficient. Since 2016, we have been able to execute tens of thousands of jobs and accumulate large quantities of data. Users can easily configure property settings, and  data older than one year can be separated into archives, allowing the automatic transfer of files for long-term data storage.

 

From a Cloud Data Lake to a Cloud Data Lakehouse and Real Time Signals  

SCB's data system is hosted in the Cloud, putting the Bank ahead of the curve and making SCB the leading company in Thailand for Data & AI technology. SCB uploaded all its data and migrated engines, ETL, and logics to the cloud, and no longer requires a Data Warehouse. Data Governance has achieved a new level with a Cloud Data Lakehouse. Now, data scientists may utilize Azure Databricks, which can swiftly examine data in order to build machine learning models. Instead of utilizing third-party services, the Bank has developed data engineering teams. There are now around forty Data Engineers in existence. Teams of Project Managers (PM) and Function Analysts (FA) with well-defined roles and responsibilities have been formed. Function Analysts (FA) will perform the same duties as Business Analysts (BA), who communicate directly with business units regarding their requirements while Project Managers oversee the process. Requirements will be submitted to the Business Requirements Specifications (BRS) and Data Engineer teams, which will include System Analysts (SA) who evaluate how the FA's business requirements will be technically delivered.

Prior to implementation, it is the responsibility of SAs to map system sources.  If business units require extra information, SAs will map data prior to submitting it to the Data Engineer for development and production before releasing it to the QA team for testing in accordance with the cycle. If the test is successful, the project will proceed to the deployment phase, where the operation team will be responsible for monitoring the performance of the system. The Development Team may proceed with more initiatives. With well-defined roles, projects may function and progress efficiently, allowing employees to concentrate on their specific responsibilities. Now, our Clouds have scaled and been upgraded beyond HDInsight 3.6 to HDI 4.0 in 2021.

Real Time Signals are an additional innovation that debuted in 2021. Before this project was launched, IT teams, users, and data scientist teams engaged in extensive deliberation. Using a PowerBI dashboard to animate graphs, a prototype was produced for senior management’s consideration. In circumstances of actual usage, data will be communicated to SCB LINE Connect. For instance, fund transfer data will be mapped into a data science model before submitting to SCB LINE Connect.  Additionally, they can retrieve information including SCB Easy transactions. When a customer completes a transaction, the Real Time Platform will receive the data, and the Data Team will transmit real-time data to the Data Lakehouse in order to construct a use case, which will then be sent to the Data Science Team for matching with campaigns, such as selecting appropriate product offers to send to customers. SCB does not stop there and continues to improve the effectiveness of its systems.

In addition, machine learning is used to monitor jobs by analyzing log data and predicting which processes are experiencing problems and how long it will take to rerun problematic data. This function will ease the process of notifying users of problems, while the dependency graphs generated will make it easier for the Operations Team to visualize the situation.

Moving toward regional dominance with DataX’s Monoline Platform

SCB has advanced to the next level by establishing a company specializing in cutting-edge data and artificial intelligence services. This procedure has made the structure of the Centralized Data Platform even more obvious. The establishment of SCB DataX will radically alter the duties of data teams. The SCBX Group has established group companies in a variety of industries, with DataX serving as the hub of all data and providing data and artificial intelligence-related services. To accomplish this purpose, DataX must develop a new Monoline Platform for other SCBX subsidiaries. DataX will develop a data platform to support the data analytics efforts of group firms. Concurrently, we must undertake Cloud-based data sharing.

After uploading data to the centralized pool, a direct monoline connection must be made to each client company, which could number up to twenty. The procedure has been advancing. Maintaining the platform and regulating data exchange would be the more difficult tasks for DataX to overcome. Each company has its own collection of skilled employees. The logical change is identical, but the platforms are unique. Consequently, data exchange will necessitate consent, which must be handled with caution. In the past, when SCB was a single entity and all users reported to SCB, Data Governance was straightforward. The use of data on the Monoline Platform will necessitate a legal procedure involving consent requests.

When businesses are spun off, each acquires its own data source. This requires a completely new system. This procedure involves data mapping for the Monoline Platform in order to construct a group-wide Data Sharing system. Bringing the data system to a higher level is a difficult undertaking for the DataX team. There is currently no company in the country that can spin out a data-specialized company. The majority of businesses in Thailand continue to utilize Data Warehouses administered by IT departments, data officers, or business unit teams. DataX utilizes cutting-edge tools, technologies, and platforms validated and positioned in the leader quadrant by management consulting firm Gartner. DataX will be Data Lake House – Data Integration, offering advanced and comprehensive data & AI services to address the individual demands of each business, thereby elevating the level of the customer experience that will contribute to the profitability and success of the SCBX family of enterprises.

SCB's foresight in understanding the significance of data was important in making SCB DataX a reality today. DataX would not exist if senior management had not taken this matter seriously. It illustrates the goal of the SCBX mothership to lead the entire conglomerate into the Blue Ocean, where various business opportunities lie, utilizing data as a compass to direct the Group through the door of business opportunities using data and smart technology to lead the SCBX Group to success.

Source : SCB DataX: https://data-x.ai/\