Emerging Trends of Big Data Technology in China and the U.S. in 2022

Apache SeaTunnel
10 min readSep 2, 2022

2022 is a year of full of global turmoil while witnessing the rapid development of global data technology. During the year, the new generation of Chinese Internet companies such as TikTok, SheIn, Shopee, etc. has obtained partial results in globalization. The new generation of data technology stack MDS (Modern Data Stack) in Silicon Valley in the US has flourished. And cloud-native data tech companies have attracted the attention of global capital in OLAP engines and the DataOps engines space. All the changes foster the trend of data technology globalization.

Making a general observation, from Hadoop, Spark, and the data platform (Data Middle Office) to the new generation of the data technology stack, we found that big data has also entered a new stage actually.

The early stage of big data technology (2005–2015)

The big data technology stack dominated by Apache Hadoop, Apache Spark, and Apache Oozie, has gradually replaced the commercial data warehouse technology stack, which was dominated by Teradata and Greenplum in the past, and quickly took over most of the market by its features of distributed, high-performance, open source, and free charge mode.

At that time in China, it was a time that Baidu, Alibaba, and Tencent underwent an entrepreneurial rise. Encouraged by the successful testimony of a large number of open source software by Google, Amazon, and AOL in the U.S., domestic Internet companies began their big data journeys. As a result, the rapid growth of open-source users of big data in China speeded up the construction of big data platforms in traditional industries, which mainly aimed at replacing the ODS layer and unstructured data storage and processing based on Hadoop and Spark.

Open code, low-cost X86 hardware support, and an easy-to-use SQL ecosystem allow most of the medium-sized Internet companies in China to continue to use this system for big data management and mining in the Hadoop/Spark ecosystem.

The period of Cloud-based big data platform and Data Middle-Office(2010–2020)

With the development of Internet business, the concept of data-driven is widely accepted, and more and more business demands are flocking to big data platforms. At this time, China and the U.S. have gone in different but interesting directions in big data.

At the moment, teeming big data applications stimulate a dramatic data volume rise in the U.S. Due to the high cost of labor and global operation and maintenance of physical machines, Internet companies represented by Amazon, Netflix, and LinkedIn in Silicon Valley use public clouds as the company’s infrastructure. While American companies are using public cloud virtual machines, all big data is put on the cloud, entering the era of separation of storage and computing. The data no longer exists in the offline Hadoop or Spark clusters, but in the cheap object storage of the public cloud such as S3 and GC Storage, and then the related data are dynamically processed through the elastic public cloud EMR, and Apache Azkaban and Apache Airflow are used for big data tasks scheduling. The overall open-source ecosystem further embraces the public cloud era.

With the boom of the public cloud, the U.S. public cloud spending accounts for about 24% of the average IT budget in 2020, and private cloud accounts for about 5% (from McKinsey Report China Public Cloud: Big Challenges, Big Potential), and as a whole, enterprises’ big data migration to the cloud has completed most of the data migration work.

In China, data-driven is also deeply accepted. A new generation of Internet companies such as Kuaishou, Toutiao, Meituan, and JD.com have begun to adopt more open source big data technologies in addition to Apache Hadoop/Spark to meet the needs of enterprise data analysis, such as ClickHouse, Apache Doris, Presto to further narrow the gap between business users and data.

At the same time, due to the huge amount of domestic data, the original scheduling tools such as Apache Airflow/Azkaban is insufficient to meet the demand. Internet companies have begun to build their scheduling engines or use a new generation of scheduling distributed engine Apache DolphinScheduler as a scheduling tool.

Moreover, Alibaba proposed the concept of a “Data Middle Office”, which integrates a variety of big data tools into a data platform system to quickly meet the needs of enterprise users. It represented further attempts to make data closer to the use of business personnel.

There was a watershed arising then between China and the U.S. regarding the technical path, with the former continuing to develop rapidly on the cloud, and the latter diving deeper into privatization.

Cloud-native and new generation technology stack(Modern Data Stack) (2015-present)

After 2020, with the development of cloud-native, the Chinese and the U.S. technology stacks have begun to march toward the cloud-native era with their different characteristics.

In Silicon Valley, the rise of a new generation of data technology stack MDS further differentiated the original public cloud services:

  • From IT-centric to business-centric

Using no code or low code technology to lower the technical threshold and make data processing and complex processes into services available to more people;

Streamline the data team with cloud-native, public cloud services, etc., allowing enterprises to focus more on high-value business data analysis rather than performance optimization;

With data technology self-service as a core function, data technologists become holistic data-driven enablers rather than analytical bottlenecks.

  • From an integrated overall solution to a products and services portfolio

The data is deployed in the cloud rather than locally, and is calculated separately according to the storage and computing used, saving the overall company cost;

Leveraging modern data technology stacks to make it possible to quickly solve business problems with DataOps/MLOps using out-of-the-box tools such as SaaS;

Different from the Data Middle Platform proposed by China, the mainstream tools in Silicon Valley disassemble complex integrated tools into a variety of professional products and services and recombine them to achieve lightweight and more professional services.

  • The rise of DataOps/MLOps enables analysts, engineers, and data scientists to reuse analytical processes and develop more efficiently

a. Compare to the relatively complex development process of Data Middle Office and traditional ETL, the new generation of data technology stack integrates DevOps-related processes, making development more efficient and rigorous

b. DataOps/MLOps allows engineers to change the one-time analysis of the past into reusable data analysis and data mining process, improving the development efficiency of the overall enterprise

c. DataOps/MLOps makes data governance a core element of the modern data stack

A collection of next-generation MDS tools represented by tools such as DBT, Fivetran, Airbyte, Airflow, DolphinScheduler, Apache SeaTunnel, Perfecto, etc., simplifies the use of data users.

In China, the new generation of global Internet companies such as TikTok, SheIn, and Shopee directly uses a new generation of cloud-native technologies to deploy cloud-native K8s services on the global public cloud, such as AWS AKS, Google Cloud GKE, etc., combining with their K8S Management with tools like Spark on K8s, Flink on K8s, DolphinScheduler on K8s, etc., to comprehensively build a multi-cloud big data hybrid cloud architecture under the cloud-native system:

The trend of big data technology in China and the U.S.

To sum up, diving through more than a decade of history of the development of big data technology in China and the U.S., several obvious industry trends are emerging:

  • Cloud-native

As labor costs rise and globalization develops, companies that do not have specific industry compliance requirements will gradually choose cloud-based big data infrastructure. But China and the U.S. turned to this path in a different order: the companies in the U.S. choose to go to the public cloud first, and then switch to the cloud-native system through MDS in the public cloud environment; however, the companies in China transfer to cloud-native first, and gradually switch from the local to the public cloud-native system. They tread different paths that lead to the same goal, which is to maximize the utilization of resources and data R&D efficiency.

  • The rise of self-service analytics and the democratization of data capabilities

More and more enterprise business personnel begin to use data tools directly to complete internal analysis: the market competition intensified in the face of emerging data engines and technologies, and it is more and more challenging for internal data engineers to satisfy scientists, Product managers, and operators’ frequent need to “extract data”. More enterprises use new generation tools such as Metabase, and DolphinScheduler to meet the needs of enterprise internal data analysis, data extraction, and timed data task encapsulation, allowing more people in the enterprise to use data more efficiently.

  • Open-source ecological commercialization

After more than ten years of development in the big data ecosystem, hundreds of different technologies and interfaces have emerged and developed rapidly. The traditional software development model of the past is outdated and failed to adapt to the development of the new generation of big data technology. Against this background, in the field of big data, some open-source commercial companies based on new-generation technologies have both emerged in China and the U.S. They run excellent open-source technology communities and offer enterprises the latest cloud-native services in the form of SaaS and commercial subscriptions at the same time, such as DBT (DBTCore), Astronomer (Apache Airflow), Airbyte (Airbyte), etc. in the U.S., and SphereEX (Apache ShardingSphere) and WhaleOps (Apache DolphinScheduler, Apache SeaTunnel), etc. in China.

On one hand, these companies basically can meet the ever-changing needs of new big data and new technology interface iterations during the open-source communities operation process. On the other hand, they provide commercial services based on open-source versions. By continuously enhancing open-source communities, they form a positive open-source-commercial flywheel effect.

Data technology is still developing rapidly. I believe that in the future, quantum computing, brain-computer interface, and AI applications will broaden the development space and foster development momentum for big data technology.

From the perspective of the user’s perception, these are the same goals of every data technology:

  • Civilization: “de-specialization”, not only for the engineers but allowing more internal users to use data;
  • Start simple: “de-Data Middle Office”, allowing organizations and users to start from the components they need to use, avoiding unnecessary costs and complexity;
  • Fast iteration: “extremely fast experience”, allowing users to see the results of data operations as soon as possible instead of days of complicated programming, debugging, and going online to verify the final results;
  • Cost-effective: “use on demand”, whether using private cloud-native technology or public cloud-native technology, the days of huge and idle big data cluster computing have gone away, and cloud-native big data technology that can be used on demand will replace the existing big data computing system in enterprises.

In the past two years, I remember there were loud voices claiming that “Hadoop is dead”, and that the development of big data technology has stagnated. But I want to justify the fact that big data technology is rapidly customed to the use scenarios of enterprises in a cloud-native way by a new generation of data technology stack (MDS), which is more efficient, more simple, and cheaper.

Big data technology is still growing rapidly. China’s IT managers should take advantage of technology to achieve the goal of globalization from the perspective of the globe. The U.S. is also closely watching new technologies emerging from the huge volume of Chinese developers to promote their development.

Although the world is still in turmoil at this moment, the development of science and technology often goes through cycles of germination-development-overheating-calming down-rising-climax. I believe that after this turbulent baptism, excellent technology can survive and eventually take over the highlands in the next economic cycle!

About Apache SeaTunnel

Apache SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day in a stable and efficient manner.

Why do we need Apache SeaTunnel?

Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

  • Data loss and duplication
  • Task buildup and latency
  • Low throughput
  • Long application-to-production cycle time
  • Lack of application status monitoring

Apache SeaTunnel Usage Scenarios

  • Massive data synchronization
  • Massive data integration
  • ETL of large volumes of data
  • Massive data aggregation
  • Multi-source data processing

Features of Apache SeaTunnel

  • Rich components
  • High scalability
  • Easy to use
  • Mature and stable

How to get started with Apache SeaTunnel quickly?

Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?

We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:

https://github.com/apache/incubator-seatunnel/issues

Contribute code to:

https://github.com/apache/incubator-seatunnel/pulls

Subscribe to the community development mailing list :

dev-subscribe@seatunnel.apache.org

Development Mailing List :

dev@seatunnel.apache.org

Join Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-1kcxzyrxz-lKcF3BAyzHEmpcc4OSaCjQ

Follow Twitter:

https://twitter.com/ASFSeaTunnel

Come and join us!

--

--

Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.