Introduction

Last Updated：2020-09-11

Overview

The Baidu Messaging System (BMS) is a distributed, highly scalable, and high-throughput hosted message queue service that is fully compatible with Apache Kafka. You can directly experience the advanced features introduced by Kafka without considering cluster operations, and only pay for what you use.

Kafka's Position and Role in the Network

This chapter is used to introduce Kafka's position and role in the network by taking the development history of the new-generation Internet applications as an example.

Stage 1: The first application network is set up as follows:
- A Web application is deployed on the Cloud Compute to provide access service for personal computers or mobile users.
- The SQL database provides data persistence and data query for the web application.
Stage 2: Based on the rapid business development, the network capacity is expanded as follows:
- Increase the cache service to reduce the load on the SQL database.
- Collect logs and save them in Hadoop for offline processing to understand users’ behavior deeply.
- Summarize all data in the data warehouse to acquire interactive reports.
- Add real-time modules, external data interaction and so on.
The problems occur after the capacity expansion:
- Data synchronization between different systems
- System expansion
Stage 3: The new Kafka module provides the message queue feature. With the help of this feature, the web application only needs to add data to the queue, and then every component in the network reads the data from the queue in turn and processes it by itself. The whole procedure is shown in the following figure:

In this way, the problems caused by the capacity expansion are fixed easily, and the complexity of the system network is reduced. Furthermore, the programming complexity is reduced, and subsystems serve as plug-in components, rather than interactively negotiating interfaces, making Kafka play a role of the high-speed data bus.

Introduction to Apache Kafka

Apache Kafka is a distributed streaming data platform featured with three major characteristics:

Provide Pub/Sub massive message processing feature.

The Pub/Sub feature provided by Kafka is to perform asynchronous message exchange typically. The message publisher (Pub) only needs to specify the message category without interacting with the subscriber (Sub), while the Sub only receives one category or multiple categories of messages subscribed. The decoupling between the Pub and Sub can independently extend or modify the processing on both sides of the interface. For example, you can create different topics for server logs or devices of Internet of Things. After that, data can be continuously sent to each topic, and the backend data warehouse, streaming analysis, or full-text index are connected to specific topics without paying attention to servers or devices of Internet of Things.

Meanwhile, Kafka divides the topic into multiple partitions and selects some partitions to save messages according to the partition rules. If the partition rules are specified reasonably, all messages can be evenly distributed to different partitions. In this way, the load balance and horizontal expansion can be achieved,

and massive data streams can be saved at a high fault tolerance rate.

Multiple Subs can simultaneously consume data from one partition or multiple partitions to support the massive data processing capability. Meanwhile, to ensure the data reliability, Kafka sets up a Leader for each partition. In this way, the Leader is responsible for all read and write operations, and the Followers are responsible for only synchronizing messages with their Leader. When a message is written into a partition, the Leader makes multiple backups for itself and it Followers. If a Follower becomes invalid, Kafka can synchronize the historical messages of its Leader with other Followers. If the Leader becomes invalid, other Followers may become the new Leader to provide service continuously.
Ensure the sequence of data streams.

All the messages sent to the Kafka partitions by each Pub are sequential, so all Subs can process them subsequently. The sequential write works more efficiently than the random write, which guarantees the Kafka's high throughput.

BMS Overview

The BMS is a distributed, highly scalable, high-throughput, multi-partition, and multi-replica hosted message queue service based on the Apache Kafka. The BMS includes the details of the Kafka cluster and provides the hosted service. In this way, you can create different topics directly with the BMS to integrate large distributed applications without considering cluster operations, and only pay for the processed data based on the actual usage. The BMS have the following advantages over traditional messaging services:

Topics can be extended horizontally through partitions.
Partitions are distributed among multiple nodes to achieve the high throughput.
A single consumer can consume messages in the queue or as a Pub/Sub through the Consumer Group or multiple consumer clusters.
Producers and consumers can interact with each other asynchronously without waiting for each other.

Advantages

One-click deployment: After enabling the BMS, you can enjoy its service immediately, and focus on the product development without spending efforts on installation, deployment, configuration, debugging and maintenance of clusters.
Low price: No investments in any hardware and software are required. What you only to do is to enable the service and pay for the resources you use. Because the BMS is compatible with the community's Kafka, the migration fee is very low. What’s more, you don't need to worry about being bound to any technology.
Data security: The message center only supports SSL-encrypted data transmission to ensure that the customer data is secure and cannot be eavesdropped or tampered during the transmission process.
Reliability and durability: The unique high-availability feature can prevent data from being lost in the case of any application failures, individual machine failures, or device failures.

Business Scenarios

Collect data of massive users' browsing, clicking and searching of websites, devices or applications for real-time analysis.
Summarize remote sensing data from distributed applications to facilitate system operations.
Connect to the Spark Streaming provided by Baidu MapReduce to perform the real-time streaming data analysis.

Feature release notes

Pricing

百度智能云

Message Service for Kafka

Introduction

Overview

Kafka's Position and Role in the Network

Introduction to Apache Kafka

BMS Overview