简介:本文以高可用架构为核心,系统梳理了从理论到落地的全流程方法论,涵盖冗余设计、故障隔离、负载均衡等关键技术,结合分布式系统、云原生等场景提供可复用的架构方案,助力开发者构建稳定可靠的分布式系统。
在分布式系统时代,服务中断的代价远超想象。以电商系统为例,1分钟的宕机可能导致百万级交易损失;金融核心系统每秒故障可能造成千万级资金风险。高可用架构通过冗余设计、故障隔离等机制,将系统可用性提升至99.99%(年停机时间≤52分钟)甚至99.999%(年停机时间≤5分钟),成为企业数字化转型的基石。
典型场景:银行核心系统要求RTO≤2秒、RPO=0,而社交媒体应用可接受RTO≤5分钟、RPO≤1分钟。
代码示例:MySQL主从配置关键参数
[mysqld]server-id=1log-bin=mysql-binbinlog-format=ROWsync_binlog=1
@SentinelResource(value = "getUser", blockHandler = "handleBlock")public User getUser(Long id) {// 业务逻辑}
upstream backend {least_conn;server 10.0.0.1:80;server 10.0.0.2:80;}
apiVersion: networking.k8s.io/v1kind: Ingressmetadata:annotations:nginx.ingress.kubernetes.io/canary: "true"nginx.ingress.kubernetes.io/canary-weight: "20"
HystrixCommandProperties.Setter().withCircuitBreakerEnabled(true).withCircuitBreakerRequestVolumeThreshold(20).withCircuitBreakerErrorThresholdPercentage(50);
@Retryable(value = {RemoteAccessException.class},maxAttempts = 3,backoff = @Backoff(delay = 1000, multiplier = 2))public void callRemoteService() {// 远程调用}
@Beanpublic Tracer jaegerTracer() {return new Configuration("service-name",new Configuration.SamplerConfiguration(ProbabilityBasedSampler.TYPE, 1.0),new Configuration.ReporterConfiguration(true, "localhost", 6831, 1000)).getTracer();}
group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"group_replication_start_on_boot=OFF(先手动加入)group_replication_bootstrap_group=OFF(仅主节点配置为ON)
rs.addArb("arbiter.example.net:27017")
readPreference="secondaryPreferred"CLUSTER SETSLOT 1000 IMPORTING source-node-idmin-slaves-to-write 1和min-slaves-max-lag 10Ketama一致性哈希:Java实现示例:
public class KetamaNodeLocator {private final SortedMap<Long, Server> circle = new TreeMap<>();public void addServer(Server server, int weight) {int points = 160 * weight; // 每权重160个虚拟节点for (int i = 0; i < points; i++) {double key = (server.hashCode() + i) / (double) points;circle.put(HashUtil.hash(key), server);}}public Server getServer(String key) {Long hash = HashUtil.hash(key);if (!circle.containsKey(hash)) {SortedMap<Long, Server> tailMap = circle.tailMap(hash);hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();}return circle.get(hash);}}
eureka:server:enable-self-preservation: falserenewal-percent-threshold: 0.85
collector.servers=127.0.0.1:11800agent.service_name=order-serviceagent.sample_n_per_3_secs=-1
restartPolicy: Always和livenessProbe
apiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata:name: ordersspec:hosts:- ordershttp:- route:- destination:host: orderssubset: v1weight: 90mirror:host: orderssubset: v2mirror_percentage:value: 10.0
chaos.monkey.enabled=truechaos.monkey.watcher.instanceTypes=ASGchaos.monkey.scheduler.frequency=12chaos.monkey.scheduler.timeZone=UTC
rpl_semi_sync_master_wait_for_slave_count需≥1resources.requests.cpu建议≤500m
{"FunctionName": "order-processor","ProvisionedConcurrencyConfig": {"ProvisionedConcurrentExecutions": 100}}
结语:高可用架构是持续演进的过程,需要结合业务特点选择合适的技术栈。建议从核心交易链路开始,逐步完善监控体系,最终实现”设计高可用、运维自动化、故障自愈”的智能架构。实际落地时,可参考AWS的5个9可用性设计原则,或阿里云的ACP认证体系中的高可用模块,系统提升架构能力。