常见问题
常见调试工具
- 查看控制平面debug级别的日志,编辑 deploy istiod --log_output_level=default:debug
- 更改业务 Pod 中的 istio-proxy 日志等级,便于查看 envoy 的日志
# 1. 修改日志级别(包含none、error、warn、info、debug)
istioctl -n namespace proxy-config log <pod> --level 日志级别
# 2. 变更 envoy 中日志级别
kubectl -n default exec -it pod名称 -c istio-proxy bash
# 变更 envoy 中的所有组件的日志级别
curl -s -X POST 127.1:15000/logging?level=trace
# 变更 envoy 中的 lua 日志级别
curl -s -X POST 127.1:15000/logging?lua=debug
- 查看 Envoy 的 xds 配置
# 在 istio-proxy 容器中
curl localhost:15000/config_dump > config_dump
- 获取 istiod 控制平面中的所有的 CRD 信息
curl -s 127.1:15014/debug/configz
- 获取控制平面的内部状态和指标信息
curl -s 127.1:15014/metrics
- 获取数据平面 Envoy 收取到服务发现的实例信息及状态指标
curl -s 127.1:15000/clusters
配置完 DestinationRule 后出现503
排查:
可以查看是否由于 DestinationRule 与全局的 TLS 策略有冲突导致,此错误只有在安装期间禁用了自动双向 TLS 时才会看到。
使用注意事项:
任何时候应用 DestinationRule,都要确保 trafficPolicy TLS mode 和全局的配置一致。
控制平面出现 pilot_eds_no_instances 指标
原因:
可能是 service selector 标签没有匹配到相关联的 pod,或者 service 与之关联的 pod 没有启动。
排查:
可以排查 pod 的标签与 Service selector 字段是否匹配,DestinationRule 子集标签是否配置正确,或者查看 ep 是否关联了这个 pod。
案例:
展示 istioctl ps 如下所示:
手动编辑 DestinationRule review name v1 的 version 为 v1-bug。 重新从浏览器打开页面,review 页面出现报错。
快速定位:
1、查看 istiod 控制平面的日志,出现 pilot_eds_no_instances 异常指标。
2、查看 productpage Pod 中的业务 istio-proxy 日志。
3、进入到 productpage Pod 中的 istio-proxy 中,查看 clusters。可以查看到 envoy 没有 reviews v1 相关的 cluster,v2/v3 而是有相关的 cluster。
curl localhost:15000/clusters | grep reviews
4、查看 productpage Pod 中的 istio-proxy 的 config_dump 文件。
kubectl -n default exec -it productpage-v1-7795568889-npt8f -c istio-proxy -- curl localhost:15000/config_dump > config_dump
5、使用 istioctl 工具,pc、ps 查看 all、bootstrap、cluster、endpoint、listener、log、route、secret 等 envoy 运行时的参数。 编辑 Destinationrule reviews 恢复正常。
控制平面出现 pilot_conflict_inbound_listener 指标
原因:
该指标是由于协议冲突导致的 。
排查:
重点排查 k8s Service 资源,是否存在相同端口上配置了不同协议的情况。
灰度发布场景中,未按照预期配置进行流量治理
排查:
- 可以排查 VirtualService 的 host 字段是否配置错误,match 字段的匹配规则是否错误,和关于子集的配置是否错误,还有 DestinationRule 下标签和子集是否匹配等方面。
- 添加或删除配置时候,可能出现503错误,这是由于配置传播未到达导致。为确保不出现此情况,在添加配置时,应当先更新 DestinationRule,再更新 VirtualService 的配置对新添加的子集进行引用。当删除子集时,应当先删除 VirtualService 对任何子集的引用,再更新 DestinationRule ,删除不用的子集。
- 还可能出现配有故障注入和重试/超时策略的 VirtualService 未按预期工作。 Istio 不支持在同一个 VirtualService 上配置故障注入和重试或超时策略,如果有此需求,可以通过 EnvoyFilter 注入故障。
配置 ServiceEnvoy 将外部服务注册到网格内部时未生效
排查:
可能是网格内部未开启对 DNS 的相关配置,可以检查下 istio-system 命名空间下的 configmap istio proxyMetadata字段。
sidecar 自动注入失败
排查:
- 首先确保 pod 不在 kube-system 或 kube-public命名空间中。 这些命名空间中的 Pod 将忽略 Sidecar 自动注入 。
- 确保 Pod 定义中没有 hostNetwork:true。 hostNetwork:true 的 Pod 将忽略 Sidecar 自动注入。
- 检查 webhook 关于自动注入的标签是否出现问题 或者 istio-system 命名空间下的 configmap istio-sidecar-injector 的默认注入策略是否有误。
- 排查 pod 中的标签 sidecar.istio.io/inject: "true" 是否设置有误,true 为强制注入,false 则不会强制注入 sidecar。
- 在注入失败的 Pod 的 Deployment 上运行 kubectl -n namespace describe deployment name。通常能在事件中看到调用注入 webhook 失败的原因。
控制平面出现 pilot_total_rejected_configs 指标
现象:
网格中同时存在以下两个 Gateway 请求 https://test1.example.com 正常返回 404,说明访问到了,请求 https://test2.example.com 出现异常。
原因:
通过 istiod 监控发现 pilot_total_rejected_configs 指标异常,显示 default/test2 配置被拒绝,导致被拒绝的原因是 每个域名在同一端口上只能配置一次 TLS,我们这里 test1.example.com 在 2 个 Gateway 的 443 端口都配置了 TLS, 导致其中一个被拒绝,通过监控确认被拒绝的是 test2,test2.example.com 和 test1.example.com 配置在 test2 的同一个 Server,Server 配置被拒绝导致请求异常,解决此问题可以把 test 1删除。删除后,请求恢复正常。
Envoy 日志中出现 headers size exceeds limit 错误
背景:
某个客户接入mesh后有个API接口调用失败,envoy 包含inbound、outbound两个方向,具体报错详情参考如下所示。
报错日志:
日志报错 2022-12-31T14:24:27.501043Z debug envoy client [C4689] protocol error: headers size exceeds limit
相关问题:
https://github.com/envoyproxy/envoy/issues/13827
解决方案:
通过 envoyfilter 调大默认限制值。
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: max-header
namespace: istio-system
spec:
configPatches:
- applyTo: NETWORK_FILTER
match:
context: ANY
listener:
filterChain:
filter:
name: "envoy.http_connection_manager"
patch:
operation: MERGE
value:
typed_config:
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
max_request_headers_kb: 80
common_http_protocol_options:
max_headers_count: 500
- applyTo: CLUSTER
match:
context: ANY
cluster:
portNumber: 5000
patch:
operation: MERGE
value: # cluster specification
common_http_protocol_options:
max_headers_count: 500
总结:
原因1:pod 中监听的端口与程序启动监听的端口不匹配,会导致出现 503 问题
原因2:envoy 自身对于 header 数量的限制,会出现 502 协议问题。可参见:https://github.com/envoyproxy/envoy/issues/13827
排错思路积累:
503 错误可以考虑 程序监听的端口与 Pod 配置的端口不一致。
502 错误可以考虑 header 数量过大导致,通过 istio-proxy debug 日志可以观察。
参考:
- https://github.com/envoyproxy/envoy/issues/13827
- https://www.gl.sh.cn/2022/04/11/network_filter-httpconnectionmanager.html
- https://github.com/envoyproxy/envoy/issues/11774
- https://github.com/envoyproxy/envoy/pull/12216
如何给Envoy开启AccessLog访问日志
我们可以通过 Telemetry API、Mesh Config、EnvoyFilter 三种方式开启 Envoy 的 access log 日志。
Telemetry API
cat <<EOF > ./telemtry-example.yaml
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
spec:
selector:
matchLabels:
app: helloworld
accessLogging:
- providers:
- name: envoy
EOF
kubectl apply -f ./telemtry-example.yaml
我们查看 helloworld 中的 istio-proxy 中的日志,可以看到下发的 Telemetry 配置生效了。
$ kubectl -n test logs helloworld-v2-7d55d87964-6tncj -c istio-proxy --tail=10
2023-04-17T09:27:13.557757Z info xdsproxy connected to upstream XDS server: 192.168.0.6:15012
2023-04-17T09:57:43.387553Z info xdsproxy connected to upstream XDS server: 192.168.0.6:15012
2023-04-17T10:30:52.106631Z info xdsproxy connected to upstream XDS server: 192.168.0.6:15012
2023-04-17T10:59:38.532144Z info xdsproxy connected to upstream XDS server: 192.168.0.6:15012
[2023-04-17T11:02:59.500Z] "GET /hello HTTP/1.1" 200 - via_upstream - "-" 0 60 119 119 "-" "curl/7.87.0-DEV" "4d166eba-cb47-4147-a182-7a3896643d06" "helloworld:5000" "x.x.x.x:5000" inbound|5000|| x.x.x.x:37311 x.x.x.x:5000 x.x.x.x:54644 outbound_.5000_._.helloworld.test.svc.cluster.local default
[2023-04-17T11:03:00.923Z] "GET /hello HTTP/1.1" 200 - via_upstream - "-" 0 60 123 122 "-" "curl/7.87.0-DEV" "fbd2b87a-913c-4d6f-8f8a-c625f1aaf80e" "helloworld:5000" "x.x.x.x:5000" inbound|5000|| x.x.x.x:38235 x.x.x.x:5000 x.x.x.x:54656 outbound_.5000_._.helloworld.test.svc.cluster.local default
[2023-04-17T11:03:39.534Z] "GET /hello HTTP/1.1" 200 - via_upstream - "-" 0 60 119 119 "-" "curl/7.87.0-DEV" "bc364db9-dc03-4331-95db-ffea73fe5c0a" "helloworld:5000" "x.x.x.x:5000" inbound|5000|| x.x.x.x:52206 x.x.x.x:5000 x.x.x.x:54644 outbound_.5000_._.helloworld.test.svc.cluster.local default
[2023-04-17T11:03:42.281Z] "GET /hello HTTP/1.1" 200 - via_upstream - "-" 0 60 116 116 "-" "curl/7.87.0-DEV" "b88fa88a-5ec1-4747-945b-8ace802db94f" "helloworld:5000" "x.x.x.x:5000" inbound|5000|| x.x.x.x:33526 x.x.x.x:5000 x.x.x.x:54644 outbound_.5000_._.helloworld.test.svc.cluster.local default
[2023-04-17T11:03:43.371Z] "GET /hello HTTP/1.1" 200 - via_upstream - "-" 0 60 120 119 "-" "curl/7.87.0-DEV" "dd20b26a-af35-411b-a2e5-bc7f7f847c87" "helloworld:5000" "x.x.x.x:5000" inbound|5000|| x.x.x.x:55067 x.x.x.x:5000 x.x.x.x:54656 outbound_.5000_._.helloworld.test.svc.cluster.local default
[2023-04-17T11:03:44.500Z] "GET /hello HTTP/1.1" 200 - via_upstream - "-" 0 60 116 115 "-" "curl/7.87.0-DEV" "be3a1993-0502-4f50-ad2d-c1672594962c" "helloworld:5000" "x.x.x.x:5000" inbound|5000|| x.x.x.x:34296 x.x.x.x:5000 x.x.x.x:54644 outbound_.5000_._.helloworld.test.svc.cluster.local default
使用 Mesh Config
修改 istio 配置:
kubectl -n istio-system edit configmap istio
添加以下内容:
apiVersion: v1
data:
mesh: |-
# 全局修改 Envoy 输出 accesslog
accessLogEncoding: JSON
accessLogFile: /dev/stdout
accessLogFormat: ""
defaultConfig:
holdApplicationUntilProxyStarts: true
rootNamespace: istio-system
kind: ConfigMap
metadata:
name: istio
accessLogEncoding: 表示 accesslog 输出格式,istio 预定义了 TEXT 和 JSON 两种日志输出格式。默认使用 TEXT,通常我们习惯改成 JSON 以提升可读性,同时也利于日志采集。
accessLogFile: 表示 accesslog 输出到哪里,通常我们指定到 /dev/stdout (标准输出),以便使用 kubectl logs 来查看日志,同时也利于日志采集。
accessLogFormat: 如果不想使用 istio 预定义的 accessLogEncoding,我们也可以使用这个配置来自定义日志输出格式。
使用 EnvoyFilter
cat <<EOF > ./envoyfilter-example.yaml
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: enable-accesslog
spec:
configPatches:
- applyTo: NETWORK_FILTER
match:
context: ANY
listener:
filterChain:
filter:
name: envoy.http_connection_manager
patch:
operation: MERGE
value:
typed_config:
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
access_log:
- name: envoy.access_loggers.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: "/dev/stdout"
log_format:
json_format:
authority: "%REQ(:AUTHORITY)%"
bytes_received: "%BYTES_RECEIVED%"
bytes_sent: "%BYTES_SENT%"
downstream_local_address: "%DOWNSTREAM_LOCAL_ADDRESS%"
downstream_remote_address: "%DOWNSTREAM_REMOTE_ADDRESS%"
duration: "%DURATION%"
method: "%REQ(:METHOD)%"
path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
protocol: "%PROTOCOL%"
request_id: "%REQ(X-REQUEST-ID)%"
requested_server_name: "%REQUESTED_SERVER_NAME%"
response_code: "%RESPONSE_CODE%"
response_flags: "%RESPONSE_FLAGS%"
route_name: "%ROUTE_NAME%"
start_time: "%START_TIME%"
upstream_cluster: "%UPSTREAM_CLUSTER%"
upstream_host: "%UPSTREAM_HOST%"
upstream_local_address: "%UPSTREAM_LOCAL_ADDRESS%"
upstream_service_time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
upstream_transport_failure_reason: "%UPSTREAM_TRANSPORT_FAILURE_REASON%"
user_agent: "%REQ(USER-AGENT)%"
x_forwarded_for: "%REQ(X-FORWARDED-FOR)%"
EOF
kubectl apply -f ./envoyfilter-example.yaml
部分 workload 启用 accesslog
如果想要精确到只为指定的 workload 启用 accesslog,可以在 EnvoyFilter 上加一下 workloadSelector:
spec:
workloadSelector:
labels:
app: helloworld
我们查看 helloworld 中的 istio-proxy 中的日志,可以看到下发 EnvoyFilter 配置后生效了。
{"authority":"helloworld:5000","method":"GET","user_agent":"curl/7.87.0-DEV","bytes_received":0,"route_name":"default","upstream_host":"x.x.x.x:5000","upstream_service_time":"117","request_id":"ba543d08-28a0-4dc3-a490-6547474ff858","upstream_local_address":"x.x.x.x:39069","bytes_sent":60,"duration":117,"x_forwarded_for":null,"path":"/hello","upstream_cluster":"inbound|5000||","requested_server_name":"outbound_.5000_._.helloworld.test.svc.cluster.local","upstream_transport_failure_reason":null,"downstream_remote_address":"x.x.x.x:46866","downstream_local_address":"x.x.x.x:5000","start_time":"2023-04-17T11:22:56.569Z","response_flags":"-","protocol":"HTTP/1.1","response_code":200}
使用SidecarCRD降低数据平面Envoy资源消耗
目的
给 istio 增加 Sidecar 配置,降低数据平面 Envoy 的资源消耗。
单namespace示例
单namespace集群的sidecar配置如下:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: test-sidecar
namespace: test
spec:
egress:
- hosts:
- test/* # 业务所在的命名空间
- istio-system-csm-testtest/* # istiod 所在的命名空间,csm 实例
多namespace示例
- 涉及的namespace:test、test-1
sidecar配置如下:
# ns为test的sidecar示例
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: test-sidecar
namespace: test
spec:
egress:
- hosts:
- test/* # 业务所在的命名空间
- test-1/* # 业务所在的命名空间
- istio-system-csm-testtest/* # istiod 所在的命名空间,csm 实例
# ns为test-1的sidecar
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: test-sidecar
namespace: test-1
spec:
egress:
- hosts:
- test/* # 业务所在的命名空间
- test-1/* # 业务所在的命名空间
- istio-system-csm-testtest/* # istiod 所在的命名空间,csm 实例
Envoy删除部分Header
现象
业务注入的 Envoy 后,返回的 Header 中会有 x-envoy 信息,业务需要删除 x-envoy-decorator-operation header。
原因
envoy 默认会生成这些特定的 header。
解决方案
使用 envoyfilter 删除这些 header。参考的 header 如下所示:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: response-headers-filter
spec:
configPatches:
- applyTo: NETWORK_FILTER
match:
context: ANY
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
patch:
operation: MERGE
value:
typed_config:
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
server_header_transformation: PASS_THROUGH
- applyTo: HTTP_ROUTE
match:
context: SIDECAR_INBOUND
patch:
operation: MERGE
value:
decorator:
propagate: false # removes the decorator header
response_headers_to_remove:
- x-envoy-upstream-service-time
- x-powered-by
- server
参考
https://janine.pcfe.net/posts/2022-10-10-istio-envoyfilter-strip-headers/
主集群primary Istiod 总是重启
现象
主集群 Istiod 总是发生 OOM 导致 Istiod Pod 重启,进而数据平面不能正常连接到 istiod 控制平面。
Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "podName" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: HTTP status code 503 , message: "2023-03-31T01:30:00.231989Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2023-03-31T01:31:00.457233Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: HTTP status code 503\nError: timeout waiting for Envoy proxy to become ready. Last error: HTTP status code 503\n"
原因
istiod 所在的 k8s 集群 node、pod 等 Endpoint 元数据太多,导致 xds 推送频繁,耗费较多的资源,进而 istiod 容易 OOM。
解决方案
方案一:主集群 primary 添加 node 节点,保证 istiod 能够正常调度。
方案二:使用 sidecar 限制 istiod 计算范围。案例如下所示:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: healthsvc-sidecar
namespace: healthsvc-xxx
spec:
egress:
- hosts:
- healthsvc-xxx/* # 业务所在的命名空间
- istio-system-xxx/* # istiod 所在的命名空间
参考
- https://istio.io/latest/docs/tasks/observability/distributed-tracing/jaeger/
- https://istio.io/latest/docs/tasks/observability/metrics/customize-metrics/
- https://github.com/istio/istio/issues/44266
某业务线上接口接入Mesh后不通
现象
某个业务线上服务接入 Mesh 后,某个接口出现 502。
正常:curl -I 'http://x.x.x.x:8080/test/hospitalImjjh/card3/s?word=西安冠心病' -H 'host: xxxx.baidu.com'
异常:curl -I 'http://x.x.x.x:8080/test/hospitalImjjh/card3/s?word=%E8%A5%BF%E5%AE%89%E5%86%A0%E5%BF%83%E7%97%85' -H 'host: xxxx.baidu.com'
Envoy 的报错日志如下所示:
2023-03-31T02:56:51.929765Z trace envoy filter [C5806] upstream connection received 14240 bytes, end_stream=false
2023-03-31T02:56:51.929785Z trace envoy connection [C5806] writing 14240 bytes, end_stream false
2023-03-31T02:56:51.929795Z trace envoy connection [C5807] socket event: 2
2023-03-31T02:56:51.929798Z trace envoy connection [C5807] write ready
2023-03-31T02:56:51.929801Z trace envoy connection [C5806] socket event: 2
2023-03-31T02:56:51.929802Z trace envoy connection [C5806] write ready
2023-03-31T02:56:51.929844Z trace envoy connection [C5806] write returns: 14240
2023-03-31T02:56:51.930000Z trace envoy connection [C5805] socket event: 3
2023-03-31T02:56:51.930015Z trace envoy connection [C5805] write ready
2023-03-31T02:56:51.930020Z trace envoy connection [C5805] read ready. dispatch_buffered_data=false
2023-03-31T02:56:51.930039Z trace envoy connection [C5805] read returns: 14333
2023-03-31T02:56:51.930049Z trace envoy connection [C5805] read error: Resource temporarily unavailable
2023-03-31T02:56:51.930055Z trace envoy http [C5805] parsing 14333 bytes
2023-03-31T02:56:51.930063Z trace envoy http [C5805] message begin
2023-03-31T02:56:51.930076Z trace envoy http [C5805] completed header: key=Content-Type value=text/html;charset=utf-8
2023-03-31T02:56:51.930089Z trace envoy http [C5805] completed header: key=Transfer-Encoding value=chunked
2023-03-31T02:56:51.930107Z trace envoy http [C5805] completed header: key=Connection value=close
2023-03-31T02:56:51.930118Z trace envoy http [C5805] completed header: key=Vary value=Accept-Encoding
2023-03-31T02:56:51.930129Z trace envoy http [C5805] completed header: key=Server value=nginx/1.8.0
2023-03-31T02:56:51.930139Z trace envoy http [C5805] completed header: key=Date value=Fri, 31 Mar 2023 02:56:51 GMT
2023-03-31T02:56:51.930143Z trace envoy http [C5805] completed header: key=Vary value=Accept-Encoding
2023-03-31T02:56:51.930146Z trace envoy http [C5805] completed header: key=Set-Cookie value=58D4878A25DC3F9C0E2FB62870D096D3:FG=1; max-age=31536000; expires=Sat, 30-Mar-24 02:56:50 GMT; path=/; version=1; comment=bd
2023-03-31T02:56:51.930156Z debug envoy client [C5805] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN
2023-03-31T02:56:51.930160Z debug envoy connection [C5805] closing data_to_write=0 type=1
2023-03-31T02:56:51.930162Z debug envoy connection [C5805] closing socket: 1
2023-03-31T02:56:51.930199Z trace envoy connection [C5805] raising connection event 1
2023-03-31T02:56:51.930206Z debug envoy client [C5805] disconnect. resetting 1 pending requests
2023-03-31T02:56:51.930210Z debug envoy client [C5805] request reset
2023-03-31T02:56:51.930213Z trace envoy main item added to deferred deletion list (size=1)
2023-03-31T02:56:51.930222Z debug envoy router [C5804][S6039679106682162641] upstream reset: reset reason: protocol error, transport failure reason:
2023-03-31T02:56:51.930231Z trace envoy connection [C5806] socket event: 3
2023-03-31T02:56:51.930244Z trace envoy connection [C5806] write ready
2023-03-31T02:56:51.930247Z trace envoy connection [C5806] read ready. dispatch_buffered_data=false
原因
- 为什么两个请求,第一个请求正常 第二个会报错?
第一个能够通信的原因是:流量当成 TCP 处理。
- 导致第二个请求报错的原因是:
80058 2023-03-31T06:40:27.095936Z trace envoy connection [C27090] read error: Resource temporarily unavailable
80059 2023-03-31T06:40:27.095939Z trace envoy filter [C27091] downstream connection received 0 bytes, end_stream=true
80060 2023-03-31T06:40:27.095944Z trace envoy http [C27090] parsing 1191 bytes
80061 2023-03-31T06:40:27.095971Z trace envoy http [C27090] message begin
80062 2023-03-31T06:40:27.095990Z trace envoy http [C27090] completed header: key=Content-Type value=text/html;charse t=utf-8
80063 2023-03-31T06:40:27.095945Z trace envoy connection [C27092] writing 0 bytes, end_stream true
80064 2023-03-31T06:40:27.096006Z trace envoy connection [C27092] socket event: 2
80065 2023-03-31T06:40:27.096009Z trace envoy connection [C27092] write ready
80066 2023-03-31T06:40:27.096028Z trace envoy connection [C27092] socket event: 2
80067 2023-03-31T06:40:27.096029Z trace envoy connection [C27092] write ready
80068 2023-03-31T06:40:27.096096Z trace envoy http [C27090] completed header: key=Connection value=close
80069 2023-03-31T06:40:27.096126Z trace envoy http [C27090] completed header: key=Vary value=Accept-Encoding
80070 2023-03-31T06:40:27.096133Z trace envoy http [C27090] completed header: key=Server value=nginx/1.8.0
80071 2023-03-31T06:40:27.096140Z trace envoy http [C27090] completed header: key=Date value=Fri, 31 Mar 2023 06:40:2 7 GMT
80072 2023-03-31T06:40:27.096146Z trace envoy http [C27090] completed header: key=Vary value=Accept-Encoding
80073 2023-03-31T06:40:27.096150Z trace envoy http [C27090] completed header: key=Set-Cookie value=C590DB1FDF A19EB891E90DB75916FF92:FG=1; max-age=31536000; expires=Sat, 30-Mar-24 06:40:25 GMT; path=/; version=1; comment=bd
80074 2023-03-31T06:40:27.096162Z debug envoy client [C27090] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN
80075 2023-03-31T06:40:27.096172Z debug envoy connection [C27090] closing data_to_write=0 type=1
80076 2023-03-31T06:40:27.096174Z debug envoy connection [C27090] closing socket: 1
80077 2023-03-31T06:40:27.096214Z trace envoy connection [C27090] raising connection event 1
80078 2023-03-31T06:40:27.096222Z debug envoy client [C27090] disconnect. resetting 1 pending requests
80079 2023-03-31T06:40:27.096229Z debug envoy client [C27090] request reset
80080 2023-03-31T06:40:27.096232Z trace envoy main item added to deferred deletion list (size=1)
80081 2023-03-31T06:40:27.096245Z debug envoy router [C27089][S17527852526966740131] upstream reset: reset reason: prot ocol error, transport failure reason:
80082 2023-03-31T06:40:27.096296Z debug envoy http [C27089][S17527852526966740131] Sending local reply with details u pstream_reset_before_response_started{protocol_error}
业务流量抓包
0x03d0: 7061 7468 3d2f 0d0a 5374 7269 6374 2d54 path=/..Strict-T
0x03e0: 7261 6e73 706f 7274 2d53 6563 7572 6974 ransport-Securit
0x03f0: 793a 206d 6178 2d61 6765 3d31 3732 3830 y:.max-age=17280
0x0400: 300d 0a54 696d 6520 3a20 5475 6520 4f63 0..Time.:.Tue.Oc
0x0410: 7420 3138 2031 313a 3234 3a35 3020 4353 t.18.11:24:50.CS
0x0420: 5420 3230 3232 0d0a 5472 6163 6569 643a T.2022..Traceid:
0x0430: 2031 3638 3033 3535 3131 3530 3537 3335 .168035511505735
0x0440: 3937 3435 3038 3739 3139 3332 3734 3131 9745087919327411
0x0450: 3330 3639 3832 3438 0d0a 5661 7279 3a20 30698248..Vary:.
0x0460: 4163 6365 7074 2d45 6e63 6f64 696e 670d Accept-Encoding.
0x0470: 0a58 2d47 732d 466c 6167 3a20 3078 300d .X-Gs-Flag:.0x0.
有问题的 header key
0x0400: 300d 0a54 696d 6520 3a20 5475 6520 4f63 0..Time.:.Tue.Oc
"Time :" 这里有个空格,不符合 header 规范,触发envoy检测header报错。
Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN
线下环境复现和调试
下载envoy (使用1.18.3版本),使用如下最简配置文件启动envoy,转发流量到业务服务。
admin:
address:
socket_address: { address: 127.0.0.1, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 127.0.0.1, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: some_service }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: some_service
connect_timeout: 0.25s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: some_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 业务IP
port_value: 业务端口
请求envoy 发送业务请求,即可看到如下报错(开启envoy日志等级为debug)。
[2023-04-01 18:14:44.358][1535551][trace][connection] [source/common/network/raw_buffer_socket.cc:38] [C17] read error: Resource temporarily unavailable
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:555] [C17] parsing 1079 bytes
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:834] [C17] message begin
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=idcinfo value=hba
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Bdqid value=43e2823cd8d4f5df
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Bdqid value=43e2823cd8d4f5df
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Bdsvrtm value=2374
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Cache-Control value=no-cache
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Content-Type value=text/html;charset=utf-8
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Date value=Sat, 01 Apr 2023 10:14:44 GMT
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=P3p value=CP=" OTI DSP COR IVA OUR IND COM "
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Rpql value=1
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Server value=apache
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=7F014F6421F98673E55D5BA0F02EB03D:FG=1; max-age=31536000; expires=Sun, 31-Mar-24 10:14:42 GMT; path=/; version=1; comment=bd
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=delPer=0; path=/
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=search_mapping=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=bzt_device_sn=deleted; expires=Thu, 01-Jan-1970 17:00:00 GMT; path=/
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=bzt_username=deleted; expires=Thu, 01-Jan-1970 17:00:00 GMT; path=/
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=BDSVRTM=2374; path=/
[2023-04-01 18:14:44.358][1535551][trace][http] [source/common/http/http1/codec_impl.cc:475] [C17] completed header: key=Set-Cookie value=PSINO=1; path=/
[2023-04-01 18:14:44.358][1535551][debug][client] [source/common/http/codec_client.cc:149] [C17] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN
[2023-04-01 18:14:44.358][1535551][debug][connection] [source/common/network/connection_impl.cc:133] [C17] closing data_to_write=0 type=1
[2023-04-01 18:14:44.358][1535551][debug][connection] [source/common/network/connection_impl.cc:243] [C17] closing socket: 1
验证Time\s: 确实会触发报错
随便找个不报错的服务,本地启动一个nginx,转发流量到该服务,且添加header Time : xxx,将enovy本地配置转发到改 nginx服务,即可复现错误,且可以通过调整是否添加此header来验证。
location / {
add_header "Time " "xxx";
prox_pass http://服务;
}
Envoy 源码报错定位
日志debug信息
source/common/http/http1/codec_impl.cc:475 [C17] completed header: key=Set-Cookie value=PSINO=1; d
source/common/http/codec_client.cc:149 [C17] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN
代码校验header位置:https://github.com/nodejs/http-parser/blob/5c5b3ac62662736de9e71640a8dc16da45b32503/http_parser.c
http_parser.c
解决方案
解决方式:修改业务中相关的接口代码,将 header 中的 key 设置正确。
参考
- https://github.com/istio/istio/issues/36711
- https://github.com/istio/istio/issues/31458
- https://stackoverflow.com/questions/72808664/istio-error-dispatching-received-data-http-1-1-protocol-error-unsupported-tra
- https://github.com/istio/istio/issues/39706
某业务接入Mesh某个接口不通
现象
某个业务接入 Mesh 后,部分接口出现 502。
访问以下接口,直接返回 502。
curl -v 'http://x.x.x.x:80/api/v1/hub/meta/attachment?attachmentId=20230303035842-xhchy&fileName=%E7%A7%81%E5%AF%86%E9%A1%B9%E7%9B%AE%E5%8F%82%E4%B8%8E%E4%BA%BA%E5%91%98%E4%BF%9D%E5%AF%86%E6%89%BF%E8%AF%BA%E4%B9%A6.pdf' -H'cookie: xxx' -H'host:xxx.baidu.com'
Envoy 的 trace 日志是:
2023-03-06T13:23:40.384836Z debug envoy client [C14158] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN。
2023-03-06T13:23:40.381998Z debug envoy router [C14938][S1043862392353535809] upstream headers complete: end_stream=false
'date', 'Mon, 06 Mar 2023 13:23:40 GMT'
'content-type', 'application/pdf'
'content-length', '108860'
'accept-ranges', 'bytes'
'content-md5', '+e4pDfRHe+JDTsww5MGeOw=='
'etag', '"f9ee290df4477be2434ecc30e4c19e3b"'
'expires', 'Thu, 09 Mar 2023 13:23:40 GMT'
'last-modified', 'Thu, 02 Mar 2023 19:58:42 GMT'
'server', 'envoy'
'x-bce-expiration-date', '2022-11-25T00:00:00Z'
'x-bce-request-id', '358f4bdc-5d3e-4209-92c5-8156b65a1a06'
'x-bce-storage-class', 'STANDARD'
'x-envoy-upstream-service-time', '52'
2023-03-06T13:23:40.384836Z debug envoy client [C14158] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN
2023-03-06T13:23:40.384855Z debug envoy connection [C14158] closing data_to_write=0 type=1
2023-03-06T13:23:40.384904Z debug envoy client [C14158] request reset
2023-03-06T13:23:40.384913Z debug envoy router [C14935][S15319585049759823265] upstream reset: reset reason: protocol error, transport failure reason:
2023-03-06T13:23:40.384978Z debug envoy http [C14935][S15319585049759823265] Sending local reply with details upstream_reset_before_response_started{protocol_error}
2023-03-06T13:23:40.385016Z debug envoy http [C14935][S15319585049759823265] encoding headers via codec (end_stream=false):
':status', '502'
'content-length', '87'
'content-type', 'text/plain'
'date', 'Mon, 06 Mar 2023 13:23:39 GMT'
'server', 'istio-envoy'
'x-envoy-decorator-operation', ':0/*'
原因
问题根因:业务代码在返回response时新增header,header中为encode文件名,导致502错误。
解决方案
解决方式:修改业务中相关的接口代码,将 fileName 正确地 Encoding。
某业务访问某服务 mesh配置失效问题
现象
某业务配置了envoyfilter去给请求增加header,请求集群内的服务有header,请求集群外的服务没有header。
原因
上游未找到被访问服务对应的信息,导致outbound透传了,没有执行用户配置的envoyfilter逻辑。
解决方式
将被访问服务信息通过serviceEntry信息导入到集群中。参考https://istio.io/latest/docs/reference/config/networking/service-entry/
Istiod出现报错
现象
istiod 启动过程中出现以下报错。
Warning FailedMount 8s (x6 over 24s) kubelet MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the API server does not have TokenRequest endpoints enabled
原因
解决方案
根据 k8s 平台是否支持 jwtPolicy third-party 自动调整 values.global.jwtPolicy 策略。
istioctl manifest generate --set values.global.jwtPolicy=first-party-jwt > first-party-jwt.yaml
istioctl manifest generate --set values.global.jwtPolicy=third-party-jwt > third-party-jwt.yaml
参考
pod处于terminating状态,流量丢失问题
问题
pod处于terminating状态,流量丢失问题。
原因
用户部署过程中,某服务侧同步新版本 pod 需要耗时几分钟,这个时候旧版本 pod 处于 terminating 状态,业务使用 preStop 让当前状态的 pod 还可以接受流量,直到服务侧更新完成。但是注入sidecar容器后,sidecar容器会在 pod 状态变为 terminating 后直接拒绝流量,导致在变更的几分钟 导向旧版本 pod 的流量全部丢失。
解决方案
临时方案
直接在sidecar的cm中添加prestop。
{{- else if $holdProxy }}
lifecycle:
postStart:
exec:
command:
- pilot-agent
- wait
preStop:
exec:
command:
- sh
- -c
- |
while true; do
if curl -s http://127.0.0.1:8080/health; then
sleep 5
else
break
fi
done
{{- end }}
期望方案
terminationDrainDuration
https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/
cm istio中 设置这个参数。设置为30min后,退出时envoy日志如下:
实测,主容器存活的情况下,curl主容器的health端口会被拒绝,看起来与描述一致,拒绝了新请求。
参考
https://github.com/istio/istio/issues/36517 https://github.com/istio/istio/pull/35059
istio 支持的 openTelemetry 没有收集到 trace 数据
问题
OpenTelemetry 不能收集到 trace 数据。
原因
istio configmap 配置的地址网络不通。
解决方式
配置正确的 trace 地址。
参考
https://istio.io/latest/docs/tasks/observability/logs/otel-provider/
istio-proxy 没有 access_log 日志
问题
istio-system 命名空间配置了accesslog,但是 istio-proxy 没有日志。
配置如下:
apiVersion: v1
data:
mesh: |-
accessLogEncoding: JSON
accessLogFile: /dev/stdout
defaultConfig:
discoveryAddress: istiod/istio-system.svc:15012
holdApplicationUntilProxyStarts: true
meshId: csm-u7r86kt8
proxyMetadata:
ISTIO_META_DNS_AUTO_ALLOCATE: "true"
ISTIO_META_DNS_CAPTURE: "true"
proxyStatsMatcher:
inclusionPrefixes:
- thrift
- dubbo
- kafka
- meta_protocol
inclusionRegexps:
- .*dubbo.*
- .*thrift.*
- .*kafka.*
- .*zookeeper.*
- .*meta_protocol.*
tracing:
zipkin:
address: zipkin.istio-system:9411
enableTracing: true
extensionProviders:
- name: otel
opentelemetry:
port: 31504
service: x.x.x.x
enablePrometheusMerge: true
rootNamespace: istio-system
trustDomain: cluster.local
meshNetworks: 'networks: {}'
kind: ConfigMap
metadata:
creationTimestamp: "2023-08-11T07:31:14Z"
labels:
install.operator.istio.io/owning-resource: unknown
istio.io/rev: default
operator.istio.io/component: Pilot
release: istio
name: istio
namespace: istio-system
resourceVersion: "1536380"
uid: d89e1f61-f9af-4f5e-9eaa-b26616fb267e
原因
discoveryAddress: istiod/istio-system.svc:15012 配置错误,可以通过 istiod 控制平面报错日志看出来。
解决方式
将 discoveryAddress: istiod/istio-system.svc:15012 配置为正确的地址 discoveryAddress: istiod.istio-system.svc:15012。
参考
https://cloud.tencent.com/developer/article/2172765
注入 sidecar pod 启动失败
问题:按照CSM操作手册,某个Pod开启sidecar注入后,启动失败,查看pod中的istio-init容器日志,如下报错:
2023-11-02T02:25:27.723319Z info Running command: iptables-restore --noflush /tmp/iptables-rules-1698891927723191413.txt835755000
2023-11-02T02:25:27.724588Z error Command error output: xtables parameter problem: iptables-restore: unable to initialize table 'raw'
Error occurred at line: 1
Try `iptables-restore -h' or 'iptables-restore --help' for more information.
2023-11-02T02:25:27.724604Z error Failed to execute: iptables-restore --noflush /tmp/iptables-rules-1698891927723191413.txt835755000, exit status 2
原因:CCE集群中可能存在Istio不支持的Node节点,如BaiduLinux、Centos8等内核。具体原因可参见Istio前置条件。Centos8及一些红帽系Linux使用iptables-nftables,不使用iptables模块。Istio通过使用iptables添加nat规则来拦截流量,Linux应该启用netfix linux内核模块。
解决办法: 永久生效(需要重启CCE集群上的Node节点机器),在节点Node上执行以下操作:
cat >/etc/modules-load.d/99-istio-modules.conf <<EOF
br_netfilter
nf_nat
nf_nat_redirect
xt_REDIRECT
xt_owner
iptable_nat
iptable_mangle
iptable_filter
EOF
# 重启下机器
reboot
临时生效(不需要重启CCE集群上的Node节点机器,机器重启后失效),在节点Node上执行以下操作:
modprobe br_netfilter
modprobe nf_nat
modprobe nf_nat_redirect
modprobe xt_REDIRECT
modprobe xt_owner
modprobe iptable_nat
modprobe iptable_mangle
modprobe iptable_filter