Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added serialization from etcd error metric #114376

Merged

Conversation

baomingwang
Copy link
Contributor

@baomingwang baomingwang commented Dec 9, 2022

What type of PR is this?

/kind bug
/sig api-machinery

What this PR does / why we need it:

To revive previous PR #90612 for capturing data corruption issue #69579

Able to manually flip one bit and have targeting metric storage_decode_errors emitted by replace protoEncodingPrefix from []byte{0x6b, 0x38, 0x73, 0x00} to []byte{0x6b, 0x38, 0x73, 0x01}.

$ kubectl get pods nginx
Error from server: illegal base64 data at input

$ kubectl get --raw /metrics | grep storage_decode_errors
# HELP apiserver_storage_decode_errors_total [ALPHA] Number of stored object decode errors split by object type
# TYPE apiserver_storage_decode_errors_total counter
apiserver_storage_decode_errors_total{resource="pods"} 1

kube-apiserver log with --v=4 output

I1208 07:56:17.397972 10 store.go:181] Decoding pods "/pods/default/nginx" failed: illegal base64 data at input byte 3

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kube-apiserver: errors decoding objects in etcd are now recorded in an `apiserver_storage_decode_errors_total` counter metric

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/apiserver labels Dec 9, 2022
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 9, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @baomingwang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@baomingwang baomingwang marked this pull request as ready for review December 9, 2022 04:05
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2022
@baomingwang
Copy link
Contributor Author

/assign @deads2k @lavalamp
cc @jingyih @micahhausler who was working on previous PR

@alexzielenski
Copy link
Contributor

/triage accepted
/cc @jpbetz @jingyih
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 13, 2022
@dims
Copy link
Member

dims commented Dec 13, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 13, 2022
@dims
Copy link
Member

dims commented Dec 13, 2022

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 13, 2022
@liggitt
Copy link
Member

liggitt commented Feb 1, 2023

should we have a test that flips a bit in etcd, then calls storage accessor methods and ensures this metric increments?

also, we decode in paths other than get and list (on every write, in watch, etc). should this metric get incremented if decode errors on existing data are encountered in those methods?

@baomingwang
Copy link
Contributor Author

baomingwang commented Feb 1, 2023

should we have a test that flips a bit in etcd, then calls storage accessor methods and ensures this metric increments?

I did some local testing to have targeting metric storage_decode_errors emitted by replace protoEncodingPrefix from []byte{0x6b, 0x38, 0x73, 0x00} to []byte{0x6b, 0x38, 0x73, 0x01}.

$ kubectl get pods nginx
Error from server: illegal base64 data at input

$ kubectl get --raw /metrics | grep storage_decode_errors
# HELP apiserver_storage_decode_errors_total [ALPHA] Number of stored object decode errors split by object type
# TYPE apiserver_storage_decode_errors_total counter
apiserver_storage_decode_errors_total{resource="pods"} 1

I1208 07:56:17.397972 10 store.go:181] Decoding pods "/pods/default/nginx" failed: illegal base64 data at input byte 3

Is it enough?

also, we decode in paths other than get and list (on every write, in watch, etc). should this metric get incremented if decode errors on existing data are encountered in those methods?

Really nice callout. You remind me that it would be better to capture decode error inside of decode() function, to capture every codec.Decode errors. Does it make sense to you?
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L998-L1001

@liggitt

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2023
@baomingwang
Copy link
Contributor Author

Comments addressed with updated commit. Can you pls take a look again?
Let me know if anything didn't look to you.
@liggitt @dims

@dims
Copy link
Member

dims commented Feb 2, 2023

/lgtm

will wait for @liggitt to remove hold!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d9afc39922b772c077743e9e96bc034811c69380

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plumbing data to decode and recording the metric / error there makes sense. If we're intending to log the actual path in etcd, we should use preparedKey in a bunch of places

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 6, 2023
@liggitt
Copy link
Member

liggitt commented Feb 7, 2023

/retest

@liggitt
Copy link
Member

liggitt commented Feb 7, 2023

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 7, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3ba83b0c0ca16d0f6403c6eaad6c68188c7aca18

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: baomingwang, dims, liggitt, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit dfb976e into kubernetes:master Feb 7, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Feb 7, 2023
wongma7 pushed a commit to wongma7/kubernetes that referenced this pull request Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet