Aggregating Prometheus Alert Messages Using Prometheus Alertmanager

Deploy PrometheusAlert

1
2
3
4
git clone https://github.com/feiyu563/PrometheusAlert.git
cd PrometheusAlert/example/helm/prometheusalert
# Update config/app.conf to set login user info and database configuration
helm install -n monitoring .

Create a WeChat Work Group Robot

After creating a WeChat Work group, right-click the group → “Add Group Robot”. This will generate a webhook URL for the robot. Record this URL for later use.

Develop a Kubernetes cluster backup strategy

For backups, every internet company’s technical team must handle this task, and we are no exception. Today, I’ll share my own strategies for backing up production Kubernetes clusters.

My primary goals for Kubernetes backups are to prevent:

  • Accidental deletion of a namespace within the cluster
  • Accidental misconfiguration causing resource anomalies (e.g., deployments, configmaps)
  • Accidental deletion of partial resources in the cluster
  • Loss of etcd data

Backing Up etcd

Backing up etcd prevents catastrophic failures at the cluster level or loss of etcd data, which could render the entire cluster unusable. In such cases, only full cluster recovery can restore services.

How to quickly set up a Greenplum cluster

Recently, our internal project has been supporting a big data initiative, requiring the simulation of customer scenarios using Greenplum (older version 4.2.2.4). Below is a record of the Greenplum cluster setup process—note that the procedure for higher versions of GP remains largely identical.

Building Base Image

CentOS 6 Dockerfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
FROM centos:6

RUN mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
RUN curl -o /etc/yum.repos.d/CentOS-Base.repo https://www.xmpan.com/Centos-6-Vault-Aliyun.repo
RUN yum -y update; yum clean all
RUN yum install -y \
    net-tools \
    ntp \
    openssh-server \
    openssh-clients \
    less \
    iproute \
    lsof \
    wget \
    ed \
    which; yum clean all
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
RUN groupadd gpadmin
RUN useradd gpadmin -g gpadmin
RUN echo gpadmin | passwd gpadmin --stdin
ENTRYPOINT ["/usr/sbin/sshd", "-D"]

Build image:

Alibaba Cloud Shared GPU Solution Testing

I. Deploy GPU Sharing Plugin in Kubernetes

Before deployment, ensure that nvidia-driver and nvidia-docker are installed on your Kubernetes nodes, and Docker’s default runtime has been set to nvidia.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# cat /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}

1. Install gpushare-device-plugin via Helm

1
2
3
$ git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
$ cd gpushare-scheduler-extender/deployer/chart
$ helm install --name gpushare --namespace kube-system --set masterCount=3 gpushare-installer

2. Label GPU Nodes

1
2
$ kubectl label node sd-cluster-04 gpushare=true
$ kubectl label node sd-cluster-05 gpushare=true

3. Install kubectl-inspect-gpushare

Ensure kubectl is already installed (omitted here).

Deploying a High-Availability Kubernetes Cluster with kubeadm

To facilitate later verification of private deployment, a quick Kubernetes cluster setup is required in the internal network environment. Previously, for larger clusters, I typically used Kubeasz or Kubespray. For this small-scale cluster, using kubeadm will be more efficient.

Below is the recorded process for deploying with kubeadm:

Cluster Nodes:

1
2
3
4
192.168.1.206 sd-cluster-206 node
192.168.1.207 sd-cluster-207 master,etcd
192.168.1.208 sd-cluster-208 master,etcd,haproxy,keepalived
192.168.1.209 sd-cluster-209 master,etcd,haproxy,keepalived

Image Versions:

1
2
3
4
5
6
7
8
9
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.2
docker pull registry.cn-shanghai.aliyuncs.com/gcr-k8s/flannel:v0.14.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/nginx-ingress-controller:v0.48.1

I. Basic Environment Setup

1. Install Docker and Configure Hosts

1
2
3
4
5
6
yum install -y yum-utils device-mapper-persistent-data lvm2 git
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum install docker-ce -y
systemctl start docker
systemctl enable docker
systemctl status docker

2. Configure /etc/hosts

1
2
3
4
5
cat >> /etc/hosts << hhhh
192.168.1.207 sd-cluster-207
192.168.1.208 sd-cluster-208
192.168.1.209 sd-cluster-209
hhhh

3. Disable Firewall and Set SELinux

1
2
3
4
systemctl stop firewalld
systemctl disable firewalld
setenforce 0
sed -i 's/SELINUX=permissive/SELINUX=disabled/' /etc/sysconfig/selinux

4. Disable Swap

Kubernetes 1.8+ requires disabling swap. If not disabled, kubelet will fail to start by default.
Option 1: Use --fail-swap-on=false in kubelet startup args.
Option 2: Disable system swap.

Implementing Internal DNS with Alibaba Cloud PrivateZone + Bind9 + Dnsmasq

Requirements:

  • Alibaba Cloud cluster can resolve internal domain names
  • Office network resolves internal domain names + internet access resolution

Solution:

  • For the first requirement, directly use Alibaba Cloud PrivateZone for resolution.
  • For the second requirement, configure internal domain zones in PrivateZone, then synchronize them to the office network’s bind9 server using Alibaba Cloud’s synchronization tool. Use Dnsmasq as the DNS entry point for the office network: forward public queries to public DNS servers, and forward internal domain queries to the bind9 server.

Some may wonder: Why not use bind9 alone to handle all internal resolutions? The main reason is that in practice, bind9 exhibits performance issues when forwarding to multiple DNS servers simultaneously—occasional timeouts occur. In contrast, Dnsmasq handles this scenario significantly better.

Getting Started with Argo Events

Previously, we introduced how to install Argo Workflow and trigger tasks. In this article, we focus on a new tool:

What is ArgoEvents?

Argo Events is an event-driven Kubernetes workflow automation framework. It supports over 20 different event sources (e.g., webhooks, S3 drops, cronjobs, message queues such as Kafka, GCP PubSub, SNS, SQS, etc.).

Features:

  • Supports events from over 20 event sources and more than 10 trigger types.
  • Enables customization of business-level constraints for workflow automation.
  • Manages everything from simple, linear, real-time workflows to complex, multi-source event scenarios.
  • Complies with the CloudEvents specification.

Components:

  • EventSource (similar to a gateway; sends messages to the event bus)
  • EventBus (event message queue; implemented using high-performance distributed messaging middleware NATS — note that NATS has ceased maintenance after 2023, so architectural changes may be expected in the future)
  • EventSensor (subscribes to the message queue, parameterizes events, and filters them)

Deploying ArgoEvents

Deploy argo-events:

1
2
kubectl create ns argo-events
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.2.3/manifests/install.yaml

Deploy argo-eventbus:

1
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/stable/examples/eventbus/native.yaml

RBAC Account Authorization

Create operate-workflow-sa account

Grant operate-workflow-sa permission to create Argo Workflows within the argo-events namespace — required for EventSensor to automatically create workflows later.

Argo Workflow Practice I: Installation and Deployment

Introduction & Architecture

Argo Workflows is an open-source, container-level workflow engine designed to orchestrate parallel jobs on Kubernetes. It leverages Kubernetes Custom Resource Definitions (CRDs) to implement its full architecture, including Workflow, Workflow Template, and Cron Workflow.

What Can Argo Workflows Do?

  • Define workflows where each step is a container.
  • Model multi-step workflows as a sequence of tasks or capture task dependencies using Directed Acyclic Graphs (DAGs).
  • Easily run compute-intensive jobs—such as machine learning or data processing—on Kubernetes within a short time frame.
  • Run CI/CD pipelines natively on Kubernetes without configuring complex software development tooling.

Key Features of Argo Workflows:

  • Workflow: Orchestrates multiple workflow templates with customizable execution order.
  • Workflow Template: A reusable template definition for workflows; can be invoked by other workflows or templates within the same namespace or cluster.
  • Cluster Workflow Template: A cluster-scoped workflow template accessible across all namespaces via ClusterRole permissions.
  • Cron Workflow: Scheduled workflow type, equivalent to an advanced version of Kubernetes CronJob.

Installation & Configuration

Install Argo Workflows

We are installing the stable version 2.12.10. The installation process sets up ServiceAccount, Role, ClusterRole, Deployment, and other necessary components.