Platform Architecture

Technology Selection

OpenObserve (formerly O2) is an open-source observability platform with significant advantages over traditional ELK solutions:

  • Low resource consumption: Memory usage is only 1/10 of Elasticsearch, with 90% storage space saved
  • High performance: Written in Rust, a single node can process 5TB+ logs/day
  • Full-featured: Integrates logs, metrics, and distributed tracing, replacing Grafana+Loki+Tempo
  • Strong compatibility: Supports Elasticsearch API, Prometheus API, and OpenTelemetry
  • Easy deployment: Single binary file with no external dependencies

Fluent Bit is a cloud-native log collector:

  • Lightweight: Memory usage <1MB, suitable for embedded and container environments
  • High performance: Multi-threaded asynchronous processing, throughput up to 100MB/s
  • Rich plugins: Supports 200+ input/output/filter/parser plugins
  • Kubernetes native: Automatic Pod discovery, label injection, metadata association

Architecture Design

┌─────────────────────────────────────────────────────────────────┐
│                        Business Application Layer                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │  Java App│  │  Go Service │ │ Nginx Logs│ │ System Logs│      │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘       │
│       │             │             │             │               │
│  stdout/stderr   stdout/stderr  /var/log/*   /var/log/*         │
└───────┼─────────────┼─────────────┼─────────────┼───────────────┘
        │             │             │             │
┌───────┴─────────────┴─────────────┴─────────────┴───────────────┐
│                      Kubernetes Cluster                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              DaemonSet: Fluent Bit                      │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │   │
│  │  │  Tail Input  │→ │  Filter/Parse│→ │  Buffer      │  │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  │   │
│  └─────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────┘
                               │ HTTP/HTTPS
┌──────────────────────────────┴──────────────────────────────────┐
│                    OpenObserve Cluster                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Ingest Node │→ │  Data Storage│→ │  Query Engine│          │
│  │  (Write Opt) │  │  (Parquet)   │  │  (Full Text) │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
└─────────────────────────────────────────────────────────────────┘

Component Versions

Component Version Description
OpenObserve 0.15.0+ Stable version, supports distributed
Fluent Bit 4.0.0+ Latest stable version
Kubernetes 1.24+ Production environment
Helm 3.0+ Deployment tool

OpenObserve Deployment

Prerequisites

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 1. Create Namespace
kubectl create namespace logging

# 2. Add OpenObserve Helm repository
helm repo add openobserve https://charts.openobserve.ai
helm repo update

# 3. Create PV/PVC (use distributed storage for production)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: openobserve-data
  namespace: logging
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: zstack-rbd  # Adjust according to actual SC
  resources:
    requests:
      storage: 500Gi
EOF

Helm Deployment of OpenObserve

Create values file openobserve-values.yaml:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
# openobserve-values.yaml

image:
  repository: public.ecr.aws/zinclabs/openobserve
  tag: "0.15.1"
  pullPolicy: IfNotPresent

replicaCount: 3  # 3 nodes recommended for HA in production

# Resource configuration (adjust based on log volume)
resources:
  limits:
    cpu: "8"
    memory: 16Gi
  requests:
    cpu: "4"
    memory: 8Gi

# Data persistence
persistence:
  enabled: true
  existingClaim: openobserve-data
  mountPath: /data

# Environment variables
env:
  - name: ZO_ROOT_USER_EMAIL
    value: "admin@example.com"
  - name: ZO_ROOT_USER_PASSWORD
    valueFrom:
      secretKeyRef:
        name: openobserve-secret
        key: password
  - name: ZO_DATA_DIR
    value: "/data"
  - name: ZO_HTTP_PORT
    value: "5080"
  - name: ZO_MEMORY_CACHE_ENABLED
    value: "true"
  - name: ZO_MEMORY_CACHE_MAX_SIZE
    value: "4096"  # MB
  - name: ZO_COMPRESSION_ENABLED
    value: "true"
  - name: ZO_COMPRESSION_FORMAT
    value: "zstd"
  - name: ZO_PARQUET_COMPRESSION
    value: "zstd"
  - name: ZO_META_STORE
    value: "sqlite"  # Use PostgreSQL for production
  - name: ZO_METRICS_ENABLED
    value: "true"

# Configure PostgreSQL metadata store (recommended for production)
envFrom:
  - secretRef:
      name: postgres-connection

# Service configuration
service:
  type: ClusterIP
  port: 5080
  targetPort: 5080
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"

# Ingress configuration
ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
  hosts:
    - host: logs.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: logs-tls
      hosts:
        - logs.example.com

# Pod scheduling
nodeSelector: {}
tolerations: []
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                  - openobserve
          topologyKey: kubernetes.io/hostname

# Monitoring configuration
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    namespace: logging

Deploy OpenObserve

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 1. Create password Secret
kubectl create secret generic openobserve-secret \
  --from-literal=password='your-secure-password' \
  -n logging

# 2. Deploy
helm upgrade --install openobserve openobserve/openobserve \
  -n logging \
  -f openobserve-values.yaml \
  --version 0.10.2 \
  --timeout 15m

# 3. Wait for Pods to be ready
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=openobserve \
  -n logging \
  --timeout=300s

# 4. Check status
kubectl get pods -n logging
kubectl logs -f deployment/openobserve -n logging

# 5. Access Web UI
echo "Access URL: https://logs.example.com"
echo "Default user: admin@example.com"

Fluent Bit Deployment

Install Helm Chart

1
2
3
4
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm pull fluent/fluent-bit --version 0.50.0
helm install fluent-bit ./fluent-bit-0.50.0.tgz --namespace uganda-prod

Adjust DaemonSet Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
spec:
  volumes:
    - name: config
      configMap:
        name: fluent-bit
        defaultMode: 420
  containers:
    resources:
      requests:
        cpu: "1"
        memory: 2Gi
      limits:
        cpu: "2"
        memory: 4Gi
    volumeMounts:
      - name: config
        mountPath: /fluent-bit/etc/conf/custom_parsers.conf
        subPath: custom_parsers.conf
      - name: config
        mountPath: /fluent-bit/etc/conf/multiline_parsers.conf
        subPath: multiline_parsers.conf

Fluent Bit Collection Rule Adjustments

Complete ConfigMap Configuration

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
kind: ConfigMap
apiVersion: v1
metadata:
  name: fluent-bit-config
  namespace: uganda-prod
  labels:
    app: fluent-bit
    tier: logging
data:
  # ---------------------------------------------------------------------------
  # 1. Custom Parsers
  # Used to extract specific log fields, e.g., extract log level (INFO, ERROR, etc.) from Java logs
  # ---------------------------------------------------------------------------
  custom_parsers.conf: |
    [PARSER]
        Name        java_log_level
        Format      regex
        # Match format: 2023-10-27 10:00:00.123 INFO ...
        # The 'level' capture group will contain the log level
        Regex       ^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}\s+(?<level>[A-Z]+)    

  # ---------------------------------------------------------------------------
  # 2. Multiline Parsers
  # Specifically for handling multi-line logs like Java stack traces to prevent stacks from being split into multiple records
  # ---------------------------------------------------------------------------
  multiline_parsers.conf: |
    [MULTILINE_PARSER]
        Name          java_md_multiline
        Type          regex
        # Rule 1: If line starts with date (e.g., 2023-...), consider it start of new log
        rule      "start_state"   "/^\d{4}-\d{2}-\d{2}/"           "cont"
        # Rule 2: If line doesn't start with date, consider it continuation of previous line (stack part)
        rule      "cont"        "/^(?!\d{4}-\d{2}-\d{2}).+/"     "cont"
        # Flush timeout (seconds), force output current buffered multi-line log after timeout
        flush_timeout 5    

  # ---------------------------------------------------------------------------
  # 3. Main Configuration File
  # ---------------------------------------------------------------------------
  fluent-bit.conf: |
    [SERVICE]
        Daemon              Off
        Flush               1
        Log_Level           info
        # Load standard and custom parsers
        Parsers_File        /fluent-bit/etc/conf/parsers.conf
        Parsers_File        /fluent-bit/etc/conf/custom_parsers.conf
        Parsers_File        /fluent-bit/etc/conf/multiline_parsers.conf
        
        # Enable built-in HTTP server for health checks and metrics exposure (/api/v1/metrics)
        HTTP_Server         On
        HTTP_Listen         0.0.0.0
        HTTP_Port           2020
        Health_Check        On

        # File system buffer configuration (prevent data loss when backend is unavailable)
        storage.path              /var/log/flb_storage
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 200M

    # -----------------------------------------------------------------------
    # [INPUT] Tail file collection
    # Collect standard container logs generated by Docker/Containerd
    # -----------------------------------------------------------------------
    [INPUT]
        Name                tail
        Path                /var/log/containers/*.log
        # Enable built-in multiline parsing (docker, cri format)
        multiline.parser    docker, cri
        Tag                 kube.*
        Mem_Buf_Limit       500MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Persist offset database, resume reading from checkpoint after restart
        DB                  /var/log/flb_kube.db
        DB.Sync             Normal
        Rotate_Wait         30
        Read_from_Head      Off

    # -----------------------------------------------------------------------
    # [FILTER] Stage 1: Multiline merging
    # Merge multiline logs before getting K8s metadata to ensure stack traces are handled as single records
    # -----------------------------------------------------------------------
    [FILTER]
        Name                multiline
        Match               kube.*
        multiline.key_content log
        multiline.parser    java_md_multiline

    # -----------------------------------------------------------------------
    # [FILTER] Stage 2: K8s metadata enhancement
        Call K8s API to get Pod details (Namespace, Pod Name, Labels, etc.)
    # -----------------------------------------------------------------------
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Merge_Log           Off       # Don't try to parse log field as JSON (avoid performance overhead or errors)
        Keep_Log            On        # Keep original log field
        K8S-Logging.Parser  On        # Allow Pod Annotation to specify parser
        K8S-Logging.Exclude Off       # Don't exclude logs with specific annotation

    # -----------------------------------------------------------------------
    # [FILTER] Stage 3: Log content parsing
    # Use custom parser to extract 'level' field
    # -----------------------------------------------------------------------
    [FILTER]
        Name                parser
        Match               kube.*
        Key_Name            log
        Parser              java_log_level
        Reserve_Data        On        # Keep original data
        Preserve_Key        On        # Keep original log key

    # -----------------------------------------------------------------------
    # [FILTER] Stage 4: Field cleaning
    # Copy extracted level field, remove unnecessary temporary fields
    # -----------------------------------------------------------------------
    [FILTER]
        Name                modify
        Match               kube.*
        Copy                level level
        Remove              _p
        Remove              stream
        Remove              time

    # -----------------------------------------------------------------------
    # [FILTER] Stage 5: Structure restructuring (Nest Lift)
    # Lift fields under kubernetes object to top level with 'k8s_' prefix
    # Purpose: Flatten data structure for easier rule matching
    # -----------------------------------------------------------------------
    [FILTER]
        Name                nest
        Match               kube.*
        Operation           lift
        Nested_under        kubernetes
        Add_prefix          k8s_

    # -----------------------------------------------------------------------
    # [FILTER] Stage 6: Remove redundant metadata
    # Delete large fields not needed to be sent to backend (e.g., annotations, docker_id)
    # -----------------------------------------------------------------------
    [FILTER]
        Name                modify
        Match               kube.*
        Remove              k8s_annotations
        Remove              k8s_docker_id
        Remove              k8s_container_hash

    # -----------------------------------------------------------------------
    # [FILTER] Stage 7: Structure restructuring (Nest Nest)
    # Repackage all 'k8s_' prefixed fields back under 'kubernetes' object, removing prefix
    # Purpose: Restore clean nested structure while cleaning up useless fields
    # -----------------------------------------------------------------------
    [FILTER]
        Name                nest
        Match               kube.*
        Operation           nest
        Wildcard            k8s_*
        Nested_under        kubernetes
        Remove_prefix       k8s_

    # =======================================================================
    # [FILTER] Stage 8: Dynamic routing (Rewrite Tag) - First hop: Split by environment
    # Modify Tag based on Namespace name to route logs to different processing streams
    # Syntax: Rule $field_name regex newTag keep_original(boolean)
    # =======================================================================
    
    # Route: uganda-uat environment
    [FILTER]
        Name                rewrite_tag
        Match               kube.*
        Rule                $kubernetes['namespace_name'] ^uganda-uat$ uganda-uat.temp false
        Emitter_Name        re_emitted_uganda-uat-temp

    # Route: uganda-test environment
    [FILTER]
        Name                rewrite_tag
        Match               kube.*
        Rule                $kubernetes['namespace_name'] ^uganda-test$ uganda-test.temp false
        Emitter_Name        re_emitted_uganda-test-temp

    # Route: uganda-prod environment (current deployment namespace)
    [FILTER]
        Name                rewrite_tag
        Match               kube.*
        Rule                $kubernetes['namespace_name'] ^uganda-prod$ uganda-prod.temp false
        Emitter_Name        re_emitted_uganda-prod-temp

    # Route: uganda-offline environment
    [FILTER]
        Name                rewrite_tag
        Match               kube.*
        Rule                $kubernetes['namespace_name'] ^uganda-offline$ uganda-offline.temp false
        Emitter_Name        re_emitted_uganda-offline-temp

    # =======================================================================
    # [FILTER] Stage 9: Dynamic routing (Rewrite Tag) - Second hop: Split by service
    # Secondary routing for specific containers in specific environments for fine-grained index isolation
    # =======================================================================

    # Example: UAT environment -> lms-backend service
    [FILTER]
        Name                rewrite_tag
        Match               uganda-uat.temp
        Rule                $kubernetes['container_name'] ^lms-backend$ uganda-uat-lms-backend false
        Emitter_Name        re_emitted_uganda-uat-lms-backend

    # Example: UAT environment -> other-service service (needs complete configuration)
    [FILTER]
        Name                rewrite_tag
        Match               uganda-uat.temp
        Rule                $kubernetes['container_name'] ^other-service$ uganda-uat-other-service false
        Emitter_Name        re_emitted_uganda-uat-other-service

    # Note: Production and other environments need similar container name filtering rules added
    # For example:
    # [FILTER]
    #     Name rewrite_tag
    #     Match uganda-prod.temp
    #     Rule $kubernetes['container_name'] ^payment-service$ uganda-prod-payment-service false
    #     Emitter_Name re_emitted_uganda-prod-payment-service

    # =======================================================================
    # [OUTPUT] Output plugin configuration
    # Send filtered logs to OpenObserve
    # =======================================================================

    # Output: UAT LMS Backend
    [OUTPUT]
        Name                http
        Match               uganda-uat-lms-backend
        URI                 /api/39NVPcXSEBOwGM5UnceQ35hQFNB/lms_backend/_json
        Host                openobserve.uganda-uat.svc.cluster.local
        Port                5080
        tls                 Off
        Format              json
        Json_date_key       _timestamp
        Json_date_format    iso8601
        
        HTTP_User           ops@test.com
        HTTP_Passwd         ROPe50N4BJjovJiT 
        
        compress            gzip
        Retry_Limit         False       # Infinite retry until successful
        net.connect_timeout 10
        net.io_timeout      30

    # Output: UAT Other Service (example)
    [OUTPUT]
        Name                http
        Match               uganda-uat-other-service
        URI                 /api/39NVPcXSEBOwGM5UnceQ35hQFNB/other_service/_json
        Host                openobserve.uganda-uat.svc.cluster.local
        Port                5080
        tls                 Off
        Format              json
        Json_date_key       _timestamp
        Json_date_format    iso8601
        HTTP_User           devops@test.com
        HTTP_Passwd         ROPe50N4BJjovJiT
        compress            gzip
        Retry_Limit         False

    # Default output (optional): Capture logs that don't match any specific rules to prevent data loss
    # [OUTPUT]
    #     Name http
    #     Match uganda-prod.temp
    #     URI /api/.../default/_json
    #     ...    

Configuration Explanation

  1. Filter Stage: Use rewrite_tag filter to filter logs by Namespace and container name
  2. Output Stage: Send filtered logs to OpenObserve in different environments via HTTP
  3. Multiline Logs: Use multiline filter to handle Java exception stacks
  4. Log Optimization: gzip compression, retry on failure, TCP timeout configuration

Multiline Merge Rule Explanation

State Matching Rule Description
start_state /^\d{4}-\d{2}-\d{2}/ Starts with date → New log starts, enters cont state
cont /^(?!\d{4}-\d{2}-\d{2}).+/ Doesn’t start with date → Continue appending to previous log

Note: After modifying the ConfigMap, the fluent-bit service needs to be restarted.

Data Management

Create Data Stream

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Method 1: Create via API
curl -X POST "https://logs.example.com/api/demo/streams" \
  -u "admin@example.com:password" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "app-logs",
    "storage_type": "memory",
    "stream_type": "logs"
  }'

# Method 2: Auto-create (first write auto-creates, we use this method)
# Fluent Bit will auto-create new stream on first write

Set Data Retention Policy

OpenObserve’s dashboard allows setting retention policies based on index, with main strategies as follows:

  • Time-based retention
  • Size-based retention
  • Hybrid strategy (3650 days or 30TB, whichever comes first, we use this approach. Since it’s a financial service, we set retention for 10 years)

Data Archive Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Enable S3 archiving
env:
  - name: ZO_S3_STORE_ENABLED
    value: "true"
  - name: ZO_S3_STORE_BUCKET
    value: "logs-archive"
  - name: ZO_S3_STORE_REGION
    value: "us-east-1"
  - name: ZO_S3_STORE_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: s3-credentials
        key: access-key
  - name: ZO_S3_STORE_SECRET_KEY
    valueFrom:
      secretKeyRef:
        name: s3-credentials
        key: secret-key

Data Compression

1
2
3
4
5
6
7
8
# OpenObserve auto-compression
env:
  - name: ZO_COMPRESSION_ENABLED
    value: "true"
  - name: ZO_COMPRESSION_FORMAT
    value: "zstd"  # zstd/gzip/snappy
  - name: ZO_COMPRESSION_LEVEL
    value: "3"  # 1-19, higher compression ratio but higher CPU consumption

Query and Analysis

OpenObserve provides a powerful query engine supporting SQL mode and native query language (VQL). The following introduces basic queries and aggregation queries in different scenarios to help you quickly locate key logs.

💡 Core Tips

  • Mode Switch: Aggregation queries are recommended to be executed in SQL mode.
  • Table Name Convention: Stream names (table names) must be enclosed in double quotes, for example "lms".
  • Time Range: All queries are limited by the time selector in the upper right corner of the UI, please ensure the time range covers the target data.

Quickly locate a single log or specific event.

Automatically scans all text fields, suitable for fuzzy searching.

1
match_all('timeout')
  • Scenario: Not sure which field the error appears in, quickly search for keywords globally.

Perform exact or inclusion matching on specific fields for better performance.

1
str_match(log, 'tenant-backend')
  • Scenario: Filter logs for specific services (like tenant-backend) or specific levels (str_match(level, 'ERROR')).

1.3 Multi-condition Combination Query

Use logical operators (AND, OR, NOT) to build complex filtering rules.

1
2
(level = 'INFO' OR level = 'ERROR') 
AND str_match(log, 'AuthInterceptor')
  • Scenario: Only view logs from the authentication module, including both normal and error processes.

2. Aggregation Queries 🚀

Through SQL’s GROUP BY and aggregate functions (COUNT, SUM, AVG, MAX, MIN), massive logs can be converted into statistical chart data.

2.1 Count by Log Level

Count the number of different log levels to quickly determine system health.

1
2
3
4
SELECT level, COUNT(*) as log_count 
FROM "lms" 
GROUP BY level 
ORDER BY log_count DESC
  • Output Example:
    level log_count
    INFO 15420
    ERROR 32
    WARN 105

2.2 Time Window Trend Analysis

Combine with the date_trunc function to count logs within a time unit, used for drawing trend charts.

1
2
3
4
5
SELECT date_trunc('minute', _timestamp) as time_window, COUNT(*) as hits
FROM "lms"
WHERE str_match(log, 'payment')
GROUP BY time_window
ORDER BY time_window ASC
  • Scenario: Observe the fluctuation of “payment” related logs per minute to identify sudden traffic or failure time points.
  • Note: 'minute' can be replaced with 'hour', 'day', etc.

2.3 Group by Service/Container

Analyze which microservice generates the most logs or errors.

1
2
3
4
5
6
SELECT kubernetes.container_name, COUNT(*) as total_logs
FROM "lms"
WHERE level = 'ERROR'
GROUP BY kubernetes.container_name
ORDER BY total_logs DESC
LIMIT 5
  • Scenario: Find the top 5 containers generating the most errors, prioritize troubleshooting.

2.4 Multi-dimensional Cross Analysis

Group by two or more dimensions simultaneously for deep dive analysis.

1
2
3
4
5
6
7
8
SELECT 
  date_trunc('hour', _timestamp) as hour,
  level,
  COUNT(*) as count
FROM "lms"
WHERE str_match(log, 'database')
GROUP BY hour, level
ORDER BY hour ASC, level
  • Scenario: View the distribution of INFO and ERROR database-related logs per hour.

2.5 Numeric Field Statistics

If logs contain extracted numeric fields (like response time duration), mathematical calculations can be performed.

1
2
3
4
5
6
SELECT 
  AVG(duration) as avg_latency,
  MAX(duration) as max_latency,
  MIN(duration) as min_latency
FROM "lms"
WHERE str_match(log, 'API_REQUEST')
  • Scenario: Calculate average and maximum latency for specific API requests to evaluate performance bottlenecks.

3. Advanced Techniques and Best Practices

Technique Description Example
Limit Result Set Avoid returning too much data causing browser lag Add LIMIT 100
Deduplication Count Count unique error message types COUNT(DISTINCT log)
Alias Optimization Make output column names more readable COUNT(*) as "Total Errors"
Null Value Handling Exclude records with empty fields WHERE level IS NOT NULL
Regex Matching More flexible matching than str_match REGEXP_MATCH(log, 'Error.*\d+')

⚠️ Performance Recommendations

  1. Filter before aggregating: Be sure to narrow down the data range as much as possible in the WHERE clause (specifying time, keywords) before performing GROUP BY, which can significantly improve query speed.
  2. Time granularity: When querying over large time ranges (like 7 days), use date_trunc('hour', ...) or date_trunc('day', ...), avoid using 'second' which causes too many data points.
  3. Field indexing: For fields commonly used in GROUP BY (like level, container_name), it’s recommended to enable indexing in Stream Settings for best performance.

OpenObserve Optimization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Memory cache
env:
  - name: ZO_MEMORY_CACHE_ENABLED
    value: "true"
  - name: ZO_MEMORY_CACHE_MAX_SIZE
    value: "8192"  # 8GB

# Query cache
  - name: ZO_QUERY_CACHE_ENABLED
    value: "true"
  - name: ZO_QUERY_CACHE_MAX_SIZE
    value: "4096"  # 4GB

# Parallel queries
  - name: ZO_QUERY_WORKER_THREADS
    value: "8"

Storage Optimization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Parquet columnar storage
env:
  - name: ZO_PARQUET_COMPRESSION
    value: "zstd"
  - name: ZO_PARQUET_ROW_GROUP_SIZE
    value: "1048576"  # 1MB

# Data partitioning
  - name: ZO_PARTITION_ENABLED
    value: "true"
  - name: ZO_PARTITION_FIELDS
    value: "_timestamp,kubernetes.namespace_name"

Monitoring and Alerting

There are two parts that need monitoring: one is the OpenObserve service monitoring itself, and the other is business log monitoring. For monitoring of OpenObserve and Fluent Bit services themselves, Prometheus monitoring can be used directly. For core business order log monitoring, OpenObserve’s built-in alerting mechanism can meet the requirements directly.

Prometheus Metrics Collection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Fluent Bit ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: fluent-bit
  endpoints:
    - port: metrics
      interval: 30s

# OpenObserve ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: openobserve
  namespace: logging
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: openobserve
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Alert Rules

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: logging-alerts
  namespace: logging
spec:
  groups:
    - name: logging
      rules:
        # Fluent Bit log loss alert
        - alert: FluentBitLogsDropping
          expr: rate(fluentbit_output_proc_records_failed_total[5m]) > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Fluent Bit Log Loss"
            description: "Lost {{ $value }} logs in the last 5 minutes"

        # OpenObserve storage alert
        - alert: OpenObserveDiskSpaceLow
          expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) < 0.1
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "OpenObserve Disk Space Low"
            description: "Less than 10% disk space remaining"

        # Log ingestion delay
        - alert: LoggingIngestLag
          expr: openobserve_ingest_delay_seconds > 300
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Log Ingestion Lag"
            description: "Log lag is {{ $value }} seconds"

Business Log Alerts

This involves several key parts:

Alert Message Template

Settings –> Templates –> Add template, define Webhook alert message format, note that the message needs to be configured differently for different chat tools

1
2
3
4
5
6
7
8
Create: cdi-prod-template

    {
      "msgtype": "text",
      "text": {
        "content": "🚨 [Alert Notification]\nAlert Name: {alert_name}\nStream Name: {stream_name}\nSeverity: Error\nLog Info: {rows:1}\nTrigger Time: {alert_trigger_time_str}"
      }
    }

Note: The message here needs to be configured differently for different chat tools. My configuration here is for WeChat Work. For more information, please refer to https://openobserve.ai/docs/user-guide/management/templates/ to configure a suitable alert template for your needs.

Alert Address

Add a message assistant to the WeChat Work group to get the push message Webhook address, then add it in OpenObserve UI Settings –> Address –> Add Address.

Alert Rule Configuration

This needs to be determined based on internal R&D team communication, business log matching rules. Daily alerts should only match ERROR level logs. For core orders and risk control processes, phone alerts are recommended:

openobserve alert

OpenObserve Open Source Edition Risks and Considerations

RBAC Permission Control

Open Source Edition Limitations:

  • No granular RBAC control (Enterprise edition supports)
  • All users have the same permissions
  • Cannot isolate data by organization/project

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 1. Use multi-tenancy (Stream-level isolation)
# Create separate Streams for each team
curl -X POST "https://logs.example.com/api/demo/streams" \
  -d '{"name":"team-a-logs"}'
curl -X POST "https://logs.example.com/api/demo/streams" \
  -d '{"name":"team-b-logs"}'

# 2. Use reverse proxy for permission control
# Nginx routes users to different Streams based on user
location /api/team-a/ {
  internal;
  proxy_pass http://openobserve.logging.svc/api/team-a-logs/;
}

# 3. Use API Key for simple authentication
# Generate separate API Key for each team

Here, since there are many projects, permission control rules are usually added at the reverse proxy, such as restricting IP and Host for POST requests.

Data Security

Risk Points:

  • Open source edition has no field-level encryption
  • Sensitive information may be stored in plain text
  • Missing audit logs

Best Practices:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. Fluent Bit data masking
[FILTER]
    Name                modify
    Match               kube.*
    Remove              password,token,secret,key,ssn
    Rename              message  log_message

# 2. Use Lua filter for masking
[FILTER]
    Name                lua
    Match               kube.*
    Script              mask.lua
    Call                mask_sensitive

# mask.lua:
function mask_sensitive(tag, timestamp, record)
    if record["message"] ~= nil then
        record["message"] = string.gsub(record["message"], "password=%S+", "password=***")
    end
    return 1, timestamp, record
end

High Availability Risks

Single Point of Failure:

  • Open source edition has no automatic failover
  • Node downtime may cause data loss

Solutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 1. Deploy multiple replicas
replicaCount: 3

# 2. Use shared storage
persistence:
  enabled: true
  storageClass: "nfs-client"  # Use distributed storage, here using Ceph RBD

# 3. Configure pod anti-affinity
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
                - openobserve
        topologyKey: kubernetes.io/hostname

Capacity Planning

Log Volume Fluent Bit Resources OpenObserve Resources Storage Space/Month
10GB/day 100m CPU / 100Mi 2 CPU / 4Gi 100GB
100GB/day 200m CPU / 200Mi 4 CPU / 8Gi 1TB
1TB/day 500m CPU / 500Mi 8 CPU / 16Gi 10TB

Summary

The log platform solution based on OpenObserve and Fluent Bit has the following advantages:

  1. Cost Advantage: Compared to ELK solutions, storage costs are reduced by 90% and computing resources by 70%
  2. High Performance: A single node supports 5TB+/day log ingestion with query responses <100ms
  3. Simple and Easy to Use: Deployment time <30 minutes with low learning curve
  4. Cloud Native: Kubernetes native integration with automatic scaling

It has been running stably in production for half a year with 5TB+ data volume, and query responses are stable within 200ms, basically meeting most business needs. If there are strong requirements for RBAC, it is recommended to directly purchase the commercial version or abandon this solution.