Bare Metal Cloud Server Quality Control

Overview

The Bare Metal Cloud Server Quality Control process ensures reliability across three stages:

  1. Pre-deployment stress testing – validates hardware stability before shipment.

  2. Production-entry acceptance testing – verifies all components before joining the production pool.

  3. In-system monitoring and alerting – maintains long-term health and operational consistency.

1. Server Stress Testing

When we purchase new servers or when a server is first deployed at an edge node, a comprehensive stress testing procedure is conducted to ensure hardware stability and reliability.

The testing focuses on memory integrity and system-level performance stress. We primarily use the following tools:

1.1 MemTest86 — Memory Stability and Error Detection

  • Purpose: MemTest86 is a standalone memory testing utility that runs directly from boot, without requiring an operating system. It detects bit errors, latency anomalies, and compatibility issues in memory modules.

  • Integrated testing: Memory is tested together with the full server hardware, instead of isolating memory modules on dedicated test rigs. This approach better reflects real-world stability by capturing interactions among memory, motherboard, CPU, and power components.

  • Testing phases:

    • Each run must complete at least one full test phase, ensuring all address ranges and access patterns are validated.

    • Test results are recorded and archived based on the serial number (SN) of each memory module for traceability and long-term quality tracking.

  • Dual testing stages:

    • Pre-shipment Test – conducted after server assembly and before shipment to ensure no defective memory modules are shipped.

    • Post-rack Test – performed again after rack installation to verify integrity after transportation, especially critical in distributed edge environments.

  • Official website: https://www.memtest86.com.

1.2 stress-ng — CPU, Memory, Disk, and I/O Stress Testing

  • Purpose: stress-ng (https://manpages.ubuntu.com/manpages/focal/man1/stress-ng.1.html) is a comprehensive Linux stress-testing utility supporting hundreds of stressors to test CPU, memory, disk, and I/O subsystems.

  • Duration: Each server undergoes stress testing for no less than 4 hours; high-performance or GPU nodes may extend to 24 hours.

  • Metrics monitored during pre-shipment testing:

    • CPU temperature profile

    • Fan speed and control response

    • Power supply stability (voltage and wattage)

  • Goal: Identify early signs of thermal, fan, or power instability to ensure only fully stable units enter production.

2. Post-Deployment Quality Control

After installation in the data center, each server undergoes a full hardware acceptance process before entering the production inventory (production pool).

This ensures the system’s hardware matches the design specification and is in verified, healthy condition.

2.1 CPU Verification

  • Verify that the installed CPU model matches the expected configuration and performance profile.

2.2 Memory Verification

  • Confirm total installed memory matches specifications with no missing or extra DIMMs.

  • Validate each DIMM’s capacity and detection accuracy.

  • Check memory symmetry across CPU channels for balanced NUMA performance.

  • Ensure OS-reported capacity equals physical installation.

  • For high-performance configurations, verify memory channels are fully populated.

  • Confirm that all DIMMs share the same brand, frequency, and rank specifications.

2.3 Disk Verification

  • Confirm disk type (SSD/HDD), capacity, and interface to match model specs.

  • Verify the correct disk count.

  • Review SMART attributes and replace any drive below health thresholds.

2.4 Server Chassis Verification

  • Confirm that factory models match configuration records.

  • Verify redundant (dual) power supplies.

  • Check BIOS and BMC firmware versions against approved baselines.

2.5 GPU Verification

  • Confirm GPU count matches the defined model.

  • Verify GPU firmware consistency across all devices.

2.6 Network Adapter Verification

  • Validate actual link speed against configured model bandwidth (e.g., 25 G NIC on a 10 G switch must be labeled 10 G).

  • Ensure NIC firmware versions match approved standards.

3. Runtime Monitoring

Once in production, continuous quality assurance is maintained through runtime monitoring and lifecycle validation, including:

  1. Hardware validation during provisioning and deprovisioning

  2. Linux-level and IPMI-based hardware telemetry collection

  3. Multi-layer monitoring and alerting

3.1 Provisioning & Deprovisioning Phase – Instance Lifecycle Checks

During every instance creation (provisioning) or instance termination (deprovisioning) via the console, automated hardware checks ensure the node’s reliability before and after user workloads.

Workflow:

  1. Our control plan enters installation mode to initiate automated OS deployment.

  2. Hardware information upload: system collects data (via lshw or equivalent) and uploads it to the central platform.

    1. Includes CPU, memory, disks, NICs, motherboard, and RAID controller details.

    2. Used for conformity validation and alert correlation in the SRE platform.

    3. Because IPMI and Linux expose different hardware-information dimensions, Linux-level data is also collected during this process to complement IPMI telemetry.

    4. For security reasons, no agent is installed; collection occurs periodically, not in real time.

    5. This ensures complete hardware visibility and consistency during each provisioning and deprovisioning cycle.

  3. Disk Health Verification: prevents degraded hardware from entering or remaining in production.

    1. The program invokes tools (smartctl, megacli, NVMe utilities) to read disk health metrics. If health is abnormal, the drive is marked degraded. The server is flagged Installation Failed, added to the Fault Device List, installation halted, and reported to monitoring and systems.

3.2 Deprovisioning Phase – Recycling Checks

When a server is reclaimed after instance release, the disk health check is repeated as in the provisioning phase

3.3 Monitoring Data Collection & Alerting System

Once servers are in full production, observability is maintained across three telemetry layers:

  • IPMI-based hardware monitoring

  • Business IP network monitoring

  • Switch port–level monitoring

All collected telemetry is also displayed in each instance’s Health Dashboard within the console, giving users visibility into key hardware and network metrics.

1. IPMI-Based Monitoring

  • IPMI IP ICMP (Latency / Packet Loss): monitors BMC network connectivity.

  • Sensor Data:

    • Processor – temperature, power, health

    • PSU – voltage, current, redundancy

    • Memory – ECC error count, DIMM health

    • Disk Slot – insertion/removal state, faults

    • Fan – speed and control response

    • Temperature – real-time heat distribution

  • SEL (System Event Log): captures events such as Critical Interrupts, power or fan failures, and triggers alerts automatically.

2. Business IP-Based Monitoring

  • WAN IP ICMP (Latency / Packet Loss): monitors external network reachability and latency. Alerts are triggered if thresholds are exceeded.

3. Switch Port-Level Monitoring

  • Incoming / Outgoing Discards: detect packet loss or buffer overflows.

  • Incoming / Outgoing Errors: identify physical link faults (CRC, alignment errors).

Last updated