Page 2
Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS"...
This document describes how to collect logs, diagnose faults, upgrade software, perform preventive maintenance and common operations, and collect the information required to for troubleshoot Huawei E9000, E6000, X6000, X8000, X6800, rack, heterogeneous, Atlas 800 AI inference (model 3010), and Atlas 800 AI training (model 9010) servers.
Page 5
Added information about the FusionServer Pro G5500 server. 2017-12-14 This issue is the ninth official release. Added description about the CX916 switch module of the E9000 server. 2017-08-08 This issue is the eighth official release. Modified 4.4.1.1 Connecting a PC to the Ethernet Switching Plane.
4.1 Collecting Basic Information............................. 11 4.2 Collecting OS Logs................................12 4.3 Collecting Hardware Logs..............................13 4.4 Collecting Switch Module Logs (for E9000+MM910)....................14 4.4.1 Preparing for Log Collection............................14 4.4.1.1 Connecting a PC to the Ethernet Switching Plane..................... 14 4.4.1.2 Querying the Software Version of the Ethernet Switching Plane..............17 4.4.2 Collecting Switch Module Logs.............................18...
Take protective measures against radio interference before operating the device in residential areas. Personal Safety ● Only personnel certified or authorized by Huawei are allowed to install equipment or its components. ● Discontinue any dangerous operations and take protective measures. Report anything that could cause personal injury or equipment damage to a project supervisor.
Page 17
Management management and routine management regulations Regulations maintenance. during onsite maintenance. Software Tools Table 3-2 lists the software tools required for routine maintenance of Huawei servers. Table 3-2 Tools for routine maintenance Tool Server and Description Version FusionServer Huawei-...
Page 18
FC switching plane of a switch versions module. You can obtain the tool from the Internet. Hardware Tools Table 3-3 lists the hardware tools required for routine maintenance of Huawei servers. Table 3-3 Hardware tools required for routine maintenance Tool Description Floating nut hook Used to guide floating nuts to the holes in the mounting bars of a rack.
Use SmartKit to collect hardware logs and Windows/Linux logs. For details, FusionServer Tools 2.0 SmartKit User Guide see the 4.4 Collecting Switch Module Logs (for E9000+MM910) 4.4.1 Preparing for Log Collection 4.4.1.1 Connecting a PC to the Ethernet Switching Plane Connect a PC to the Ethernet switching plane before logging in to the switching plane.
Plane 4.4.2.2 Using SmartKit to Collect Switch Module Logs For details about how to use SmartKit to collect logs for the E9000 switch module, FusionServer Tools 2.0 SmartKit User Guide see "Collecting Server Logs" in the 4.4.2.3 Using the V5 Switch Module CLI to Collect Ethernet Switching Plane...
View the log file in the FTP directory on the PC. ----End 4.4.2.4 Using the V8 Switch Module CLI to Collect Ethernet Switching Plane Information Operation Scenario Use the CLI of an E9000 switch module to collect the following information about the V8 platform: ● Logs ●...
4 Collecting Information 4.5 Collecting Switch Module Logs (for E9000+MM910/ MM921) Using SmartKit For details about how to use SmartKit to collect logs for the E9000 switch module, FusionServer Tools 2.0 SmartKit User Guide see "Collecting Server Logs" in the Using FusionDirector ●...
Collect Emulex HBA logs when an NIC is faulty. Use the official tool OneCapture to collect Emulex HBA logs. This tool may affect services. ● For details about how to collect screen recording information, see "Video Huawei Server iMana 200 User Guide iBMC User Guide Play" in the Issue 20 (2020-09-25)
Troubleshooting 5 Diagnosing and Rectifying Faults Table 5-2 Methods for handling alarms Server Type Reference FusionServer Pro E9000 Server V100R001 HMM E9000 See the Alarm Handling To check switch module alarms, run the following commands on the Ethernet switching plane: ●...
Page 54
5-14, Table 5-15, and Table 5-16 describe the meanings of the MM910 management module indicator, E9000 fan module indicator, and E9000 switch module indicator, and the corresponding handling procedures. ● Table 5-17 Table 5-18 describe the meanings of the fan module indicator and network port indicator on the Atlas 800 training server (model 9010), and the corresponding handling procedures.
Page 76
Huawei Servers Troubleshooting 5 Diagnosing and Rectifying Faults Indicator Available Only on the E9000 Table 5-14 MM910 management module indicators Indicator Status Meaning Diagnosis Power indicator Steady green The MM910 has (PWR) on the been powered on. MM910 Blinking green...
Page 77
Huawei Servers Troubleshooting 5 Diagnosing and Rectifying Faults Table 5-15 E9000 fan module indicators Indicator Status Meaning Diagnosis Procedure Fan module Blinking green The fan module is operating status (once every 2 operating indicator on an seconds) properly. E9000 Blinking green...
Page 78
Huawei Servers Troubleshooting 5 Diagnosing and Rectifying Faults Table 5-16 E9000 switch module indicators Indicator Status Meaning Diagnosis Procedure Stack status Steady green A switch module indicator (STAT) that can be stacked is active in stacking mode or is not stacked, and is operating properly.
Page 89
Indicators to Locate Faults. affected. output NOTE ● If no, contact Huawei and the ● For E9000 servers, record technical support. health alarms on the MM910 WebUI. 2. Replace the faulty PSU with indicator 2. Check whether an "AC lost"...
"Quick Recovery Method". ● For more fault symptoms and solutions, see the Computing Case Library. The Computing Case Library is available only to Huawei engineers and partners. If the KVM connection is abnormal, you are advised to use Independent Remote Console for login.
Page 95
200 or iBMC. Then NOTE reconnect power cables. ● The CH140 and CH140 V3 3. Replace the mainboard or compute nodes of the E9000 BMC board. do not provide any serial ports. Directly ping the IP For an E9000 server, perform...
Page 97
2. Replace the mainboard or the power supply link to the PSU backplane. mainboard has failed. NOTE For an E9000 server, you are advised to use the MM910 for one-click log collection. 4. Check the power supply unit (PSU) backplane and the mainboard.
Page 112
PCIe bus are installed properly. b. (Optional) Check the mapping between the HBAs and switch modules for E9000 and E6000 servers. c. Check the FC link between the HBA and the switch by checking the optical module power, optical fiber, and optical module compatibility.
Page 114
If the fault persists, apply for spare HBAs to replace the faulty ones. 3. Before contacting Huawei technical support, it is recommended that you migrate services and collect switch module logs, OS logs, LLD networking information, and device time differences.
Administrator user as soon as possible. 7.2.2 Inspecting Indicators The front and rear panels of Huawei servers provide indicators and buttons, including the UID button/indicator, health status indicator, network port status indicators, fan module indicators, and power button/indicator. You can observe the indicators on a server to determine the server status.
Supports batch upgrade for BMC, BIOS, CPLD, and Smart Provisioning firmware of rack servers, high-density servers, blade servers, KunLun servers, and heterogeneous servers. ● Supports firmware bundle upgrade by using the E9000 active management module. ● Supports batch configuration for PSUs, BIOSs, BMCs, and RAID controller cards of rack servers, high-density servers, blade servers, KunLun servers, and heterogeneous servers.
Page 133
Inspection Conclusions and Suggestions Huawei's preventive maintenance engineers will perform a comprehensive inspection of your Huawei servers to quickly detect any potential problems. These engineers will then submit a detailed inspection report, and suggestions, to help improve your service availability.
Overview A serial number (SN) or equipment serial number (ESN) uniquely identifies a server and is required when you apply for technical support to Huawei. NO TE Check the first two digits of the product SN before reading the following information.
Page 136
SN ID (two characters), which is 21. Material identification code (eight digits), that is, processing code. Vendor code (two characters). The value 10 indicates Huawei and other values indicate outsourcing vendors. Year and month (two characters). ● The first character indicates the year. The digits 1...
Page 137
● View the product label. A product label is attached to each Huawei server. You can view the product label to obtain its ESN. The product label position varies with the Huawei server model. For details, see the user guide of a specific server.
Page 138
(2) is the product label of a server node. Figure 8-6 Product SN of an X6800 – Figure 8-7 shows the product SN of an E9000. In Figure 8-7, (1) is the product label of the server, and (2) is the product label of a compute node.
Page 139
● X8000 server node: DH310 V2, DH320 V2, DH321 V2, DH620 V2, DH621 V2, DH626 V2, and DH628 V2 ● E9000 compute node: CH121, CH140, CH220, CH221, CH222, CH240, CH242, and CH242 V3 Log in to the iMana 200 WebUI. For details, see 8.8 Logging In to the...
Page 141
Figure 8-10 Product SN ● Use the MM910 WebUI. NO TE This method applies only to E9000 servers whose MM910 version is (U54) 2.20 or later. Log in to the MM910 WebUI. For details, see 8.11 Logging In to the MM910 WebUI.
Page 142
Figure 8-12 Product SN ● Use the FusionDirector WebUI. NO TE ● This method applies only to E9000 servers whose management module is the MM920/MM921. ● Before the operations, add the MM920/MM921 to FusionDirector. Log in to the FusionDirector WebUI. For details, see 8.12 Logging In to...
MM910 Operation Scenario You can use the Serial over LAN (SOL) function of the management module to access a compute node, passthrough module, or switch module in a chassis for remote maintenance of the E9000. Prerequisites Conditions ●...
9 Other Resources Huawei Technical Support If a fault persists after taking troubleshooting measures specified in documents, contact technical support at your local Huawei office. If your local Huawei office is not available, contact Huawei technical support as follows: ●...
Page 188
9 Other Resources Table 9-3 Software tools for routine maintenance Name Server and Description Version FusionServer See the Only Huawei FusionServer V2 & V3 servers FusionServer Tools Toolkit are supported. Diagnoses and configures Tools 2.0 servers. Toolkit User Download link:...