
IT系统运维工程师应聘指南:监控优化专项
2024/3/5大约 20 分钟
IT系统运维工程师应聘指南02 - 监控优化专项指南
岗位职责重点
监控IT资源使用情况,优化成本;提出成本节约方案,如资源整合、云服务优化等;定期检查系统日志,确保系统稳定且及时发现并解决潜在问题
一、监控系统部署详细操作指南
1.1 Zabbix监控系统部署
Zabbix Server安装配置:
# 1. 系统环境准备(CentOS 8)
dnf update -y
dnf install -y wget curl vim
# 2. 安装Zabbix仓库
rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/8/x86_64/zabbix-release-6.0-4.el8.noarch.rpm
dnf clean all
# 3. 安装Zabbix服务器和Web前端
dnf install -y zabbix-server-mysql zabbix-web-mysql zabbix-apache-conf zabbix-sql-scripts zabbix-selinux-policy
# 4. 安装MariaDB数据库
dnf install -y mariadb-server mariadb
systemctl start mariadb
systemctl enable mariadb
# 5. 配置数据库
mysql_secure_installation
mysql -uroot -p
CREATE DATABASE zabbix CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
CREATE USER 'zabbix'@'localhost' IDENTIFIED BY 'Zabbix@123!';
GRANT ALL PRIVILEGES ON zabbix.* TO 'zabbix'@'localhost';
SET GLOBAL log_bin_trust_function_creators = 1;
quit;
# 6. 导入数据库结构
zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql --default-character-set=utf8mb4 -uzabbix -p zabbix
# 7. 配置Zabbix服务器
vi /etc/zabbix/zabbix_server.conf
DBPassword=Zabbix@123!
DBHost=localhost
DBName=zabbix
DBUser=zabbix
# 配置日志级别和文件大小
LogLevel=3
LogFile=/var/log/zabbix/zabbix_server.log
LogFileSize=100
# 配置缓存大小
CacheSize=128M
HistoryCacheSize=64M
HistoryIndexCacheSize=32M
TrendCacheSize=32M
# 8. 配置PHP
vi /etc/php-fpm.d/zabbix.conf
# 取消注释并设置时区
php_value[date.timezone] = Asia/Shanghai
# 9. 启动服务
systemctl restart zabbix-server zabbix-agent httpd php-fpm
systemctl enable zabbix-server zabbix-agent httpd php-fpm
# 10. 防火墙配置
firewall-cmd --add-service=http --permanent
firewall-cmd --add-port=10051/tcp --permanent
firewall-cmd --reload
# 11. Web界面配置
# 访问 http://server_ip/zabbix
# 默认用户名:Admin,密码:zabbixZabbix Agent安装配置:
# 1. 在被监控主机上安装Agent
dnf install -y zabbix-agent2
# 2. 配置Agent
vi /etc/zabbix/zabbix_agent2.conf
# 服务器配置
Server=192.168.1.100
ServerActive=192.168.1.100
Hostname=web-server-01
# 安全配置
AllowKey=system.run[*]
DenyKey=system.run[rm *]
# 监控配置
RefreshActiveChecks=60
BufferSend=5
BufferSize=100
# 3. 启动Agent服务
systemctl start zabbix-agent2
systemctl enable zabbix-agent2
# 4. 防火墙配置
firewall-cmd --add-port=10050/tcp --permanent
firewall-cmd --reload
# 5. 自定义监控脚本
mkdir -p /etc/zabbix/scripts
# CPU温度监控脚本
cat > /etc/zabbix/scripts/cpu_temp.sh << 'EOF'
#!/bin/bash
# 获取CPU温度
TEMP=$(sensors | grep 'Core 0' | awk '{print $3}' | sed 's/+//g' | sed 's/°C//g')
echo $TEMP
EOF
chmod +x /etc/zabbix/scripts/cpu_temp.sh
# 在zabbix_agent2.conf中添加用户参数
echo "UserParameter=cpu.temperature,/etc/zabbix/scripts/cpu_temp.sh" >> /etc/zabbix/zabbix_agent2.conf
# 重启Agent服务
systemctl restart zabbix-agent21.2 Prometheus + Grafana监控系统
Prometheus安装配置:
# 1. 创建prometheus用户
useradd --no-create-home --shell /bin/false prometheus
# 2. 创建目录
mkdir /etc/prometheus /var/lib/prometheus
chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
# 3. 下载安装Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar -xzf prometheus-2.40.0.linux-amd64.tar.gz
cp prometheus-2.40.0.linux-amd64/prometheus /usr/local/bin/
cp prometheus-2.40.0.linux-amd64/promtool /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
# 4. 配置Prometheus
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets:
- '192.168.1.101:9100'
- '192.168.1.102:9100'
- job_name: 'mysql-exporter'
static_configs:
- targets: ['192.168.1.101:9104']
- job_name: 'nginx-exporter'
static_configs:
- targets: ['192.168.1.101:9113']
EOF
chown prometheus:prometheus /etc/prometheus/prometheus.yml
# 5. 创建systemd服务
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle \
--storage.tsdb.retention.time=30d
[Install]
WantedBy=multi-user.target
EOF
# 6. 启动服务
systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
# 7. 防火墙配置
firewall-cmd --add-port=9090/tcp --permanent
firewall-cmd --reloadNode Exporter安装:
# 1. 下载安装Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.5.0.linux-amd64.tar.gz
cp node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/node_exporter
# 2. 创建systemd服务
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--web.listen-address=0.0.0.0:9100
[Install]
WantedBy=multi-user.target
EOF
# 3. 启动服务
systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
# 4. 防火墙配置
firewall-cmd --add-port=9100/tcp --permanent
firewall-cmd --reloadGrafana安装配置:
# 1. 添加Grafana仓库
cat > /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
# 2. 安装Grafana
dnf install -y grafana
# 3. 配置Grafana
vi /etc/grafana/grafana.ini
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = localhost
root_url = http://localhost:3000/
[database]
type = sqlite3
host = 127.0.0.1:3306
name = grafana
user = root
password =
[security]
admin_user = admin
admin_password = Grafana@123!
secret_key = SW2YcwTIb9zpOOhoPsMm
# 4. 启动服务
systemctl start grafana-server
systemctl enable grafana-server
# 5. 防火墙配置
firewall-cmd --add-port=3000/tcp --permanent
firewall-cmd --reload
# 6. 配置数据源(通过Web界面)
# 访问 http://server_ip:3000
# 添加Prometheus数据源:http://localhost:90901.3 监控告警配置
Alertmanager配置:
# 1. 下载安装Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar -xzf alertmanager-0.25.0.linux-amd64.tar.gz
cp alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
cp alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/alertmanager /usr/local/bin/amtool
# 2. 创建配置目录
mkdir /etc/alertmanager
chown prometheus:prometheus /etc/alertmanager
# 3. 配置Alertmanager
cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
smtp_smarthost: 'smtp.163.com:587'
smtp_from: 'monitor@company.com'
smtp_auth_username: 'monitor@company.com'
smtp_auth_password: 'password123'
route:
group_by: ['alertname', 'instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@company.com'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
Severity: {{ .Labels.severity }}
{{ end }}
webhook_configs:
- url: 'http://localhost:8080/webhook'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
chown prometheus:prometheus /etc/alertmanager/alertmanager.yml
# 4. 创建systemd服务
cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file /etc/alertmanager/alertmanager.yml \
--storage.path /var/lib/alertmanager/ \
--web.listen-address=0.0.0.0:9093
[Install]
WantedBy=multi-user.target
EOF
# 5. 启动服务
systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanagerPrometheus告警规则配置:
# 1. 创建规则目录
mkdir -p /etc/prometheus/rules
# 2. 系统监控规则
cat > /etc/prometheus/rules/system.yml << 'EOF'
groups:
- name: system.rules
rules:
# CPU使用率过高
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
# 内存使用率过高
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% on {{ $labels.instance }}"
# 磁盘空间不足
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High disk usage detected"
description: "Disk usage is above 90% on {{ $labels.instance }} filesystem {{ $labels.mountpoint }}"
# 系统负载过高
- alert: HighSystemLoad
expr: node_load15 / count by(instance) (node_cpu_seconds_total{mode="idle"}) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High system load detected"
description: "System load is above 2.0 on {{ $labels.instance }}"
# 服务器宕机
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance down"
description: "Instance {{ $labels.instance }} has been down for more than 1 minute"
EOF
# 3. 应用监控规则
cat > /etc/prometheus/rules/application.yml << 'EOF'
groups:
- name: application.rules
rules:
# HTTP响应时间过长
- alert: HighHttpResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP response time"
description: "95% of HTTP requests take more than 0.5s on {{ $labels.instance }}"
# HTTP错误率过高
- alert: HighHttpErrorRate
expr: rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High HTTP error rate"
description: "HTTP error rate is above 5% on {{ $labels.instance }}"
# 数据库连接数过高
- alert: HighDatabaseConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High database connections"
description: "Database connections usage is above 80% on {{ $labels.instance }}"
EOF
# 4. 重载Prometheus配置
curl -X POST http://localhost:9090/-/reload二、日志管理详细操作指南
2.1 ELK Stack部署
Elasticsearch安装配置:
# 1. 安装Java
dnf install -y java-11-openjdk java-11-openjdk-devel
# 2. 添加Elasticsearch仓库
rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
cat > /etc/yum.repos.d/elasticsearch.repo << 'EOF'
[elasticsearch]
name=Elasticsearch repository for 8.x packages
baseurl=https://artifacts.elastic.co/packages/8.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=0
autorefresh=1
type=rpm-md
EOF
# 3. 安装Elasticsearch
dnf install -y --enablerepo=elasticsearch elasticsearch
# 4. 配置Elasticsearch
vi /etc/elasticsearch/elasticsearch.yml
cluster.name: production-cluster
node.name: node-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node
# 安全配置
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
enabled: false
xpack.security.transport.ssl:
enabled: false
# JVM配置
vi /etc/elasticsearch/jvm.options
-Xms2g
-Xmx2g
# 5. 启动服务
systemctl daemon-reload
systemctl enable elasticsearch
systemctl start elasticsearch
# 6. 设置密码
/usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive
# 7. 防火墙配置
firewall-cmd --add-port=9200/tcp --permanent
firewall-cmd --reloadLogstash安装配置:
# 1. 安装Logstash
dnf install -y --enablerepo=elasticsearch logstash
# 2. 配置Logstash管道
cat > /etc/logstash/conf.d/beats-input.conf << 'EOF'
input {
beats {
port => 5044
}
}
filter {
if [fileset][module] == "system" {
if [fileset][name] == "auth" {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:server} %{PROG:program}: %{GREEDYDATA:content}" }
}
}
else if [fileset][name] == "syslog" {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:server} %{PROG:program}: %{GREEDYDATA:content}" }
}
}
}
if [fields][log_type] == "nginx_access" {
grok {
match => { "message" => "%{NGINXACCESS}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
mutate {
convert => { "response" => "integer" }
convert => { "bytes" => "integer" }
convert => { "responsetime" => "float" }
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
user => "elastic"
password => "password123"
index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
}
}
EOF
# 3. 配置JVM
vi /etc/logstash/jvm.options
-Xms1g
-Xmx1g
# 4. 启动服务
systemctl enable logstash
systemctl start logstash
# 5. 防火墙配置
firewall-cmd --add-port=5044/tcp --permanent
firewall-cmd --reloadKibana安装配置:
# 1. 安装Kibana
dnf install -y --enablerepo=elasticsearch kibana
# 2. 配置Kibana
vi /etc/kibana/kibana.yml
server.port: 5601
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://localhost:9200"]
elasticsearch.username: "kibana_system"
elasticsearch.password: "password123"
# 安全配置
xpack.security.enabled: true
xpack.encryptedSavedObjects.encryptionKey: "fhjskloppd678ehkdfdlliverpoolfcr"
# 3. 启动服务
systemctl enable kibana
systemctl start kibana
# 4. 防火墙配置
firewall-cmd --add-port=5601/tcp --permanent
firewall-cmd --reloadFilebeat配置(在客户端服务器):
# 1. 安装Filebeat
dnf install -y --enablerepo=elasticsearch filebeat
# 2. 配置Filebeat
vi /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/messages
- /var/log/secure
fields:
log_type: system
fields_under_root: true
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
fields:
log_type: nginx_access
fields_under_root: true
- type: log
enabled: true
paths:
- /var/log/nginx/error.log
fields:
log_type: nginx_error
fields_under_root: true
output.logstash:
hosts: ["192.168.1.100:5044"]
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
# 3. 启动服务
systemctl enable filebeat
systemctl start filebeat2.2 日志分析和告警
日志分析脚本:
# 1. 系统日志分析脚本
cat > /usr/local/bin/log_analyzer.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/messages"
ERROR_LOG="/var/log/log_errors.log"
DATE=$(date +%Y-%m-%d)
# 分析错误日志
echo "=== 系统错误日志分析 - $DATE ===" > $ERROR_LOG
# 查找磁盘错误
echo "磁盘错误:" >> $ERROR_LOG
grep -i "disk\|I/O\|filesystem" $LOG_FILE | tail -20 >> $ERROR_LOG
# 查找内存错误
echo -e "\n内存错误:" >> $ERROR_LOG
grep -i "memory\|oom\|killed" $LOG_FILE | tail -20 >> $ERROR_LOG
# 查找网络错误
echo -e "\n网络错误:" >> $ERROR_LOG
grep -i "network\|connection\|timeout" $LOG_FILE | tail -20 >> $ERROR_LOG
# 查找SSH登录失败
echo -e "\nSSH登录失败:" >> $ERROR_LOG
grep "Failed password" /var/log/secure | tail -20 >> $ERROR_LOG
# 统计错误频率
echo -e "\n错误统计:" >> $ERROR_LOG
grep -i "error\|fail\|critical" $LOG_FILE | awk '{print $5}' | sort | uniq -c | sort -nr >> $ERROR_LOG
# 发送邮件告警
if [ -s $ERROR_LOG ]; then
mail -s "系统错误日志报告 - $DATE" admin@company.com < $ERROR_LOG
fi
EOF
chmod +x /usr/local/bin/log_analyzer.sh
# 2. 配置定时任务
echo "0 1 * * * /usr/local/bin/log_analyzer.sh" | crontab -
# 3. Web访问日志分析
cat > /usr/local/bin/web_log_analyzer.sh << 'EOF'
#!/bin/bash
NGINX_LOG="/var/log/nginx/access.log"
REPORT_FILE="/tmp/web_access_report.txt"
DATE=$(date +%Y-%m-%d)
echo "=== Web访问日志分析 - $DATE ===" > $REPORT_FILE
# 访问量统计
echo "总访问量:" >> $REPORT_FILE
wc -l $NGINX_LOG >> $REPORT_FILE
# IP访问排名
echo -e "\n访问IP排名(前10):" >> $REPORT_FILE
awk '{print $1}' $NGINX_LOG | sort | uniq -c | sort -nr | head -10 >> $REPORT_FILE
# 页面访问排名
echo -e "\n页面访问排名(前10):" >> $REPORT_FILE
awk '{print $7}' $NGINX_LOG | sort | uniq -c | sort -nr | head -10 >> $REPORT_FILE
# 状态码统计
echo -e "\n状态码统计:" >> $REPORT_FILE
awk '{print $9}' $NGINX_LOG | sort | uniq -c | sort -nr >> $REPORT_FILE
# 404错误页面
echo -e "\n404错误页面(前10):" >> $REPORT_FILE
awk '$9==404 {print $7}' $NGINX_LOG | sort | uniq -c | sort -nr | head -10 >> $REPORT_FILE
# 异常IP检测(访问量超过1000的IP)
echo -e "\n异常访问IP:" >> $REPORT_FILE
awk '{print $1}' $NGINX_LOG | sort | uniq -c | awk '$1>1000 {print $2 " - " $1 " 次访问"}' >> $REPORT_FILE
cat $REPORT_FILE
EOF
chmod +x /usr/local/bin/web_log_analyzer.sh三、性能监控详细操作指南
3.1 系统性能监控
系统性能监控脚本:
# 1. 综合性能监控脚本
cat > /usr/local/bin/performance_monitor.sh << 'EOF'
#!/bin/bash
HOSTNAME=$(hostname)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
LOG_FILE="/var/log/performance.log"
echo "=== 系统性能报告 - $HOSTNAME - $DATE ===" | tee -a $LOG_FILE
# CPU信息
echo -e "\n### CPU使用率 ###" | tee -a $LOG_FILE
top -bn1 | grep "Cpu(s)" | tee -a $LOG_FILE
echo "CPU负载:" | tee -a $LOG_FILE
uptime | tee -a $LOG_FILE
# 内存信息
echo -e "\n### 内存使用情况 ###" | tee -a $LOG_FILE
free -h | tee -a $LOG_FILE
echo "内存使用率:" | tee -a $LOG_FILE
free | awk 'NR==2{printf "%.2f%%\n", $3*100/$2}' | tee -a $LOG_FILE
# 磁盘信息
echo -e "\n### 磁盘使用情况 ###" | tee -a $LOG_FILE
df -h | tee -a $LOG_FILE
# 网络连接
echo -e "\n### 网络连接统计 ###" | tee -a $LOG_FILE
ss -s | tee -a $LOG_FILE
# 进程信息
echo -e "\n### 占用资源最多的进程 ###" | tee -a $LOG_FILE
echo "CPU占用前5:" | tee -a $LOG_FILE
ps aux --sort=-%cpu | head -6 | tee -a $LOG_FILE
echo "内存占用前5:" | tee -a $LOG_FILE
ps aux --sort=-%mem | head -6 | tee -a $LOG_FILE
# I/O统计
echo -e "\n### 磁盘I/O统计 ###" | tee -a $LOG_FILE
iostat -x 1 1 | tee -a $LOG_FILE
# 检查告警条件
check_alerts() {
# CPU使用率检查
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
echo "ALERT: CPU使用率超过80%: $CPU_USAGE%" | tee -a $LOG_FILE
echo "CPU使用率告警: $CPU_USAGE%" | mail -s "CPU告警 - $HOSTNAME" admin@company.com
fi
# 内存使用率检查
MEM_USAGE=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then
echo "ALERT: 内存使用率超过85%: $MEM_USAGE%" | tee -a $LOG_FILE
echo "内存使用率告警: $MEM_USAGE%" | mail -s "内存告警 - $HOSTNAME" admin@company.com
fi
# 磁盘使用率检查
DISK_USAGE=$(df -h | awk '$NF=="/"{printf "%d", $5}')
if [ $DISK_USAGE -gt 90 ]; then
echo "ALERT: 根分区使用率超过90%: $DISK_USAGE%" | tee -a $LOG_FILE
echo "磁盘空间告警: $DISK_USAGE%" | mail -s "磁盘告警 - $HOSTNAME" admin@company.com
fi
}
check_alerts
echo "======================================" | tee -a $LOG_FILE
EOF
chmod +x /usr/local/bin/performance_monitor.sh
# 2. 配置定时任务(每5分钟执行一次)
echo "*/5 * * * * /usr/local/bin/performance_monitor.sh" | crontab -3.2 数据库性能监控
MySQL性能监控:
# 1. MySQL性能监控脚本
cat > /usr/local/bin/mysql_monitor.sh << 'EOF'
#!/bin/bash
MYSQL_USER="monitor"
MYSQL_PASS="Monitor@123"
MYSQL_HOST="localhost"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
LOG_FILE="/var/log/mysql_performance.log"
echo "=== MySQL性能监控 - $DATE ===" | tee -a $LOG_FILE
# 连接数监控
echo -e "\n### 连接状态 ###" | tee -a $LOG_FILE
mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Threads_running';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';" | tee -a $LOG_FILE
# 查询统计
echo -e "\n### 查询统计 ###" | tee -a $LOG_FILE
mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "
SHOW STATUS LIKE 'Queries';
SHOW STATUS LIKE 'Questions';
SHOW STATUS LIKE 'Slow_queries';
SHOW STATUS LIKE 'Com_select';
SHOW STATUS LIKE 'Com_insert';
SHOW STATUS LIKE 'Com_update';
SHOW STATUS LIKE 'Com_delete';" | tee -a $LOG_FILE
# InnoDB状态
echo -e "\n### InnoDB状态 ###" | tee -a $LOG_FILE
mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "
SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';
SHOW STATUS LIKE 'Innodb_buffer_pool_pages_total';
SHOW STATUS LIKE 'Innodb_buffer_pool_pages_free';" | tee -a $LOG_FILE
# 锁状态
echo -e "\n### 锁状态 ###" | tee -a $LOG_FILE
mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "
SHOW STATUS LIKE 'Table_locks_waited';
SHOW STATUS LIKE 'Table_locks_immediate';
SELECT * FROM information_schema.INNODB_LOCKS;" | tee -a $LOG_FILE
# 慢查询检查
echo -e "\n### 当前正在执行的查询 ###" | tee -a $LOG_FILE
mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "
SELECT ID, USER, HOST, DB, COMMAND, TIME, STATE, INFO
FROM information_schema.PROCESSLIST
WHERE COMMAND != 'Sleep' AND TIME > 5
ORDER BY TIME DESC;" | tee -a $LOG_FILE
# 告警检查
check_mysql_alerts() {
# 连接数检查
CONNECTIONS=$(mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "SHOW STATUS LIKE 'Threads_connected';" | awk 'NR==2 {print $2}')
MAX_CONNECTIONS=$(mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "SHOW VARIABLES LIKE 'max_connections';" | awk 'NR==2 {print $2}')
CONNECTION_USAGE=$(echo "scale=2; $CONNECTIONS * 100 / $MAX_CONNECTIONS" | bc)
if (( $(echo "$CONNECTION_USAGE > 80" | bc -l) )); then
echo "ALERT: MySQL连接数使用率超过80%: $CONNECTION_USAGE%" | tee -a $LOG_FILE
echo "MySQL连接数告警: $CONNECTION_USAGE%" | mail -s "MySQL连接数告警" admin@company.com
fi
# 慢查询检查
SLOW_QUERIES=$(mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST -e "SHOW STATUS LIKE 'Slow_queries';" | awk 'NR==2 {print $2}')
if [ $SLOW_QUERIES -gt 100 ]; then
echo "ALERT: 慢查询数量过多: $SLOW_QUERIES" | tee -a $LOG_FILE
fi
}
check_mysql_alerts
echo "======================================" | tee -a $LOG_FILE
EOF
chmod +x /usr/local/bin/mysql_monitor.sh
# 2. 配置定时任务
echo "*/10 * * * * /usr/local/bin/mysql_monitor.sh" | crontab -四、成本优化详细操作指南
4.1 资源使用分析
服务器资源分析脚本:
# 1. 资源使用分析脚本
cat > /usr/local/bin/resource_analyzer.sh << 'EOF'
#!/bin/bash
HOSTNAME=$(hostname)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
REPORT_FILE="/tmp/resource_report_$(date +%Y%m%d).txt"
echo "=== 服务器资源使用分析报告 - $HOSTNAME - $DATE ===" > $REPORT_FILE
# CPU分析
echo -e "\n### CPU使用分析 ###" >> $REPORT_FILE
echo "CPU核心数: $(nproc)" >> $REPORT_FILE
echo "CPU型号: $(lscpu | grep 'Model name' | awk -F: '{print $2}' | xargs)" >> $REPORT_FILE
# 过去24小时平均CPU使用率
echo "过去24小时平均CPU使用率:" >> $REPORT_FILE
sar -u 1 1 | tail -1 | awk '{printf "用户: %.2f%%, 系统: %.2f%%, 空闲: %.2f%%\n", $3, $5, $8}' >> $REPORT_FILE
# 内存分析
echo -e "\n### 内存使用分析 ###" >> $REPORT_FILE
echo "总内存: $(free -h | awk 'NR==2{print $2}')" >> $REPORT_FILE
echo "已使用: $(free -h | awk 'NR==2{print $3}')" >> $REPORT_FILE
echo "可用内存: $(free -h | awk 'NR==2{print $7}')" >> $REPORT_FILE
echo "使用率: $(free | awk 'NR==2{printf "%.2f%%\n", $3*100/$2}')" >> $REPORT_FILE
# 内存占用前10进程
echo -e "\n内存占用前10进程:" >> $REPORT_FILE
ps aux --sort=-%mem | head -11 | awk '{printf "%-10s %-8s %-8s %s\n", $1, $4"%", $6/1024"MB", $11}' >> $REPORT_FILE
# 磁盘分析
echo -e "\n### 磁盘使用分析 ###" >> $REPORT_FILE
df -h >> $REPORT_FILE
# 大文件检查
echo -e "\n大文件检查(>100MB):" >> $REPORT_FILE
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -10 >> $REPORT_FILE
# 网络使用分析
echo -e "\n### 网络使用分析 ###" >> $REPORT_FILE
echo "网络接口:" >> $REPORT_FILE
ip addr show | grep -E '^[0-9]+:' >> $REPORT_FILE
# 服务分析
echo -e "\n### 运行服务分析 ###" >> $REPORT_FILE
systemctl list-units --type=service --state=running | wc -l | awk '{print "运行中的服务数量: " $1}' >> $REPORT_FILE
# 优化建议
echo -e "\n### 优化建议 ###" >> $REPORT_FILE
# CPU优化建议
CPU_IDLE=$(sar -u 1 1 | tail -1 | awk '{print $8}')
if (( $(echo "$CPU_IDLE < 20" | bc -l) )); then
echo "- CPU使用率过高,建议升级CPU或优化应用程序" >> $REPORT_FILE
elif (( $(echo "$CPU_IDLE > 80" | bc -l) )); then
echo "- CPU资源充裕,可考虑降低配置或部署更多应用" >> $REPORT_FILE
fi
# 内存优化建议
MEM_USAGE=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then
echo "- 内存使用率过高,建议增加内存或优化应用" >> $REPORT_FILE
elif (( $(echo "$MEM_USAGE < 30" | bc -l) )); then
echo "- 内存使用率较低,可考虑降低内存配置" >> $REPORT_FILE
fi
# 磁盘优化建议
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 85 ]; then
echo "- 磁盘空间不足,建议清理或扩容" >> $REPORT_FILE
elif [ $DISK_USAGE -lt 30 ]; then
echo "- 磁盘使用率较低,配置可能过大" >> $REPORT_FILE
fi
echo -e "\n报告生成时间: $DATE" >> $REPORT_FILE
cat $REPORT_FILE
EOF
chmod +x /usr/local/bin/resource_analyzer.sh
# 2. 自动清理脚本
cat > /usr/local/bin/auto_cleanup.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/cleanup.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
echo "=== 自动清理任务开始 - $DATE ===" | tee -a $LOG_FILE
# 清理系统日志(保留30天)
echo "清理系统日志..." | tee -a $LOG_FILE
journalctl --vacuum-time=30d 2>&1 | tee -a $LOG_FILE
# 清理旧的日志文件
echo "清理旧日志文件..." | tee -a $LOG_FILE
find /var/log -name "*.log.*" -mtime +30 -delete 2>&1 | tee -a $LOG_FILE
find /var/log -name "*.gz" -mtime +30 -delete 2>&1 | tee -a $LOG_FILE
# 清理临时文件
echo "清理临时文件..." | tee -a $LOG_FILE
find /tmp -type f -mtime +7 -delete 2>&1 | tee -a $LOG_FILE
find /var/tmp -type f -mtime +7 -delete 2>&1 | tee -a $LOG_FILE
# 清理包缓存
echo "清理包缓存..." | tee -a $LOG_FILE
dnf clean all 2>&1 | tee -a $LOG_FILE
# 清理内核旧版本(保留最新3个版本)
echo "清理旧内核版本..." | tee -a $LOG_FILE
dnf remove $(dnf repoquery --installonly --latest-limit=-3 -q) -y 2>&1 | tee -a $LOG_FILE
# 统计清理效果
echo "清理完成统计:" | tee -a $LOG_FILE
df -h / | tee -a $LOG_FILE
echo "=== 自动清理任务完成 - $(date '+%Y-%m-%d %H:%M:%S') ===" | tee -a $LOG_FILE
EOF
chmod +x /usr/local/bin/auto_cleanup.sh
# 3. 配置定时清理任务
echo "0 2 * * 0 /usr/local/bin/auto_cleanup.sh" | crontab -4.2 云服务成本优化
阿里云成本分析脚本:
# 1. 阿里云成本分析脚本
cat > /usr/local/bin/aliyun_cost_analyzer.sh << 'EOF'
#!/bin/bash
# 配置阿里云CLI
REGION="cn-hangzhou"
REPORT_FILE="/tmp/aliyun_cost_report_$(date +%Y%m%d).txt"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
echo "=== 阿里云资源成本分析报告 - $DATE ===" > $REPORT_FILE
# ECS实例分析
echo -e "\n### ECS实例分析 ###" >> $REPORT_FILE
aliyun ecs DescribeInstances --RegionId $REGION | jq -r '
.Instances.Instance[] |
"实例ID: \(.InstanceId), 规格: \(.InstanceType), 状态: \(.Status), 创建时间: \(.CreationTime)"
' >> $REPORT_FILE
# 获取ECS实例数量
ECS_COUNT=$(aliyun ecs DescribeInstances --RegionId $REGION | jq '.Instances.Instance | length')
echo "ECS实例总数: $ECS_COUNT" >> $REPORT_FILE
# RDS实例分析
echo -e "\n### RDS实例分析 ###" >> $REPORT_FILE
aliyun rds DescribeDBInstances --RegionId $REGION | jq -r '
.Items.DBInstance[] |
"实例ID: \(.DBInstanceId), 规格: \(.DBInstanceClass), 状态: \(.DBInstanceStatus), 存储: \(.DBInstanceStorage)GB"
' >> $REPORT_FILE
# SLB实例分析
echo -e "\n### 负载均衡实例分析 ###" >> $REPORT_FILE
aliyun slb DescribeLoadBalancers --RegionId $REGION | jq -r '
.LoadBalancers.LoadBalancer[] |
"实例ID: \(.LoadBalancerId), 类型: \(.LoadBalancerSpec), 状态: \(.LoadBalancerStatus)"
' >> $REPORT_FILE
# 优化建议
echo -e "\n### 成本优化建议 ###" >> $REPORT_FILE
# 检查停止的ECS实例
STOPPED_ECS=$(aliyun ecs DescribeInstances --RegionId $REGION | jq '.Instances.Instance[] | select(.Status=="Stopped") | .InstanceId' | wc -l)
if [ $STOPPED_ECS -gt 0 ]; then
echo "- 发现 $STOPPED_ECS 台停止的ECS实例,建议释放或转换为按量付费" >> $REPORT_FILE
fi
# 检查低使用率实例(需要配合监控数据)
echo "- 建议定期检查实例CPU、内存使用率,优化配置" >> $REPORT_FILE
echo "- 考虑使用预留实例或包年包月获得折扣" >> $REPORT_FILE
echo "- 评估是否可以使用竞价实例降低成本" >> $REPORT_FILE
echo "- 设置费用预算和告警,避免意外超支" >> $REPORT_FILE
cat $REPORT_FILE
EOF
chmod +x /usr/local/bin/aliyun_cost_analyzer.sh
# 2. 资源利用率监控脚本
cat > /usr/local/bin/resource_utilization.sh << 'EOF'
#!/bin/bash
REPORT_FILE="/tmp/utilization_report_$(date +%Y%m%d).txt"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
echo "=== 资源利用率报告 - $DATE ===" > $REPORT_FILE
# 过去24小时的平均利用率
echo -e "\n### 过去24小时资源利用率 ###" >> $REPORT_FILE
# CPU利用率
CPU_AVG=$(sar -u -s $(date -d '1 day ago' '+%H:%M:%S') | grep Average | awk '{print 100-$8}')
echo "平均CPU利用率: ${CPU_AVG}%" >> $REPORT_FILE
# 内存利用率
MEM_TOTAL=$(free -b | awk 'NR==2{print $2}')
MEM_USED=$(free -b | awk 'NR==2{print $3}')
MEM_USAGE=$(echo "scale=2; $MEM_USED * 100 / $MEM_TOTAL" | bc)
echo "当前内存利用率: ${MEM_USAGE}%" >> $REPORT_FILE
# 磁盘利用率
echo -e "\n磁盘利用率:" >> $REPORT_FILE
df -h | awk 'NR>1 {print $6 ": " $5}' >> $REPORT_FILE
# 网络利用率
echo -e "\n网络统计:" >> $REPORT_FILE
cat /proc/net/dev | awk 'NR>2 {printf "接口%s: 接收%sMB, 发送%sMB\n", $1, $2/1024/1024, $10/1024/1024}' >> $REPORT_FILE
# 资源优化建议
echo -e "\n### 资源优化建议 ###" >> $REPORT_FILE
if (( $(echo "$CPU_AVG < 20" | bc -l) )); then
echo "- CPU利用率较低(${CPU_AVG}%),考虑降低配置或整合服务" >> $REPORT_FILE
fi
if (( $(echo "$MEM_USAGE < 30" | bc -l) )); then
echo "- 内存利用率较低(${MEM_USAGE}%),考虑减少内存配置" >> $REPORT_FILE
fi
# 检查是否有长期运行但CPU使用率很低的进程
echo -e "\n长期运行的低CPU进程:" >> $REPORT_FILE
ps -eo pid,ppid,cmd,%cpu,time --sort=-%cpu | awk '$4<1.0 && $5>"01:00:00" {print}' | head -10 >> $REPORT_FILE
cat $REPORT_FILE
EOF
chmod +x /usr/local/bin/resource_utilization.sh面试重点问题详解
Q1: 如何设计一套完整的监控体系?
A: 监控体系设计要点:
1. 监控层次设计:
- 基础设施监控:硬件、操作系统、网络
- 应用层监控:中间件、数据库、应用服务
- 业务监控:关键业务指标、用户体验
2. 监控工具选择:
- 系统监控:Zabbix/Prometheus + Grafana
- 日志监控:ELK Stack/EFK Stack
- APM监控:SkyWalking/Pinpoint
- 业务监控:自定义指标收集
3. 告警策略:
- 分级告警:Critical/Warning/Info
- 告警收敛:避免告警风暴
- 通知渠道:邮件/短信/微信/钉钉
4. 监控覆盖范围:
- 可用性监控:服务健康检查
- 性能监控:响应时间、吞吐量
- 资源监控:CPU、内存、磁盘、网络
- 安全监控:登录异常、权限变更Q2: 如何处理监控系统的海量数据?
A: 海量监控数据处理策略:
1. 数据收集优化:
- 采样策略:关键指标高频采集,普通指标低频采集
- 数据压缩:使用高效的数据格式和压缩算法
- 本地缓存:Agent端本地缓存,批量发送
2. 数据存储优化:
- 时序数据库:InfluxDB、TimescaleDB
- 数据分层存储:热数据SSD,冷数据HDD
- 数据生命周期:自动清理过期数据
3. 数据查询优化:
- 索引优化:合理设计时间序列索引
- 预聚合:预计算常用查询结果
- 缓存策略:查询结果缓存
4. 架构扩展:
- 水平扩展:多节点集群部署
- 读写分离:分离查询和写入负载
- 负载均衡:分布式存储和查询Q3: 如何进行有效的成本优化?
A: 成本优化策略:
1. 资源使用分析:
- 定期资源使用率报告
- 识别低使用率资源
- 分析资源使用趋势
2. 技术优化:
- 服务器整合:虚拟化技术
- 自动化运维:减少人工成本
- 性能调优:提高资源利用率
3. 采购策略优化:
- 预留实例:长期稳定业务
- 竞价实例:非关键业务
- 包年包月:获得折扣优惠
4. 监控和控制:
- 成本监控工具:实时费用跟踪
- 预算控制:设置费用预警
- 定期审查:月度成本分析Q4: 如何设计日志管理策略?
A: 日志管理策略设计:
1. 日志分类:
- 系统日志:操作系统、硬件日志
- 应用日志:应用程序运行日志
- 访问日志:Web服务器访问记录
- 安全日志:审计、安全事件日志
2. 日志收集:
- 集中收集:统一日志收集平台
- 实时传输:实时或近实时传输
- 格式标准化:统一日志格式
3. 日志存储:
- 分层存储:热、温、冷数据分层
- 压缩存储:节省存储空间
- 备份策略:关键日志异地备份
4. 日志分析:
- 全文搜索:快速定位问题
- 统计分析:趋势分析、异常检测
- 可视化展示:图表展示分析结果
5. 生命周期管理:
- 保留策略:不同类型日志保留期限
- 自动清理:自动删除过期日志
- 合规要求:满足法规要求实战练习建议
技能提升路径
- 监控系统实战:搭建完整的Zabbix/Prometheus监控环境
- 日志分析实战:部署ELK Stack,分析实际日志数据
- 自动化脚本开发:编写监控、告警、清理脚本
- 性能调优实战:在测试环境中进行性能压测和优化
面试准备要点
- 监控指标设计:能够设计合理的监控指标体系
- 告警策略制定:能够制定有效的告警策略
- 成本控制经验:有实际的成本优化案例
- 故障处理经验:通过监控快速定位和解决问题
建议结合实际工作环境进行深入实践,积累丰富的监控运维经验。