We use several tools to gain insight into performance at each level of our infrastructure.
Metric/stats collection is done with Collectd on host systems feeding instances of Influxdb. We then visualize this data with Grafana. A variety of Collectd plugins gather data about Ceph, system performance, network throughput, switch interfaces (snmp plugin), and more.
Log collection and aggregation uses the “ELK” stack and Filebeat for shipping logs to Elasticsearch
Log collection and processing in Logstash
Log storage in an Elasticsearch Cluster
Visualization in Kibana and also in Grafana for data processed as time-series.