We use several tools to gain insight into performance at each level of our infrastructure.
Metric/stats collection is done with Collectd on host systems feeding instances of Influxdb. We then visualize this data with Grafana. A variety of Collectd plugins gather data about Ceph, system performance, network throughput, switch interfaces (snmp plugin), and more.
Log collection and aggregation uses the “ELK” stack and Filebeat for shipping logs to Elasticsearch