Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability.
The OSiRIS Ceph deployment spans WSU, MSU, and UM. We currently have deployed approximately 900 OSD. Our OSD are 8TB or 10TB disks for a total of about 8PB raw storage.
All of our components are deployed and managed with a puppet module forked from a module started by the Openstack group. The module code is available on Github: https://github.com/MI-OSiRIS/puppet-ceph
To gather Ceph metrics we use Collectd with a plugin that reads from the daemon admin sockets. Collectd feeds into Influxdb which supports intaking Collectd UDP data directly. We also gather system stats such as CPU, Iotime, memory, threads, etc. For an overview of this toolchain please have a look at our monitoring and logging overview
We can then visualize this data with Grafana. For example, here are two simple dashboards showing OSD operation latency and operations per second.
We also can combine plots to make dashboards giving us an overview of our cluster.