Chronosphere enables organizations to operate reliably at scale and make precise, data-driven decisions

50 Most Admired Companies to Watch 2022

CIO Bulletin

StreetLight, transportation analysis platform

You cannot neglect observability as technologies are invariably increasing today. Some advanced technologies for instance include cloud, DevOps, microservices, containers, serverless. The benefits of utilizing observability tools are clear as this increases velocity and reduce the time to know about issues that could arise at any point from staging to production. Observability is a reliable way to master complexity, with a holistic insight into changes taking place within your company’s software and systems.

Getting to know the actual state of something by a mere examination of external factors to determine the root cause of issues, including what can be done to improve it at the root level, is essential to maintaining and enhancing infrastructure. Observability does this to help provide insights into the whole infrastructure.

Thus we could say; Observability entails assembling log fragments, monitoring tools, organizing them to derive actionable knowledge of the whole environment. This way, you can get proper insight.

Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. Engineering organizations at startups to well-known global brands in the Fortune 500 around the world trust Chronosphere to help them operate scalable, highly available, and resilient applications.

The story behind the establishment of Chronosphere

The founders and engineers architected, developed and scaled Uber’s monitoring platform. As the company’s cloud-native journey began and adoption of microservices and container based infrastructure grew, it became obvious that no solution, open source or vendored, was scalable, reliable or cost efficient enough.

The answer was to build from within, with a focus on tackling the unique challenges cloud-native applications created for monitoring. The outcome was the metrics engine – M3 – which was developed in open source from day one to ensure it would not only solve the problem for a single company, but help the broader community as well.

M3 scaled to power one of the largest production monitoring use cases in the world, ingesting billions of data points per second and serving hundreds of thousands of dashboards and alerts. It was also adopted by other household brands such as Walmart, FedEx and Comcast.

As the open source community grew, so did the next layer of questions from large organizations and fast growing tech companies that needed more than what the open source project had to offer.

It only felt right to take this as an opportunity to build upon the technology and the experience gained to go on a mission to create the world’s most scalable, reliable and customizable cloud monitoring solution for the rest of the companies that are embarking on their cloud-native journeys.

Mission of the company

The company’s mission is to redefine monitoring for the cloud-native world by building the world’s most scalable, reliable and customizable monitoring platform.

To successfully achieve the mission, Chronosphere is focused on 3 main goals:

Develop amazing software
Build a world-class team
Partner with great customers

The 3 phases of observability

Understanding and embracing the three phases of observability is the best way to respond to these questions. During each phase, the focus is on alleviating the customer impact — or remediating the problem — as fast as possible.

Remediation is the act of alleviating customer pain and restoring the service to acceptable levels of availability and performance. At each phase, the engineer is looking for enough information to remediate the issue, even if they don’t yet understand the root cause.

Phase 1: Know about the problem

The first step to resolving an issue is knowing the issue exists — ideally before it impacts any customers. Sometimes, just knowing an issue is occurring is enough to trigger a remediation. For example, if you deploy a new version of a service and an alert triggers for that service, rolling back the deployment is the quickest path to remediating the issue without needing to understand the full impact or diagnose the root cause during the incident. Those can be examined after the issue is remediated, when there isn’t active customer impact.

Introducing changes to a system is the largest source of production issues, so knowing about problems and the scope of the impact as these changes are introduced is key.

Phase 2: Triage the problem

The goal of this phase is to quickly understand the context and impact of an issue. Once an alert goes off, if it is not immediately obvious that a recent change to the system needs to be rolled back, the next step is to understand the business impact and the severity. Often, understanding the scope of the issue can lead to remediation.

To help triage issues, you need to be able to quickly put an alert into context of understanding how many customers or systems are impacted, and to what degree. Great observability allows you to dissect and pivot highly granular data to shine a spotlight on the contextualized telemetry to diagnose issues.

Phase 3: Understand the problem

This phase occurs ideally after remediation, when you can take the time to locate and understand the underlying root cause of issues without the pressure of a ticking clock of customer expectations. With an ever increasing volume of microservices, doing a post mortem on an incident is often an exercise in navigating a twisted web of dependencies and trying to determine which service owner you need to work with.

Great observability gives direct line of sight linking your metrics and alerts to the potential culprits. Additionally, it provides insights that can help fix underlying problems to prevent recurrence of incidents.

Martin Mao, CEO/Co-Founder

Martin Mao is the co-founder and CEO of Chronosphere. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Prior to that, he was a technical lead on the EC2 team at AWS and has also worked for Microsoft and Google. He and his family are based in Chronosphere’s Seattle hub and he enjoys playing soccer and eating meat pies in his spare time.