Even though log files are a wealth of information, combing through them is a daunting task when software suffers downtime.
Today’s software environment is complicated because it runs numerous services, making observability tools essential. Despite the observability space being a $20-billion market, the cost of downtime still surpasses $100 billion, because most of the tools highlight when or where a software failure or incident happens but not why it occurred in the first place, according to Rod Bagg, founder and vice president of engineering at Zebrium Inc., which has a mission of eliminating the pain of root cause analysis.
“It’s great to know that something went wrong, but the root cause of why it’s happening is going to be buried in log files … to get there fast, you better automate or you’re just doomed for failure, and that’s where we come in,” Bagg stated.
Bagg and Larry Lancaster (pictured), founder and chief technology officer of Zebrium, spoke with theCUBE industry analyst Dave Vellante during a special broadcast from theCUBE and Zebrium as part of the “Root Cause as a Service: Never Dig Through Logs Again” event. They discussed what root cause as a service entails and how it eliminates the pain of going through logs.
In a separate interview, titled “How Cisco Validated RCaaS at 95.8% Accuracy,” Vellante spoke with Cisco Systems Inc. resident philosopher Atri Basu and Necati Çehreli, technical leader of the customer experience innovation, automation and disruption team. They discussed how, through a rigorous and extensive process, Cisco tested Zebrium’s RCaaS solution using 192 actual customer incidents and found the correct root cause indicators more than 95% of the time. (* Disclosure Below.)
RCaaS quickly finds the needle in the haystack
Dealing with logs is not a friendly affair, because getting the right context can take hours. Zebrium takes away the hassle of digging through logs by quickly providing the root cause analysis on its dashboard, according to Bagg.
“So if something is going wrong with your metrics, and that’s the indicator, or maybe it’s something with tracing that you’re sort of digging through now that you know something’s wrong, we will be right on that same dashboard,” he explained. “So we’re deployed as a SaaS service. You send us your logs, click on one of our integrations, integrate with all these tools, and when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics.”
In his session on “Introducing Root Cause as a Service,” Lancaster discussed the importance of observability when scrutinizing a system’s internal state. However, he pointed out that a person is limited to what they can filter manually, and that’s why automating the observer becomes a game-changer.
“Observability is a property of a system, but the problem is if it’s too complicated, you just push the bottleneck up to your eyeball,” he explained. “A great way to think about it is automating the observer. It means that you reduce your MTTR, meet your service-level objectives, and improve customer experience. People have been trying to figure out how to automate this human part of finding the root cause indicators for a long time, and until Zebrium came along, I would argue no one’s really done it right.”
Since most enterprises now run software in the digital era, Bagg believes handling downtime fast is what makes the difference. “It’s extremely important to our customers and most businesses out there to drive uptime and avoid as much downtime as possible,” he said.
Bagg went on to describe a specific use case that involved an AIOps client that decided to have one of its SREs sign up for Zebrium’s service in its SaaS environment, sending the logs from the client’s system to Zebrium.
“He hadn’t put that integration in, so it wasn’t in his dashboard when he had this incident, but it was certainly in ours,” Bagg said. “It literally would’ve saved him hours and hours. They had this issue going on for over 24 hours, and we had the answer right there in five minutes.”
Zebrium’s root cause as a service solution runs both on-prem and in the cloud, according to Lancaster.
“You can run it on-prem, just like we run it in our cloud. You can run it in your cloud or your own infrastructure,” he explained. “You’ll put us on your dashboard, and it doesn’t matter what kind of a dashboard it is. It could be Datadog, New Relic, Elastic, Dynatrace, Grafana, AppDynamics or ScienceLogic.”
Zebrium’s RCaaS has assisted some big players, including Seagate Lyve Cloud, deal with outages fast, according to Lancaster.
“We got a chance to work with Seagate Lyve Cloud … Zoom has their files stored on Lyve Cloud,” he stated. “What happened was they were in alpha, in their early access, and they had an outage, and it was pretty bad because it went on for longer than a day before they were completely restored. They did some research and saw Zebrium. They went into a staging environment, recreated the exact incident that they’d had, and what they saw was Zebrium popped up a root cause report that told them exactly the root cause they took over a day to find.”
Even though Cisco was initially skeptical about Zebrium’s RCaaS solution because it sounded too good to be true, the results gained after testing showed how the software powered by unsupervised machine learning is attentive to detail, according to Lancaster.
“They took a couple of months, and they did a very detailed study … they got together 192 incidents across four product lines where they knew that the root cause was in the logs,” he explained. “So they ran that data through the Zebrium software, and what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away.”
24,000 hours used to go down the drain doing log analysis daily
Introducing his session on “How Cisco Validated RCaaS at 95.8% Accuracy,” Basu disclosed how approximately 8,000 engineers under Cisco’s support arm, Technical Assistance Center, spent three hours each day doing log analysis, equating to 24,000 hours daily. RCaaS significantly helped reduce the human resources needed to complete the log analysis.
“When we started on this journey to augment our support engineers workflow with Zebrium’s solution, one of the things that we did was we went out and asked our engineers what their experience was like doing log analysis,” Basu said.
Out of the 2.2 million support requests that Cisco TAC gets yearly, 44% are trivial, meaning that 56% of the rest are non-trivial and require digging through logs. Therefore, RCaaS helps free up engineers by taking much of the manual burden away, according to Basu.
“About 44% of these support requests are usually trivial and can be solved within a call or a day,” he said. “But the rest of TAC cases really involve getting into the network device, looking at logs. It’s a very technical job. You need to be conversed with network solutions, their designs, protocols, etc.”
Since Cisco’s internal automation system was facing challenges around maintenance, the solution was to automate 50% of the log analysis. After stumbling upon Zebrium, RCaaS was tested across Cisco’s popular products, namely Cisco’s Webex client, DNA Center, Identity Services Engine and Unified Computing System. The log analysis turned out to be 95% correct, according to Çehreli.
“We brought it, of course, to our management, and they said, ‘Okay, let’s try this with real users because the log being there is one thing, but the engineer reaching to that log is another take,’” he pointed out. “So we wanted to make sure that when we put it in front of our users and engineers, they can actually come to that log themselves. With a sample set of close to 200 SaaS, we found out the majority of the time, almost 95% of the time the engineer could find the log they were looking for in Zebrium’s analysis.”
The other objective entailed having the internal automation system unsupervised and noiseless, and Zebrium filled the void, Çehreli added.
“We wanted this platform to be unsupervised … so none of the engineers needed to create rules, you know, label logs; this is bad, this is good,” he explained. “The other most important thing for us was we wanted this to be not noisy at all because what happens with noises, when your level of false positives is really high, your engineers start ignoring the good things between that noise. Ultimately, we wanted this new framework to be easily adaptable to our existing workflow, and we came up to Zebrium.”
Since software logs are compressed and hard to read, analyzing them becomes cumbersome, necessitating significant expertise and a keen eye, according to Basu.
“Logs are very contrite … they try to pack a lot of information in a very little space. This is for performance reasons, storage reasons, etc., but the side effect of that is they’re very esoteric,” he explained. “So they’re hard to read if you’re not conversant, not the developer who wrote those logs, or you aren’t doing code deep dives. So it requires a lot of knowledge about the protocol that’s expected, because when you’re doing log analysis, what you’re really looking for is a needle in a haystack.”
Given that log analysis used to be a black and white affair, Basu believes Zebrium turned it into a colorful one.
“I think one statement that really summarizes how Zebrium’s impacted our workflow was from one of our users who said, ‘Well, you know, until you provide us with this tool, log analysis was a very black and white affair, but now it’s become really colorful,’” he said. “What Zebrium does is it provides a lot of color and context to the whole process. So now you’re able to quickly get to Word Cloud, using their interactive histogram, and using the summaries of every incident; you’re very quickly able to summarize what might be happening and what you need to look into.”
Stay tuned for the complete event video!
(* Disclosure: TheCUBE is a paid media partner for the “Root Cause as a Service” event. Neither Zebrium Inc., the sponsor of theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)