Oct 4th, 2019 - written by Kimserey with .
Splunk is a log aggregator in the same way as elastic search with Kibana can be used. When I started using Splunk I immediately acknowledged its capabilities, and its usage was largely limited by my own knowledge of writing queries (which is still very low). But every now and then I would see myself in a situation where I would need to compose the same query which I did the week before but now have forgotten how to. So today we’ll explore some nice Splunk functionalities.
The function I use the most is timechart
. It provides a way to plot a time series where we can specify a span, for the precision, an aggregation function for the events falling in the buckets, and a split clause to group events.
1
... | timechart span=5m p99(upstream_response_time)
This will get us the p99
for the upstream_response_time
for a span of 5 minutes where we will see across all our events, useful to monitor the overall latency of our service.
1
... | timechart span=5m p99(upstream_response_time) by host
Specifying a split clause by host
will generate multiple time series, one per host, useful to monitor the latency on specific instances and identify potential issue specific to a particular host.
We can only specify a single split clause but if we want to separate with two fields, we can use eval
which creates a new property in the event, and we can make use of it in our split clause.
1
2
3
...
| eval host_method=host+"@"+method
| timechart span=5m p99(upstream_response_time) by host_method
This will add a property host_method
on each event combining the host
and the method
and allowing a split on the combination.
Formatting in two line the query is useful when we want to debug a query as we are able to comment a part of the query using the comment
macro:
1
2
3
...
| eval host_method=host+"@"+method
`comment("| timechart span=5m p99(upstream_response_time) by host_method")`
Eval
can also be used to construct new properties using if
or case
.
1
2
3
4
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| timechart span=5m count by stats_str
This will remove the 4xx
status code and tag the events of 2xx
with OK
and 5xx
with ERROR
then produce a timechart on it.
Splunk limits the split values and put the rest into an OTHER
bucket. We can lift that limit off by specifying limit=0
.
1
2
3
4
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| timechart span=5m limit=0 count by stats_str
The other aspect of timechart is that it produces a table of split values, indexed by the time. For example when we did by stats_str
, we would have table with the first column as the time, and the rest of the columns as the stats_str
.
Knowing that we can compute the overall availability of our service by using the stats_str
:
1
2
3
4
5
6
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| timechart span=5m limit=0 count by stats_str
| eval success_rate = round((OK / (OK + ERROR)) * 100, 2)
| fields - ERROR OK
Once we generate the table with timechart
, we use eval
to compute the success rate and then use fields - [fields]
to remove the fields ERROR
and OK
from the table leaving only the success rate which we can then visualize directly.
Another useful functionality is filling empty values,fillnull
and filldown
which can be used to fill missing values. For example if value were missing in a bucket, we could use:
1
2
3
...
| timechart span=1m p99(upstream_response_time) as p99
| fillnull value=1000 p99
this will fill the null value in p99
with 1000
or we can use filldown
which will use the previous value for the missing values:
1
2
3
...
| timechart span=1m p99(upstream_response_time) as p99
| filldown
Timechart
can be seen as a shortcut to generate charts indexed by the time. Chart
can be used to create different chart where the row index wouldn’t be the time.
Just to understand how chart works, we will be recreating the timechart
using chart
.
Chart allows us construct a table indexed by the first property provided after the by
directive,
1
[ BY <row-split> <column-split> ]
this means that the first property given will be the row split
and the next value will be the column split
.
Having that, we can combine it with bin
, which gives us the possibility of placing replacing the _time
value,
1
| bin _time span=10m
this will replace all _time
property in each events by their respective bins with a span of 10 minutes, for example an event with a time of 8:23:24:227 AM
will be changed to 8:20:00:000 AM
, effectively making all events fit into bins.
We can then use chart
to split by the bins and specify the column split as the stats_str
we specified earlier:
1
2
3
4
5
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| bin _time span=10m
| chart count by _time stats_str
We end up with a table:
_time | ERROR | OK |
---|---|---|
2019-10-01 07:00:00 | 0 | 5 |
2019-10-01 07:10:00 | 1 | 4 |
2019-10-01 07:20:00 | 1 | 4 |
This is essentially the same as:
1
2
...
| timechart span=10m count by stats_str
Another useful functionality is table
which allows us to display a table with fields.
1
2
...
| table _time, status, upstream_response_time
Although quick limited, table
is very useful to display data in a readable way in a dashboard, removing all noise from the events.
Lastly stats
is used to group events and count. By using by
we can group the aggregation by specific fields, it also accepts multiple values to group by separated by a comma.
1
2
...
| stats count, p99(upstream_response_time) as p99 by status, host, request
In comparison to chart
, stats
will use the fields as column and index by the split fields. We will end up with the following table:
status | host | request | count | p99 |
---|---|---|---|---|
200 | host1 | POST /api/values | 10 | 2 |
200 | host2 | POST /api/values | 2 | 1 |
200 | host3 | POST /api/values | 5 | 2 |
500 | host1 | POST /api/values | 1 | 5 |
Today we looked at different Splunk displays, we started by looking at timechart
, exploring the different possibilities when combined with eval
and search
. We then moved on to look into chart
and see how we could replicate timechart
using bin
. We then completed this post by looking into table
and stats
where we saw that stats
provided us a way to apply aggregation functions on top of grouping of events. I hope you liked this post and I see you on the next one!