Oct 4th, 2019 - written by Kimserey with .
Splunk is a log aggregator in the same way as elastic search with Kibana can be used. When I started using Splunk I immediately acknowledged its capabilities, and its usage was largely limited by my own knowledge of writing queries (which is still very low). But every now and then I would see myself in a situation where I would need to compose the same query which I did the week before but now have forgotten how to. So today we’ll explore some nice Splunk functionalities.
The function I use the most is timechart. It provides a way to plot a time series where we can specify a span, for the precision, an aggregation function for the events falling in the buckets, and a split clause to group events.
1
... | timechart span=5m p99(upstream_response_time)
This will get us the p99 for the upstream_response_time for a span of 5 minutes where we will see across all our events, useful to monitor the overall latency of our service.
1
... | timechart span=5m p99(upstream_response_time) by host
Specifying a split clause by host will generate multiple time series, one per host, useful to monitor the latency on specific instances and identify potential issue specific to a particular host.
We can only specify a single split clause but if we want to separate with two fields, we can use eval which creates a new property in the event, and we can make use of it in our split clause.
1
2
3
...
| eval host_method=host+"@"+method
| timechart span=5m p99(upstream_response_time) by host_method
This will add a property host_method on each event combining the host and the method and allowing a split on the combination.
Formatting in two line the query is useful when we want to debug a query as we are able to comment a part of the query using the comment macro:
1
2
3
...
| eval host_method=host+"@"+method
`comment("| timechart span=5m p99(upstream_response_time) by host_method")`
Eval can also be used to construct new properties using if or case.
1
2
3
4
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| timechart span=5m count by stats_str
This will remove the 4xx status code and tag the events of 2xx with OK and 5xx with ERROR then produce a timechart on it.
Splunk limits the split values and put the rest into an OTHER bucket. We can lift that limit off by specifying limit=0.
1
2
3
4
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| timechart span=5m limit=0 count by stats_str
The other aspect of timechart is that it produces a table of split values, indexed by the time. For example when we did by stats_str, we would have table with the first column as the time, and the rest of the columns as the stats_str.
Knowing that we can compute the overall availability of our service by using the stats_str:
1
2
3
4
5
6
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| timechart span=5m limit=0 count by stats_str
| eval success_rate = round((OK / (OK + ERROR)) * 100, 2)
| fields - ERROR OK
Once we generate the table with timechart, we use eval to compute the success rate and then use fields - [fields] to remove the fields ERROR and OK from the table leaving only the success rate which we can then visualize directly.
Another useful functionality is filling empty values,fillnull and filldown which can be used to fill missing values. For example if value were missing in a bucket, we could use:
1
2
3
...
| timechart span=1m p99(upstream_response_time) as p99
| fillnull value=1000 p99
this will fill the null value in p99 with 1000 or we can use filldown which will use the previous value for the missing values:
1
2
3
...
| timechart span=1m p99(upstream_response_time) as p99
| filldown
Timechart can be seen as a shortcut to generate charts indexed by the time. Chart can be used to create different chart where the row index wouldn’t be the time.
Just to understand how chart works, we will be recreating the timechart using chart.
Chart allows us construct a table indexed by the first property provided after the by directive,
1
[ BY <row-split> <column-split> ]
this means that the first property given will be the row split and the next value will be the column split.
Having that, we can combine it with bin, which gives us the possibility of placing replacing the _time value,
1
| bin _time span=10m
this will replace all _time property in each events by their respective bins with a span of 10 minutes, for example an event with a time of 8:23:24:227 AM will be changed to 8:20:00:000 AM, effectively making all events fit into bins.
We can then use chart to split by the bins and specify the column split as the stats_str we specified earlier:
1
2
3
4
5
...
| eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR")
| search stats_str!=""
| bin _time span=10m
| chart count by _time stats_str
We end up with a table:
| _time | ERROR | OK |
|---|---|---|
| 2019-10-01 07:00:00 | 0 | 5 |
| 2019-10-01 07:10:00 | 1 | 4 |
| 2019-10-01 07:20:00 | 1 | 4 |
This is essentially the same as:
1
2
...
| timechart span=10m count by stats_str
Another useful functionality is table which allows us to display a table with fields.
1
2
...
| table _time, status, upstream_response_time
Although quick limited, table is very useful to display data in a readable way in a dashboard, removing all noise from the events.
Lastly stats is used to group events and count. By using by we can group the aggregation by specific fields, it also accepts multiple values to group by separated by a comma.
1
2
...
| stats count, p99(upstream_response_time) as p99 by status, host, request
In comparison to chart, stats will use the fields as column and index by the split fields. We will end up with the following table:
| status | host | request | count | p99 |
|---|---|---|---|---|
| 200 | host1 | POST /api/values | 10 | 2 |
| 200 | host2 | POST /api/values | 2 | 1 |
| 200 | host3 | POST /api/values | 5 | 2 |
| 500 | host1 | POST /api/values | 1 | 5 |
Today we looked at different Splunk displays, we started by looking at timechart, exploring the different possibilities when combined with eval and search. We then moved on to look into chart and see how we could replicate timechart using bin. We then completed this post by looking into table and stats where we saw that stats provided us a way to apply aggregation functions on top of grouping of events. I hope you liked this post and I see you on the next one!