Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in processing rate since version 0.32 #18819

Closed
davidgcameron opened this issue Oct 11, 2023 · 9 comments
Closed

Performance degradation in processing rate since version 0.32 #18819

davidgcameron opened this issue Oct 11, 2023 · 9 comments
Labels
domain: vrl Anything related to the Vector Remap Language type: bug A code related bug.

Comments

@davidgcameron
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We use Vector to insert HAProxy logs into a ClickHouse database (kafka source to clickhouse sink). Since version 0.32 we have seen a significant performance degradation in the insert rate as well as increase in CPU consumption.

To collect some numbers we ran local tests of the same pipeline with a socket source and clickhouse sink and we see a big difference in pre-0.32 vs post-0.32 versions:

Version Time to insert 500k rows (s) Peak CPU (%)
0.30.0 41 370
0.31.0 32 450
0.32.2 65 490
0.33.0 60 520

To eliminate the effect of ClickHouse we repeated the same tests with a file sink. In this case the CPU consumption was fairly constant but we see the same difference in insert time:

Version Time to insert 500k rows (s) Peak CPU (%)
0.30.0 27 520
0.31.0 27 510
0.32.2 48 540
0.33.0 41 520

Configuration

No response

Version

0.32.2

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@davidgcameron davidgcameron added the type: bug A code related bug. label Oct 11, 2023
@jszwedko
Copy link
Member

Hi @davidgcameron ,

Thanks for reporting this! Could you share your configuration so that we can attempt to reproduce what you are seeing? I took a quick look through the diff from v0.31.0 to v0.32.2 and the only change that jumped out was #18005.

@davidgcameron
Copy link
Author

I think I've narrowed down the problem to regex parsing. We have a transform which parses a log string into roughly 50 fields using parse_regex(). When I changed this to parse only the first couple of fields, the performance was identical between versions.

Here's the basic configuration with some details removed:

sources/socket_haproxy.toml:

type = "socket"
address = "0.0.0.0:${VECTOR_SOCKET_PORT}"
mode = "tcp"

transforms/to_haproxy_log.toml (with the long version of the regex)

inputs = ["socket_haproxy"]
type = "remap"
source = '''

  parsed, err = parse_regex(string!(.message), r'^(?P<log_at>\S+ +\S+ +\S+) (?P<log_host>[\w\.]+) (?P<log_domain>\w+)\[(?P<log_process>\d+)\]:( (?P<log_name>\w+))? (?P<client_ip>\S+) \- \- \[(?P<request_at_str>.*)\] "(?P<http_request_line>.*?)" (?P<http_status_code>\S+) (?P<response_size>\S+) "" "" (?P<client_port>\d+) (?P<request_at_ms>\d+) "(?P<hap_frontend>\S*)" "(?P<hap_backend>\S*)" "(?P<hap_server_name>\S*)" (?P<hap_TR>\S+) (?P<hap_Tw>\S+) (?P<hap_Tc>\S+) (?P<hap_Tr>\S+) (?P<hap_Ta>\S+) (?P<hap_tsc>[\w\-]{4}) (?P<hap_ac>\d+) (?P<hap_fc>\d+) (?P<hap_bc>\d+) (?P<hap_sc>\d+) (?P<hap_rc>\d+) (?P<hap_sq>\d+) (?P<hap_bq>\d+) "(?P<http_request_session_id>\S*)" "(?P<http_response_session_id>\S*)" "(?P<uid>\S*)" "(?P<http_host>[\w\-\.]*)" "(?P<appversion>\S*)" "(?P<http_via>(\S+ \w+|))" "(?P<request_id>\S*)" "(?P<hap_sslv>[\w\.\-]+)" "(?P<hap_sslc>\S*)" "(?P<hap_ssl_id>\S*)" "(?P<hap_ssl_sni>[\w\-\.]*)"( "(?P<hap_ssl_cn>\S+)")?( "(?P<ja3_hash>\w{32})?" "(?P<hap_net>\d+)")? +"(?P<hap_nr>[\w_\d\-:]*)"( ""(?P<hap_jsonblob>\{.*\})"")? \((?P<hostname>.*?)\)$')
 

  if err != null {
    .log_fmt = "broken"
  } else {
    . = merge(., parsed)
  }
'''

sinks/file_haproxy_log.toml

inputs = ["to_haproxy_log"]
type = "file"
path = "/tmp/vector-%Y-%m-%d.log"
encoding.codec = "text"

@neuronull
Copy link
Contributor

Hi @davidgcameron , thanks for helping us narrow that down.

Would you be able to provide a sample input event that corresponds to that regex string? To aid in the reproduction and root causing.

Thanks

@davidgcameron
Copy link
Author

Sure, here is one example:

Jul 21 02:51:46 localhost growth[2492806]: 111.11.11.1 - - [21/Jul/2023:02:51:46 +0000] \"GET /page-data/12345678.json HTTP/1.1\" 304 371 \"\" \"\" 33930 989 \"https~\" \"https-servers\" \"webhost-04\" 0 0 2 2 4 ---- 310 186 11 3 0 0 0 \"Session-Id=ZZZZ\" \"Session-Id=YYYY\" \"\" \"example.com\" \"\" \"\" \"8888\" \"TLSv1.3\" \"TLS_AES_256_GCM_SHA384\" \"X8X8X8X8X8X8X8X8X8X8=\" \"example.com\" \"\" \"0\" \"ja3_report\" \"\"{\"\"ja3_hash\"\":\"\"2F2F2F2F2F2F2F2F2F2F2F2F2F2F2F2F\"\",\"\"test\"\":0,\"\"blocklist\"\":0}\"\" (webhost-04.example.com)

@neuronull
Copy link
Contributor

🤔 Hmm so far I have not been able to reproduce this locally.

I tried both the socket source and the file source. In each case I tried v0.31.0, v0.32.0, v0.33.0 , and in all cases 500k of that example log took 22-23 seconds on an M1 Max.

I'm wondering if system differences could account for it. What are you running this on?

Another possibility is that this particular example event doesn't repro the behavior. Could you confirm that this line on repeat 500k times produces the behavior for you?

@neuronull
Copy link
Contributor

This is the command I used for the socket source:

socat -dd OPEN:data.txt TCP:localhost:1234

@davidgcameron
Copy link
Author

I re-ran the test sending that exact line and it gave the same result. I'm generating the data with a simple python script which creates the strings and sends them to the Vector socket.

I'm testing on a Dell laptop (8 core i7, 16GB RAM), but we first noticed the issue when upgrading Vector on a production server and there was suddenly a huge increase in CPU consumption and some delays in our data pipeline. This server is a high-end Xeon Gold server with 96 cores and 384GB RAM.

@dsmith3197
Copy link
Contributor

I wanted to note that Vector v0.32.0 includes #17874, which bumps the regex crate to v1.9.0, which includes a rewrite of the crate internals. The performance difference in Vector is likely explainable by that change.

Now, given that, I'm not sure how to improve the performance for this particular use case. That would require further digging.

cc @fuchsnj. I know you have some more experience with the regex crate. Does anything pop to mind here?

@dsmith3197 dsmith3197 added meta: regression This issue represents a regression domain: vrl Anything related to the Vector Remap Language and removed meta: regression This issue represents a regression labels Dec 21, 2023
@jszwedko
Copy link
Member

Closing since unfortunately I don't think there is anything we can do here. This appears to be due to the regex crate bump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: vrl Anything related to the Vector Remap Language type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

4 participants