-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent #227
Comments
Note The following contributors may be suitable for this task: whilefoo
gitcoindev
gentlementlegen
|
Thanks for sharing your research. It seems that a final solution isn't clear. I suppose that research can continue under this conversation thread. @whilefoo please look into this. |
Basically, it seems that once the token arrived to its limit, the calls all hang due to the use of As a temporary solution, I could suggest filtering some events like the ones related to |
This is easily reproducible on a local setup as well, and gives more information about failures. Even the |
It seems the primary rate limit is 5000 per hour per org/repo, but we are hitting secondary rate limit which happens when too many request happen at once - 100 at once or too many per minute. One solution would be to space out events, for example if we receive 100 events at once we need to process them one by one and not all at once, having priority in mind too as events that need instant response like commands need to be processed immediately while others like text embeddings can be processed later.
But Worker plugins would still fail to execute as they use the same installation token passed from the kernel, isn't it? Our octokit uses plugin-retry plugin which should retry requests after the rate limit is over but I think that @gentlementlegen Didn't we try running the kernel on Azure, or was that reverted? |
Yes it happens because whenever we receive an event we do the following:
In average one event involves ~10 calls to GitHub API (and subsequently we use the same token in all the plugins which can do tons of calls as well). Plus, this is multiplied by each external organization using our bot as well. The more plugins we have the more calls will be done, and this happens for literally any event which is why I was suggesting filtering out We could have some event queue indeed, but we can't delay the calls for too long without having the worker shutting down. And as you said, we needs commands to still be responsive. I think we should find a way to avoid fetching the manifest for all the plugins for each run, which would help lowering API calls.
yes it seems to be but the kernel should then work fine after waiting for some time (when the limit resets) but it seems to work right away when restarting the instance of the kernel, somehow. We have an Azure instance up and running, only configured for |
This secondary rate limit seems like a mess to deal with. So is the plan now to make an events queue and handle them with a static delay timer between each so we can avoid the rate limit? |
Like @whilefoo said the problem is when commands that need immediate response will need to be handled because if you have 100+ events in the queue they would take too long to be triggered. My suggestion as an immediate fix would be to filter events we do not use like |
I suggested multiple priority queues but thinking about this more I realized that it's not feasible because you don't know the priority if you don't know the config, for example a
We should definitely do that. I'm not sure about the credentials, they need to use installation's token but they can't get it by themselves. The most obvious fix is to not store the config in Github but somewhere else like a database. You could build queues on top of this by fetching the config from the database (each plugin would have a priority level) and you would put it in the queue for that priority level. Cloudflare Queues also have retries so if the rate limit is hit, you can schedule to retry after X time and they also have 30 min time limit compared to 30 seconds on normal workers. However I think idea won't be liked because it moves away from Github and creates a dependency on Cloudflare. |
Okay then I will start filtering events, and link the changes here. For the credentials, we share The queue seems to introduce a lot of complexity and fragile logic, I think we'd be better avoiding it for now. |
You're saying that each organization that uses our bot should create their own Github App and share credentials to the plugin via environment variables? |
Yes that would be my suggestion, but I do agree that it would add friction. Doesn't it feel dangerous that a third party can create its own plugin with our token elevations though? And yes it would count against our own token if all the requests they do in their plugin use our token. |
That was a concern from the start but it is not that critical because they can't access other organizations with that token, only the one that installed the plugin and that organization has to trust the plugin otherwise they wouldn't install it. I understand now that our Github App would be only used to fetch configs and manifests and dispatch workflows and organization's App would be used for the plugin which would alleviate the problem with rate limits, however I feel like this would too much friction |
In my mind, the following would happen:
I think this would be beneficial for two main reason:
|
Possible but we haven't had to elevate permissions in a very long time. I think we have it mostly covered. Worst case scenario: if they aren't doing anything payment related they can simply make a GitHub action. The only secret sauce we should be focusing on is providing the infrastructure to essentially map any webhook event to a financial reward and to allow the distribution of that reward.
This is only true if we 1. Accept their changes in a pull to the kernel and/or 2. Install that plugin to our repos. |
New updates regarding the quota:
|
Setting in the environment seems appropriate! Perhaps we can set an array of values and any org/repo slug can be ignored
Come to think of it though, we may even be able to depreciate the issues being opened in that repository because now we simply aggregate them into a json object, although it is kind of nice to see the confirmation when the link back occurs that it is in the directory. |
But the consumer of a third party plugin would have to install their app so you will have so many apps installed, but I think we're still a long way from third party plugins so we can think about this later |
I don't like the idea of installing custom apps for plugins. It's not a good approach |
It seems that lowering the amount of calls didn't really solve the problem, 5he kernel still gets stuck often (today particularly because Github servers seems to be partially down). The rate limit in the logs has always 5k+ calls remaining (first rate limit). When stuck, usually it stays at "trying to fetch configuration from" or "dispatching event for" and then nothing, meaning the octokit call never made it. However no error or logs is shown afterwards. The next changes I will try:
|
I tried what I mentioned above and the following:
The problem is the same, I can see a
Every time this happens, when I redeploy the kernel, it works again for around 1h and then this happens. I don't think this is due to second rate limit either because the amount of requests per second averages to |
Would you say it's safe to blame cloudflare then? Have you considered A/B testing the kernel on another platform like azure or something? |
@0x4007 I should definitely try with another service yes. I don't think cloudflare is to blame because requests using |
Is it possible that the 10ms limit is reached and Cloudflare shuts down the worker? But it's weird to only happen after 1 hour |
Updates on the monitoring: Very often the code gets stuck with the following logs (this was a which corresponds to the following source code ubiquity-os-kernel/src/github/utils/config.ts Line 190 in fbccd44
So I thought maybe the package to read YAML was the culprit. I changed it to another one, and the issues seem to happen less often but still happen at the same spot. I wondered if both of these libraries use a non thread-safe method for cloudflare, since we use It is maybe a coincidence that it always seem to break there, so with @0x4007 we considered trying a gigantic configuration, which I did (more than 40 plugins!) and it worked fine until one hour later when it started to skip events. Having no logs truly doesn't help, but I still suspect CF to unexpectedly kill the worker, but it gets silenced due to Edit: re-reading these logs I realize two GET requests were sent but only one file was parsed, could this be some race condition, or a promise failing that kills the other? will try to run these synchronously and see if there is any change. |
Maybe |
Tip
|
|
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Task | 1 | 250 |
Issue | Specification | 1 | 93.3 |
Issue | Comment | 13 | 415.55 |
Review | Comment | 1 | 0 |
Conversation Incentives
Comment | Formatting | Relevance | Priority | Reward |
---|---|---|---|---|
## What happenedWhen under heavy load, the kernel sometimes st… | 18.66content: content: h2: score: 1 elementCount: 3 p: score: 0 elementCount: 8 hr: score: 0 elementCount: 1 a: score: 5 elementCount: 1 result: 8 regex: wordCount: 243 wordValue: 0.1 result: 10.66 | 1 | 5 | 93.3 |
Basically, it seems that once the token arrived to its limit, th… | 6.67content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 140 wordValue: 0.1 result: 6.67 | 1 | 5 | 33.35 |
This is easily reproducible on a local setup as well, and gives … | 2.92content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 53 wordValue: 0.1 result: 2.92 | 1 | 5 | 14.6 |
Yes it happens because whenever we receive an event we do the fo… | 14.06content: content: p: score: 0 elementCount: 10 ol: score: 0 elementCount: 1 li: score: 0.5 elementCount: 4 result: 2 regex: wordCount: 281 wordValue: 0.1 result: 12.06 | 1 | 5 | 70.3 |
Like @whilefoo said the problem is when commands that need immed… | 3.75content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 71 wordValue: 0.1 result: 3.75 | 1 | 5 | 18.75 |
Okay then I will start filtering events, and link the changes he… | 7.47content: content: p: score: 0 elementCount: 4 result: 0 regex: wordCount: 160 wordValue: 0.1 result: 7.47 | 1 | 5 | 37.35 |
I realized that in my organization, I never had these events tri… | 7.25content: content: p: score: 0 elementCount: 2 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 39 wordValue: 0.1 result: 2.25 | 0.5 | 5 | 18.125 |
Yes that would be my suggestion, but I do agree that it would ad… | 3.02content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 55 wordValue: 0.1 result: 3.02 | 0.5 | 5 | 7.55 |
In my mind, the following would happen:- if an external organi… | 8.91content: content: p: score: 0 elementCount: 6 ul: score: 0 elementCount: 2 li: score: 0.5 elementCount: 4 result: 2 regex: wordCount: 146 wordValue: 0.1 result: 6.91 | 0.5 | 5 | 22.275 |
New updates regarding the quota:- @0x4007 noticed that it runs… | 3.35content: content: p: score: 0 elementCount: 3 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 2 result: 1 regex: wordCount: 41 wordValue: 0.1 result: 2.35 | 0.5 | 5 | 8.375 |
It seems that lowering the amount of calls didn't really solve t… | 7.39content: content: p: score: 0 elementCount: 1 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 3 result: 1.5 regex: wordCount: 121 wordValue: 0.1 result: 5.89 | 1 | 5 | 36.95 |
I tried what I mentioned above and the following:- disabling r… | 5.28content: content: p: score: 0 elementCount: 5 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 2 result: 1 regex: wordCount: 83 wordValue: 0.1 result: 4.28 | 1 | 5 | 26.4 |
@0x4007 I should definitely try with another service yes. I don'… | 8.11content: content: p: score: 0 elementCount: 3 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 57 wordValue: 0.1 result: 3.11 | 0.5 | 5 | 20.275 |
Updates on the monitoring:Very often the code gets stuck with … | 20.25content: content: p: score: 0 elementCount: 9 hr: score: 0 elementCount: 1 a: score: 5 elementCount: 2 result: 10 regex: wordCount: 232 wordValue: 0.1 result: 10.25 | 1 | 5 | 101.25 |
Resolves #227- added more logs for better debugging- configu… | 2.5content: content: p: score: 0 elementCount: 7 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 5 result: 2.5 regex: wordCount: 76 wordValue: 0 result: 0 | 0.8 | 5 | 0 |
[ 46.39 WXDAI ]
@0x4007
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Comment | 7 | 46.39 |
Conversation Incentives
Comment | Formatting | Relevance | Priority | Reward |
---|---|---|---|---|
Thanks for sharing your research. It seems that a final solution… | 1.75content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 29 wordValue: 0.1 result: 1.75 | 0.6 | 5 | 5.25 |
This secondary rate limit seems like a mess to deal with. So is … | 2.2content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 38 wordValue: 0.1 result: 2.2 | 0.7 | 5 | 7.7 |
Possible but we haven't had to elevate permissions in a very lon… | 4.8content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 95 wordValue: 0.1 result: 4.8 | 0.3 | 5 | 7.2 |
Setting in the environment seems appropriate! Perhaps we can set… | 3.84content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 73 wordValue: 0.1 result: 3.84 | 0.8 | 5 | 15.36 |
I don't like the idea of installing custom apps for plugins. It'… | 1.17content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 18 wordValue: 0.1 result: 1.17 | 0.4 | 5 | 2.34 |
Would you say it's safe to blame cloudflare then? Have you consi… | 1.54content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 25 wordValue: 0.1 result: 1.54 | 0.9 | 5 | 6.93 |
Maybe `node-compat` is not helping the problem | 0.46content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 6 wordValue: 0.1 result: 0.46 | 0.7 | 5 | 1.61 |
[ 159.385 WXDAI ]
@whilefoo
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Comment | 6 | 153.31 |
Review | Comment | 2 | 6.075 |
Conversation Incentives
Comment | Formatting | Relevance | Priority | Reward |
---|---|---|---|---|
It seems the primary rate limit is 5000 per hour per org/repo, b… | 13.1content: content: p: score: 0 elementCount: 6 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 176 wordValue: 0.1 result: 8.1 | 1 | 5 | 65.5 |
I suggested multiple priority queues but thinking about this mor… | 9.15content: content: p: score: 0 elementCount: 3 result: 0 regex: wordCount: 203 wordValue: 0.1 result: 9.15 | 1 | 5 | 45.75 |
You're saying that each organization that uses our bot should cr… | 5.05content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 101 wordValue: 0.1 result: 5.05 | 0.5 | 5 | 12.625 |
That was a concern from the start but it is not that critical be… | 6.91content: content: p: score: 0 elementCount: 4 result: 0 regex: wordCount: 146 wordValue: 0.1 result: 6.91 | 0.5 | 5 | 17.275 |
But the consumer of a third party plugin would have to install t… | 2.4content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 42 wordValue: 0.1 result: 2.4 | 0.5 | 5 | 6 |
Is it possible that the 10ms limit is reached and Cloudflare shu… | 1.54content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 25 wordValue: 0.1 result: 1.54 | 0.8 | 5 | 6.16 |
Hope this works | 0.25content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 3 wordValue: 0.1 result: 0.25 | 0.1 | 5 | 0.125 |
did you try this before changing to sync config fetch? it could … | 1.7content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 28 wordValue: 0.1 result: 1.7 | 0.7 | 5 | 5.95 |
We had one instance of this problem occurring just now. The process seems to break at the same spot, either this is really bad luck to randomly always break there or there is something wrong with the yaml package. I digged a bit in the repo and found a few relevant issues:
Maybe I will try switching to |
Possibly a bad idea but what about a novel implementation of a yaml parser? In exchange for simpler usage of yaml features, we could have a more reliable and simpler parser. Here's an o1 one-shot: https://chatgpt.com/share/678f3dc2-9a6c-8001-b527-8202efa456f3 type YamlValue = string | number | boolean | YamlObject | YamlValue[];
type YamlObject = Record<string, YamlValue>;
/**
* Parse a YAML string into a nested object.
*/
export function parseYaml(yaml: string): YamlObject {
const lines = yaml.split(/\r?\n/);
const { object: result } = parseBlock(lines, 0, 0);
return result;
}
/**
* Recursively parse lines to build YAML objects/arrays.
* @param lines The array of lines from the YAML string.
* @param currentIndent The current indentation level in spaces.
* @param startIndex The line index to begin parsing.
* @returns The parsed object and the next line index.
*/
function parseBlock(
lines: string[],
currentIndent: number,
startIndex: number
): { object: YamlObject; nextLine: number } {
const parsedObject: YamlObject = {};
let i = startIndex;
while (i < lines.length) {
const line = lines[i];
const lineIndent = getIndent(line);
// Stop if indentation is shallower than current block
if (line.trim() === '' || lineIndent < currentIndent) {
break;
}
// Strip comments
const noCommentLine = line.split('#')[0].trimEnd();
if (noCommentLine.trim() === '') {
i += 1;
continue;
}
// Parse key/value or list item
const { key, value, isListItem } = parseLine(noCommentLine);
if (isListItem) {
// It's a list item: collect all items in an array
// The parent key is the last key we encountered
const lastKey = Object.keys(parsedObject)[Object.keys(parsedObject).length - 1];
if (typeof parsedObject[lastKey] === 'undefined') {
parsedObject[lastKey] = [];
}
const arrayRef = parsedObject[lastKey];
if (Array.isArray(arrayRef)) {
// if value is missing, parse as nested block
if (typeof value === 'undefined' || value === null) {
const { object: nested, nextLine } = parseBlock(lines, lineIndent + 2, i + 1);
arrayRef.push(nested);
i = nextLine;
} else {
arrayRef.push(value);
i += 1;
}
} else {
i += 1;
}
} else if (key !== '' && typeof value !== 'undefined') {
// Plain key: value
parsedObject[key] = value;
i += 1;
} else if (key !== '') {
// Key with possible nested block
const nextLineIndent = getIndent(lines[i + 1] || '');
if (nextLineIndent > lineIndent) {
const { object: nested, nextLine } = parseBlock(lines, nextLineIndent, i + 1);
parsedObject[key] = nested;
i = nextLine;
} else {
parsedObject[key] = '';
i += 1;
}
} else {
i += 1;
}
}
return { object: parsedObject, nextLine: i };
}
/**
* Extract indentation width from the start of a line.
*/
function getIndent(line: string): number {
let count = 0;
for (let i = 0; i < line.length; i += 1) {
if (line[i] === ' ') {
count += 1;
} else {
break;
}
}
return count;
}
/**
* Parse a single line to discover a key-value pair or a list item.
*/
function parseLine(line: string): {
key: string;
value?: YamlValue;
isListItem: boolean;
} {
const isListItem = line.trimStart().startsWith('- ');
if (isListItem) {
// example: "- something"
const itemValue = line.trimStart().slice(2).trim();
if (itemValue.includes(': ')) {
// example: "- key: value" (nested object in a list)
const splitIndex = itemValue.indexOf(': ');
const subKey = itemValue.slice(0, splitIndex).trim();
const subValue = convertValue(itemValue.slice(splitIndex + 2).trim());
return { key: subKey, value: subValue, isListItem: true };
}
return { key: '', value: convertValue(itemValue), isListItem: true };
}
// example: "key: value"
const colonIndex = line.indexOf(':');
if (colonIndex >= 0) {
const rawKey = line.slice(0, colonIndex).trim();
const rawValue = line.slice(colonIndex + 1).trim();
if (rawValue !== '') {
return { key: rawKey, value: convertValue(rawValue), isListItem: false };
}
return { key: rawKey, isListItem: false };
}
return { key: '', isListItem: false };
}
/**
* Convert a raw string value to a typed value (boolean, number, or string).
*/
function convertValue(value: string): YamlValue {
if (value === 'true') {
return true;
}
if (value === 'false') {
return false;
}
if (!Number.isNaN(Number(value))) {
return Number(value);
}
return value;
} |
We use anchors in the configuration, and overall wouldn't be super confident about having a custom parser for YAML, plus side is that it would be easier to debug. I will check for packages if I find anything lighter, and try |
What happened
When under heavy load, the kernel sometimes stop forwarding events to plugins. We sometimes notice that when users try to invoke commands but nothing happens afterwards. This gets solved by redeploying, or having an event that would unstuck the kernel.
After lots of tests, it seems to get stuck around these lines:
https://github.com/ubiquity-os/ubiquity-os-kernel/blob/development/src/github/utils/config.ts#L63-L66
What was expected
The kernel should be able to handle heavy traffic, either delaying requests or cancelling them, and should not get stuck perpetually.
How to reproduce
The best way I found to reproduce the issue is to simultaneously post lots of comments at the same time. Here is a script achieving so:
Let this run for a while before you notice no more response from the bot.
Further findings: it seems that it stops working once the limit of requests for the GitHub token has been reached. That's why commands like
/help
will not work although the comment is received, because the kernel cannot post the comment back to the issue. Likewise, any plugin that would need an Action dispatched won't run. However, plugins that run as Workers through fetch will work fine.If we remove the
waitUntil
function, we get the following error thrown by the Worker run:the script will never generate a response
, which gets silenced withinwaitUntil
when used. This will happen anytime theOctokit
instance will be used, due to the limit being reached and no network call being able to be sent, resulting in a403
error (thrown from GitHub API side)..The text was updated successfully, but these errors were encountered: