Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent #227

gentlementlegen · 2024-12-30T08:28:28Z

What happened

When under heavy load, the kernel sometimes stop forwarding events to plugins. We sometimes notice that when users try to invoke commands but nothing happens afterwards. This gets solved by redeploying, or having an event that would unstuck the kernel.

After lots of tests, it seems to get stuck around these lines:
https://github.com/ubiquity-os/ubiquity-os-kernel/blob/development/src/github/utils/config.ts#L63-L66

What was expected

The kernel should be able to handle heavy traffic, either delaying requests or cancelling them, and should not get stuck perpetually.

How to reproduce

The best way I found to reproduce the issue is to simultaneously post lots of comments at the same time. Here is a script achieving so:

import { Octokit } from "@octokit/rest";

const octokit = new Octokit({
  auth: "your gh token",
});

async function postCommentBatch({ owner, repo, issueNumber, batchSize = 10 }) {
  const promises = Array(batchSize)
    .fill()
    .map(() =>
      octokit.rest.issues.createComment({
        owner,
        repo,
        issue_number: issueNumber,
        body: "/help",
      })
    );

  try {
    await Promise.all(promises);
    console.log(`Successfully posted batch of ${batchSize} comments`);
    return true;
  } catch (error) {
    console.error("Error posting batch:", error.message);
    return false;
  }
}

async function postComments({
  owner,
  repo,
  issueNumber,
  timespan = 2000,
  delay = 100,
  batchSize = 10,
}) {
  const startTime = Date.now();
  const endTime = startTime + timespan;

  console.log(`Starting batch comment posting for ${timespan}ms with ${delay}ms delay between batches`);

  while (Date.now() < endTime) {
    try {
      const success = await postCommentBatch({
        owner,
        repo,
        issueNumber,
        batchSize,
      });

      if (success) {
        console.log(`Batch posted at ${new Date().toISOString()}`);
      }

      await new Promise((resolve) => setTimeout(resolve, delay));
    } catch (error) {
      if (error.status === 429) {
        console.error("Rate limit exceeded. Waiting before retrying...");
        await new Promise((resolve) => setTimeout(resolve, 1000));
      } else {
        console.error("Error in batch posting cycle:", error.message);
      }
    }
  }

  console.log("Finished posting all comment batches");
}

postComments({
  owner: "owner",
  repo: "repo",
  issueNumber: 3,
  timespan: 2000,
  delay: 100,
  batchSize: 10,
}).catch(console.error);

Let this run for a while before you notice no more response from the bot.

Further findings: it seems that it stops working once the limit of requests for the GitHub token has been reached. That's why commands like /help will not work although the comment is received, because the kernel cannot post the comment back to the issue. Likewise, any plugin that would need an Action dispatched won't run. However, plugins that run as Workers through fetch will work fine.

If we remove the waitUntil function, we get the following error thrown by the Worker run: the script will never generate a response, which gets silenced within waitUntil when used. This will happen anytime the Octokit instance will be used, due to the limit being reached and no network call being able to be sent, resulting in a 403 error (thrown from GitHub API side)..

The text was updated successfully, but these errors were encountered:

ubiquity-os-beta · 2024-12-30T08:28:39Z

Note

The following contributors may be suitable for this task:

0x4007 · 2024-12-30T10:05:50Z

Thanks for sharing your research. It seems that a final solution isn't clear. I suppose that research can continue under this conversation thread. @whilefoo please look into this.

gentlementlegen · 2024-12-30T11:05:58Z

Basically, it seems that once the token arrived to its limit, the calls all hang due to the use of waitUntil (which otherwise would just throw the error The script will never generate a response) because GitHub API returns a 403 (exceeded secondary rate limit). However even knowing the cause, I don't see how we can work around these limits? Eventually we could look into lowering the amount of calls we make, but it doesn't seem scalable.

As a temporary solution, I could suggest filtering some events like the ones related to workflow because they run a lot (one when workflow starts, one when it ends) which would significantly lower API usage. Still, knowing that we are the only organization using the bot (plus your org and some other user), it means that eventually this problem is bound to come back again if we get more users.

gentlementlegen · 2024-12-31T00:41:56Z

This is easily reproducible on a local setup as well, and gives more information about failures. Even the getToken fails with a 500 error once the token exhausted the available quota, which means all subsequent requests will fail as well. Interesting enough, restarting the app solves the problem (same as within a deployed worker).

whilefoo · 2024-12-31T14:32:07Z

It seems the primary rate limit is 5000 per hour per org/repo, but we are hitting secondary rate limit which happens when too many request happen at once - 100 at once or too many per minute.
We will hit this limits even faster using Github-based storage.

One solution would be to space out events, for example if we receive 100 events at once we need to process them one by one and not all at once, having priority in mind too as events that need instant response like commands need to be processed immediately while others like text embeddings can be processed later.

it seems that it stops working once the limit of requests for the GitHub token has been reached. That's why commands like /help will not work although the comment is received, because the kernel cannot post the comment back to the issue. Likewise, any plugin that would need an Action dispatched won't run. However, plugins that run as Workers through fetch will work fine.

But Worker plugins would still fail to execute as they use the same installation token passed from the kernel, isn't it?

Our octokit uses plugin-retry plugin which should retry requests after the rate limit is over but I think that waitUntil still has 30 seconds hard limit so it would time out before it could retry.

@gentlementlegen Didn't we try running the kernel on Azure, or was that reverted?

gentlementlegen · 2025-01-01T10:52:39Z

Yes it happens because whenever we receive an event we do the following:

fetch the configuration from the repository
fetch the configuration from the organization
for each plugin, fetch the manifest
for each plugin that needs to be triggered, dispatch a workflow event

In average one event involves ~10 calls to GitHub API (and subsequently we use the same token in all the plugins which can do tons of calls as well). Plus, this is multiplied by each external organization using our bot as well.

The more plugins we have the more calls will be done, and this happens for literally any event which is why I was suggesting filtering out workflow related events, because we don't use them and because they get triggered at start of the workflow, end of the workflow, and we have around 4 workflows per push triggered on each of our repos. But this is obviously a short term solution.

We could have some event queue indeed, but we can't delay the calls for too long without having the worker shutting down. And as you said, we needs commands to still be responsive. I think we should find a way to avoid fetching the manifest for all the plugins for each run, which would help lowering API calls.

But Worker plugins would still fail to execute as they use the same installation token passed from the kernel, isn't it?

yes it seems to be but the kernel should then work fine after waiting for some time (when the limit resets) but it seems to work right away when restarting the instance of the kernel, somehow.

We have an Azure instance up and running, only configured for ubiquity-os-beta and we do not use it as the moment, but it is available. Do you suggest this for longer runs?

0x4007 · 2025-01-01T23:14:23Z

This secondary rate limit seems like a mess to deal with. So is the plan now to make an events queue and handle them with a static delay timer between each so we can avoid the rate limit?

gentlementlegen · 2025-01-02T12:57:20Z

Like @whilefoo said the problem is when commands that need immediate response will need to be handled because if you have 100+ events in the queue they would take too long to be triggered.

My suggestion as an immediate fix would be to filter events we do not use like workflow, and that plugins use their own credentials probably, I do not know if that counts towards the quota but most likely.

whilefoo · 2025-01-03T14:50:58Z

Like @whilefoo said the problem is when commands that need immediate response will need to be handled because if you have 100+ events in the queue they would take too long to be triggered.

I suggested multiple priority queues but thinking about this more I realized that it's not feasible because you don't know the priority if you don't know the config, for example a issue_comment.created needs to be processed fast because it can be a command but it can also be just a normal comment that can be processed in the background by text-embeddings or another plugin that needs it ASAP.

My suggestion as an immediate fix would be to filter events we do not use like workflow, and that plugins use their own credentials probably, I do not know if that counts towards the quota but most likely.

We should definitely do that. I'm not sure about the credentials, they need to use installation's token but they can't get it by themselves.

The most obvious fix is to not store the config in Github but somewhere else like a database. You could build queues on top of this by fetching the config from the database (each plugin would have a priority level) and you would put it in the queue for that priority level. Cloudflare Queues also have retries so if the rate limit is hit, you can schedule to retry after X time and they also have 30 min time limit compared to 30 seconds on normal workers. However I think idea won't be liked because it moves away from Github and creates a dependency on Cloudflare.

gentlementlegen · 2025-01-04T01:27:09Z

Okay then I will start filtering events, and link the changes here.

For the credentials, we share APP_ID and APP_PRIVATE_KEY in the organization, so the plugins should be able to authenticate themselves. We could change the logic in the SDK for authentication as well, so no changes in the plugins would be needed. I also mentioned that because my concern was that outside organizations receive our token at this time isn't it? I saw in the logs that there are ~3 different orgs that have ubiquity-os installed.
If outside organizations use our own plugins, then the auth would be our bot (using APP_ID and APP_PRIVATE_KEY from our organization). But if they create their own, it makes sense to me that they use their own credentials, otherwise it would count against our own quota and they would have too many permissions.

The queue seems to introduce a lot of complexity and fragile logic, I think we'd be better avoiding it for now.

gentlementlegen · 2025-01-05T05:53:43Z

I realized that in my organization, I never had these events triggered, and then understood that you can disable these directly within the GitHub App settings. I disabled them for our two bots:

whilefoo · 2025-01-05T13:36:00Z

For the credentials, we share APP_ID and APP_PRIVATE_KEY in the organization, so the plugins should be able to authenticate themselves. We could change the logic in the SDK for authentication as well, so no changes in the plugins would be needed. I also mentioned that because my concern was that outside organizations receive our token at this time isn't it? I saw in the logs that there are ~3 different orgs that have ubiquity-os installed. If outside organizations use our own plugins, then the auth would be our bot (using APP_ID and APP_PRIVATE_KEY from our organization). But if they create their own, it makes sense to me that they use their own credentials, otherwise it would count against our own quota and they would have too many permissions.

You're saying that each organization that uses our bot should create their own Github App and share credentials to the plugin via environment variables?
I think that adds a lot of friction and setup on the part of the organization and I'm not sure if this fixes the rate limit. The rate limit is based on the installation token / org so there's no difference in using our bot or their own bot. Installation token only has permissions on that organization so a plugin would have same level of permissions if using our bot or organization's own bot

gentlementlegen · 2025-01-05T14:46:49Z

Yes that would be my suggestion, but I do agree that it would add friction. Doesn't it feel dangerous that a third party can create its own plugin with our token elevations though?

And yes it would count against our own token if all the requests they do in their plugin use our token.

whilefoo · 2025-01-05T16:11:44Z

Yes that would be my suggestion, but I do agree that it would add friction. Doesn't it feel dangerous that a third party can create its own plugin with our token elevations though?

That was a concern from the start but it is not that critical because they can't access other organizations with that token, only the one that installed the plugin and that organization has to trust the plugin otherwise they wouldn't install it.
The most damage they could do is force push to the repo but that's possible in either cases (our github app or organization's own github app).
There was an idea that third party plugins would have to call the kernel to make actions on Github so our kernel would restrict which operations are allowed.

I understand now that our Github App would be only used to fetch configs and manifests and dispatch workflows and organization's App would be used for the plugin which would alleviate the problem with rate limits, however I feel like this would too much friction

gentlementlegen · 2025-01-06T00:15:47Z

In my mind, the following would happen:

if an external organization just wishes to use our product, they only have to install the bot and link our plugins in their configuration, no extra step would be needed
if they wish to develop their custom plugin, they would have to provide authentication methods and use their own github app credentials

I think this would be beneficial for two main reason:

they can give different access levels to that plugin, imagine they create something that would take care of billing manager access for example, our plugin does not have read or write access on this so they would need their own token anyway
at this moment, they can post any comment in behalf of ubiquity-os bot, which could allow them to post fraudulent reward links for example, even within our repo I suppose since the token would have access

0x4007 · 2025-01-06T01:24:21Z

they can give different access levels to that plugin, imagine they create something that would take care of billing manager access for example, our plugin does not have read or write access on this so they would need their own token anyway

Possible but we haven't had to elevate permissions in a very long time. I think we have it mostly covered. Worst case scenario: if they aren't doing anything payment related they can simply make a GitHub action. The only secret sauce we should be focusing on is providing the infrastructure to essentially map any webhook event to a financial reward and to allow the distribution of that reward.

at this moment, they can post any comment in behalf of ubiquity-os bot, which could allow them to post fraudulent reward links for example, even within our repo I suppose since the token would have access

This is only true if we 1. Accept their changes in a pull to the kernel and/or 2. Install that plugin to our repos.

gentlementlegen · 2025-01-06T06:54:13Z

New updates regarding the quota:

@0x4007 noticed that it runs when devpool-directory is updated, which happens quite often and is not needed
we could filter out this repo, but it might required to be hard-coded or set within the worker environment directly

0x4007 · 2025-01-06T07:05:08Z

Setting in the environment seems appropriate! Perhaps we can set an array of values and any org/repo slug can be ignored

["ubiquity/devpool-directory"]

Come to think of it though, we may even be able to depreciate the issues being opened in that repository because now we simply aggregate them into a json object, although it is kind of nice to see the confirmation when the link back occurs that it is in the directory.

whilefoo · 2025-01-06T16:33:18Z

if they wish to develop their custom plugin, they would have to provide authentication methods and use their own github app credentials

But the consumer of a third party plugin would have to install their app so you will have so many apps installed, but I think we're still a long way from third party plugins so we can think about this later

0x4007 · 2025-01-06T21:47:39Z

I don't like the idea of installing custom apps for plugins. It's not a good approach

gentlementlegen · 2025-01-09T11:43:17Z

It seems that lowering the amount of calls didn't really solve the problem, 5he kernel still gets stuck often (today particularly because Github servers seems to be partially down). The rate limit in the logs has always 5k+ calls remaining (first rate limit). When stuck, usually it stays at "trying to fetch configuration from" or "dispatching event for" and then nothing, meaning the octokit call never made it. However no error or logs is shown afterwards.

The next changes I will try:

add a wrapper with explicit timeout to avoid getting stuck forever inside waitUntil
add the logger plug-in to Octokit so we can have logs about the ongoing requests
explicitly bind the fetch from Cloudflare to Octokit instance

gentlementlegen · 2025-01-11T08:08:50Z

I tried what I mentioned above and the following:

disabling retry plugin
removing the custom error logs we had in octokit plugins

The problem is the same, I can see a GET request is being sent but nothing happens afterwards.

request {
  method: 'GET',
  baseUrl: 'https://api.github.com',
  headers: {
    accept: 'application/vnd.github.v3+json',
    'user-agent': 'octokit-core.js/6.1.2 Cloudflare-Workers'
  },
  mediaType: { format: 'raw', previews: [] },
  request: {
    fetch: [Function: bound fetch],
    hook: [Function: bound bound register]
  },
  url: '/repos/{owner}/{repo}/contents/{path}',
  owner: 'ubiquity',
  repo: '.github-private',
  path: '.github/.ubiquity-os.config.yml'
}

Every time this happens, when I redeploy the kernel, it works again for around 1h and then this happens. I don't think this is due to second rate limit either because the amount of requests per second averages to 0.06 requests which is very low.

0x4007 · 2025-01-11T08:15:27Z

Would you say it's safe to blame cloudflare then? Have you considered A/B testing the kernel on another platform like azure or something?

gentlementlegen · 2025-01-11T08:25:16Z

@0x4007 I should definitely try with another service yes. I don't think cloudflare is to blame because requests using fetch are working normally and events seem to be received just fine, but maybe a mix that they use their own implementation of fetch could be the cause.

Started to see some

whilefoo · 2025-01-11T10:03:00Z

Is it possible that the 10ms limit is reached and Cloudflare shuts down the worker? But it's weird to only happen after 1 hour

gentlementlegen · 2025-01-16T08:43:41Z

Updates on the monitoring:

Very often the code gets stuck with the following logs (this was a /wallet command for example):

which corresponds to the following source code

ubiquity-os-kernel/src/github/utils/config.ts

Line 190 in fbccd44

console.log("Before parsing", data);

So I thought maybe the package to read YAML was the culprit. I changed it to another one, and the issues seem to happen less often but still happen at the same spot. I wondered if both of these libraries use a non thread-safe method for cloudflare, since we use node-compat. But that's a wild guess and I don't wanna read the whole source code.

It is maybe a coincidence that it always seem to break there, so with @0x4007 we considered trying a gigantic configuration, which I did (more than 40 plugins!) and it worked fine until one hour later when it started to skip events.

Having no logs truly doesn't help, but I still suspect CF to unexpectedly kill the worker, but it gets silenced due to waitUntil. I have no more ideas on what to try except a different hosting platform to be honest, and you have any suggestions.

Edit: re-reading these logs I realize two GET requests were sent but only one file was parsed, could this be some race condition, or a promise failing that kills the other? will try to run these synchronously and see if there is any change.

0x4007 · 2025-01-16T09:17:29Z

Maybe node-compat is not helping the problem

ubiquity-os-beta · 2025-01-18T07:56:30Z

Beneficiary 0x0fC1b909ba9265A846b82CF4CE352fc3e7EeB2ED

Tip

Use /wallet 0x0000...0000 if you want to update your registered payment wallet address.
Be sure to open a draft pull request as soon as possible to communicate updates on your progress.
Be sure to provide timely updates to us when requested, or you will be automatically unassigned from the task.

ubiquity-os-beta · 2025-01-19T06:02:35Z

[ 500 WXDAI ]
@gentlementlegen

⚠️ Your rewards have been limited to the task price of 250 WXDAI.

Contributions Overview

View	Contribution	Count	Reward
Issue	Task	1	250
Issue	Specification	1	93.3
Issue	Comment	13	415.55
Review	Comment	1	0

Conversation Incentives

Comment	Formatting	Relevance	Priority	Reward
## What happenedWhen under heavy load, the kernel sometimes st…	18.66 content: content: h2: score: 1 elementCount: 3 p: score: 0 elementCount: 8 hr: score: 0 elementCount: 1 a: score: 5 elementCount: 1 result: 8 regex: wordCount: 243 wordValue: 0.1 result: 10.66	1	5	93.3
Basically, it seems that once the token arrived to its limit, th…	6.67 content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 140 wordValue: 0.1 result: 6.67	1	5	33.35
This is easily reproducible on a local setup as well, and gives …	2.92 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 53 wordValue: 0.1 result: 2.92	1	5	14.6
Yes it happens because whenever we receive an event we do the fo…	14.06 content: content: p: score: 0 elementCount: 10 ol: score: 0 elementCount: 1 li: score: 0.5 elementCount: 4 result: 2 regex: wordCount: 281 wordValue: 0.1 result: 12.06	1	5	70.3
Like @whilefoo said the problem is when commands that need immed…	3.75 content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 71 wordValue: 0.1 result: 3.75	1	5	18.75
Okay then I will start filtering events, and link the changes he…	7.47 content: content: p: score: 0 elementCount: 4 result: 0 regex: wordCount: 160 wordValue: 0.1 result: 7.47	1	5	37.35
I realized that in my organization, I never had these events tri…	7.25 content: content: p: score: 0 elementCount: 2 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 39 wordValue: 0.1 result: 2.25	0.5	5	18.125
Yes that would be my suggestion, but I do agree that it would ad…	3.02 content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 55 wordValue: 0.1 result: 3.02	0.5	5	7.55
In my mind, the following would happen:- if an external organi…	8.91 content: content: p: score: 0 elementCount: 6 ul: score: 0 elementCount: 2 li: score: 0.5 elementCount: 4 result: 2 regex: wordCount: 146 wordValue: 0.1 result: 6.91	0.5	5	22.275
New updates regarding the quota:- @0x4007 noticed that it runs…	3.35 content: content: p: score: 0 elementCount: 3 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 2 result: 1 regex: wordCount: 41 wordValue: 0.1 result: 2.35	0.5	5	8.375
It seems that lowering the amount of calls didn't really solve t…	7.39 content: content: p: score: 0 elementCount: 1 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 3 result: 1.5 regex: wordCount: 121 wordValue: 0.1 result: 5.89	1	5	36.95
I tried what I mentioned above and the following:- disabling r…	5.28 content: content: p: score: 0 elementCount: 5 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 2 result: 1 regex: wordCount: 83 wordValue: 0.1 result: 4.28	1	5	26.4
@0x4007 I should definitely try with another service yes. I don'…	8.11 content: content: p: score: 0 elementCount: 3 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 57 wordValue: 0.1 result: 3.11	0.5	5	20.275
Updates on the monitoring:Very often the code gets stuck with …	20.25 content: content: p: score: 0 elementCount: 9 hr: score: 0 elementCount: 1 a: score: 5 elementCount: 2 result: 10 regex: wordCount: 232 wordValue: 0.1 result: 10.25	1	5	101.25
Resolves #227- added more logs for better debugging- configu…	2.5 content: content: p: score: 0 elementCount: 7 ul: score: 0 elementCount: 1 li: score: 0.5 elementCount: 5 result: 2.5 regex: wordCount: 76 wordValue: 0 result: 0	0.8	5	0

[ 46.39 WXDAI ]
@0x4007

Contributions Overview

View	Contribution	Count	Reward
Issue	Comment	7	46.39

Conversation Incentives

Comment	Formatting	Relevance	Priority	Reward
Thanks for sharing your research. It seems that a final solution…	1.75 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 29 wordValue: 0.1 result: 1.75	0.6	5	5.25
This secondary rate limit seems like a mess to deal with. So is …	2.2 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 38 wordValue: 0.1 result: 2.2	0.7	5	7.7
Possible but we haven't had to elevate permissions in a very lon…	4.8 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 95 wordValue: 0.1 result: 4.8	0.3	5	7.2
Setting in the environment seems appropriate! Perhaps we can set…	3.84 content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 73 wordValue: 0.1 result: 3.84	0.8	5	15.36
I don't like the idea of installing custom apps for plugins. It'…	1.17 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 18 wordValue: 0.1 result: 1.17	0.4	5	2.34
Would you say it's safe to blame cloudflare then? Have you consi…	1.54 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 25 wordValue: 0.1 result: 1.54	0.9	5	6.93
Maybe `node-compat` is not helping the problem	0.46 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 6 wordValue: 0.1 result: 0.46	0.7	5	1.61

[ 159.385 WXDAI ]
@whilefoo

Contributions Overview

View	Contribution	Count	Reward
Issue	Comment	6	153.31
Review	Comment	2	6.075

Conversation Incentives

Comment	Formatting	Relevance	Priority	Reward
It seems the primary rate limit is 5000 per hour per org/repo, b…	13.1 content: content: p: score: 0 elementCount: 6 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 176 wordValue: 0.1 result: 8.1	1	5	65.5
I suggested multiple priority queues but thinking about this mor…	9.15 content: content: p: score: 0 elementCount: 3 result: 0 regex: wordCount: 203 wordValue: 0.1 result: 9.15	1	5	45.75
You're saying that each organization that uses our bot should cr…	5.05 content: content: p: score: 0 elementCount: 2 result: 0 regex: wordCount: 101 wordValue: 0.1 result: 5.05	0.5	5	12.625
That was a concern from the start but it is not that critical be…	6.91 content: content: p: score: 0 elementCount: 4 result: 0 regex: wordCount: 146 wordValue: 0.1 result: 6.91	0.5	5	17.275
But the consumer of a third party plugin would have to install t…	2.4 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 42 wordValue: 0.1 result: 2.4	0.5	5	6
Is it possible that the 10ms limit is reached and Cloudflare shu…	1.54 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 25 wordValue: 0.1 result: 1.54	0.8	5	6.16
Hope this works	0.25 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 3 wordValue: 0.1 result: 0.25	0.1	5	0.125
did you try this before changing to sync config fetch? it could …	1.7 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 28 wordValue: 0.1 result: 1.7	0.7	5	5.95

gentlementlegen · 2025-01-21T06:04:08Z

We had one instance of this problem occurring just now.

The process seems to break at the same spot, either this is really bad luck to randomly always break there or there is something wrong with the yaml package.

I digged a bit in the repo and found a few relevant issues:

Cannot find module 'node:process' eemeli/yaml#598 (could be cloudflare implementation doing funny stuff)
Performance issue on relatively small file eemeli/yaml#537 (could be that we eventually time out because of a slow parsing)

Maybe I will try switching to js-yml and see what happens.

Related run here.

0x4007 · 2025-01-21T06:25:25Z

Possibly a bad idea but what about a novel implementation of a yaml parser? In exchange for simpler usage of yaml features, we could have a more reliable and simpler parser.

Here's an o1 one-shot: https://chatgpt.com/share/678f3dc2-9a6c-8001-b527-8202efa456f3

type YamlValue = string | number | boolean | YamlObject | YamlValue[];
type YamlObject = Record<string, YamlValue>;

/**
 * Parse a YAML string into a nested object.
 */
export function parseYaml(yaml: string): YamlObject {
  const lines = yaml.split(/\r?\n/);
  const { object: result } = parseBlock(lines, 0, 0);
  return result;
}

/**
 * Recursively parse lines to build YAML objects/arrays.
 * @param lines The array of lines from the YAML string.
 * @param currentIndent The current indentation level in spaces.
 * @param startIndex The line index to begin parsing.
 * @returns The parsed object and the next line index.
 */
function parseBlock(
  lines: string[],
  currentIndent: number,
  startIndex: number
): { object: YamlObject; nextLine: number } {
  const parsedObject: YamlObject = {};
  let i = startIndex;

  while (i < lines.length) {
    const line = lines[i];
    const lineIndent = getIndent(line);

    // Stop if indentation is shallower than current block
    if (line.trim() === '' || lineIndent < currentIndent) {
      break;
    }

    // Strip comments
    const noCommentLine = line.split('#')[0].trimEnd();
    if (noCommentLine.trim() === '') {
      i += 1;
      continue;
    }

    // Parse key/value or list item
    const { key, value, isListItem } = parseLine(noCommentLine);

    if (isListItem) {
      // It's a list item: collect all items in an array
      // The parent key is the last key we encountered
      const lastKey = Object.keys(parsedObject)[Object.keys(parsedObject).length - 1];
      if (typeof parsedObject[lastKey] === 'undefined') {
        parsedObject[lastKey] = [];
      }

      const arrayRef = parsedObject[lastKey];
      if (Array.isArray(arrayRef)) {
        // if value is missing, parse as nested block
        if (typeof value === 'undefined' || value === null) {
          const { object: nested, nextLine } = parseBlock(lines, lineIndent + 2, i + 1);
          arrayRef.push(nested);
          i = nextLine;
        } else {
          arrayRef.push(value);
          i += 1;
        }
      } else {
        i += 1;
      }
    } else if (key !== '' && typeof value !== 'undefined') {
      // Plain key: value
      parsedObject[key] = value;
      i += 1;
    } else if (key !== '') {
      // Key with possible nested block
      const nextLineIndent = getIndent(lines[i + 1] || '');
      if (nextLineIndent > lineIndent) {
        const { object: nested, nextLine } = parseBlock(lines, nextLineIndent, i + 1);
        parsedObject[key] = nested;
        i = nextLine;
      } else {
        parsedObject[key] = '';
        i += 1;
      }
    } else {
      i += 1;
    }
  }

  return { object: parsedObject, nextLine: i };
}

/**
 * Extract indentation width from the start of a line.
 */
function getIndent(line: string): number {
  let count = 0;
  for (let i = 0; i < line.length; i += 1) {
    if (line[i] === ' ') {
      count += 1;
    } else {
      break;
    }
  }
  return count;
}

/**
 * Parse a single line to discover a key-value pair or a list item.
 */
function parseLine(line: string): {
  key: string;
  value?: YamlValue;
  isListItem: boolean;
} {
  const isListItem = line.trimStart().startsWith('- ');
  if (isListItem) {
    // example: "- something"
    const itemValue = line.trimStart().slice(2).trim();
    if (itemValue.includes(': ')) {
      // example: "- key: value" (nested object in a list)
      const splitIndex = itemValue.indexOf(': ');
      const subKey = itemValue.slice(0, splitIndex).trim();
      const subValue = convertValue(itemValue.slice(splitIndex + 2).trim());
      return { key: subKey, value: subValue, isListItem: true };
    }
    return { key: '', value: convertValue(itemValue), isListItem: true };
  }

  // example: "key: value"
  const colonIndex = line.indexOf(':');
  if (colonIndex >= 0) {
    const rawKey = line.slice(0, colonIndex).trim();
    const rawValue = line.slice(colonIndex + 1).trim();
    if (rawValue !== '') {
      return { key: rawKey, value: convertValue(rawValue), isListItem: false };
    }
    return { key: rawKey, isListItem: false };
  }

  return { key: '', isListItem: false };
}

/**
 * Convert a raw string value to a typed value (boolean, number, or string).
 */
function convertValue(value: string): YamlValue {
  if (value === 'true') {
    return true;
  }
  if (value === 'false') {
    return false;
  }
  if (!Number.isNaN(Number(value))) {
    return Number(value);
  }
  return value;
}

gentlementlegen · 2025-01-21T06:29:08Z

We use anchors in the configuration, and overall wouldn't be super confident about having a custom parser for YAML, plus side is that it would be easier to debug. I will check for packages if I find anything lighter, and try js-yml as well.

gentlementlegen added the Time: <2 Hours label Dec 30, 2024

devpool-directory-superintendent bot mentioned this issue Dec 30, 2024

Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent ubiquity/devpool-directory#2006

Closed

0x4007 added the Priority: 5 (Emergency) label Dec 30, 2024

ubiquity-os-beta bot added the Price: 250 USD label Dec 30, 2024

rndquu added this to Development Dec 31, 2024

This was referenced Jan 5, 2025

Footer ubiquity/pay.ubq.fi#344

Open

fix: event filtering #229

Closed

gentlementlegen mentioned this issue Jan 18, 2025

fix: configuration files do not load asynchronously #236

Merged

ubiquity-os-beta bot assigned gentlementlegen Jan 18, 2025

gentlementlegen closed this as completed in #236 Jan 19, 2025

github-project-automation bot moved this to Done in Development Jan 19, 2025

rndquu removed this from Development Jan 25, 2025

Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent #227

Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent #227

Comments

gentlementlegen commented Dec 30, 2024 • edited Loading

What happened

What was expected

How to reproduce

ubiquity-os-beta bot commented Dec 30, 2024 • edited Loading

0x4007 commented Dec 30, 2024

gentlementlegen commented Dec 30, 2024 • edited Loading

gentlementlegen commented Dec 31, 2024

whilefoo commented Dec 31, 2024

gentlementlegen commented Jan 1, 2025 • edited Loading

0x4007 commented Jan 1, 2025

gentlementlegen commented Jan 2, 2025

whilefoo commented Jan 3, 2025

gentlementlegen commented Jan 4, 2025

gentlementlegen commented Jan 5, 2025

whilefoo commented Jan 5, 2025

gentlementlegen commented Jan 5, 2025

whilefoo commented Jan 5, 2025

gentlementlegen commented Jan 6, 2025

0x4007 commented Jan 6, 2025

gentlementlegen commented Jan 6, 2025

0x4007 commented Jan 6, 2025 • edited Loading

whilefoo commented Jan 6, 2025

0x4007 commented Jan 6, 2025

gentlementlegen commented Jan 9, 2025

gentlementlegen commented Jan 11, 2025

0x4007 commented Jan 11, 2025 • edited Loading

gentlementlegen commented Jan 11, 2025 • edited Loading

whilefoo commented Jan 11, 2025

gentlementlegen commented Jan 16, 2025 • edited Loading

0x4007 commented Jan 16, 2025

ubiquity-os-beta bot commented Jan 18, 2025

ubiquity-os-beta bot commented Jan 19, 2025 • edited Loading

⚠️ Your rewards have been limited to the task price of 250 WXDAI.

Contributions Overview

Conversation Incentives

Contributions Overview

Conversation Incentives

Contributions Overview

Conversation Incentives

gentlementlegen commented Jan 21, 2025 • edited Loading

0x4007 commented Jan 21, 2025 • edited Loading

gentlementlegen commented Jan 21, 2025

gentlementlegen commented Dec 30, 2024 •

edited

Loading

ubiquity-os-beta bot commented Dec 30, 2024 •

edited

Loading

gentlementlegen commented Dec 30, 2024 •

edited

Loading

gentlementlegen commented Jan 1, 2025 •

edited

Loading

0x4007 commented Jan 6, 2025 •

edited

Loading

0x4007 commented Jan 11, 2025 •

edited

Loading

gentlementlegen commented Jan 11, 2025 •

edited

Loading

gentlementlegen commented Jan 16, 2025 •

edited

Loading

ubiquity-os-beta bot commented Jan 19, 2025 •

edited

Loading

gentlementlegen commented Jan 21, 2025 •

edited

Loading

0x4007 commented Jan 21, 2025 •

edited

Loading