Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos 1.9.3: diskSelector is not working for disks behind a RAID controller #10292

Closed
eugene-marchanka opened this issue Feb 4, 2025 · 11 comments · Fixed by #10353
Closed

Talos 1.9.3: diskSelector is not working for disks behind a RAID controller #10292

eugene-marchanka opened this issue Feb 4, 2025 · 11 comments · Fixed by #10353
Assignees
Labels

Comments

@eugene-marchanka
Copy link

eugene-marchanka commented Feb 4, 2025

Bug Report

Talos 1.9.3 is reporting SSD disks TYPE as UNKNOWN behind a RAID controller

Description

I tried:

  • RAID-0 with 1 and 2 disks
  • RAID-1 with 2 disks

RAID-0 with 1 disk:

$ talosctl -n 172.18.0.52 -e 172.18.0.52 disks --insecure
DEV          MODEL            SERIAL   TYPE      UUID   WWID                                   MODALIAS      NAME   SIZE     BUS_PATH                                                                                       SUBSYSTEM          READ_ONLY   SYSTEM_DISK    
/dev/sda     PERC H740P Adp   -        UNKNOWN   -      naa.64cd98f0b4518d002f355bc93432dc00   scsi:t-0x00   -      480 GB   /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:0/0:2:0:0                                /sys/class/block               

RAID-0 with 2 disks:

$ talosctl -n 172.18.0.52 -e 172.18.0.52 disks --insecure                                                                                                                                                                              
DEV          MODEL            SERIAL   TYPE      UUID   WWID                                   MODALIAS      NAME   SIZE     BUS_PATH                                                                                       SUBSYSTEM          READ_ONLY   SYSTEM_DISK    
/dev/sda     PERC H740P Adp   -        UNKNOWN   -      naa.64cd98f0b4518d002f355b4e7c892598   scsi:t-0x00   -      959 GB   /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:0/0:2:0:0                                /sys/class/block                                    

RAID-1 with 2 disks:

$ talosctl -n 172.18.0.52 -e 172.18.0.52 disks --insecure                                                                                                                                                                              
DEV          MODEL            SERIAL   TYPE      UUID   WWID                                   MODALIAS      NAME   SIZE     BUS_PATH                                                                                       SUBSYSTEM          READ_ONLY   SYSTEM_DISK                                   
/dev/sda     PERC H740P Adp   -        UNKNOWN   -      naa.64cd98f0b4518d002f3556c1aeb59cf1   scsi:t-0x00   -      480 GB   /pci0000:17/0000:17:00.0/0000:18:00.0/host1/target1:2:0/1:2:0:0                                /sys/class/block                                    

Those 2 disks were wiped before attempting to install Talos
Error that I got when trying to apply-config:

$ ./apply-config.sh 172.18.0.52 node001.yaml                                                                                                                                                                                           
error applying new configuration: rpc error: code = InvalidArgument desc = runtime configuration validation failed: 1 error occurred:                                                                                                                                           
        * no disks matched the expression: disk.size < 900000000000u && disk.transport != "" && !disk.readonly && !disk.cdrom

node001.yaml:

...
machine:
    install:
        diskSelector:
            size: < 1000GB
...

I also tried different diskSelector options against UNKNOWN type which is my ssd and also hdd drives. None of those worked for me. Example using wwid:

$ talosctl -n 172.18.0.52 -e 172.18.0.52 disks --insecure
DEV          MODEL            SERIAL   TYPE      UUID   WWID                                   MODALIAS      NAME   SIZE     BUS_PATH                                                          SUBSYSTEM          READ_ONLY   SYSTEM_DISK
/dev/loop0   -                -        UNKNOWN   -      -                                      -             -      74 MB    /virtual                                                          /sys/class/block   *           
/dev/sda     PERC H740P Adp   -        UNKNOWN   -      naa.64cd98f0b4518d002f355bc93432dc00   scsi:t-0x00   -      480 GB   /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:0/0:2:0:0   /sys/class/block               *
/dev/sdb     PERC H740P Adp   -        HDD       -      naa.64cd98f0b4518d002f3556cec3146a4e   scsi:t-0x00   -      1.2 TB   /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:1/0:2:1:0   /sys/class/block               
/dev/sdc     PERC H740P Adp   -        HDD       -      naa.64cd98f0b4518d002f3556ddd8ac8af9   scsi:t-0x00   -      1.2 TB   /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:2/0:2:2:0   /sys/class/block               
/dev/sdd     PERC H740P Adp   -        HDD       -      naa.64cd98f0b4518d002f3556ecefdf9e2f   scsi:t-0x00   -      1.2 TB   /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:3/0:2:3:0   /sys/class/block

node001.yaml:

machine:
    install:
        diskSelector:
          wwid: naa.64cd98f0b4518d002f3556cec3146a4e

output:

$ ./apply-config.sh 172.18.0.52 node001.yaml 
error applying new configuration: rpc error: code = InvalidArgument desc = runtime configuration validation failed: 1 error occurred:
        * no disks matched the expression: glob("naa.64cd98f0b4518d002f3556cec3146a4e", disk.wwid) && disk.transport != "" &&
!disk.readonly && !disk.cdrom

It worked when I set /dev/sda as my install disk regardless my /dev/sda is specified as UNKNOWN type. See disks output above.

machine:
    install:
        disk: /dev/sda

Logs

Environment

  • Talos version: [1.9.3]
  • Kubernetes version: [1.32.1]
  • Platform: baremetal
@smira
Copy link
Member

smira commented Feb 5, 2025

Please attach full output of talosctl get disks -o yaml.

@eugene-marchanka
Copy link
Author

Please attach full output of talosctl get disks -o yaml.

node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: loop0
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-04T23:54:05Z
    updated: 2025-02-04T23:54:05Z
spec:
    dev_path: /dev/loop0
    size: 73703424
    pretty_size: 74 MB
    io_size: 512
    sector_size: 512
    readonly: true
    cdrom: false
    bus_path: /virtual
    sub_system: /sys/class/block
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sda
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-04T23:54:08Z
    updated: 2025-02-04T23:54:08Z
spec:
    dev_path: /dev/sda
    size: 479559942144
    pretty_size: 480 GB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f355bc93432dc00
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:0/0:2:0:0
    sub_system: /sys/class/block
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sdb
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-04T23:54:08Z
    updated: 2025-02-04T23:54:08Z
spec:
    dev_path: /dev/sdb
    size: 1199638052864
    pretty_size: 1.2 TB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f3556cec3146a4e
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:1/0:2:1:0
    sub_system: /sys/class/block
    rotational: true
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sdc
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-04T23:54:08Z
    updated: 2025-02-04T23:54:08Z
spec:
    dev_path: /dev/sdc
    size: 1199638052864
    pretty_size: 1.2 TB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f3556ddd8ac8af9
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:2/0:2:2:0
    sub_system: /sys/class/block
    rotational: true
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sdd
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-04T23:54:08Z
    updated: 2025-02-04T23:54:08Z
spec:
    dev_path: /dev/sdd
    size: 1199638052864
    pretty_size: 1.2 TB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f3556ecefdf9e2f
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:3/0:2:3:0
    sub_system: /sys/class/block
    rotational: true

@smira
Copy link
Member

smira commented Feb 5, 2025

Yes, the core issue is that Talos doesn't detect the transport, so it assumes it's a virtual device.

Would you be able to provide some debugging output from a running/installed machine (might be older version of Talos as well)? I don't know yet exactly what I'm looking for, but I will ask once I have that info.

@eugene-marchanka
Copy link
Author

Yes, the core issue is that Talos doesn't detect the transport, so it assumes it's a virtual device.

You are correct. Those are virtual drives.

Here is the get disks for the same node that was booted from Talos ISO 1.8.4:

node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: loop0
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-05T18:29:04Z
    updated: 2025-02-05T18:29:04Z
spec:
    dev_path: /dev/loop0
    size: 75091968
    pretty_size: 75 MB
    io_size: 512
    sector_size: 512
    readonly: true
    cdrom: false
    bus_path: /virtual
    sub_system: /sys/class/block
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sda
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-05T18:29:07Z
    updated: 2025-02-05T18:29:07Z
spec:
    dev_path: /dev/sda
    size: 479559942144
    pretty_size: 480 GB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f36411e978ff951
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:0/0:2:0:0
    sub_system: /sys/class/block
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sdb
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-05T18:29:07Z
    updated: 2025-02-05T18:29:07Z
spec:
    dev_path: /dev/sdb
    size: 1199638052864
    pretty_size: 1.2 TB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f3556cec3146a4e
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:1/0:2:1:0
    sub_system: /sys/class/block
    rotational: true
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sdc
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-05T18:29:07Z
    updated: 2025-02-05T18:29:07Z
spec:
    dev_path: /dev/sdc
    size: 1199638052864
    pretty_size: 1.2 TB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f3556ddd8ac8af9
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:2/0:2:2:0
    sub_system: /sys/class/block
    rotational: true
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sdd
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-05T18:29:07Z
    updated: 2025-02-05T18:29:07Z
spec:
    dev_path: /dev/sdd
    size: 1199638052864
    pretty_size: 1.2 TB
    io_size: 262144
    sector_size: 512
    readonly: false
    cdrom: false
    model: PERC H740P Adp
    modalias: scsi:t-0x00
    wwid: naa.64cd98f0b4518d002f3556ecefdf9e2f
    bus_path: /pci0000:17/0000:17:00.0/0000:18:00.0/host0/target0:2:3/0:2:3:0
    sub_system: /sys/class/block
    rotational: true
---
node:
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: sr0
    version: 1
    owner: block.DisksController
    phase: running
    created: 2025-02-05T18:29:08Z
    updated: 2025-02-05T18:29:08Z
spec:
    dev_path: /dev/sr0
    size: 105445376
    pretty_size: 105 MB
    io_size: 2048
    sector_size: 2048
    readonly: false
    cdrom: true
    model: Virtual CD/DVD
    modalias: scsi:t-0x05
    bus_path: /pci0000:00/0000:00:14.0/usb1/1-14/1-14.4/1-14.4.1/1-14.4.1:1.0/host1/target1:0:0/1:0:0:0
    sub_system: /sys/class/block
    transport: usb
    rotational: true

@eugene-marchanka
Copy link
Author

eugene-marchanka commented Feb 5, 2025

I was able to successfully apply the same config to node booted from ISO 1.8.4 🤷‍♂

@eugene-marchanka
Copy link
Author

Possibly related to this

@smira
Copy link
Member

smira commented Feb 7, 2025

If you could please provide the following output from a machine with a similar controller:

talosctl ls -lr /sys/dev/block
talosctl ls -lr /sys/class
talosctl ls -lr /sys/bus
talosctl ls -lr /sys/block

I can try to find what is the problem.

@smira smira added the kind/bug label Feb 7, 2025
@eugene-marchanka
Copy link
Author

@smira
Copy link
Member

smira commented Feb 7, 2025

Thank you, we will take a look and figure out a solution!

@eugene-marchanka eugene-marchanka changed the title Talos 1.9.3 is reporting SSD disks TYPE as UNKNOWN behind a RAID controller Talos 1.9.3: diskSelector is not working for disks behind a RAID controller Feb 8, 2025
@smira smira self-assigned this Feb 11, 2025
@smira
Copy link
Member

smira commented Feb 12, 2025

@eugene-marchanka to make sure that I found the bug correctly, can you please do:

talosctl ls -l /sys/bus/scsi/devices/0:2:0:0

@eugene-marchanka
Copy link
Author

$ talosctl -n 172.18.0.17 -e 172.18.0.8 ls -l /sys/bus/scsi/devices/0:2:0:0
NODE          MODE         UID   GID   SIZE(B)   LASTMOD           LABEL   NAME
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:19:30           .
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           blacklist
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:19:30           block
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:19:31           bsg
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           cdl_enable
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           cdl_supported
172.18.0.17   --w-------   0     0     4096      Feb 12 17:38:42           delete
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           device_blocked
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           device_busy
172.18.0.17   Lrwxrwxrwx   0     0     0         Feb 12 17:38:42           driver -> ../../../../../../../bus/scsi/drivers/sd
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           eh_timeout
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           evt_capacity_change_reported
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           evt_inquiry_change_reported
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           evt_lun_change_reported
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           evt_media_change
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           evt_mode_parameter_change_reported
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           evt_soft_threshold_reached
172.18.0.17   Lrwxrwxrwx   0     0     0         Feb 12 17:38:42           generic -> scsi_generic/sg0
172.18.0.17   -r--r--r--   0     0     0         Feb 12 17:38:42           inquiry
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           iocounterbits
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           iodone_cnt
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           ioerr_cnt
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           iorequest_cnt
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           iotmo_cnt
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           modalias
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           model
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:38:42           power
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           queue_depth
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           queue_ramp_up_period
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           queue_type
172.18.0.17   --w-------   0     0     4096      Feb 12 17:38:42           rescan
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           rev
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:19:31           scsi_device
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:19:31           scsi_disk
172.18.0.17   drwxr-xr-x   0     0     0         Feb 12 17:19:31           scsi_generic
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           scsi_level
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           state
172.18.0.17   Lrwxrwxrwx   0     0     0         Feb 12 17:19:31           subsystem -> ../../../../../../../bus/scsi
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:38:42           timeout
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:19:31           type
172.18.0.17   -rw-r--r--   0     0     4096      Feb 12 17:19:31           uevent
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:19:31           vendor
172.18.0.17   -r--r--r--   0     0     0         Feb 12 17:38:42           vpd_pg0
172.18.0.17   -r--r--r--   0     0     0         Feb 12 17:38:42           vpd_pg80
172.18.0.17   -r--r--r--   0     0     0         Feb 12 17:38:42           vpd_pg83
172.18.0.17   -r--r--r--   0     0     0         Feb 12 17:38:42           vpd_pgb0
172.18.0.17   -r--r--r--   0     0     0         Feb 12 17:38:42           vpd_pgb1
172.18.0.17   -r--r--r--   0     0     4096      Feb 12 17:38:42           wwid

smira added a commit to smira/talos that referenced this issue Feb 13, 2025
Fixes siderolabs#10292

This pulls in fixes from go-blockdevice library:

* siderolabs/go-blockdevice#127
* siderolabs/go-blockdevice#128

Also allow `megaraid` emulation in `talosctl cluster create`.

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit 8531d91)
samip5 pushed a commit to skyssolutions/talos that referenced this issue Feb 13, 2025
Fixes siderolabs#10292

This pulls in fixes from go-blockdevice library:

* siderolabs/go-blockdevice#127
* siderolabs/go-blockdevice#128

Also allow `megaraid` emulation in `talosctl cluster create`.

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit 8531d91)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants