Argument list too long, when publishing image

Peter_Husen · March 4, 2021, 10:30am

My daily backup script is failing when trying to publish a snapshot image. Specifically, the command

lxc publish production/210226

is failing with the error

Error: argument list too long

This is usually the message produced for an exec call violating the system limit ARG_MAX, so I am assuming this happens for some external program invocation as part of the “publishing” procedure for some reason. Of course, it could also be due to some program producing the same error message just in case of invalid arguments…

I have played around with strace and extrace to try to figure out what command invocation goes wrong, but I think I’m not seeing the system call, when this error occurs (handled at the library level without actually making the syscall?). I also tried attaching gdb to the running lxd process to try to break on exec() calls, but it just puked all over itself with errors like

/build/gdb-veKdC1/gdb-8.1.1/gdb/nat/x86-linux-dregs.c:146: internal-error: void x86_linux_update_debug_registers(lwp_info*): Assertion `lwp_is_stopped (lwp)' failed.
A problem internal to GDB has been detected,

Any great suggestions how to track down this error? I guess there my be underlying issues with the btrfs filesystem or something like that that indirectly lead to the problem, but I don’t know where to start.

lxc monitor produce the following output:

metadata:
  context:
    ip: '@'
    method: GET
    url: /1.0
  level: dbug
  message: handling
timestamp: "2021-03-04T10:53:02.066483163+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: 'New event listener: 96982f24-88cb-4972-82df-421021ae21fb'
timestamp: "2021-03-04T10:53:02.07124664+01:00"
type: logging


metadata:
  context:
    ip: '@'
    method: GET
    url: /1.0/events?project=
  level: dbug
  message: handling
timestamp: "2021-03-04T10:53:02.071113498+01:00"
type: logging


metadata:
  context:
    ip: '@'
    method: POST
    url: /1.0/images
  level: dbug
  message: handling
timestamp: "2021-03-04T10:53:02.072726849+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: 'New task operation: 51ed24be-6de8-4134-8fe9-d6bc4ebabb79'
timestamp: "2021-03-04T10:53:02.086101313+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: 'Started task operation: 51ed24be-6de8-4134-8fe9-d6bc4ebabb79'
timestamp: "2021-03-04T10:53:02.086237679+01:00"
type: logging


metadata:
  class: task
  created_at: "2021-03-04T10:53:02.073507372+01:00"
  description: Downloading image
  err: ""
  id: 51ed24be-6de8-4134-8fe9-d6bc4ebabb79
  may_cancel: false
  metadata: null
  resources: null
  status: Running
  status_code: 103
  updated_at: "2021-03-04T10:53:02.073507372+01:00"
timestamp: "2021-03-04T10:53:02.086282745+01:00"
type: operation


metadata:
  class: task
  created_at: "2021-03-04T10:53:02.073507372+01:00"
  description: Downloading image
  err: ""
  id: 51ed24be-6de8-4134-8fe9-d6bc4ebabb79
  may_cancel: false
  metadata: null
  resources: null
  status: Pending
  status_code: 105
  updated_at: "2021-03-04T10:53:02.073507372+01:00"
timestamp: "2021-03-04T10:53:02.086179486+01:00"
type: operation


metadata:
  context:
    ip: '@'
    method: GET
    url: /1.0/operations/51ed24be-6de8-4134-8fe9-d6bc4ebabb79
  level: dbug
  message: handling
timestamp: "2021-03-04T10:53:02.087969154+01:00"
type: logging


metadata:
  context:
    created: 2021-02-26 03:30:01 +0100 CET
    ephemeral: "false"
    name: production/210226
    used: 1970-01-01 01:00:00 +0100 CET
  level: info
  message: Exporting container
timestamp: "2021-03-04T10:53:02.090946236+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: Initializing BTRFS storage volume for snapshot "production/210226" on storage
    pool "default"
timestamp: "2021-03-04T10:53:02.093332746+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: Mounting BTRFS storage pool "default"
timestamp: "2021-03-04T10:53:02.093443037+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: Initialized BTRFS storage volume for snapshot "production/210226" on storage
    pool "default"
timestamp: "2021-03-04T10:53:12.769503386+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: Stopping BTRFS storage volume for snapshot "production/210226" on storage
    pool "default"
timestamp: "2021-03-04T10:53:16.64802313+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: Mounting BTRFS storage pool "default"
timestamp: "2021-03-04T10:53:16.64803394+01:00"
type: logging


metadata:
  context:
    created: 2021-02-26 03:30:01 +0100 CET
    ephemeral: "false"
    name: production/210226
    used: 1970-01-01 01:00:00 +0100 CET
  level: eror
  message: Failed exporting container
timestamp: "2021-03-04T10:53:16.647989589+01:00"
type: logging


metadata:
  context: {}
  level: dbug
  message: Stopped BTRFS storage volume for snapshot "production/210226" on storage
    pool "default"
timestamp: "2021-03-04T10:53:17.969753374+01:00"
type: logging


metadata:
  class: task
  created_at: "2021-03-04T10:53:02.073507372+01:00"
  description: Downloading image
  err: argument list too long
  id: 51ed24be-6de8-4134-8fe9-d6bc4ebabb79
  may_cancel: false
  metadata: null
  resources: null
  status: Failure
  status_code: 400
  updated_at: "2021-03-04T10:53:02.073507372+01:00"
timestamp: "2021-03-04T10:53:17.970154432+01:00"
type: operation


metadata:
  context: {}
  level: dbug
  message: 'Failure for task operation: 51ed24be-6de8-4134-8fe9-d6bc4ebabb79: argument
    list too long'
timestamp: "2021-03-04T10:53:17.970097498+01:00"
type: logging

stgraber · March 4, 2021, 2:54pm

What LXD version is that?

I’m failing to find Failed exporting container in our current code bases.

Peter_Husen · March 4, 2021, 10:22pm

This is lxd 3.0.3 from Ubuntu 18.04. The package version is 3.0.3-0ubuntu1~18.04.1

Could “Failed exporting container” be a constructed message? There is an earlier message there that says “Exporting container”.

stgraber · March 4, 2021, 10:27pm

Any chance you could test on 4.11 or at least 4.0.x?
The entire storage layer was rewritten for 4.0 so there’s a very good chance that this issue is gone.

3.0.x is the previous LTS release so only gets security fixes at this stage.

simos · March 4, 2021, 10:37pm

There are twenty cases of the same message Failed exporting container in LXD 3.0.3, when trying to export a container. See https://github.com/lxc/lxd/blob/lxd-3.0.3/lxd/container_lxc.go#L4836

Peter_Husen · March 5, 2021, 8:23am

I will try to think about that. This is a production server that I inherited, so I can’t just upgrade recklessly. And migrating the container image to another server could probably be problematic due to this same problem. Anyway, in case I can reproduce the problem on a test server, can lxd 3.0.3 simply be upgraded to 4.x, while keeping the existing containers?

stgraber · March 5, 2021, 2:23pm

Yeah, the main bump for you will be switching from the deb to the snap, after that things will be very smooth when dealing with further updates.

To migrate from the deb to the snap, you just need to do:

sudo snap install lxd --channel=4.0/stable
sudo lxd.migrate

The first command will install LXD 4.0 alongside your existing 3.0 (no impact on anything at this stage), the second command will transfer all data from the deb to the snap and start it up so it can upgrade.

The migration process does shut down all containers so you need to plan a maintenance window for it, as no data is copied, just moved, the migration is very quick even with a large number of containers on the system.

We daily test upgrades from LXD 2.0 and LXD 3.0 deb to LXD 2.0, 3.0, 4.0 and latest snap and it’s also the process which automatically happens when people upgrade their Ubuntu 18.04 systems to Ubuntu 20.04, so it’s pretty well used code