"lxc publish" strange behaviour (not working)

lxd

(berzas) #1

On lxd 2.0.10, when executing lxc publish container/name_backup --alias name_backup is getting stuck and not giving any errors. But it creates at /var/lib/lxd/images/ folders with random numbers like lxd_build_884612142 with a file inside like lxd_post_516577205

filesytem used is ZFS and there is enough space at zpool.

Any clue about what is happening?

thanks in advance


#2

It can take a few minutes for lxd publish to complete. Do you wait enough time for it to finish?
I suppose that if you kill it with Ctrl+C, you will get remnants like those directories.


(berzas) #3

Yes in fact we noticed because backup script suddenly stop working:

lxc snapshot "${vm}" "${vm}"_backup 
lxc publish "${vm}"/"${vm}"_backup --alias "${vm}"_backup

also "lxc publish" is not working if I leave it for a long time neither the script for daily cron jobs


(berzas) #4

starting a new test container and trying to to publish an image of it, is also getting stuck

  lxc publish test --alias test

also no traces at /var/log/lxd/lxd.log


(berzas) #5

running lxc publish test/test --alias test --debug not seems to give any error, but never ending:

DBUG[11-26|[some-time]] Raw response: {"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":{"config":{"core.https_address":"0.0.0.0:8443","core.trust_password":true,"storage.zfs_pool_name":"lxd"},"api_extensions":["id_map"],"api_status":"stable","api_version":"1.0","auth":"trusted","public":false,"environment":{"addresses":

DBUG[11-26|[some-time]] POST {"properties":null,"public":false,"source": {"name":"test/test","type":"snapshot"}}
to http://unix.socket/1.0/images
DBUG[11-26|[some-time]] Raw response: {"type":"async","status":"Operation created","status_code":100,"operation":"/1.0/operations/[some-id]","error_code":0,"error":"","metadata":{"id":"[some-id]","class":"task","created_at":"[some-time]","updated_at":"[some-time]","status":"Running","status_code":103,"resources":null,"metadata":null,"may_cancel":false,"err":""}}

Anything else to troubleshoot this issue?


(St├ęphane Graber) #6

It could be that your container contains a very large sparse file which would cause tar to take ages to create the tarball (and eat a lot of disk space).

du -x --apparent-size -sch /

In the container you're about to publish should tell you how large a temporary tarball you'll end up with.


(berzas) #7

Thanks for advise @stgraber but as I told before "lxc publish" is even non working for a new test container in that case the result of du -x --apparent-size -sch / is 371MB. Also is not working in any other containers that was working until now with different sizes, from some megas up to 10-12GB containers "lxc publish" got stuck.


(berzas) #8

What else we could check to determinate why "lxc publish" stop working?

No way to make "lxc publish" work we need it so much as our backups depends on it. Is there a way to reproduce what does "lxc publish" by hand?

I see the contents of an "lxc export file" are the ones at /var/lib/lxd/containers/ but then won't work compress each container folder?

or is there any method to backup containers individually?

what we did until now is snapshot > image publish > export . I know snapshot is good for fast local restore but we need remote backups, also i test to backup whole /var/lib/lxd + lxd.db but this don't suite our needs to recover containers individually. but looks like if manually tar /var/lib/lxd/containers/ + sqlite3 /var/lib/lxd/lxd.db .dump > lxdbak.db and then recover in other host (without containers) both looks like working. Is this a correct workaround or could give some problems?

Any help could be much appreciate, thanks


#9

You may try sysdig to diagnose what is going on,
https://blog.simos.info/how-to-use-sysdig-and-falco-with-lxd-containers/

Your alternative backup plan looks OK, though let's get a second opinion on this.


(St├ęphane Graber) #10

You may want to run lxc monitor at the same time as you run lxc publish, that may give you more details on what's going on inside LXD. You could also look for temporary files in /var/lib/lxd/images as that's where lxc publish will be writing during publish.


(berzas) #11

lxc monitor shows:

metadata:
context: {}
 level: dbug
 message: 'New events listener: [some-id]'
 timestamp: [date and time]
 type: logging

metadata:
  context:
  ip: '@'
  method: GET
  url: /1.0
  level: dbug
  message: handling
  timestamp: [date and time]
  type: logging

 metadata:
  context:
  ip: '@'
  method: GET
  url: /1.0/containers/test
  level: dbug
  message: handling
  timestamp: [date and time]
  type: logging

metadata:
 context: {}
 level: dbug 
 message: 'New task operation: [some-id]'
 timestamp: [date and time]
 type: logging

metadata:
 context:
 ip: '@'
 method: POST
 url: /1.0/images
 level: dbug
 message: handling
 timestamp: [date and time]
 type: logging


 metadata:
  class: task
  created_at: [date and time]
  err: ""
  id: [some-id]
  may_cancel: false
  metadata: null
  resources: null
  status: Pending
  status_code: 105
  updated_at: [date and time]
  timestamp: [date and time]
  type: operation


 metadata:
  context: {}
  level: dbug
  message: 'Started task operation: [some-id]'
  timestamp: [date and time]
  type: logging


metadata:
 class: task
 created_at: [date and time]
 err: ""
 id: [some-id]
 may_cancel: false
 metadata: null
 resources: null
 status: Running
 status_code: 103
 updated_at: [date and time]
 timestamp: [date and time]
 type: operation


metadata:
 context:
 ip: '@'
 method: GET
 url: /1.0/operations/[some-id]
 level: dbug
 message: handling
 timestamp: [date and time]
 type: logging

Here is hangging and for the files at /var/lib/lxd/images/ I see strings like:

{"properties":null,"public":false,"source":{"name":"container_name/snapshot_name","type":"snapshot"}}

I don't see any errors here. I did lxc monitor when trying to publish a a new empty stopped container named test, similar results at lxc monitor if I try to publish image by snapshot


(berzas) #12

thanks so much for suggestion, days ago I readed your post, and start digging to this tool. Great discovering, let's see if trough it can determinate something arround that weird issue.

On the other hand if we do a backup of the full /var/lib/lxd or individually each /var/lib/lxd/ is easy to have system inconsistency in case of restore, right?. A part of lxc publish which is the best approach for backup lxd on top of zfs?


(berzas) #13

Using csysdig -pc + lstrace gives no output on a new lxc publish process, and using gdb inside gives:

attaching to process 17550
[New LWP 17551]
[New LWP 17552]
[New LWP 17553]
[New LWP 17554]
[New LWP 17555]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00000000004a1db3 in ?? ()

The proces of lxc publish test --alias test hangs like this last days then trough csysdig I send kill -9 and the proces stop as expected. But this is not the case for the processes hanging in relation of older lxc publish for production containers when I send kill -9 to them just start another proces.

this is weird, how we should proceed?


#14

You would need to install debug symbol packages in order to get meaningful results from gdb,
https://wiki.ubuntu.com/Debug%20Symbol%20Packages

sysdig can be messy to make sense. You can use https://github.com/draios/sysdig-inspect to make sense of the sysdig capture.