vicidial.org

by **marcin** » Tue Jul 09, 2019 10:28 am

Perhaps someone can help me:
New installation of vicibox 8.1 on dell R610 server 6 ssd drives in raid 10 configuration.
uname -a output :
Linux pepper 4.4.179-99-default #1 SMP Tue May 14 18:07:16 UTC 2019 (c775d39) x86_64 x86_64 x86_64 GNU/Linux

once I put a significant (15 agents predictive 3:1) load on it it segments
dmesg shows:

[ 3249.645571] ------------[ cut here ]------------
[ 3249.645581] WARNING: CPU: 4 PID: 9773 at ../kernel/sched/core.c:8172 __might_sleep+0x76/0x80()
[ 3249.645585] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810cb7cb>] prepare_to_wait+0x2b/0x80
[ 3249.645632] Modules linked in: ip_set_hash_net(O) ip_set_hash_ip(O) ip_set(O) nfnetlink dahdi(O) crc_ccitt af_packet iscsi_ibft iscsi_boot_sysfs xt_tcpudp msr iptable_filter ip_tables x_tables intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drbg iTCO_wdt ansi_cprng ipmi_ssif iTCO_vendor_support gpio_ich lpc_ich aesni_intel aes_x86_64 lrw gf128mul igb ptp pps_core pcspkr mfd_core ipmi_si ipmi_devintf wmi fjes joydev bnx2 acpi_cpufreq dcdbas dca glue_helper i7core_edac button ablk_helper ipmi_msghandler edac_core cryptd shpchp processor ext4 crc16 jbd2 mbcache sr_mod cdrom uas hid_generic ata_generic usb_storage usbhid i2c_algo_bit drm_kms_helper(O) syscopyarea sysfillrect uhci_hcd ehci_pci sysimgblt fb_sys_fops ehci_hcd ttm(O) ata_piix
[ 3249.645641] sd_mod usbcore libata drm(O) serio_raw usb_common megaraid_sas sg scsi_mod autofs4
[ 3249.645644] CPU: 4 PID: 9773 Comm: asterisk Tainted: G IO 4.4.179-99-default #1
[ 3249.645645] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
[ 3249.645647] 0000000000000000 ffffffff81349e57 ffff880e11757cc8 ffffffff81a27b6e
[ 3249.645649] ffffffff810863a1 ffffffffa047a2d0 ffff880e11757d18 0000000000001743
[ 3249.645651] 0000000000000000 0000000000000002 ffffffff8108641c ffffffff81a18070
[ 3249.645651] Call Trace:
[ 3249.645667] [<ffffffff8101b0a9>] dump_trace+0x59/0x350
[ 3249.645671] [<ffffffff8101b49a>] show_stack_log_lvl+0xfa/0x180
[ 3249.645674] [<ffffffff8101c291>] show_stack+0x21/0x40
[ 3249.645680] [<ffffffff81349e57>] dump_stack+0x5c/0x85
[ 3249.645687] [<ffffffff810863a1>] warn_slowpath_common+0x81/0xb0
[ 3249.645691] [<ffffffff8108641c>] warn_slowpath_fmt+0x4c/0x50
[ 3249.645694] [<ffffffff810abfa6>] __might_sleep+0x76/0x80
[ 3249.645702] [<ffffffff811cb5a4>] __might_fault+0x34/0x40
[ 3249.645712] [<ffffffffa047204b>] dahdi_chanandpseudo_ioctl+0x3db/0x1790 [dahdi]
[ 3249.645729] [<ffffffffa047455a>] dahdi_unlocked_ioctl+0x31a/0x14e0 [dahdi]
[ 3249.645735] [<ffffffff8122eef7>] do_vfs_ioctl+0x337/0x5f0
[ 3249.645748] [<ffffffff8122f224>] SyS_ioctl+0x74/0x80
[ 3249.645755] [<ffffffff8164bb25>] entry_SYSCALL_64_fastpath+0x24/0xed
[ 3249.648761] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x24/0xed

[ 3249.648762] Leftover inexact backtrace:

[ 3249.648779] ---[ end trace 4f0560146804eb1e ]---
[ 5165.553988] perf interrupt took too long (2508 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[ 9074.460721] asterisk[26757]: segfault at 0 ip 000000000052917a sp 00007f7c2072fe90 error 4 in asterisk[400000+2b2000]

asterisk segfault happened much later then kernel dump, but it shows CPU issue; I think.

and

asterisk last 2 Threads of core dump backtrace shows:

[New LWP 2932]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/asterisk -vvvvvvvvvvvvvvvvvvvvvgcT'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000052917a in ast_frame_adjust_volume ()
[Current thread is 1 (Thread 0x7f7c20734700 (LWP 26757))]
#0 0x000000000052917a in ast_frame_adjust_volume ()
No symbol table info available.
#1 0x00007f7c3bbcf666 in conf_run () from /usr/lib64/asterisk/modules/app_meetme.so
No symbol table info available.
#2 0x00007f7c3bbd3363 in conf_exec () from /usr/lib64/asterisk/modules/app_meetme.so
No symbol table info available.
#3 0x000000000058d479 in pbx_exec ()
No symbol table info available.
#4 0x0000000000581e7c in pbx_extension_helper.constprop ()
No symbol table info available.
#5 0x0000000000583e8a in __ast_pbx_run ()
No symbol table info available.
#6 0x0000000000586d1d in ast_pbx_run ()
No symbol table info available.
#7 0x000000000047b8b5 in ast_bridge_run_after_goto ()
No symbol table info available.
#8 0x00000000004734ad in bridge_channel_ind_thread ()
No symbol table info available.
#9 0x00000000005fa47a in dummy_start ()
No symbol table info available.
#10 0x00007f7cec2a0724 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#11 0x00007f7ceb80ae8d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 188 (Thread 0x7f7c226c3700 (LWP 2932)):
#0 0x00007f7cec2a50ff in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f7c43b54532 in iax2_process_thread () from /usr/lib64/asterisk/modules/chan_iax2.so
#2 0x00000000005fa47a in dummy_start ()
#3 0x00007f7cec2a0724 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f7ceb80ae8d in clone () from /lib64/libc.so.6

I ran a stress test on all cores for 30 minutes and no new dmesg errors
I stressed ram as well without new dmesg errors

Is my hardware bad? if so, what is it CPU (4)?

by **williamconley** » Tue Jul 09, 2019 11:50 pm

1) If you can't reliably seg fault in successive runs, it can't be traced to hardware or software and may be a fluke or external force.

2) ALWAYS post your Full Vicidial Version with build. This is a requirement for all posts on this Free Forum provided by The Vicidial Group. Thank them by adhering to that one rule.

3) Try to also post your Full Installer Version. IE: 8.X.X not 8.X. It makes a difference, especially with non-vicidial-script faults

4) Recently it has been found that some versions of Asterisk are prone to attack when not whitelisted which can result in a segfault. Whitelist your system if it has not already been whitelisted.

5) We noload several modules after having bumped into some segfault and other issues in the distant past.

/etc/asterisk/modules.conf

noload => res_config_sqlite.so
noload => res_config_sqlite3.so
noload => cdr_sqlite.so
noload => cel_sqlite3_custom.so
noload => cdr_sqlite3_custom.so
noload => chan_ooh323.so

by **marcin** » Wed Jul 10, 2019 8:30 am

This is a new installation using vicibox 8.1.2 iso VERSION: 2.14-711a BUILD: 190607-1525 SVN Version: 3112
All listed modules are noload by default and server is behind a firewall; whitelisted.
This is a single server installation.
I have see segfault before caused by g729 but this crash happened when g729 was disabled,
asterisk core dump did not show any malloc, calloc, or realloc indicating that the ram was not an issue.
asterisk messages warnings show 30 or so concurrent lines about the time of dmesg warning:
chan_iax2.c: Resyncing the jb. last_delay -34, this delay 1159, threshold 1118, new offset -1211
and a bunch of
WARNING[26795][C-000002f6] func_hangupcause.c: Unable to find information for channel
right before the crash.

Hope this info helps.

Edit:
I spotted 2 more potential issues:
MDS CPU bug present and SMT on, data leak possible
and
mtrr: your CPUs had inconsistent variable MTRR settings
mtrr: probably your BIOS does not setup all CPUs

by **williamconley** » Wed Jul 10, 2019 11:54 am

WARNING[26795][C-000002f6] func_hangupcause.c: Unable to find information for channel

irrelevant. feature not bug. lol. ie: lazy programming ... no check for a channel before sending a command related to that channel which is ignored because the channel was already terminated. could they put in a check for the channel right before the command and not issue the command if the channel is already gone? sure. but that's just extra cpu usage ... on a machine with 600 live channels constantly churning that extra cpu usage may be notable whereas the "ignore this" takes no cpu at all and is what happens. that warning should be a notice and it should be a preference to disable it.

how often has it segfaulted? if it's only once, move along. it may never happen again. if you get segfault less than once per quarter ... it's just linux. you can try to chase it, but you'll need to set the core to dump and learn to read the dump.

by **marcin** » Wed Jul 10, 2019 1:06 pm

It has happened 4 times within first hour of using new system
Next day after extensive Load testing happens again only after 5 minutes
Load was about 15 agents and 30-35 calls
I’m replacing hardware and doing entire new installation.
Will update on outcome.
Asterisk core dump was in my first post.

by **williamconley** » Wed Jul 10, 2019 2:20 pm

possibly a wise move. we generally start with disabling all modules we are not using. then memory. but definitely clear any BIOS, POST or boot errors "just in case".

we did have an old issue where a sip or iax phone would try to register as the wrong type and crash asterisk. but i think they fixed that in the core.

happy hunting. 8-)

by **marcin** » Thu Jul 11, 2019 7:50 pm

I have replaced entire hardware, I simply used a deferent server and the issue persist. Identical dmesg message about pseudodahdi crashing, and asterisk crashed with no particular error in core dump
To replicate one may use sipp and point it to meetme conference. After about 10 minutes (400 channels) or so dmesg shows pseudeodahdi Warning and 2-3 minutes later asterisk 13 is crashing.
I used very same test on asterisk 11 and no dahdi crash nor asterisk crash.

BTW zypper ref comes back with errors and it is unable to refresh, it appears that openSUSE-Leap-42.3-Server-Database.repo is no longer available and that brakes the mariadb installation. The script vicibox-ast11 also attempts to update repo and distribution and it braking mariadb.

Anyway, please provide a vicibox version and download link that uses asterisk 11 by default .
Thank you.

by **williamconley** » Thu Jul 11, 2019 9:48 pm

Vicibox 7 uses asterisk 11. To get to the older vicibox downloads, go to the download for the latest Vicibox, then just delete the .iso filename and when you surf to this location you'll be viewing a file structure with an archive folder available has ALL the prior Vicibox .iso images.

We have Asterisk 13 running in several machines with lots of memory and lots of cores and have not had this issue at all.

I wonder if you are making some sort of changes to the OS or other packages that have a conflict of some sort. But if you don't care about the Viciphone, asterisk 13 and asterisk 11 have no differences and if dropping back to asterisk 11 will resolve your issue that is NOT a bad idea at all.

I will say that if Zypper presents an issue ... Don't Do It. We have many machines over the years that have generated a problem when attempting to zypper up, and the solution has *always* been just to NOT zypper up and use the OS as it arrived from the .iso installer. As long as you're whitelisted and do not employe a hacker, this doesn't present a problem. The last bug fix version of each major distro is always stable (historically). So Vicibox 7.0.4 is your best bet.

by **marcin** » Fri Jul 12, 2019 5:12 pm

William, Thank you for all your help. I have been using vici for years and I did showed my appreciation over the years more then once.
You are an exceptional programmer/ code writer and I am very happy to work with you from time to time, even if it is only via a forum messages.
I am sure it is not easy to keep your cool reading and responding to some of the messages and questions you are seeing here.
Now I am done kissing your ass and I wish to present my findings.
dahdi 2.11.1 crashes using kernel 4.1.155, 4.1.179 and 4.1.180.(from my experience only)
When I downgraded the vicibox install to 7.0.4 with kernel 4.1.36 also using dahdi 2.11.1 the crashes did not happened.
All this may be due to hardware I am using; Dell r610 server with h700 raid controller, dual X5690 and 96 Gb or ram. All the additional hardware not in use has been removed from the server.
I do not have explanation only experience.

Thank you again for all your effort.

by **williamconley** » Fri Jul 12, 2019 7:23 pm

If you're going there, you'll need to compare ... all packages. Especially the asterisk modules.

Unless, of course, you have zero interest in ViciPhone, in which case you should just go with the "non-crashing" version and call it a day. We've found several situations where "wait a version" and the upgrade fixes whatever problem was happening, and the client has zero interest in digging down the actual fault. We've also had several that "had to have" the version in question, and that's how they found their way to us: After failing miserably and pulling out a lot of hair.

Pitbull Programmers. 8-)

Often the problem will be an arcane, unused, or even obsolete package that can merely be discarded. On one occasion we even had to violate one of the Prime Directives of Vicidial to get the systems online. But if it works, it works. lol

by **exile** » Sun Jul 21, 2019 1:30 pm

Marsin im curious on how the downgrade worked out for you. We are experiencing the same exact issue as you segfault crashing asterisk. The only difference from your issue from mine is im running in a cluster and I assumed it was some bad hardware in one of the servers so I migrated everyone off to a fresh asterisk server and was facing the same problem segfault crashing asterisk. Asterisk comes right back up after the segfault but its happening quite often. Also running on 8.1.2

vicidial.org

asterisk segfault at 0

asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Re: asterisk segfault at 0

Who is online