Re: gdb help: debugging a segfault in boost::shared_ptr



phear wrote:

Valgrind reports quite a few errors/warnings, most of them occuring in
the oracle library on connect, which I think is normal. I will try
using mysql, to see if I can replicate the error there.
If you're using OCI, I feel sorry for you. I have plenty of valgrind and core dumps all segfaulting deep in oracle code (mostly Oracle's internal resolver libraries, judging by the function names - for me, it appeared to get better by switching to connection pools and manually setting up the connection and keeping environment handles open instead of relying on the easier OCILogon)


Valgrind also reports two errors that is likely related:

=9500== 994 errors in context 184 of 186:
==9500== Invalid read of size 4
==9500== at 0x43FD729:
boost::detail::shared_count::shared_count(boost::detail::shared_count
const&) (shared_count.hpp:165)
==9500== by 0x43FD9DB:
boost::shared_ptr<IDatabaseAdapter>::shared_ptr(boost::shared_ptr<IDatabaseAdapter>
const&) (shared_ptr.hpp:106)

==9500== Address 0x591D60C is 20 bytes inside a block of size 24
free'd
==9500== at 0x401D268: operator delete(void*)
(vg_replace_malloc.c:246)
==9500== by 0x43FDF68:
__gnu_cxx::new_allocator<std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>
::deallocate(std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>*, unsigned) (new_allocator.h:94)
==9500== by 0x43FDF9D:

Well, at first glance, this would appear to be a race condition when handling the shared_ptr, perhaps the object containing a shared_ptr got destroyed but somehow the object is still being accessed. Are you making use of shared_ptr's get() function instead of taking an explicit reference, or passing the object (or one of its member) as reference to a function instead of by value? If so, are you sure the lifetime of the object is guaranteed as long as some code can use the pointer returned by get() or by a reference ?

These only confirm what I already know though. Although the count shows
that it happens almost every time, but oddly enough never resulting in
a segfault when running it though valgrind.
That is to be expected: valgrind takes the hit (addresses stay valid longer than normally so valgrind can monitor them) so no segfault occurs. I believe you can use environment variables to tune how long valgrind keeps deallocatted addresses.

Without valgrind, the "Invalid read" would have caused a segfault, if that address happened to be on a part of the heap that was returned back to the system. Even without valgrind, the allocator doesn't immediately return freed memory back to the OS.


You mentioned doing some intrusive code changes to analyze the
corruption. What did you have in mind?

Using the intrusive boost shared pointers. It keeps the reference count (the thing that caused the actual segfault) inside the objects you are counting. It also saves quite a few allocations (the reference count isn't allocated separately) but it does require you to modify the classes to hold a reference count and supply counting functions

(older boost implementations only required you to derive from an intrusive pointer class, and shared_ptr would automatically adapt. I guess they figured that was too easy to use and that it needed to be more complex)
.



Relevant Pages

  • Re: gdb help: debugging a segfault in boost::shared_ptr
    ... resolver libraries, judging by the function names - for me, it appeared ... reference, or passing the object as reference to ... a segfault when running it though valgrind.That is to be expected: valgrind takes the hit (addresses stay valid ...
    (comp.os.linux.development.apps)
  • Re: Valgrind magically fixes my error
    ... > thing I do is run the program through gdb, ... > hence a segfault when it is dereferenced. ... > The problem is that whenever I run my program through valgrind, ... Different memory management, as you mention, could also be a factor. ...
    (comp.unix.programmer)
  • Valgrind magically fixes my error
    ... This one is for those people who regularly use valgrind. ... I have found an input sequence that reproduces the segfault 100% of the ... behaviour by overrunning my buffer, *anything* could happen, and in my ...
    (comp.unix.programmer)