Hi
I sometimes use Csound for some basic analysis and think there is quite a bit of scope for developing in-depth approaches. Here is a fairly basic example of the sort of thing I do:
http://userfiles.1bpm.net/cs_example/analysis/analysis_example.html
(and the raw CSD - http://userfiles.1bpm.net/cs_example/analysis/analysis_example.csd )
this example uses http://userfiles.1bpm.net/cs_example/analysis/test.wav
The output from the example CSD gives the text:
Start: 0.124000 ; End: 4.124000 ; Pitch min: 188.235294, max: 657.534247, avg: 623.896755 ; Centroid min: 2492.601178, max: 5333.084601, avg: 3276.696148
Start: 5.857333 ; End: 9.074667 ; Pitch min: 436.363636, max: 5333.333333, avg: 937.918167 ; Centroid min: 2173.124842, max: 11985.012861, avg: 2550.169149
Start: 9.076000 ; End: 13.569333 ; Pitch min: 436.363636, max: 2181.818182, avg: 2133.588831 ; Centroid min: 3206.290090, max: 6883.542139, avg: 5076.361830
Start: 13.570667 ; End: 18.561333 ; Pitch min: 2181.818182, max: 8000.000000, avg: 2885.677004 ; Centroid min: 3214.463470, max: 6472.805549, avg: 3679.556923
Basically it borrows from the pvsbufread manual page example to perform some analysis operations in a single k-cycle. The min/max/mean RMS of the input file are first determined so that then a ratio can be used to segment into parts. Then on each part the min/max/mean pitches and centroids are extracted along with start/end times.
I think it would be possible to extend this idea to incorporate things like analysis of changes between identified segments (eg if the pitch differs by a certain threshold etc), peaks/etc, perhaps with transient detection, and then that be used to get more of the detailed sort of thing you are looking for. There are a lot of possibilities/parameters/things to try but hopefully this may help in some way.
All the best
Richard