Once we finally, rolled back a set of features Solr seemed to be behaving optimally. Below is what we were seeing as we looked at our search servers CPU:
Throughout the day we had periods where we saw large CPU spikes, but they didn’t really seem to affect throughput or average latency of the server. None the less we suspected there was still an issue, and started looking for a root cause.
Kill -3 To The Rescue
If you’ve never used kill -3, its perhaps one of the most useful Java debugging utilities around. It tells the JVM to produce a full thread dump, which it will then print to the STDOUT of the process. I became familiar with this when trying to hunt down treads in a Tomcat container that were blocking the process from exiting. Issuing kill -3 would give you enough information to find the problematic thread, and work with development to fix it.
In this case, I was hunting for a hint as to what went wrong with our search. I issued kill -3 during a spike, and got something like this:
Looking at the the output, I noticed that we had a lot threads calling FuzzyTermEnum. I thought this was strange, and sounded like an expensive search method. I talked with the developer, and we expected that the tilde character was being ignored by edismax. At the very least being escaped by our library, since it was included in the characters to escape. I checked the request logs, and we had people looking for exact titles that contained ~. This turned a 300ms query into a query that timed out, due to the size of our index. Further inspection of the thread dump revealed that we were also allowing the * to be used in query terms as well. Terms like *s ended up being equally problematic.
A Solr Surprize
We hadn’t sufficiently tested edismax, and we’re surprised that it ran ~,+,^, and * when escaped. I didn’t find any documentation that stated this directly, but I didn’t really expect to. We double checked our Solr library to see if that it was properly escaping the special characters in the query, but they we’re still being processed by Solr. On a hunch we tried double escaping the characters, which resolved the issue.
I’m not sure if this is a well known problem with edismax, but if you’re seeing odd CPU spikes this is definitely worth checking for. In addition, when trying to get to a root of a tough problem kill -3 can be a great shortcut. It saved me a bunch of painful debugging, and really eliminated almost all my guess work.
Comments
One response to “Solr Upgrade Surprise and Using Kill To Debug It”
Thanks Geoffrey for this article, this is really helpful as I am going to do migration to 4.0 myself.
A note about kill -3. First of all it does not kill the process, it just send a “signal” to the process. Many times people are afraid to use it because they think it is going to cause the process to terminate.
Second, it sends the output to the standard out of the process you are sending the signal to, i.e. your java/tomcat, which may be redirected somewhere, like catalina.out
Third, instead of kill -3 you can use jstack . Jstack is bundled with jdk, although I am not sure if it is a standard across differents JDKs implementations or specific to Oracle’s.
Jstack is different in the way that it prints thread dump to its own standard out, i.e. current shell, which is usually more convenient.