Sucessfull ECI: beware the tu…

Sucessfull ECI: beware the tuning

Published on: 16/01/2013
News

Hi,

RIght2Water.eu is hosted on a dedicated server with linux+mysql+glassfish. Since about a week ago, we started to get an order of magnitude more traffic and signatures than we (and the other ECIs) had experienced perviously.

We are currently getting more than 30'000 signatures a day, with peaks of 3000 signatures by hour.

And unfortunately, the server doesn't handle the load very well. It crashed or became extermely slow a lot of time. We had to restart the appiication server at least 50 times. We know for a fact these downtime cost 1000 of signatures.

The team at the European Commission has been extremely supportive and we are working together to try identifying the botlenecks and remove them. We have probably improved a bit the situation, but as of today, we are still not ok.

If you plan to host the OCS software yourself, plan a lot of time to read all the best practice and tuning suggestions before the start. Obviously, you won't be able to know if it was able to make it more scalable. So plan that you might need a lot of system administrator time when your ECI becomes successful.

If you are an expert in glassfish tuning and monitoring, could you please contact me? I would love that me and my team go back to normal work hours ;)

It's too early to see if it the bottleneck is in the OCS software (at least 2 requests could be avoided per visits for instance) or the configuration of glassfish, probably both. I'll update this post when we have solved it.

Edit: so we have passed the million signatures and it does crash only every 10 days or so. We had a peak day of nearly 100'000 signatures in a single day. 

The OCS team at the european commission did find the cause of the problem: the application was having a timeout for the session of 10 hours. We reverted back to the default (30 min) and now the system works under load.

I'm not sure a patch has been created yet, but if you host OCS, I'd suggest you do it manually (you need to change the config and re-deploy)

X+

Comments

Wed, 16/01/2013 - 21:23

 

> It's too early to see if it the bottleneck is in the OCS software (at least 2 > requests could be avoided per visits for instance) or the configuration of > glassfish, probably both.   I wish you had realised this before running such a violent campaign on social networks against OCS' stability. From someone who 1) makes money using OCS 2) gets free technical support from the Commission to optimize your paid OCS setup 3) lacks the technical knowledge to distinguish between problems caused by OCS and problems caused by your own hosting environment, I would have expected a more professional attitude.  

Thu, 17/01/2013 - 03:38


Posted by Lionel Antunes on January 16, 2013 at 21:23
> It's too early to see if it the bottleneck is in the OCS software (at least 2
> requests could be avoided per visits for instance) or the configuration of
> glassfish, probably both.

I wish you had realised this before running such a violent campaign on social networks against OCS' stability.
From someone who
1) makes money using OCS
2) gets free technical support from the Commission to optimize your paid OCS setup
3) lacks the technical knowledge to distinguish between problems caused by OCS and problems caused by your own hosting environment,
I would have expected a more professional attitude.

 

Dear Lionel,

I'm afraid you vastly underestimate what would be a violent campaign against the OCS stability if I were going to run one ;) This is a a factual description of the issues we face, trying to share the information we have as we progress.

As for your points

1) We are working on a fixed budget, so all the extra time I spend on the OCS is money I loose, not money I make. It's 3am and I'm not invoicing anyone for the time I just spent on OCS for instance. Beside, I'm not sure I understand how if I make money has any impact on the intrinsic stability of OCS.

2) You might have skipped that point, but I did acknowlege the support from some of the staff at the Commission and I mentionned that I highly appreciated it. I also mention that despite their time, expertise and work we still haven't been able to solve the problem. And unless you are trying to imply that because they provided a free support they didn't do their best, I'm not sure either what impact it would have on the stability of OCS.

3) I know we have the technical knowledge to make and host the main website, that handles vastly more traffic than OCS and  works flowlessly with zero downtime. I know as well that I have applied the same expertise to the OCS server that has roughly the same hardware, and we did follow all the infrastructure recommendations about the configuration as published in the EC provided documentation.

As for distringuishing what is the source of the problem, I know for a fact that the OCS software generates more requests than could have been designed. So I know that all things being equal, a different design would have allowed to handle a higher numbers of signatures on a similar infrastructure.

Finally I am deeply interested to find out what part of the problem is from the OCS or the infrastructure, not the least so we stop loosing thousand of signatures daily and so I can get back to full nights of sleep. It is why we have asked numerous of times if the european commission had made any load tests in realistics conditions on an open source environement (linux+glassfish+mysql)

Specifically:

  • full workflow (force redirect to home+home+"support" click+choice country+fill form with an error+submit correct form)
  • requesting the included resources (css+images+js+captcha)
  • 1000 concurrents visitors at least
  • tested on at least 50k signatures over at least 1/2 day distributed unequally

I haven't found any documentation about the tests the commission has done and none of the persons at the commission have been able to provide any result of load tests.

In absence of such tests and because we are the only onces to have ever experienced such high volumes on OCS, I'm not sure how anyone could categorically exclude that the problem isn't caused -at least partly- by OCS.

By all means, if you have conducted such a test or is aware of it, please do publish it on the internet so all the ECIs can benefit from it. If you were able to share the details of the hosting environement used and the configuration of glassfish, that would solve our problem and let us focus on a successful first ECI. I'm sure that's what we both want and hope.

And as I professional, I will take the blame on our environement and help you run a violent campaign on social networks to support the OCS stability ;)

X+

P.S. Talking about professional attitude, it's important to avoid ad hominem attacks. I don't think I have publicly doubted the technical knowlege of anyone working around the OCS. It would help if we all do the same, especially when we don't know the details about our respective technical knowledges.

Thu, 17/01/2013 - 12:22

@Xavier
It sounds like you're living through the hell we've all been predicting since the first release of the OCS. I highly appreciate your patience and dedication towards making it work...
We all know how you feel right now, heads up + see you in Brussels next week!

@lionel
If you don't understand what people like Xavier and us (both being "commercial" entitties) are trying to achieve here then you probably haven't been following the conversations and discussions. Both of us have attended numerous events, meetings with the commission, had email exchange with people in the EC and put in A LOT of unpaid work for our clients. We don't even earn an average wage with the OCS related work.

Calling 3.000 signatures an hour "violent campaign" is rather cute. Everyone working in the field of digital campaigning (like Xavier and us) knows that 3k signatures an hour is not even the bottom end of the stress testing we do on decent campaigning tools and petition sites.


The OCS does a lot of things that make stress testing super annoying - and you probably know that. Without the key to "unlock" you can't even access the public page. The form validation is a pain, you have to hack your way around the OCS to even get to a tool you can stress test (and it probably behaves differently once live) and glassfish itself has a lot of bugs (also security issues, scalability issues) that play into it. Now if you tell me that you've done all the stress testing we usually do with servers and websites for our clients it would be very nice of you to share info on your hacks and your setup.

Cheers,
Florian

Thu, 17/01/2013 - 15:21

I can only confirm, Lionel's money argument is so far from reality it's not even funny anymore.

 

About whether the OCS can be blamed for instability: 

Xavier is absolutely right, this should be tested during development. Then today we would be having a different discussion.

Thu, 17/01/2013 - 15:30

And btw we might even have done more testing ourselves, if the software didn't make this terribly difficult. It only works on https, it requires a special document to be uploaded, and the test mode requires a login. Let alone the time it eats up for troubleshooting, so one is happy if it works at all.

 

With this given, it is hard to not think of blaming the OCS when it comes to instability.

Thu, 17/01/2013 - 20:47

 

 

Apparently some people didn't understand what I was referring to, so here's a small excerpt from Xavier's twitter account over the last few days:

---

15/01: The #ECI software doesn't handle the load and is a pita to deal with.

 

15/01: Well, it exploded so often under the load that I probably had to restart #right2water >50 times since fri.

 

14/01: another long day trying to make the #EC software for #ECI working. Seems that defective by design is an official position...

 

14/01: I'm focussed on the f*$$ of the #EC that has been crashing non stop under the load.

 

13/01: Guess what? the #EC software goes crazy and crashes all the time. glassfish expert? we need you

---

Very graceful. Then people help you fix your problems, then you come here with “tuning is important” and "it's too early to know where the problems come from". Before criticizing an application as harshly as you did, I would first learn to read its logs and then make sure that my problems really come from that application. You seem to prefer the "shoot first, check later" way.

 

> Talking about professional attitude, it's important to avoid ad hominem attacks.

 

Pretty hilarious coming from someone who just called me "a troll or a moron":

https://twitter.com/eucampaign/status/291913804206448640

 

Split joinup/twitter personality maybe?

 

Apparently you see no problem with your behaviour, so I'll leave it here and instead go back trying to improve OCS and helping organisers to go online. Most of them are quite happy with EC's free software, and I'm sure Right2water organisers are delighted to have been able to collect 250k signatures thanks to OCS. That's all that matters to me.

Fri, 18/01/2013 - 08:44

Dear Lionel,

Thank you for clarifying that your definition of a "violent campaign on social networks against OCS' stability" is 5 tweets were I mention I have problems with the OCS software crashing. And thank you for noticing than the tone might be more colloquial in a twitter conversation than on joinup.

As for your concern about my split personnality, might as well be that some public places are better to blow off some steam or make private jokes with friends and others focus more on discussing OCS development?

You might be interested to know that thanks to the idea from one of the person at the European Commission and the debug and tuning session I've had with them this morning, the OCS software for right 2 water seems to behave normally again. It's too early to get a definitive answer but so far, so good.

Would you want to bet with us if the problem was on the configuration of the server or in the OCS software?

To be continued....

X+