Aller au contenu
PcPerf.fr
PcPerf bot

PS3 servers

Messages recommandés

Our FAH servers for PS3's are getting hit pretty hard right now. We are looking into whether this is a client problem (failure to backoff correctly during high loads) or a server issue. We added a server last night to help and will add more this morning. We are actively working on this one right now.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We've been trying various server-side changes to improve the PS3 situation and have been in close contact with Sony. We have some ideas which we will be implementing. So far, the situation has gotten better (at least based on our statistics), but it's still not good, and we're working to improve it.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We've been working constantly the last few days to improve the PS3 situation. Based on our statistics, the situation has gotten better. A few days ago (at its worst), it was very bad (~80% failure, which means 5-6 retries on average or more to get WU's back). We're seeing more like 20-35% (1-3 retries) right now and the trend is getting better and better. We have added some additional PS3 servers, and done some major code changes. A new client should fix this issue introduced in 1.3 so this doesn't happen again. For now, our server-side changes should take care of this in time, although it may take a few more days to settle down to say 1% failures (more typical). We have also extended deadlines to compensate for this problem.

 

 

 

Voir l'article complet

Modifié par Thor
Fusion des posts à propos du problème concernant les ps3

Partager ce message


Lien à poster
Partager sur d’autres sites

We know this is an issue of concern, so I'll be publishing daily updates until it's taken care of for good. As posted in update #3 yesterday, we expect it to take a couple of days for our fixes to truly kick in. So far, so good -- the WU failure rate is now down to about 35% across the board and the servers have stabilized somewhat. There's still a ways to go from 35% to the normal values, but at least everything is going as expected for this fix. A client update from Sony would fix this issue for good without the extreme server side machinations we've had to do, and we are working with them on that as well.

 

Traduction:

 

Nous savons que c'est une source de soucis, je vais donc publier des mises à jour quotidiennes jusqu'à ce que le problème soit résolu pour de bon.

Comme dit dans la mises à jour #3 hier, nous prévoyons que ça devrais prendre quelques jours pour fixer complétement le problème.

Si loin, si bon --- le taux d'echec de WU est maintenant descendu aux alentour de 35 % depuis que les serveurs ont légèrement stabilisé.

Il reste encore du chemin pour repasser de 35 % à des valeurs normales mais au moins, tous ce passes comme prévu avec ce fix.

Une mises à jour du client par sony devrais résoudre le problème pour de bon sans les mécanismes serveur très compliqué que nous devons utiliser, nous travaillons avec eux à ce sujet.

 

Voir l'article complet

Modifié par Thor
Rajout de la traduction

Partager ce message


Lien à poster
Partager sur d’autres sites

Some good news. The servers have continued to improve (25-30% failure now, which means that at least 1 out of 2 attempts should work, so only ~2 retries should be needed). We've been implementing and testing lots of different strategies and I think we've found one that works best. We've also been in close contact with Sony and they have some ideas on the client side and are working on revisions there.

 

 

 

We've also drasitically brought down the assigns on vsp06, the server which was assigning large WU's, so to make sure that it's not loaded when those WU's need to come back (which should be about now). So, the bottom line is that the server load is still extremely heavy, but the situation is continuing to improve. Most importantly, the client mods should prevent this from happening in the future.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We have been working aggressively to find the root of this problem, investigating all possibilities (server, client, network, etc). Our investigation has found that this issue is due to a problem in the client, as we have identified a specific issue that's causing this problem. We have given the info on how to fix this to Sony and we are hoping that they can come up with an updated client soon. Unfortunately with the PS3, we cannot update the client ourselves, otherwise we would release a client update ourselves, as we have done in the past as needed. Therefore, over the last few days, we have worked on server side tweaks (the only part we can work on) while Sony can work on the client. Before the client gets updated, we expect that the situation will continue to improve gradually but with the issues we have been seeing.

 

 

 

It looks like there are some misconceptions about the situation, based on the comments here. This is not an issue of the FAH servers not being powerful enough -- any server network would be showing this same issue with the client issue that's going on (in fact, the FAH servers serving the PS3 are more powerful than in other parts of FAH, which is operating just fine with far more clients); indeed, the PS3 backend has been spec'd out to handle 1M PS3 clients and we are way under that. Also, this is also not an issue that the PS3's are too fast. The server load depends on the number of clients and the amount of bytes they send back; in FAH, the PS3's compute faster, and that is used to do more complex calculations, not to send more bytes, and thus does not create a greater server load.

 

 

 

The client update would fix how the client interacts with the servers

to stop the problem we're seeing right now, including the issues with

assigns (getting new WU's), accepts (returning WU's), and points. Sony is a large company and the development team likely cannot publicly give out ETA's on when this will be fixed, etc, but it's important to stress that they are working on this and know this is a very, very critically important update to make.

 

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We have been working aggressively to find the root of this problem, investigating all possibilities (server, client, network, etc). We have been debugging the entire FAH system on the PS3 over the last few days (as there could be several causes for what we're seeing), examining especially how the server is interacting with the clients, what the clients are doing, and how the Stanford network is handling the situation. Several Stanford network engineers have looked into the problem to see if this is a network issue, but that does not appear to be the case. The servers are running extremely well right now. Our investigation so far has found that this issue is due to a problem

in the client, and we have identified a specific issue that's causing

this problem.

 

 

 

With the completion of this investigation Saturday night, we have given the info of the results of our debugging and our plan on how to fix this to Sony last night, and we are hoping that they can come up with an updated client soon. Unfortunately with the PS3, we cannot update the client ourselves, otherwise we would release a client update ourselves, as we have done in the past as needed. Therefore, over the last few days, we have worked on server side tweaks (the only part we can work on) while Sony can work on the client. Before the client gets updated, we expect that the situation will continue to improve gradually but with the issues we have been seeing.

 

 

 

It looks like there are some misconceptions about the situation, based on the comments posted here. This is not an issue of the FAH servers not being powerful enough -- any server network would be showing this same issue with the client issue that's going on (in fact, the FAH servers serving the PS3 are more powerful than in other parts of FAH, which is operating just fine with far more clients); indeed, the PS3 backend has been spec'd out to handle 1M PS3 clients and we are way under that. Also, this is also not an issue that the PS3's are too fast. The server load depends on the number of clients and the amount of bytes they send back; in FAH, the PS3's compute faster, and that is used to do more complex calculations, not to send more bytes, and thus does not create a greater server load.

 

 

 

The client update would fix how the client interacts with the servers

to stop the problem we're seeing right now, including the issues with

assigns (getting new WU's), accepts (returning WU's), and points. Sony is a large company and the development team likely cannot publicly give out ETA's on when this will be fixed, etc, but it's important to stress that they are working on this and know this is a very, very critically important update to make, and they are working aggressively to fix it.

 

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We have a conference call with Sony today to brainstorm a short and medium term plan. We came up with some new ideas to try to help things immediately (i.e. later today if we can get the coding done, tomorrow hopefully otherwise), and also to plan for updates to the client. The call was very productive and the Sony team is very eager to get this fixed as well. I think we have a good idea to fix this, but we'll see when we implement it.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We had a long meeting with Sony yesterday to brainstorm fixes that we can do *before* a new client is released. We've come up with a plan, coded it yesterday, and we've started to roll it out today. The result is that clients will wait a bit to get work, but in their waiting, uploads and downloads for those that do get through will go smoothly. The new client will have these waits set up for general cases, but we think we can generate one in an adhoc manner with what we've set up. If all goes well, this fix should solve the issue in about a day or two. If not, we will do more brainstorming until the new client goes out.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We've been trying creative several fixes to handle what the servers can do with the existing clients to improve the situation immediately, rather than waiting for a new client could come out. So far, I think they have largely been unsuccessful, but I think we learned more about what's going on. We have a new idea, which will require some more coding, and then we'll roll that out tomorrow.

 

 

 

However, people should see a major points increase due to some work over here, in particular for the large-point WU's.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We had another meeting with Sony to discuss the results of our recent server code change to improve the situation. We have a working model of the situation and we will continue to see what we can do server side before the client patch comes out.

 

 

 

We will be testing an approach which means that many clients (99%) won't get assigns for a while, allowing the remaining 1% to get work. Once they have work, they'll go away crunching, allowing a new 1% to get work. If all goes well, all should get work without all pounding the server simultaneously.

 

 

 

The new client will take care of this automatically (as this is already done in the non-PS3 clients), but we will handle this from the Assignment Server "manually" to take care of this until the new client gets out. Once we get over the hump, we should be ok.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

Sony has been working on a new client to fix several client-side bugs which have come to light during the last few days. Once that new client is in, we can set the servers back to normal and all should be well, assuming the client addresses the issues we're facing. For now, we (Stanford) have been trying to do what we can server-side to work around some critical bugs in the 1.3 client to allow for both client uploads and downloads. It's easy to get uploads or downloads working, but with the current state of 1.3, it's hard to get both. We have had to rewrite server code to work around these 1.3 bugs and I think we've made some progress.

 

 

 

The bugs in the 1.3 code are very subtle and the sort that only comes up once a critical mass of 1.3 clients exist and start to hit the server collectively in incorrect ways. Sony QA didn't show this up, as one can't QA 30,000 clients. However, we have discussed in detail with Sony some ideas for helping to prevent this in the client in the future, and this behavior is being removed for the upcoming patch.

 

 

 

I want to address some misconceptions in the comments posted here. There is a problem in the client, and the server changes were made to address this and work around the bug (the server changes did not create the problem, but were done in response to the problem). Also, Sony QA is extremely rigorous, but these sorts of things can't be found until clients hit critical mass. Finally, we have tried several server code changes and the reasons they didn't help appear to be that the client had multiple aspects which didn't behave as expected (eg client bugs) which rendered our changes not as useful as one would have expected at first (forcing us to reconsider what one can do server-side).

 

 

 

Last night, we tried a new strategy where we use the AS to help control the weight of clients going to the WS. This seems to be working as we're getting lots of uploads right now as well as a steady stream of downloads. We hope that this is a reasonable balance and if not, we will continue to see what we can do from here until the new client comes out.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

We're seeing signs that are most recent modifications are helping and we may have the PS3 network of machines back on track. Nevertheless, we're still working to see how we can improve it from here (there's lots of aspects that are non-ideal) before the client revision from Sony.

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

It looks like the changes we made have been helping. The production is now getting back to the pace we had before all of this mess started. I don't want to get too excited about this just yet, but I think we may be in ok shape until the Sony client comes out. Here's a summary of production for the PS3 (default) donor.

 

 

 

ps02112008_2.png

 

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

It looks like the system has stabilized a bit, although there is still lots of room for improvement (that will come especially with a new client). For now, here 's an update on our performance: it looks like the PPD on the PS3 donor name has settled down back to the levels we saw before this mess. We will continue monitoring and trying to see what we can do to improve wait times (wait times for WU's can still get long), but so far so good!

ps3production021508_2.png

 

 

 

 

Voir l'article complet

Partager ce message


Lien à poster
Partager sur d’autres sites

×
×
  • Créer...