Jump to content
PcPerf.fr

PcPerf bot

PcPerfonaute
  • Content Count

    378
  • Joined

  • Last visited

Everything posted by PcPerf bot

  1. It looks like the changes we made have been helping. The production is now getting back to the pace we had before all of this mess started. I don't want to get too excited about this just yet, but I think we may be in ok shape until the Sony client comes out. Here's a summary of production for the PS3 (default) donor. Voir l'article complet
  2. We're seeing signs that are most recent modifications are helping and we may have the PS3 network of machines back on track. Nevertheless, we're still working to see how we can improve it from here (there's lots of aspects that are non-ideal) before the client revision from Sony. Voir l'article complet
  3. Sony has been working on a new client to fix several client-side bugs which have come to light during the last few days. Once that new client is in, we can set the servers back to normal and all should be well, assuming the client addresses the issues we're facing. For now, we (Stanford) have been trying to do what we can server-side to work around some critical bugs in the 1.3 client to allow for both client uploads and downloads. It's easy to get uploads or downloads working, but with the current state of 1.3, it's hard to get both. We have had to rewrite server code to work around these 1.3 bugs and I think we've made some progress. The bugs in the 1.3 code are very subtle and the sort that only comes up once a critical mass of 1.3 clients exist and start to hit the server collectively in incorrect ways. Sony QA didn't show this up, as one can't QA 30,000 clients. However, we have discussed in detail with Sony some ideas for helping to prevent this in the client in the future, and this behavior is being removed for the upcoming patch. I want to address some misconceptions in the comments posted here. There is a problem in the client, and the server changes were made to address this and work around the bug (the server changes did not create the problem, but were done in response to the problem). Also, Sony QA is extremely rigorous, but these sorts of things can't be found until clients hit critical mass. Finally, we have tried several server code changes and the reasons they didn't help appear to be that the client had multiple aspects which didn't behave as expected (eg client bugs) which rendered our changes not as useful as one would have expected at first (forcing us to reconsider what one can do server-side). Last night, we tried a new strategy where we use the AS to help control the weight of clients going to the WS. This seems to be working as we're getting lots of uploads right now as well as a steady stream of downloads. We hope that this is a reasonable balance and if not, we will continue to see what we can do from here until the new client comes out. Voir l'article complet
  4. We did the test I discussed in update #11 at 6AM-7:30AM PST this morning. It was useful in debugging, but we've decided to hold off with that plan for now. Voir l'article complet
  5. We had another meeting with Sony to discuss the results of our recent server code change to improve the situation. We have a working model of the situation and we will continue to see what we can do server side before the client patch comes out. We will be testing an approach which means that many clients (99%) won't get assigns for a while, allowing the remaining 1% to get work. Once they have work, they'll go away crunching, allowing a new 1% to get work. If all goes well, all should get work without all pounding the server simultaneously. The new client will take care of this automatically (as this is already done in the non-PS3 clients), but we will handle this from the Assignment Server "manually" to take care of this until the new client gets out. Once we get over the hump, we should be ok. Voir l'article complet
  6. PS It looks like with the recent credit update, the points have done a lot to get back on track. Check out some 3rd party stats (eg http://folding.extremeoverclocking.com/use...s=&u=207511 for the latest). Voir l'article complet
  7. We've been trying creative several fixes to handle what the servers can do with the existing clients to improve the situation immediately, rather than waiting for a new client could come out. So far, I think they have largely been unsuccessful, but I think we learned more about what's going on. We have a new idea, which will require some more coding, and then we'll roll that out tomorrow. However, people should see a major points increase due to some work over here, in particular for the large-point WU's. Voir l'article complet
  8. We had a long meeting with Sony yesterday to brainstorm fixes that we can do *before* a new client is released. We've come up with a plan, coded it yesterday, and we've started to roll it out today. The result is that clients will wait a bit to get work, but in their waiting, uploads and downloads for those that do get through will go smoothly. The new client will have these waits set up for general cases, but we think we can generate one in an adhoc manner with what we've set up. If all goes well, this fix should solve the issue in about a day or two. If not, we will do more brainstorming until the new client goes out. Voir l'article complet
  9. We have a conference call with Sony today to brainstorm a short and medium term plan. We came up with some new ideas to try to help things immediately (i.e. later today if we can get the coding done, tomorrow hopefully otherwise), and also to plan for updates to the client. The call was very productive and the Sony team is very eager to get this fixed as well. I think we have a good idea to fix this, but we'll see when we implement it. Voir l'article complet
  10. We have been working aggressively to find the root of this problem, investigating all possibilities (server, client, network, etc). Our investigation has found that this issue is due to a problem in the client, as we have identified a specific issue that's causing this problem. We have given the info on how to fix this to Sony and we are hoping that they can come up with an updated client soon. Unfortunately with the PS3, we cannot update the client ourselves, otherwise we would release a client update ourselves, as we have done in the past as needed. Therefore, over the last few days, we have worked on server side tweaks (the only part we can work on) while Sony can work on the client. Before the client gets updated, we expect that the situation will continue to improve gradually but with the issues we have been seeing. It looks like there are some misconceptions about the situation, based on the comments here. This is not an issue of the FAH servers not being powerful enough -- any server network would be showing this same issue with the client issue that's going on (in fact, the FAH servers serving the PS3 are more powerful than in other parts of FAH, which is operating just fine with far more clients); indeed, the PS3 backend has been spec'd out to handle 1M PS3 clients and we are way under that. Also, this is also not an issue that the PS3's are too fast. The server load depends on the number of clients and the amount of bytes they send back; in FAH, the PS3's compute faster, and that is used to do more complex calculations, not to send more bytes, and thus does not create a greater server load. The client update would fix how the client interacts with the servers to stop the problem we're seeing right now, including the issues with assigns (getting new WU's), accepts (returning WU's), and points. Sony is a large company and the development team likely cannot publicly give out ETA's on when this will be fixed, etc, but it's important to stress that they are working on this and know this is a very, very critically important update to make. Voir l'article complet
  11. We have been working aggressively to find the root of this problem, investigating all possibilities (server, client, network, etc). We have been debugging the entire FAH system on the PS3 over the last few days (as there could be several causes for what we're seeing), examining especially how the server is interacting with the clients, what the clients are doing, and how the Stanford network is handling the situation. Several Stanford network engineers have looked into the problem to see if this is a network issue, but that does not appear to be the case. The servers are running extremely well right now. Our investigation so far has found that this issue is due to a problem in the client, and we have identified a specific issue that's causing this problem. With the completion of this investigation Saturday night, we have given the info of the results of our debugging and our plan on how to fix this to Sony last night, and we are hoping that they can come up with an updated client soon. Unfortunately with the PS3, we cannot update the client ourselves, otherwise we would release a client update ourselves, as we have done in the past as needed. Therefore, over the last few days, we have worked on server side tweaks (the only part we can work on) while Sony can work on the client. Before the client gets updated, we expect that the situation will continue to improve gradually but with the issues we have been seeing. It looks like there are some misconceptions about the situation, based on the comments posted here. This is not an issue of the FAH servers not being powerful enough -- any server network would be showing this same issue with the client issue that's going on (in fact, the FAH servers serving the PS3 are more powerful than in other parts of FAH, which is operating just fine with far more clients); indeed, the PS3 backend has been spec'd out to handle 1M PS3 clients and we are way under that. Also, this is also not an issue that the PS3's are too fast. The server load depends on the number of clients and the amount of bytes they send back; in FAH, the PS3's compute faster, and that is used to do more complex calculations, not to send more bytes, and thus does not create a greater server load. The client update would fix how the client interacts with the servers to stop the problem we're seeing right now, including the issues with assigns (getting new WU's), accepts (returning WU's), and points. Sony is a large company and the development team likely cannot publicly give out ETA's on when this will be fixed, etc, but it's important to stress that they are working on this and know this is a very, very critically important update to make, and they are working aggressively to fix it. Voir l'article complet
  12. Some good news. The servers have continued to improve (25-30% failure now, which means that at least 1 out of 2 attempts should work, so only ~2 retries should be needed). We've been implementing and testing lots of different strategies and I think we've found one that works best. We've also been in close contact with Sony and they have some ideas on the client side and are working on revisions there. We've also drasitically brought down the assigns on vsp06, the server which was assigning large WU's, so to make sure that it's not loaded when those WU's need to come back (which should be about now). So, the bottom line is that the server load is still extremely heavy, but the situation is continuing to improve. Most importantly, the client mods should prevent this from happening in the future. Voir l'article complet
  13. We know this is an issue of concern, so I'll be publishing daily updates until it's taken care of for good. As posted in update #3 yesterday, we expect it to take a couple of days for our fixes to truly kick in. So far, so good -- the WU failure rate is now down to about 35% across the board and the servers have stabilized somewhat. There's still a ways to go from 35% to the normal values, but at least everything is going as expected for this fix. A client update from Sony would fix this issue for good without the extreme server side machinations we've had to do, and we are working with them on that as well. Traduction: Nous savons que c'est une source de soucis, je vais donc publier des mises à jour quotidiennes jusqu'à ce que le problème soit résolu pour de bon. Comme dit dans la mises à jour #3 hier, nous prévoyons que ça devrais prendre quelques jours pour fixer complétement le problème. Si loin, si bon --- le taux d'echec de WU est maintenant descendu aux alentour de 35 % depuis que les serveurs ont légèrement stabilisé. Il reste encore du chemin pour repasser de 35 % à des valeurs normales mais au moins, tous ce passes comme prévu avec ce fix. Une mises à jour du client par sony devrais résoudre le problème pour de bon sans les mécanismes serveur très compliqué que nous devons utiliser, nous travaillons avec eux à ce sujet. Voir l'article complet
  14. The beta clients will be expiring tomorrow, but new ones are up. There's a 6 month expiration on these, although we are expecting several of them to go final much sooner than that (and thus no expiration date) since the beta has gone very smoothly. You can find the new beta clients on the download page. Voir l'article complet
  15. We've been working constantly the last few days to improve the PS3 situation. Based on our statistics, the situation has gotten better. A few days ago (at its worst), it was very bad (~80% failure, which means 5-6 retries on average or more to get WU's back). We're seeing more like 20-35% (1-3 retries) right now and the trend is getting better and better. We have added some additional PS3 servers, and done some major code changes. A new client should fix this issue introduced in 1.3 so this doesn't happen again. For now, our server-side changes should take care of this in time, although it may take a few more days to settle down to say 1% failures (more typical). We have also extended deadlines to compensate for this problem. Voir l'article complet
  16. The net was only down for about an hour and it looks like everything is back up. That server room's networking has now been upgraded to the new high speed Stanford trunk. Voir l'article complet
  17. Our network provider has scheduled a maintenance window for 4:00 AM to 6:00 AM, PST, on January 31 for one of our primary server rooms. Access to parts of Folding@home may be interrupted during this period. This includes the stats, stats web page, and primary AS, although the main web page, backup AS and many work servers will not be affected. Voir l'article complet
  18. We've been trying various server-side changes to improve the PS3 situation and have been in close contact with Sony. We have some ideas which we will be implementing. So far, the situation has gotten better (at least based on our statistics), but it's still not good, and we're working to improve it. Voir l'article complet
  19. Our FAH servers for PS3's are getting hit pretty hard right now. We are looking into whether this is a client problem (failure to backoff correctly during high loads) or a server issue. We added a server last night to help and will add more this morning. We are actively working on this one right now. Voir l'article complet
  20. We're updating the hardware of our backup assignment server. The switch over to the new hardware should occur today. Note that this involves a DNS change and so we expect it may take some time for the DNS to propagate. However, we will keep both servers up, so donors should not see any interruption in service. However, if you do see something strange related to this backup AS, please report it in our forum (http://foldingforum.org). Voir l'article complet
  21. That server room that went down is back up and Del and Dan got all of the servers back up (no small feat). We have the servers running FAH, but often there is one or two that may have issues coming back up, and we're looking into that. If you find any problems, please feel free to post a report in our forum (foldingforum.org). Voir l'article complet
  22. Here are some code development updates on some important client/cores GPU core: we've got the GPU core running in house and we found and fixed some bugs in our QA stage. We're now continuing QA to see if we find any more bugs. Right now, the GPU core is running on all new ATI cards, so we're excited to roll it out. We are using CAL now (ATI's hardware abstraction layer) and that seems to make life a lot easier, and also should make running a GPU client a lot easier from the point of view of donors, as the driver issues and complexities should now be resolved. We are still looking into an NVIDIA client. The NVIDIA GPUs are very different to program, so a port isn't a simple thing to do. We are looking into this, though. SMP core: right now, SMP on Linux and OSX is behaving fairly well, whereas Windows is giving some issues. This is perhaps not a surprise, since the SMP code must use MPI, which has its origins on UNIX and is a newcomer to Windows. We have been working with Windows MPI developers to improve the situation, but they tell us this isn't a simple fix. Since we are in the business of studying proteins, not writing MPI libraries for Windows, we will wait until the MPI experts improve the Windows MPI before we make any claims of improvement there. Finally, beta clients will be expiring soon, and we are in the process of QA for new clients. We will also extend the expiration deadlines in the future clients to give some more time, and since the clients are appearing to be maturing. Voir l'article complet
  23. There will be a planned power outage in one of our server rooms on this Saturday, starting in the morning (8am PST) and lasting likely until 5pm PST. This affects only one of our server rooms, so we will re-route around it for new assigns, and do not expect any problems. It is a bit annoying as this is exactly the same room which had to go down a few weeks ago to fix this very item, which could not be fixed completely then, so hence another shutdown down. Voir l'article complet
×
×
  • Create New...