Aller au contenu
PcPerf.fr

PcPerf bot

PcPerfonaute
  • Compteur de contenus

    399
  • Inscription

  • Dernière visite

Tout ce qui a été posté par PcPerf bot

  1. We've been trying creative several fixes to handle what the servers can do with the existing clients to improve the situation immediately, rather than waiting for a new client could come out. So far, I think they have largely been unsuccessful, but I think we learned more about what's going on. We have a new idea, which will require some more coding, and then we'll roll that out tomorrow. However, people should see a major points increase due to some work over here, in particular for the large-point WU's. Voir l'article complet
  2. We had a long meeting with Sony yesterday to brainstorm fixes that we can do *before* a new client is released. We've come up with a plan, coded it yesterday, and we've started to roll it out today. The result is that clients will wait a bit to get work, but in their waiting, uploads and downloads for those that do get through will go smoothly. The new client will have these waits set up for general cases, but we think we can generate one in an adhoc manner with what we've set up. If all goes well, this fix should solve the issue in about a day or two. If not, we will do more brainstorming until the new client goes out. Voir l'article complet
  3. We have a conference call with Sony today to brainstorm a short and medium term plan. We came up with some new ideas to try to help things immediately (i.e. later today if we can get the coding done, tomorrow hopefully otherwise), and also to plan for updates to the client. The call was very productive and the Sony team is very eager to get this fixed as well. I think we have a good idea to fix this, but we'll see when we implement it. Voir l'article complet
  4. We have been working aggressively to find the root of this problem, investigating all possibilities (server, client, network, etc). Our investigation has found that this issue is due to a problem in the client, as we have identified a specific issue that's causing this problem. We have given the info on how to fix this to Sony and we are hoping that they can come up with an updated client soon. Unfortunately with the PS3, we cannot update the client ourselves, otherwise we would release a client update ourselves, as we have done in the past as needed. Therefore, over the last few days, we have worked on server side tweaks (the only part we can work on) while Sony can work on the client. Before the client gets updated, we expect that the situation will continue to improve gradually but with the issues we have been seeing. It looks like there are some misconceptions about the situation, based on the comments here. This is not an issue of the FAH servers not being powerful enough -- any server network would be showing this same issue with the client issue that's going on (in fact, the FAH servers serving the PS3 are more powerful than in other parts of FAH, which is operating just fine with far more clients); indeed, the PS3 backend has been spec'd out to handle 1M PS3 clients and we are way under that. Also, this is also not an issue that the PS3's are too fast. The server load depends on the number of clients and the amount of bytes they send back; in FAH, the PS3's compute faster, and that is used to do more complex calculations, not to send more bytes, and thus does not create a greater server load. The client update would fix how the client interacts with the servers to stop the problem we're seeing right now, including the issues with assigns (getting new WU's), accepts (returning WU's), and points. Sony is a large company and the development team likely cannot publicly give out ETA's on when this will be fixed, etc, but it's important to stress that they are working on this and know this is a very, very critically important update to make. Voir l'article complet
  5. We have been working aggressively to find the root of this problem, investigating all possibilities (server, client, network, etc). We have been debugging the entire FAH system on the PS3 over the last few days (as there could be several causes for what we're seeing), examining especially how the server is interacting with the clients, what the clients are doing, and how the Stanford network is handling the situation. Several Stanford network engineers have looked into the problem to see if this is a network issue, but that does not appear to be the case. The servers are running extremely well right now. Our investigation so far has found that this issue is due to a problem in the client, and we have identified a specific issue that's causing this problem. With the completion of this investigation Saturday night, we have given the info of the results of our debugging and our plan on how to fix this to Sony last night, and we are hoping that they can come up with an updated client soon. Unfortunately with the PS3, we cannot update the client ourselves, otherwise we would release a client update ourselves, as we have done in the past as needed. Therefore, over the last few days, we have worked on server side tweaks (the only part we can work on) while Sony can work on the client. Before the client gets updated, we expect that the situation will continue to improve gradually but with the issues we have been seeing. It looks like there are some misconceptions about the situation, based on the comments posted here. This is not an issue of the FAH servers not being powerful enough -- any server network would be showing this same issue with the client issue that's going on (in fact, the FAH servers serving the PS3 are more powerful than in other parts of FAH, which is operating just fine with far more clients); indeed, the PS3 backend has been spec'd out to handle 1M PS3 clients and we are way under that. Also, this is also not an issue that the PS3's are too fast. The server load depends on the number of clients and the amount of bytes they send back; in FAH, the PS3's compute faster, and that is used to do more complex calculations, not to send more bytes, and thus does not create a greater server load. The client update would fix how the client interacts with the servers to stop the problem we're seeing right now, including the issues with assigns (getting new WU's), accepts (returning WU's), and points. Sony is a large company and the development team likely cannot publicly give out ETA's on when this will be fixed, etc, but it's important to stress that they are working on this and know this is a very, very critically important update to make, and they are working aggressively to fix it. Voir l'article complet
  6. Some good news. The servers have continued to improve (25-30% failure now, which means that at least 1 out of 2 attempts should work, so only ~2 retries should be needed). We've been implementing and testing lots of different strategies and I think we've found one that works best. We've also been in close contact with Sony and they have some ideas on the client side and are working on revisions there. We've also drasitically brought down the assigns on vsp06, the server which was assigning large WU's, so to make sure that it's not loaded when those WU's need to come back (which should be about now). So, the bottom line is that the server load is still extremely heavy, but the situation is continuing to improve. Most importantly, the client mods should prevent this from happening in the future. Voir l'article complet
  7. We know this is an issue of concern, so I'll be publishing daily updates until it's taken care of for good. As posted in update #3 yesterday, we expect it to take a couple of days for our fixes to truly kick in. So far, so good -- the WU failure rate is now down to about 35% across the board and the servers have stabilized somewhat. There's still a ways to go from 35% to the normal values, but at least everything is going as expected for this fix. A client update from Sony would fix this issue for good without the extreme server side machinations we've had to do, and we are working with them on that as well. Traduction: Nous savons que c'est une source de soucis, je vais donc publier des mises à jour quotidiennes jusqu'à ce que le problème soit résolu pour de bon. Comme dit dans la mises à jour #3 hier, nous prévoyons que ça devrais prendre quelques jours pour fixer complétement le problème. Si loin, si bon --- le taux d'echec de WU est maintenant descendu aux alentour de 35 % depuis que les serveurs ont légèrement stabilisé. Il reste encore du chemin pour repasser de 35 % à des valeurs normales mais au moins, tous ce passes comme prévu avec ce fix. Une mises à jour du client par sony devrais résoudre le problème pour de bon sans les mécanismes serveur très compliqué que nous devons utiliser, nous travaillons avec eux à ce sujet. Voir l'article complet
  8. The beta clients will be expiring tomorrow, but new ones are up. There's a 6 month expiration on these, although we are expecting several of them to go final much sooner than that (and thus no expiration date) since the beta has gone very smoothly. You can find the new beta clients on the download page. Voir l'article complet
  9. We've been working constantly the last few days to improve the PS3 situation. Based on our statistics, the situation has gotten better. A few days ago (at its worst), it was very bad (~80% failure, which means 5-6 retries on average or more to get WU's back). We're seeing more like 20-35% (1-3 retries) right now and the trend is getting better and better. We have added some additional PS3 servers, and done some major code changes. A new client should fix this issue introduced in 1.3 so this doesn't happen again. For now, our server-side changes should take care of this in time, although it may take a few more days to settle down to say 1% failures (more typical). We have also extended deadlines to compensate for this problem. Voir l'article complet
  10. The net was only down for about an hour and it looks like everything is back up. That server room's networking has now been upgraded to the new high speed Stanford trunk. Voir l'article complet
  11. Our network provider has scheduled a maintenance window for 4:00 AM to 6:00 AM, PST, on January 31 for one of our primary server rooms. Access to parts of Folding@home may be interrupted during this period. This includes the stats, stats web page, and primary AS, although the main web page, backup AS and many work servers will not be affected. Voir l'article complet
  12. We've been trying various server-side changes to improve the PS3 situation and have been in close contact with Sony. We have some ideas which we will be implementing. So far, the situation has gotten better (at least based on our statistics), but it's still not good, and we're working to improve it. Voir l'article complet
  13. Our FAH servers for PS3's are getting hit pretty hard right now. We are looking into whether this is a client problem (failure to backoff correctly during high loads) or a server issue. We added a server last night to help and will add more this morning. We are actively working on this one right now. Voir l'article complet
  14. We're updating the hardware of our backup assignment server. The switch over to the new hardware should occur today. Note that this involves a DNS change and so we expect it may take some time for the DNS to propagate. However, we will keep both servers up, so donors should not see any interruption in service. However, if you do see something strange related to this backup AS, please report it in our forum (http://foldingforum.org). Voir l'article complet
  15. That server room that went down is back up and Del and Dan got all of the servers back up (no small feat). We have the servers running FAH, but often there is one or two that may have issues coming back up, and we're looking into that. If you find any problems, please feel free to post a report in our forum (foldingforum.org). Voir l'article complet
  16. Here are some code development updates on some important client/cores GPU core: we've got the GPU core running in house and we found and fixed some bugs in our QA stage. We're now continuing QA to see if we find any more bugs. Right now, the GPU core is running on all new ATI cards, so we're excited to roll it out. We are using CAL now (ATI's hardware abstraction layer) and that seems to make life a lot easier, and also should make running a GPU client a lot easier from the point of view of donors, as the driver issues and complexities should now be resolved. We are still looking into an NVIDIA client. The NVIDIA GPUs are very different to program, so a port isn't a simple thing to do. We are looking into this, though. SMP core: right now, SMP on Linux and OSX is behaving fairly well, whereas Windows is giving some issues. This is perhaps not a surprise, since the SMP code must use MPI, which has its origins on UNIX and is a newcomer to Windows. We have been working with Windows MPI developers to improve the situation, but they tell us this isn't a simple fix. Since we are in the business of studying proteins, not writing MPI libraries for Windows, we will wait until the MPI experts improve the Windows MPI before we make any claims of improvement there. Finally, beta clients will be expiring soon, and we are in the process of QA for new clients. We will also extend the expiration deadlines in the future clients to give some more time, and since the clients are appearing to be maturing. Voir l'article complet
  17. There will be a planned power outage in one of our server rooms on this Saturday, starting in the morning (8am PST) and lasting likely until 5pm PST. This affects only one of our server rooms, so we will re-route around it for new assigns, and do not expect any problems. It is a bit annoying as this is exactly the same room which had to go down a few weeks ago to fix this very item, which could not be fixed completely then, so hence another shutdown down. Voir l'article complet
×
×
  • Créer...