Aller au contenu
PcPerf.fr

PcPerf bot

PcPerfonaute
  • Compteur de contenus

    399
  • Inscription

  • Dernière visite

Tout ce qui a été posté par PcPerf bot

  1. We've been working behind the scenes for a while to find the right way to share the raw data results from Folding@home. We've partnered with the Simbios National Biomedical Computing Center to provide data for download. If you're curious, check out our first data set project page https://simtk.org/home/foldvillin We'll be releasing more data as time goes on. Our hope is that by making the raw data openly available, this will greatly supplement the published results. Voir l'article complet
  2. We've made a modification to how the Assignment Server (AS) code works. We've done some initial testing and are now releasing it to the backup AS (assign2.stanford.edu). If that checks out, we'll release it to the main AS. The change involves how we assign SMP WUs. If you're seeing something strange there (eg SMP WUs' to non-SMP clients or vice versa), please let us know in the forum (http://foldingforum.org). Voir l'article complet
  3. PcPerf bot

    FAH/SMP Q & A

    There was a good question in the forum that I thought others would be curious to hear: From Vijay's blog entries it would seem that the SMP client has some fundamental advantages over running multiple singlecore client, but I can't really think of how that might be. Do you know of some architectural overview of how the MPI stuff is being used in this context? We could just run multiple independent clients, but this would be throwing away a lot of power. What makes an SMP machine special is that it is more than just the sum of the individual parts (CPU cores), since those cores can talk to each other very fast. In FAH, machines talk to each other when they return WUs to a central server, say once a day. On SMP this happens once a millisecond or so (or faster). That 86,000,000x speed up in communication can be very useful, even if there isn't 100% utilization in the cores themselves. The easy route would have been to run multiple single-CPU FAH-cores (this is what other projects do), but that would be a big loss for the science, as this throws away a very, very powerful resource (fast interconnects between CPUs). Indeed, it is this sort of fast interconnect which makes a supercomputer "super", since the CPUs in supercomputers (eg BlueGene) are pretty slow, but the communication between cores is very, very fast. We've done a lot to develop algorithms for FAH-style internet connections between CPUs, but there are some calculations which require fast interconnects, and that's where the FAH/SMP client is particularly important. By allowing us to do calculations that we couldn't do otherwise, the science is pushed forward significantly (and we thus reward SMP donors with a points bonus due to this extra science done and the extra hassle involved in running the SMP client). I guess it remains to be seen if we can pull off MPI on FAH to the point where it works effortlessly, but so far Lin and OSX look pretty good, so we're close. The A2 core should hopefully seal the deal. Now, the main task is getting Windows/SMP behaving well ... Voir l'article complet
  4. We're rolling out an automatic core upgrade for GPU2/NVIDIA clients to v1.06. Please post in the forum if you're having any new problems with your GPU2 client after this (although the 1.06 core has been tested very thoroughly so far). Voir l'article complet
  5. We've upgraded our code in our main Assignment server to improve some load balancing issues. This upgrade also has new code for how the url for new core downloads are sent. We have tested this code on less important AS's first, but if you start seeing problems with core downloads, please make a post in our main forum (http://foldingforum.org), ideally with some of the log with -verbosity 9 to show the core download url. Voir l'article complet
  6. The GPU2 client has been out for a while now for our newest platform, NVIDIA, and I wanted to give an update. We're making great progress on several fronts of the beta testing of this client, with improvements to the CPU utilization and visualization (which currently is pretty much broken) coming soon. We are also working to support multi-gpu configurations. These are our highest non-science priorities. On the science-side, we're scrubbing the GPU clients to make sure the results make sense. GPU programming is challenging for many reasons, especially due to reduced precision and the complexity of using lots of threads in flight, and so it's important to make sure the results are accurate. So far, the results look promising. Once the GPU2 cores are completely validated and these client issues are addressed, we'll take the client out of beta and make a push to get an even greater adoption of this new client platform. KNOWN ISSUES FOR GPU2/NVIDIA Viewer doesn't work (coming soon). This will require a core upgrade, which is in the works. GTX280 driver version. For pre-GTX280 cards, we recommend version 174.55 of the CUDA driver. We recommend 177.35 for GTX cards. CPU usage can be strange (we're looking into this). The CPU utilization can spike on certain machines. We have an idea for what's the issue and Scott LeGrand at NVIDIA is working on a fix. UNSTABLE_MACHINE error if too many EUE's (not really a bad thing -- a true feature of the client). If you see this error in your client log, it means that there is some problem with your configuration. Voir l'article complet
  7. We've forced a core upgrade to version 1.05 for GPU2/NVIDIA. We did this since there is important new code there. In particular, there's code that will better handle unstable or incorrectly configured machines. This core now generates an UNSTABLE_MACHINE error and the 6.12beta6 client will trap this error, leading to a 24 hour sleep for the client if 5 such UNSTABLE_MACHINE errors are reached. If you are having problems running your client after this forced core upgrade, it most likely means that there's a problem with the client installation, most likely with startup aliases. If this is the case, please check out the FAQ and in particular note that the Start In and Target directories have to be different: In Windows XP: Target: "C:Program FilesFolding@homeFolding@home-gpuFolding@home.exe" -verbosity 9 (or whatever flags you use instead of -verbosity 9) Start in: "C:Documents and Settings<your_windows_username>Application DataFolding@home-gpu" In Windows Vista: Target: "C:Program Files (x86)Folding@homeFolding@home-gpuFolding@home.exe" -verbosity 9 (or whatever flags you use instead of -verbosity 9) Start in: "C:Users<your_windows_username>AppDataRoamingFolding@home-gpu" Please see our FAQ for more details: http://folding.stanford.edu/English/FAQ-NVIDIA Voir l'article complet
  8. Since the GPU2 client 6.12beta6 is behaving well, we've put it on our main download page: http://folding.stanford.edu/English/DownloadWinOther We'll post updates there as they work their way through QA. Voir l'article complet
  9. Due to the new GPU2/NVIDIA client release today, we got cited on Digg, which sent all the traffic to the forum (which can't take too heavy of a load), instead of say http://folding.stanford.edu (which can). We're working with our forum Internet service provider to work this out ASAP. Voir l'article complet
  10. We've started our open beta release. See the post here http://foldingforum.org/viewtopic.php?f=42&t=3188 We will put it on our main download page when it gets a bit further through the beta QA process. This is an early release. Note that this is still a very early beta and there's lots to fix. In particular, we need to correct some issues with the visualization and the EUE handling. However, we've got that in our roadmap and should be rolling that out over the next few weeks or so (hopefully sooner, but it may be longer depending on how tricky these bugs are to handle). If this release looks to run well for most people, we'll put it on our main download page. Voir l'article complet
  11. Please go to our forum (http://foldingforum.org) for up to the second details of the new client. I'll make a post here when there is more to say. Voir l'article complet
  12. Our SMP core is very different than the way other distributed computing projects handle multi-core CPU's, and I thought it might be interesting for the FAH community to hear about the differences, pro and con. As I think most people interested in computers know, Moore's law stating that the transistor count in CPUs will double every 1.5 years has continued for decades. Most people think of Moore's law in terms of the speed of CPU's, but this isn't what Moore originally had in mind. In the past, more transistors have lead to greater CPU speeds, but that has essentially ended (at least for traditional CPU's) a few years ago. But if Moore's law is still marching along (as it is), what do all those transistors do? Over the last few years, more transistors have translated into more CPU cores, i.e. more CPUs on a chip. While this is not what we wanted, this is perhaps not necessarily a disaster, if one can use these multiple CPUs to get faster calculations. If we simply do more calculations (i.e. multiple Work Units, or WU's, simultaneously) not faster calculations (a WU completed in less time), distributed computers will run into the same problems that face supercomputers: how to scale to lots and lots of processors -- i.e. how can we use all these processors to do a calculation faster over all. In FAH, we've taken a different approach to multi-core CPUs. Instead of just doing more WU's (eg doing 8 WU's simultaneously), we are applying methods to do a single WU faster. This is typically much more valuable to a scientific project and it's important to us. However, it comes with new challenges. Getting a calculation to scale to lots of cores can be a challenge, as well as running complex multi-core calculations originally meant for supercomputers on operating systems not meant for this (eg Windows). Right now, our SMP client seems to be running fairly well under Linux and OSX -- operating systems based on UNIX, as is found on supercomputers. We use a standard supercomputing library (MPI) to run these WU's and MPI behaves well on Unix-based machine. MPI does not run well on Windows and we've been running into problems there. However, as Windows MPI implementations mature, our SMP/Windows app will behave better. Along the way, we also have a few tricks up our sleeve which may help as well. However, if we can't get it to run as well as we'd like on Windows, we may choose to overhaul the whole code, as we did with the GPU1 client (which was really hard to run). We're very excited about what the SMP client has been able to do so far. One of our recent papers (#53 in our papers web site http://folding.stanford.edu/English/Papers) would have been impossible without the SMP client and represents a landmark calculation in the simulation of protein folding. We're looking forward to more exciting results like that in the years to come! Voir l'article complet
  13. We've set an auto core upgrade to v1.03 for the GPU2 client. The 1.03 core has much better behavior for R600 cards and in general seems much more stable and better behaved than previous cores. Voir l'article complet
  14. There's been several questions regarding the GPU1 client and why we decided to shut it down. I hope I can shed some light here at least on why we're doing what we're doing, such that even if people disagree with our decisions, they can at least see where we're coming from. Some people have asked "why shutdown the client if it's working?" The bottom line here is that the GPU1 results are no longer scientifically useful. It's pretty clear now that DirectX (DX) is not sufficiently reliable for scientific calculations. This was not known before (and some people wouldn't believe this until we proved it). With the GPU1 results, we can now show what the limitations are pretty decisively. GPU1 also did help us a lot in terms of developing its successor and what's needed to run GPU's in a distributed computing fashion. The good news here is that GPU2 is behaving very well, on both ATI and NVIDIA hardware, and this is a direct result of what we've learned with GPU1 WU's. In the end, however, GPU1 will not be able to help us understand protein misfolding, Alzheimer's Disease, etc due to this unresolvable limitations. We could keep GPU1 live just crunching away in its current form, but that would be wasting people's electricity at this point, as we've learned everything we can learn from those cards can do. In the past, we had a somewhat similar shutdown situation, i.e. when QMD core projects stopped. In that case, donors were left hanging since we didn't give any warning for stopping QMD projects. We did try (perhaps unsuccessfully) to handle the GPU1 situation better than QMD. In QMD, we stopped needing that core and so we stopped the calculation without warning, not realizing the impact that would cause. With GPU1, we gave a several month warning (indeed, note that GPU1 is still actively running, so all of this is information in advance to shutting down GPU1). We tried to avoid the QMD situation by giving advance warning, but it looks like donors would like even more advance warning. However, there's limits to how much in advance we know the situation ourselves. Indeed, the knowledge that it made sense to end GPU1 came reasonably recently to us. We have been working on CAL for a while and it seemed that CAL might be a solution, but we only knew until we got some testing "in the wild." DirectX (DX -- what GPU1 is based on) works much better in the lab than in the wild, and it was possible that CAL behaved that way too. After seeing that CAL behaved well in the wild, it became clear that the GPU1 path was obsolete. However, this is a relatively recent finding and we made the announcement about the situation relatively shortly thereafter. It was a tough decision. Some suggested we just leave GPU1 running, even though people's electricity really would be going to waste, other than generating points. I didn't think that was a good idea. We did know it would be a tough PR hit, but when people talk about the history of FAH, I want to make it clear that we're here to address AD and other diseases, not just running calculations for the sake of points and nothing more (which has been the critique of some other distributed computing projects). So, what's the right thing to do? I guess it comes to this: would GPU1 donors be happier if we just keep GPU1 servers running, doing with no scientific value for points? We could do that, at a cost of taking away personnel from improving existing clients, keeping existing servers going, etc for the sake of keeping GPU1 running. However, that's not what FAH is for and I think it's important that FAH not devolve into a big points game, losing sight of why we're doing what we're doing. PS Some further discussion can be found here Voir l'article complet
  15. There are several new developments regarding the GPU cores and clients. We've been working with NVIDIA to develop a GPU2 core for NVIDIA hardware. So far, the code is progressing well and the new GPU2/NVIDIA core is now in closed beta testing. It's hard to tell if there will be any show stoppers (there's lots of things that could go wrong in distributed computing on GPU's), but so far so good. We're very excited about the performance (more details on that later). We hope to have a public beta in the next few weeks. We are nearing the end for the GPU1 project. Our plan is to deactivate the GPU1 client on June 2. We would like to thank everyone who has contributed to that project. FAH GPU1 was a landmark in computing, being the first distributed computing project on GPU's as well as the first major molecular dynamics calculation to be performed on GPU's. We have learned a lot from GPU1, and those lessons have been used to architect GPU2, which will be faster, more reliable, and much more scientifically useful. We are beta testing a new visualization for GPU2/ATI. This new code provides real time visualization for the GPU2 core, similar to what we have for the PS3. More information can be found on our FAQ. We plan to release a similar visualizer for the GPU2/NVIDIA core as well when it's ready. So, we've definitely had our hands full on the GPU front. We look forward to taking these next steps forward! Voir l'article complet
  16. On Friday, Stanford launched the Pervasive Parallelism Lab (PPL). There's been lots of press describing it. The general plan for the lab is to develop a common paradigm for programming new architectures like GPU's, the Cell processor, Intel's Larabee, as well as multi-core CPUs. This is something we at FAH are very interested in, as we have had to have a unique code path for each of these (i.e. a separate code for the high performance part of the ATI GPU, NVIDIA GPU, PS3, and SMP). Having a single code path for all would be very, very exciting to keep FAH code development onto new hardware going smoothly. Voir l'article complet
  17. The server that went down was reset by our sysadmins just now (thanks to them for coming in on a Sunday evening) and we've got the server code running on them. Voir l'article complet
  18. Looks like one of our key servers went down and so regular FAH clients (non-adv, non-PS3, non-SMP, non-GPU) will be low on work until the sysadmins get the machine back on-line. The other platforms (PS3, GPU, SMP, and adv settings) should have plenty of jobs and some even have their own assignment servers (in the case of GPU and PS3). The sysadmins work M-F, so we expect that they will do a reboot on Monday morning. In the mean time, we have added some new servers on line with jobs, but they are getting hit hard at the moment. Finally, we are preping a series of servers to add 1 *million* jobs (I always imagine Dr. Evil saying that) hopefully this week on multiple servers, so being low on work won't be an ongoing problem after they're up. However, until Monday morning (i.e. the next 16-18 hours or so), it will likely be tight (for non-SMP, non-adv, non-GPU clients). Voir l'article complet
  19. We put the user and team file updates back to every 3 hours (used by 3rd party stats), so we're now back to regular behavior. It looks like we should be ok from here on out. There are two upshots of this mess. First, we've developed some new emergency procedures to deal with such back logs better in the future. We also have plans for how to refactor the stats input code to potentially speed up the process by 5x (at least 2x). That should help in general (perhaps lettings us go back to hourly updates). Voir l'article complet
  20. We've got the stats back to their regular 2 hour updates, although we're keeping the 3rd party stats still to every 12 hours until Sunday morning PST. We've cleaned up some aspects of how the internal stats scripts work and see a way to speed up the stats input significantly (perhaps 4x), but we'll leave that until this mess has blown over before we start trying to make any further improvements (to avoid introducing any new errors at a sensitive time). Voir l'article complet
  21. Update: slow stats pt 3: The last stats update (started at 8pm PST) just finished (8:50pm PST), which is very good news. The next update should be much closer to normal. If that goes well, we will turn the external stats access back on tonight PST (i.e. in about an hour or two). Voir l'article complet Traduction: La dernière mises à jour des stats (commencé à 8pm PST) vient de ce terminer (8:50pm PST) ce qui est une très bonne nouvelle. La prochaine mise à jour devrais revenir à un délais normal. Si tous ce passes bien, nous rouvrirons l'accès externe à la base de données cette nuit PST ( dans à peu près une heure ou deux).
  22. We've been working on several internal FAH scripts. Most changes are behind the scenes (mostly to aid the FAH team in detecting server problems in real time). However, some of these changes has lead to a streamlining of the serverstat page. In particular, we've cleared out some older servers and removed some columns that we don't use very much. The goal is to make it just have the most critical information, making it more obvious what the issues are. Voir l'article complet
  23. It looks like our stats update as worked well. The last update started at 8pm PST! It has now finished in about an hour after our modifications. The next stats update has more stats in it (almost 24 hours of WU data), but it looks like it will go much faster (hopefully 2-3 hours). That means the next update will only have 2-3 hours of stats info in it and should hopefully only take ~20 minutes, and then we're caught up. Voir l'article complet Traduction: Il semble que notre mises à jour des stats fonctionnes correctement. La dernière mises à jour a commencé à 8 heure pm PST! Elle est maintenant finis au bout d'environ une heure après nos modifications. La prochaine étape de mises à jour contient plus de stats (à peu près 24 heures de données de WU), mais cela devrait étre plus rapide (avec un peu de chance 2-3 heures). Logiquement, la prochaine mises à jour devrais seulement avoir 2-3 heures d'info des stats à géré et ne devrais prendre qu'environ 20 minutes, nous aurons ainsi réglé le problème.
  24. Update: slow stats pt 2 The last stats update just finished (taking about 6 hours). The next one should be a lot faster since there's less stats built up, and the next one faster yet. We'll keep the outside access to the db down until we're back to every 2 hours though. It looks like that will be tomorrow morning (Sunday morning PST). Voir l'article complet Traduction: La dernière mises à jour vient tout juste de finir. (elle à pris environ 6 heures). La prochaine devrais être beaucoup plus rapide vu qu'il y a moins de données à manipuler et la suivante encore plus rapide. Nous gardons inaccessible l'accès externe à la base de données jusqu'à ce que nous soyons revenu à une mises à jour toutes les deux heures. Cela devrait être demain matin (Dimanche matin PST).
  25. As I've posted below, we have a backlog of stats to input into our db for donors to access their scores. To speed this process, we've made some temporary changes. We've disabled stats updates from our web site. We've also limited our updates on the external stats pages for teams and donors to once every 12 hours at 6am and 6pm PST. We're trying to streamline the process so that we can get the backlog through and back to business as normal. This is just temporary, but will be a big help. Once the backlog is through, the points will be up and hopefully all will be back to normal. We expect this may take as long as 2 days, and will give updates along the way. Voir l'article complet
×
×
  • Créer...