Aller au contenu
PcPerf.fr

PcPerf bot

PcPerfonaute
  • Compteur de contenus

    399
  • Inscription

  • Dernière visite

Tout ce qui a été posté par PcPerf bot

  1. One of the more common question I get asked is how we do our client/server/core programming and backend system administration. Also, others were curious about updates on various core projects. So, I thought it made sense to answer both in one post, since the answers are related. This will be a bit of a long answer to several short questions, but hopefully it will help give some insight into how we do what we do. First, some history. When we started in 2001, I personally wrote most of the code (client, server, scientific code integration, etc), with some help from a summer student (Dr. Jarod Chapman) and some help from Adam Beberg on general distributed computing issues and the use of his Cosm networking library. I was just starting out as a professor then, with a relatively small group (4 people at the time), so it was common for the leader of the lab to do a lot of hands on work. As time went on, the group matured and grew, with increasing funding from NIH and NSF. This allowed the group to grow to about 10 people in 2005. At this point, much of the duties were given to different people in the lab: the server code development was performed by (now Prof.) Young Min Rhee and then later by Dr. Guha Jayachandran. Client development was done by Siraj Khaliq, then Guha, then help from several people (including Adam Beberg as well as volunteers, such as Uncle Fungus). Core development was done by Dr. Rhee, (now Prof.) Michael Shirts, and others. This model worked reasonably well, with each team member giving some significant, but not overly onerous amount of his/her time (eg 10% to 20%) to FAH development. These key developers were able to add a lot of functionality, both to aid the science and the donor experience. However, in time, this model became unscalable and unsustainable. As time went on, the individual developers graduated (in academic research, the research is done by graduate students or postdoctoral scholars, both of whom do not stay longer than say 3-5 years). While the original team was able to build a powerful and complex system, maintaining that system by new generations of students/postdocs became unsustainable. The code was complex and well known by the original authors, but maintenance by new developers was complex and easy to make errors, due to the complexity of the software. In parallel with these efforts in code development, we also were maturing in terms of our server backend. We went from having a few small (10GB hard drives!) servers, to a very large, enterprise style backend, with hundreds of terabytes of storage. This too became a major challenge to manage by the scientific group. A new plan. Therefore, in 2007, I started a new plan to migrate these duties (code development and system administration) to professional programmers and system administrators. Today, most of FAH code development is done by professional programmers, and in time I expect all of it will be done that way. The desire to start with a clean code base lead to new projects, such as the v5 server code, second generation GPU code (GPU2), second generation SMP code (SMP2), new client (v7 client in the works), which have been developed with a clean slate. There are some differences in how donors will see the fruits of these efforts. I have found that while the programmers write much cleaner code (much more modular and systematic and maintainable), the code development is typically slower. While the scientific group can often make certain changes say in a month, the professional programmers may take 2 or 3. What we get for that extra time is more cleanly written code, no hacks, and a plan for long term sustainability (clean code, well documented code, high level programming practices, etc). Some projects are still done by the scientific staff (eg Dr. Peter Kasson continues to do great things with the SMP client as well as work towards SMP2), I expect that in time this will all be done by programmers. Analogously, sysadmin has been pushed to a professional group at Stanford. Similarly, they are more careful and methodical, but slower to respond due to this. My hope is that as we migrate away from our older legacy hardware and they set up clean installs with the v5 server code, the issues of servers needing restarts should be greatly improved. This infrastructure changeover has been much slower than I expected, in part due to the practices used by the sysadmin team to avoid hackish tricks and to keep a well-organized, uniform framework amongst all of the servers (eg scripting and automating common tasks). One important piece good news is that the people we've got are very good. I'm very happy to be working with some very strong programmers, including Peter Eastman, Mark Friedrichs, and Chris Bruns (GPU2/OpenMM code), Scott Legrand and Mike Houston (contacts at NVIDIA and ATI, respectively, for GPU2 issues), Joe Coffland and his coworkers (v5 server, Protomol Core, Desmond/SMP2 core, v7 client). System admin is also now done professionally, via Miles Davis' admin group at Stanford Computer science. Also, since she has help desk experience, Terri Fedelin (who does University admin duties for me personally) has also been working on the forum helping triage issues. Where are we now? Much of their work is behind the scenes and we generally only talk about big news when we're ready to release, but if you're curious, you can see some of it publicly, such as tracking GPU2 development via the OpenMM project (http://simtk.org/home/openmm) and the Gromacs/SMP2 core via the http://gromacs.org cvs (look for updates involving threads, since what is new about SMP2 is the use of threads instead of MPI). You can also follow some more of the nitty gritty details on my Twitter feed (http://twitter.com/vijaypande), where I plan to try to give more day-to-day updates, albeit in a simpler (and less gramatically correct) form; the hope here is to try to have more frequent updates, even if they are smaller and simpler. As the GPU2 code base matured in functionality, GPU2 core development has been mainly bug fixes, which is a good thing. SMP2 has been testing in house for a while and I expect it will still take a few weeks. The main issue is trying to make sure we get good scalability with threads based solutions, removing bottlenecks, etc. The SMP2 initiative lead to two different cores, one for the Desmond code from DE Shaw Research and another for a Gromacs variant (a variant of the A4 core). We having been testing both in single cpu-core format (the A4 Gromacs core is a single core version of what will become SMP2) and we hope to release in a week or two a set of single core Desmond jobs. If those look good, multiple-core versions via threads (not MPI) will follow thereafter. The v5 system roll out is continuing, with the plan to have a parallel v5 infrastructure (set up by the new sysadmins) with our current one, and have the science team migrate new projects to the new infrastructure. The v5 code has been running for a while in a few tests and we expect one of the GPU servers to migrate this week, with one or two servers migrating every week as time goes on. The new code does not crash/hang the way the v3/v4 code does (it hung under high load and needed the process to be killed) and so we expect much more robust behavior from it. Also, Joe Coffland has been great regarding responding to code changes and bug fixes. So, the upshot of this new scheme is that donors will likely see more mature software, which also means slower revs between cores, both since fewer revs are needed in the new model (a lot of issues are simplified by the cleaner code base) and because the revs now involve a lot of internal QA and testing and more careful methodical programming. The long term upshot for FAH is better software and more sustainable software. It's taking time to get it done, but based on the results so far (eg GPU2 vs GPU), I think it has been worth the wait (but we still have a fair ways to go before we can see all of the fruits of this work). Voir l'article complet
  2. We've had a rough night with GPU servers. One has been down hard over the day yesterday (it crashed hard and now can't find its / partition -- the admins are attempting a rescue disk fsck this morning). Two more went down last night (PST) due to the heavy load, but those were easy to get back up (they are up now). We are stretched a bit thin as we are implementing the new server infrastructure in parallel with the old one. The upshot is that once the new one has been deployed, we will have much more functional collection servers (CS's) and also get work servers (WS's) that should not need to be restarted nearly as frequently when under heavy load. We are beginning the roll out of the new WS (v5) code this week onto GPU servers, although these issues have slowed us down a bit. Voir l'article complet
  3. PcPerf bot

    MD Workshop

    Next week, we will be having a Molecular Dynamics (MD) workshop (see http://simbios.stanford.edu/calendar.htm for details) for three main pieces of Folding@home-related software, namely OpenMM (the GPU acceleration code behind the FAH GPU client), MSMbuilder (the software used to parallelize calculations over all of Folding@home, i.e. how to make all the individual donors' calculations work together) and OpenMM/Zephyr (a tool intended to make molecular dynamics simulation easier for non-experts, built from the accelerated OpenMM codebase). We're very excited about not just making the key software pieces that drive FAH available to other researchers, and moreover, we're working hard to teach others how they work and how to use them efficiently. You can learn more about these projects at their simtk.org web pages as well: OpenMM: http://simtk.org/home/openmm MSMBuilder: https://simtk.org/home/msmbuilder Zephyr: https://simtk.org/home/zephyr Voir l'article complet
  4. We are seeing the network getting worse right now, especially for the GPU assignment server (eg traceroute to it and other machines on related networks are failing). We have tickets out to IT support to deal with this issue. If this cannot be resolved in a timely manner (24-36 hours) by the IT dept, we will start taking more aggressive measures ourselves. Voir l'article complet
  5. We have had some issues with the GPU assignment server (AS) migration to new hardware. Several issues have been resolved, including - issues with ATI GPU client work assignments- some work servers without work But there are two remaining issues which we have filed tickets with Stanford IT, including- port 80 forwarding for the GPU AS- allowing assignments to 171.64.122.70 Of these remaining issues, the port 80 issue is a big issue for those who have port 8080 blocked and we've been pushing IT to get this resolved ASAP. The second issue (going to 171.64.122.70) is less time critical (that server can still receive completed WUs from clients), but we would like to get more GPU work servers on line in general. Voir l'article complet
  6. We hit a few snags with bringing back one of the servers, so we've left the stats update down until we could resolve the problem. It looks like we have a reasonable fix for now, so we're going to turn the stats update back on. This means the stats will likely take a while to update due to the backlog, but all the points from the last 10 hours or so should be coming in. We will likely take the stats update off line again tomorrow morning so we can make a longer term fix to the issue at hand that slowed us down today. However, in general, the outage went ok, especially considering it included a major migration of hardware. The new hardware should be much more reliable (0.5 year old vs 5 years old) and also much more versatile and sophisticated. Almost all of the migration is done, but the last main bit of infrastructure that needs to be migrated is the stats system, which we're waiting for a new server to come in. Voir l'article complet
  7. Our planned shutdown today is going onward as planned. We hope to have the main systems migrated, except for the stats db (which will be down) and the stats web pages. The main web pages, work servers, and AS's should be up (although the AS's may have issues if the migration isn't working correctly). We are monitoring the situation very closely. Thanks for your patience here! Voir l'article complet
  8. There is going to be some major maintenance at Stanford on Monday May 18 in the morning (Pacific time). Most of FAH will be up, but right now, the stats web site looks like it will be down. Also, it is possible that the GPU AS and PS3 AS will be down, although we are working on migrating those this week to avoid downtime. The data servers (work servers) should all be up. Stats updates will be suspended briefly during this period as well, but will start back up (with no WUs lost) after we're done. I'll post more as we have more to say. At its worst, those servers should only be down for about 3 hours, assuming all goes well with the infrastructure work being done. Voir l'article complet
  9. The main AS is down. We are working on it right now. Assuming nothing serious comes up, it should be back up in an hour. I'll post updates as we know them. The stats update is on hold until this is resolved, but the update will include all the latest points once it is run. Voir l'article complet
  10. There was a pretty strange power surge today at about 3:45pm pacific time. Right now, it looks like it's affected some servers. Our sysadmins are working on it. In general, most of FAH is up and running fine, but we expect some issues later today and possibly overnight. Voir l'article complet
  11. In order to speed the stats updates, we've suspended the updates for the project WU counts. This does not affect WU counts or points, just the stats pages which tells donors how many WUs they have contributed to each project. We have polled donors in the past regarding this issue, and there was overwhelming support for this change, especially if it could significantly increase stats updates speeds. After doing some in house tests, we expect that this will greatly enhance stats updates speeds, so we have rolled this out. Even if this is the main culprit for the slow stats updates, we may put this back in time, once we work out a better scheme for how to update this information and/or get hardware better suited for it. The main issue here is that this is a lot of information, with data going back many, many years, and it has become unwieldy. For now, this should lead to a pretty dramatic difference in stats update times. It looks like we can now get it done in about 20 minutes, rather than the ~2 hours it was taking previously, for 3 hours worth of stats accumulated. If this looks good, we will likely switch stats updates to occur more frequently in the future, starting with every 2 hours. Voir l'article complet
  12. PcPerf bot

    FLOPS

    There has been much interest in our stats page (osstats) detailing different OS's and the FLOPS they produce: http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats We've been trying to come up with a way to standardize these numbers so that they can be more easily compared to each other. That has resulted in a new FAQ: http://folding.stanford.edu/English/FAQ-flops We also plan on updating the osstats page to include both the Native FLOPS (which is on there now) and the more common x86 FLOPS, which allows for a more "apples to apples" comparison of FLOPS. Voir l'article complet
  13. We're happy to announce a new paper (#63 at http://folding.stanford.edu/English/Papers). This paper describes the code behind the Folding@home GPU clients, detailing how they work, how we achieved such a significant speed up on GPUs, and other implementation details. For those curious about the technical details, I've pasted our technical abstract below: ABSTRACT. We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. Also, this software is now available for general use (for scientific research outside of FAH). Please go to http://simtk.org/home/openmm for more details. Voir l'article complet
  14. Based on our FLOP estimate (see http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats), Folding@home has passed the 5 petaflop mark recently. To put it in context, traditional supercomputers have just broken the 1 petaflop mark, and even that level of performance is very challenging to aggregate. The use of GPU's and Cell processors is has been key to this, and in fact the NVIDIA numbers alone have just passed 2 petaflops. Thanks to all who have contributed and we look forward to the next major milestones to be crossed! Voir l'article complet
  15. The machine which handles the main assignment server (AS) and the stats update is currently down. We expect that this machine will be back up by 10am pacific time. Since this has happened previously somewhat recently, we are looking into transferring some duties on this machine to both keep it less loaded and to lessen the severity if it goes down again. This means that the stats update will be down until we get the machine back up, but the back up AS should handle all classic clients, and the PS3 and GPU clients are unaffected (as their AS is on a different machine). Voir l'article complet
  16. It's still early (since this paper was just accepted), but I wanted to give FAH donors a heads up on our work on Huntington's Disease aggregation, which is just about to come out in the Journal of Molecular Biology. I'll comment on it more in a future post. See our papers page for more details. Voir l'article complet
  17. We've made a code change in the two main assignment servers (assign.stanford.edu & assign2.stanford.edu -- but no the PS3 or GPU AS's) to handle a potential problem people have been having with getting through firewalls. For those who are very experienced with running FAH (eg are familiar with the logs and how to interpret them): If you have had problems in the past with your firewall, could you now try again and see if you can get a server assignment? You likely won't be able to get to the server itself (we've just updated the AS code), but just knowing that this fix helped will let me know what we should update all of the work servers ASAP. If everything was working fine, but not it isn't, please make a post in the forum: http://foldingforum.org Voir l'article complet
  18. Today (January 8, 2008) at 4pm, I'll be giving a talk at PARC, giving an update about Folding@home. If anyone is local to the Palo Alto area, you can be there in person. Note that they also do live streaming and should have a video on line (check out http://www.parc.com/forum for details). Voir l'article complet
  19. We have made some updates to the stats code to make it faster and more useful. The recent changes speed web page creation in general and also cache team pages such that they can be read during stats updates. This is the first stage of further improvements of the stats to come in 2009. Voir l'article complet
  20. We're happy to announce a new Pande Group paper (paper #61 at http://folding.stanford.edu/English/Papers). This paper describes a new computational screen to identify important mutations in influenza: Combining Mutual Information with Structural Analysis to Screen for Functionally Important Residues in Influenza Hemagglutinin. Peter M. Kasson and Vijay S. Pande. Pacific Symposium on Biocomputing 14:492-503(2009). Download URL: http://psb.stanford.edu/psb-online/proceed...sb09/kasson.pdf The influenza hemagglutinin protein performs several important functions, including attaching the virus to cells it will infect and releasing the viral genome into the interior of the cell. Most protective antibodies against influenza also bind to the hemagglutinin protein. We wish to understand how mutations to hemagglutinin affect viral function, including what keeps avian influenza ("bird flu") from being readily transmissible between humans. In this paper, we have applied a technique from information theory known as mutual information to genetic sequence data to predict important mutation sites on the hemagglutinin protein. In follow-up work, we are combining this technique with other methods to refine these predictions and test some of them using Folding@Home. PS For those curious out in more details, check out the paper (see link above) or the technical abstract: Influenza hemagglutinin mediates both cell-surface binding and cell entry by the virus. Mutations to hemagglutinin are thus critical in determining host species specificity and viral infectivity. Previous approaches have primarily considered point mutations and sequence conservation; here we develop a complementary approach using mutual information to examine concerted mutations. For hemagglutinin, several overlapping selective pressures can cause such concerted mutations, including the host immune response, ligand recognition and host specificity, and functional requirements for pH-induced activation and membrane fusion. Using sequence mutual information as a metric, we extracted clusters of concerted mutation sites and analyzed them in the context of crystallographic data. Comparison of influenza isolates from two subtypes—human H3N2 strains and human and avian H5N1 strains—yielded substantial differences in spatial localization of the clustered residues. We hypothesize that the clusters on the globular head of H3N2 hemagglutinin may relate to antibody recognition (as many protective antibodies are known to bind in that region), while the clusters in common to H3N2 and H5N1 hemagglutinin may indicate shared functional roles. We propose that these shared sites may be particularly fruitful for mutagenesis studies in understanding the infectivity of this common human pathogen. The combination of sequence mutual information and structural analysis thus helps generate novel functional hypotheses that would not be apparent via either method alone. Voir l'article complet
  21. Jim Clark has been a major donor to Stanford and his great contributions has had a huge impact on my group's work in general and on Folding@home in particular. Jim Clark was a professor of CS at Stanford, but subsequently was involved in many successful major Silicon Valley companies (see his wikipedia page for all the details). He also donated over $100M to Stanford to build the Clark Center, a University-wide center for interdisciplinary biology. Today, Dr. Clark along with John Hennessy (the President of Stanford University) visited our offices to hear about our recent work. They both were heavily involved in computer architecture in the past, so they were interested to hear about our work with GPUs and the success we are seeing there (in particular, the significant speed increases). Also, they are both interested in neuroscience and so I was excited to tell them about our recent Alzheimer's work. Anyway, I was excited to give them both an update and some idea of where we're going, and it was great fun for me to tell them all about how much we've done. Voir l'article complet
  22. Jim Clark has been a major donor to Stanford and his great contributions has had a huge impact on my group's work in general and on Folding@home in particular. Jim Clark was a professor of CS at Stanford, but subsequently was involved in many successful major Silicon Valley companies (see his wikipedia page for all the details). He also donated over $100M to Stanford to build the Clark Center, a University-wide center for interdisciplinary biology. Today, Dr. Clark along with John Hennessy (the President of Stanford University) visited our offices to hear about our recent work. They both were heavily involved in computer architecture in the past, so they were interested to hear about our work with GPUs and the success we are seeing there (in particular, the significant speed increases). Also, they are both interested in neuroscience and so I was excited to tell them about our recent Alzheimer's work. Anyway, I was excited to give them both an update and some idea of where we're going, and it was great fun for me to tell them all about how much we've done. Voir l'article complet
  23. Stanford will be closed for the next two weeks for the Winter Holidays (ending January 5, 2009). FAH will still be up (FAH is always up), but we will be on a reduced staff. FAH team members have broken up their vacations so that there will always be someone around, but the main issue is that responses to problems will likely be slower than normal. However, we have been working to add lots of jobs, clear out lots of HD space on servers, get all the servers up, such that we should hopefully be in good shape even if there are some problems. We'd like to wish all the FAH donors a happy holiday and thank all of you for all your great help with our project. Voir l'article complet
  24. This is very preliminary news, but something I'm very, very excited about, so I'll give some advance news. On Tuesday, we presented our results regarding new possible drugs (small molecule leads) to fight Alzheimer's Disease at a recent meeting at Stanford. This meeting was part of the NIH Roadmap Nanomedicine center (http://proteinfoldingcenter.org/) retreat and was supported by NIH grants to Folding@home. It's very early (so we are not publicly talking about the details until this has passed peer review), but we are very excited that it looks like we may have multiple small molecules which appear to inhibit toxicity of Abeta, the protein which is the toxic element in Alzheimer's Disease. This is exciting in many ways. It's been a long road for FAH to get to this point, but we are starting to see the possibility of seeing these results published easily before our 10th birthday (October 2010). Considering all the technology development that had to be done in the first five years, these results have come very quickly (in the last 3 years), which is exciting. In particular, we are now looking to apply these methods to other protein misfolding diseases (we have pilot projects for Huntington's Disease underway). Finally, I should stress that while we're very excited about this, it's still early and a lot can go wrong between where we are and having a drug that doctors can prescribe. Over the holidays, we will be double checking the experimental data, crossing t's and dotting i's to make sure there is nothing missed before we think about submitting this for peer reviewed publication. Also, there is still a long way from an interesting possible drug (where we are now) to something which has passed FDA clinical trials (where we'd love to be), and a lot can go wrong in clinical trials in particular. Thus, this is an important milestone for FAH and we are very grateful to all who have contributed. Happy holidays to all! Voir l'article complet
  25. During the next two weeks (December 20 - January 4), the Backbone Networking group plans to schedule a backbone maintenance window every morning from 4-8am pacific time to implement improvements in the network, as they did last year. In most cases, the changes should not affect the connectivity of the networks used by Folding@home. In cases where they might, any interruption in service should be under 5 minutes. Voir l'article complet
×
×
  • Créer...