Big Computing

Sunday, May 31, 2015

Coursera Data Science compared to Data Camp courses for R

Recently Data Camp has really expanded their offering fof R tutorials. So I thought the time has come to re-look at the courses offered on Coursera for Data Science compared to those offered on Data Camp. This review is solely based on my own experience.

I took the courses in the Coursera Data Science series last year. Usually I enrolled in a class or two at a time. Each course in the nine course series was a month long and consisted of a series of lectures, some quizzes and one or two projecst which had the student work on a piece of data and submit the results for evaluation that could include automated grading or peer review. The courses started with setting up R, Github and Rstudio. The courses then go on to cover data visualization, data manipulation, regression and some machine learning. I found the courses to be a good base overview of the skills and tools needed to work in R as a data scientist. My greatest concern is the hardest parts are the first few classes that set everything up. After that I found the classes to be pretty easy. In fact, I am concerned that many of the students who fail to finish this series do so because they can not even get started. The forums are a good source of information and help on stuff. I found them really important when I got stuck. The other down side was the peer review of your projects. I found few reviewers spent much time and effort on this part of the class, and their reviews were in many cases not helpful or just plain wrong. There was a case where I did a project incorrectly yet all my reviewers gave me full credit, and I had another project that I did differently but properly than many of the other students, but received poor reviews because what I did did not look like the projects of my reviewers.

The Data Camp courses are different from the Coursera classes in that you are running R in the Data Camp environment. This is a benefit because you do not have to go through the work of setting us all the things you need to do this work, but has the same down side that you really are only learning how to do this stuff on the data camp site and not in the real world. I did enjoy the interactive and step by step method of learning examples that is the core of the data camp method. I did not like that the interface really requires the work you do to be in the exact format that the teacher used. This could be very vexing at times.

At this stage I would more strongly recommend the Coursera class because they really get you ready to do real work. However, if you are frustrated or getting stuck with the Coursera series. Do some modules on Data Camp. It will function as a remedial trainer and up your skill and confidence to take on more challenging and more independent tasks.

Friday, May 29, 2015

Fake Data used to make fake research papers and studies

Recently there has been a lot of attention given to Michael LaCour, a UCLA graduate student, making up data and some other thing in a widely reference academic paper. It is certainly embarrassment for the other authors and the publishers of the article. I think it is time to be honest this issue is not a rare problem and that has been well known for some time.

The first time I heard about faked research paper was in a blog by Andrew Gelman. I later heard him give a talk on the same topic. His methods with quickly able to vette out a number of Chinese studies that were obviously faked. I think this is where the current system of research papers falls down. It only meets the standard of peer review. There is no systemized effort to analyze or reproduce the results and finding of the study. Peer review is not a high enough standard because we are at our core human which means prone to error and our own bias. We need to up the standard of acceptance in scholarly articles.

Recent Blog post by Gelman on Research Fraud (there are others)

It is even better that Jeff Leek in involved in this discussion with Gelmen

text
"I had a brief email exchange with Jeff Leek regarding our recent discussions of replication, criticism, and the self-correcting process of science.

Jeff writes:

(1) I can see the problem with serious, evidence-based criticisms not being published in the same journal (and linked to) studies that are shown to be incorrect. I have been mostly seeing these sorts of things show up in blogs. But I’m not sure that is a bad thing. I think people read blogs more than they read the literature. I wonder if this means that blogs will eventually be a sort of “shadow literature”?

(2) I think there is a ton of bad literature out there, just like there is a ton of bad stuff on Google. If we focus too much on the bad stuff we will be paralyzed. I still manage to find good papers despite all the bad papers.

(3) I think one positive solution to this problem is to incentivize/publish referee reports and give people credit for a good referee report just like they get credit for a good paper. Then, hopefully the criticisms will be directly published with the paper, plus it will improve peer review.

A key decision point is what to do when we encounter bad research that gets publicity. Should we hype it up (the “Psychological Science” strategy), slam it (which is often what I do), ignore it (Jeff’s suggestion), or do further research to contextualize it (as Dan Kahan sometimes does)?
OK, I’m not planning to take that last option any time soon: research requires work, and I have enough work to do already. And we’re not in the business of hype here (unless the topic is Stan). So let’s talk about the other two options: slamming bad research or ignoring it. Slamming can be fun but it can carry an unpleasant whiff of vigilantism. So maybe ignoring the bad stuff is the better option. As I wrote earlier:

Ultimately, though, I don’t know if the approach of “the critics” (including myself) is the right one. What if, every time someone pointed me to a bad paper, I were to just ignore it and instead post on something good? Maybe that would be better. The good news blog, just like the happy newspaper that only prints stories of firemen who rescue cats stuck in trees and cures for cancer. But . . . the only trouble is that newspapers, even serious newspapers, can have low standards for reporting “cures for cancer” etc. For example, here’s the Washington Post and here’s the New York Times. Unfortunately, these major news organizations seem often to follow the “if it’s published in a top journal, it must be correct” rule.

Still and all, maybe it would be best for me, Ivan Oransky, Uri Simonsohn, and all the rest of us to just turn the other cheek, ignore the bad stuff and just resolutely focus on good news. It would be a reasonable choice, I think, and I would fully respect someone who were to blog just on stuff that he or she likes.

Why, then?
Why, then, do I spend time criticizing research mistakes and misconduct, given that it could even be counterproductive by drawing attention to sorry efforts that otherwise might be more quickly forgotten?
The easiest answer is education. When certain mistakes are made over and over, I can make a contribution by naming, exploring, and understanding the error (as in this famous example or, indeed, many of the items on the lexicon).
Beyond this, exploring errors can be a useful research direction. For example, our criticism in 2007 of the notorious beauty-and-sex-ratio study led in 2009 to a more general exploration of the issue of statistical significance, which in turn led to a currently-in-the-revise-and-resubmit-stage article on a new approach to design analysis.
Similarly, the anti-plagiarism rants of Thomas Basbøll and myself led to a paper on the connection between plagiarism and ideas of statistical evidence, and another paper storytelling as model checking. So, for me, criticism can open doors to new research.
But it’s not just about research
One more thing, and it’s a biggie. People talk about the self-correcting nature of the scientific process. But this self-correction only happens if people do the correction. And, in the meantime, bad ideas can have consequences.
The most extreme example was the infamous Excel error by Reinhardt and Rogoff, which may well have influenced government macroeconomic policy. In a culture of open data and open criticism, the problem might well have been caught. Recall that the paper was published in 2009, its errors came to light in 2013, but as early as 2010, Dean Baker was publicly asking for the data.
Scientific errors and misrepresentations can also have indirect influences. Consider …, where Stephen Jay Gould notoriously… And evolutionary psychology continues to be a fertile area for pseudoscience. Just the other day, Tyler Cowen posted, on a paper called “Money, Status, and the Ovulatory Cycle,” which he labeled as the “politically incorrect paper of the month.”
The trouble is that the first two authors are Kristina Durante, Vladas Griskevicius, and I can’t really believe anything that comes out of that research team, given they earlier published the ridiculous claim that among women in relationships, 40% in the ovulation period supported Romney, compared to 23% in the non-fertile part of their cycle. (For more on this issue, see section 5 of this paper.)
Does publication and publicity of ridiculous research cause problems (besides wasting researchers’ time)? Maybe so. Two malign effects that I can certainly imagine coming from this sort of work are (a) a reinforcing of gender stereotypes, and (b) a cynical attitude about voting and political participation. Some stereotypes reflect reality, I’m sure of that—and I’m with Steven Pinker on not wanting to stop people from working in controversial areas. But I don’t think anything is gained from the sort of noise-mining that allows researchers to find whatever they want. At this point we as statisticians can contribute usefully be stepping in and saying: Hey, this stuff is bogus! There ain’t no 24% vote swings. If you think it’s important to demonstrate that people are affected in unexpected ways by hormones, then fine, do it. But do some actual scientific research. Finding “p less than 0.05″ patterns in a non representative between-subjects study doesn’t cut it, if your goal is to estimate within-person effects.
What about meeeeeeeee?
Should I be spending time on this? That’s another question. All sorts of things are worth doing by somebody but not necessarily by me. Maybe I’d be doing more for humanity by working on Stan, or studying public opinion trends in more detail, or working harder on pharmacokinetic modeling, or figuring out survey weighting, or go into cancer research. Ir maybe I should chuck it all and do direct services with poor people, or get a million-dollar job, make a ton of money, and then give it all away. Lots of possibilities. For this, all I can say is that these little investigations can be interesting and fruitful for my general understanding of statistics (see the items under the heading “Why then” above). But, sure, too much criticism would be too much.
“Bumblers and pointers”
A few months ago after I published an article criticizing some low-quality published research, I received the following email:

There are two kinds of people in science: bumblers and pointers. Bumblers are the people who get up every morning and make mistakes, trying to find truth but mainly tripping over their own feet, occasionally getting it right but typically getting it wrong. Pointers are the people who stand on the sidelines, point at them, and say “You bumbled, you bumbled.” These are our only choices in life.

The sad thing is, this email came from a psychology professor! Pretty sad to think that he thought those were our two choices in life. I hope he doesn’t teach this to his students. I like to do both, indeed at the same time: When I do research (“bumble”), I aim criticism at myself, poking holes in everything I do (“pointing”). And when I criticize (“pointing”), I do so in the spirit of trying to find truth (“bumbling”).
If you’re a researcher and think you can do only one or the other of these two things, you’re really missing out."

Article from Buzz Feed on Michael LaCour

"A study claiming that gay people advocating same-sex marriage can change voters’ minds has been retracted due to fraud.

What’s more, the funding agencies credited with supporting the study deny having any involvement.

The study was published last December in Science, and received lots of media attention (including from BuzzFeed News). It found that a 20-minute, one-on-one conversation with a gay political canvasser could steer California voters in favor of same-sex marriage. Not only that, but these changed opinions lasted for months and influenced other people in the voter’s household, the study found.

Donald Green, the senior author on the study, retracted it shortly after learning that his co-author, UCLA graduate student Michael LaCour, had faked the results of surveys supposedly taken by voters. On Thursday afternoon, Science posted an official retraction, citing funding discrepancies and “statistical irregularities.”

“I am deeply embarrassed by this turn of events and apologize to the editors, reviewers, and readers of Science,” Green, a professor of political science at Columbia University, said in his retraction letter to the journal, as posted on the Retraction Watch blog.

“There was an incredible mountain of fabrications with the most baroque and ornate ornamentation. There were stories, there were anecdotes, my dropbox is filled with graphs and charts, you’d think no one would do this except to explore a very real data set,” Green told Ira Glass, host of the This American Life radio program, last week. This American Life had featured the study in an episode in April.

“I stand by the findings,” LaCour told BuzzFeed News by email. He also said he will provide “a definitive response” by May 29.

The problems came to light after three other researchers tried, and failed, to replicate the study. David Broockman, of Stanford, Joshua Kalla, of the University of California, Berkeley, and Peter Aronow of Yale found eight statistical irregularities in the data set. No one of these would by itself be proof of wrongdoing, they wrote, but all of them collectively suggest that “the data were not collected as described.”

Broockman, Kalla, and Aronow told Green about the paper’s “irregularities” and sent him a summary of their concerns. According to his retraction letter, Green then contacted Lynn Vavreck, LaCour’s adviser at UCLA, who confronted him. LaCour couldn’t come up with the raw data of his survey results. He claimed that he accidentally deleted the file, but a representative from Qualtrics — the online survey software program he used — told UCLA that there was no evidence of such a deletion. What’s more, according to what Green told Politico, the company didn’t know anything about the project and “denied having the capabilities” to do the survey.

Vavreck also asked LaCour for the contact information of the survey respondents. He didn’t have it, and apparently confessed that he hadn’t used any of the study’s grant money to conduct any of the surveys.

What happened, apparently, is that people from the Los Angeles LGBT Center — more than 1,000 volunteers, according to This American Life — really did go out and talk to people about same-sex marriage; it’s just that those people were never actually surveyed about their opinions. As one of the canvassers told Ira Glass: “LaCour gave them lists of people he claimed to have signed up for the online survey. Then canvassers did their jobs and went to those houses. This took hundreds of hours.”

David Fleischer, a leader of the LGBT Center, sent BuzzFeed News a statement about the study:

“We were shocked and disheartened when we learned yesterday of the apparent falsification of data by independent researcher Michael LaCour,” Fleischer stated.

“We are not in a position to fully interpret or assess the apparent irregularities in the research as we do not have access to the full body of information and, by design, have maintained an arms-length relationship with the evaluation of the project,” Fleischer added. “We support Donald Green’s retraction of the Science article and are grateful that the problems with LaCour’s research have been exposed.”

In the study’s acknowledgements, LaCour states that he received funding from three organizations — the Ford Foundation, Williams Institute at UCLA, and the Evelyn and Walter Haas, Jr., Fund. But when contacted by BuzzFeed News, all three funders denied having any involvement with LaCour and his work. (In 2012, the Haas, Jr. Fund gave a grant to the Los Angeles LGBT Center related to their canvassing work, but the Center said that LaCour’s involvement did not begin until 2013.) Science cited this fabrication in its official retraction.

There are at least two CVs that were reportedly published on LaCour’s website but have since been taken down. Both list hundreds of thousands of dollars in grants for his work. One of these listings, a $160,000 grant in 2014 from the Jay and Rose Phillips Family Foundation of Minnesota, was made up, according to reporting by Jesse Singal at The Science of Us.

Political scientists are shocked and disappointed by the news of the fabrication, especially because the study was so celebrated.

“The whole episode is tragic,” David Nickerson, an associate professor of political science at Notre Dame, told BuzzFeed News.

The tainted study was some of the strongest evidence to date for the 60-year-old “contact hypothesis,” which says that the best way to reduce prejudice against individuals in a minority group is to boost interactions between them and the majority.

“It’s pretty clear that what you think about the world, policy issues, can be shaped by who you come into contact with,” Ryan Enos, an assistant professor of government at Harvard, told BuzzFeed News.

Before LaCour and Green’s study, there was a lot of survey-based evidence for the contact hypothesis. In a study in 2011, for example, Gregory Lewis from Georgia State University compiled data from 27 national surveys and found that “people who know LGBs are much more likely to support gay rights.” And last year, a study by Andrew Flores of UCLA found that the higher the population of gay people in legislative districts, the more likely those districts will support rights for same-sex couples.

The trouble with these survey-based studies, though, is that it’s impossible to determine causality: Does having gay friends make you more supportive of them, or does being supportive make you more friendly?

“That’s a very, very difficult hypothesis to tease out using plain old survey data,” Patrick Egan, a political scientist at NYU, told BuzzFeed News. LaCour and Green’s study, in contrast, was a field experiment that could compare the opinions of the same group of people before and after having contact with a gay person.

That’s why the study’s fabrication is so disappointing, Egan said. “It’s a real loss to knowledge that we don’t actually have real data coming out of this experiment.”

Same-sex marriage advocates say they will still be pushing this kind of “field persuasion.”

It’s “really disheartening to see that someone apparently tainted a study,” Marc Solomon, national campaign director of Freedom to Marry, told BuzzFeed News by email. But this approach has long been a key component of strategy for the LGBT movement, and there is other evidence for it beyond this one study, he said.

In Maine, for example, the organization found that about one-quarter of opponents to same-sex marriage became more supportive after having an in-depth conversation. Freedom To Marry has worked closely with social scientists “to ensure that we can prove that what we’re doing works!” Solomon added. “The efficacy of it has been proven multiple times.”"

Wednesday, May 20, 2015

Evidence that Data Scientists should never be singers: The Overfitting music video

Finally irrefutable proof that Data Scientists should never become entertainers. Yes the evil that is the overfitting song has been released upon us. The words are scary and fitted over the overplayed Michael Jackson song Thriller. I know that many hang onto the magic song that was SVD, but I promise you that that song was purely an outlier in the realm of geek songs. The Overfitting song is more the norm.

So if you are a data scientist, predictive modeler or simple R programmer please do not sing! Just build better models and try not to overfit.

The Importance of Time when Building Cohort for Healthcare Data

In Temporal Relativity in Cohort Builds video Dr Eran Bellin explains this important feature in correctly build a cohort. The simple fact is that doing this correctly can produce very different results than doing this incorrectly. As Dr Bellin says himself,"It is important to understand the temporal relationships between condition lines used to build a cohort. Further, the index event line is especially important to identify the condition line whose event is going to define the index date of the cohort. By recapitulating a study from the medical literature we will demonstrate how a temporally aware cohort object is built."

Dr Bellin is an authority on how to do predictive analytics on historical patient data. He has been doing it for over 20 years. This video is one in a series of video he produced. He also wrote the book Riddles in Accountable Healthcare: A Primer to Develop Analytic Intuition for Medical Homes and Population Health. This book is a critical read for anyone trying to understand how to do predictive analytics on historical patient data.

Tuesday, May 19, 2015

Anatomy of a Simple Multiple Event Cohort

Building a multiple event cohort can be essential in doing any sort of analytic of patient data in healthcare. Here Dr Bellin does gives an example of building a simple multiple event cohort with patient data.

Dr Bellin is a thought leader in Healthcare Analytics having worked with patient data for over 20 years. His work has improved the performance of healthcare systems and the outcomes of patients. He also recently published a book on Healthcare Analytics called Riddles in Accountable Care.

This is the second video in the series. I posted the first video earlier. Here is the Link to that video.

Monday, May 18, 2015

Video of Max Kuhn's talk at the NYC Data Science Academy on Applied Predictive Modeling

While I was working on another post I can across this video of Max Kuhn giving a talk to a Meetup of Data Scientists about his book Applied Predictive Modeling. The video itself is quite long running just over an hour. However, Max's talks tend to be well worth it. The cover a variety of classification model which are the basis of his R package called Caret.

Link to Slides from Max Kuhn's talk on Random Forest and Caret

Sunday, May 17, 2015

Video slides of the R Caret package including Random Forest (RF)

By far the most visited page on my blog is the example of Random Forest that I posted about a year ago. I wrote it when I was taking the Coursera Data Science Classes which use the Caret package in their Machine Learning Section. A few months back Max Kuhn, the creator of the Caret Package, gave a talk for the Orange County R Users group on Caret which was recorded. I am posting it here for anyone who is interested in a deeper drive into Caret because Caret does so much more than just Random Forest (RF). It is fairly lengthy at about an hour long, but well worth it.

Also here is the link to my simple example of using Caret for Random Forest on the Iris data set.

Caret is so much more than just Random Forest. It can do a lot of preprocessing with things like centering and Scaling. Also there are almost 200 other classification models in Caret other than Random Forest. Caret has really become the deficit tool for establishing the baseline for the predictive power of a dataset and building out a superior parsimonious model.

Subscribe To My Blog