The age-old Python or R debate always rages here at CAN. While we have a pretty impressive staff of data scientists who all have their individual quirks (Some like to run in their spare time, some bird watch, some of them binge-watch obscure sci-fi), they have something in common. They work hard, around the clock if they have to, to accomplish projects, and put their best foot forward for clients.
If you’re completely new to the computer programming discussion
Webopedia defines computer programming language as “A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks.” How does one talk to computers? In code. It’s gets tricky, however, because there are a lot of different codes that computers can understand. There are not just 10, 20, or 30 different computer languages that exist. There are hundreds and hundreds of languages. You can browse a full list here. Python and R are just two of the most popular for data science.
For some additional help, we’ve compiled a list of terms that will help you understand the background of this topic (inspired by LinkedIn).
Programmatic thinking. It’s exactly what it sounds like. It’s a way of thinking that you have to turn on when you learn computer programming. It means seeing the large problem as a series of smaller steps. It also requires being able to transcribe ideas into a code that computers understand.
Compiled and interpreted languages. Compiled languages require the user to compile and build code before it can run. Interpreted languages can read code directly without compiling.
API. API stands for application programming interface. Basically, it’s instructions put out by the program designers for accessing the full functions of the language and softwares.
Pseudocode. It’s like code, but not. It’s shorthand for standard code and helps programmers with outlining before they dig into bigger coding tasks.
Armed with a few definitions, let’s jump into the debate.
Python v. R: Where to Start
First, we’re going to hit at the hard truth. In order to succeed in the data science world, you need to be familiar with both languages (or at least good at one and familiar with the other). Particularly in Omaha, where CAN is headquartered and data analyst jobs are highly competitive, knowledge of both languages gives you a leg up on the competition. In fact we have training classes through the Omaha Data Science Academy that teach both.
But that’s not what you want to hear, we know that. So we’re still going to break the two down and tear them apart in comparison.
Both Python and R are good at . . .
Python and R are both free to download, and the learning curve is about the same once you’ve already mastered some basic programming skills. They’re both impressive to master, so in that way you can’t go wrong. No one will shame you for mastering one and not the other.
Python is know for data munging, data wrangling, website scraping, web app building, and data engineering.
Let’s say you’re tackling a project with a lot of disparate data. Maybe you’re collecting sales data from the past 5 years for a company to help them predict new trends. The problem is that the company has had several turns in management, and that data is stored in multiple locations. Python would be more helpful in this situation. It succeeds as a software for gathering data from many databases and making it one.
If you already know Java or C, Python is going to come more naturally for you. The similarities coincide for your benefit.
It is an object-oriented programming language (see above), so it’s easy to write large scale and robust code. And, some people say there is data to prove that more business owners are looking for those proficient in Python over other languages.
Positives of R
R has better visualization tools than Python. It’s also been around a lot longer, which means there are more online support communities than Python (think: APIs). There are over 5,000 softwares you can find on the internet to run alongside R to boost its capabilities.
R is known for being great at statistical modeling, graphing, and converting math to code.
Perhaps you’re working on a project for a company that has a nice and neat database. The problem is, it’s difficult for most people to look at a bunch of numbers and understand trends. R is the most helpful for these situations, as it can successfully take data and make it into graphs and pictures for others to understand it.
Let’s talk to CAN
In attempt to settle this debate, we’ve brought in some professional opinions.
Matt Hoover, Director of Data Visualization, Flywheel: Matt sees R used as a more efficient math language, emphasis on the word “math”. It can achieve in one line of code what Python needs several lines to accomplish. R’s specialty is research, statistics, and data analysis, so it’s more efficient on the stats side. He continues, “Python is way more flexible as a language overall and can be used to do a wider range of things.” Matt sees R used in more learning settings than on the field, and sees Python used for more high-level data science.
Essentially, R is easier to learn and better on the math/statistics side, but overall Python has more capabilities.
Gordon Summers, Senior Data Scientist, CAN: Gordon’s advice is a bit more far-reaching. He says, “The hardest thing about picking between Python and R isn’t choosing which one to start learning, it is in choosing when it is time to stop learning it”. Basically, Gordon’s advice is to not focus so much on which language to master, but instead realize that something new could come along at any time, so don’t invest too much time in one.
If you work consistently with clean data, and your goal is to dissect the data and creative visualizations from it, go with R. If you have messy data that you need to “wrangle,” Python is more helpful.
Still stuck? Answer the following questions to help you navigate the Python v. R world.
- What are your teammates using? Maybe you just got a job in data science and can’t decide which one to learn. Look around – what are you friends and fellow employees using? Are they successful in their work?
- What are the data trends of you job market? It wouldn’t be inappropriate for you to call up a company who just posted a data science job and ask what they would prefer. Get a feel more the market, decide from there.
- Whose data are you working with? Is the data messy and needs to be gathered? Python is your answer. Is your data clean and needs to be visualized? Go with R.
You can’t go wrong
Neither Python nor R is perfect. Both will have downfalls, but there are packages that exist to help alleviate those pains. Examples of libraries that can help alleviate problems can be found at https://elitedatascience.com/r-vs-python-for-data-science.
To summarize more thoughts by Gordon Summers, the IT world is changing. He says, “To do development is to use the application and to use the application is to do development. There is no IT person and no business user. The person is both a developer and a business user. One of the reasons that larger organization have struggled to embrace Python and R is that frequently there is an organizational barrier between IT and Business.” When you enter the programming language, data science, or IT world, be ready to be flexible. Businesses are still struggling to figure out where IT fits in their company. The best advice is to be adaptable and to understand where you are going so you can understand the best way to get there.
Oh, and not to complicate the entire argument, but about the time we get the R v Python debate settled, Scala might just come from the back of the pack to win the whole thing. After all, Twitter is in part written in Scala and Hadoop choose to write Spark in Scala. Social Media Speed and Big Data Prowess? Perhaps this dark horse isn’t the long shot after all.