Hyper Talks: Data Science Languages

Hola! My name is José Pérez-Parras and I am a Data Scientist at Hyper Group.

I was born in Córdoba, Spain. A fantastic city with the best meal in the world “salmorejo”, but do not go in summer if you want to keep living 🔥.

How have I ended up in Yorkshire? Since a child I have been learning English and during my studies of a software engineering degree I did an Erasmus to Sheffield. I had an amazing time and loved this country and the tiny towns that conform the Yorkshire landscape.

So, after my Master’s in Data Science and some experience in the research world I decided to move into the business world and come here looking for a job, finding an amazing team and company in Leeds!
Apart from being a data geek, what I love the most is travelling (hopefully I can resume this activity soon) and seeing the world, or just discovering the hidden gems of this country.

Top Languages for Aspiring Data Scientists to Learn

When entering the exciting field of data science and machine learning it can be stressful knowing where to start learning. To make life easier I’m comparing popular and upcoming languages, following the 2019 stack overflow survey. I will summarise the top 5 languages predominantly used now and predict which we expect to emerge as popular in the field of data science.

Current Top 5 Languages:


Python is an extremely popular general purpose, dynamic language, and is widely used within the data science community. It is commonly referred to as the easiest programming language to learn, thanks to it having relatively accessible syntax.

Not only is it used in the data science community but everywhere in the programming world, from DevOps to web programming. This is largely due to the huge community it has amassed, supporting it with open source projects and libraries.

Another advantage of Python is the easy integration with C and C++, which greatly enhances the speed and performance of some tasks, without adding too much difficulty to the equation. This is a key reason why it is used so widely for data science and machine learning.

Some of the top libraries used in machine learning are Pandas, Scikit-learn, TensorFlow and Keras. All of these have been developed in Python by open source contributors or major companies, such as Google in the case of Keras.

For statistics and research R is one of the most frequently used tools. R it is an open source language and software supported by the R Foundation for Statistical Computing. R is accessed via R Studio, which allows you to visualise and transform the data in the same place.

R uses syntax which is relatively easy to understand plus has the advantage of working directly with data frames, the equivalent to tables in SQL.

The number of libraries for machine learning and data science in R is huge. A drawback of these libraries is that there is very little differentiation between them, as R does not have the same standards for coding as other languages such as Python or JavaScript. This is down to it being used more by the research and mathematics communities than by software engineers and data scientists.


Scala (scalable language) is a general-purpose, open source programming language which runs on the same technology as Java. Scala is an ideal language for those working with high-volume data sets due to it using the Apache Spark library, which enables you to create distributed file systems to split the data into several files and work with them in parallel. Scala has full support for functional programming and a strong static type system, making it easier to read and implement.

Scala also has some libraries for machine learning such as MLlib with gives the user many algorithms that run in a parallel on distributed architecture.


SQL, literally meaning Structured Query Language, is by far the most popular language used for working with structured databases. It’s considered an essential language for every data scientist as data is mostly housed in structured databases, such as MySQL, SQLite and PostgreSQL, or newer cloud-based technologies such as Snowflake, AWS Redshift and Google BigQuery.

SQL allows you to query, modify and create tables of data. It is often part of an ETL process due to its high speed when working with structured data, regardless of data volume and size. Once an ETL process has completed, SQL is often replaced by a more powerful, high-level language.


Like R, SAS was created for statistical analysis. However, while R is open source and therefore free, SAS is a commercial product which must be licensed for use.

SAS developed their software for advanced analytics, predictive modeling, and business intelligence. It has been in the market for many years and still core to numerous organisations, presenting difficulties switching into another technology or language. It also offers a wide range of libraries for several tasks in machine learning and data science.

 

The up and coming languages to watch out for:


Learning this language is enabled by the use of Jupyter Notebooks, a tool frequently used by data scientists to code in different languages and visualise and interact with the code in real time. Julia has been fully integrated into this environment, allowing a smoother learning curve.

As this language is still embryonic, it does not have the same community as other languages and therefore less libraries to use. However, the community seems to be growing every day.

Further detail on Julia’s performance and capabilities can be found in this article.


The language with the cute mascot. Known as Go or Golang, was as invented by a trio of software engineers at Google, originally to improve on the performance of some tasks that Python was struggling to complete.

It was created to be simpler and more concise than other languages, due to the simplicity of its code and documentation. It is now used in multiple disciplines such as web development, DevOps, and data analysis, with the benefit of having one language to perform many tasks and not having to change from one language to another. It’s predicted that this language will have a major impact in the near future.

Benchmarking Languages

Below we see the compute speed for different algorithms and functions performed by many of the languages covered in this blog. C and Julia have the lowest latency for almost all the algorithms and computations. This is taken from a comparison on Julia’s website.

Thanks for reading, I hope you find this top-level overview interesting and informative.

Jose

 

Visit Hyper Get In Touch

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Hyper Group

Avenue HQ
10-12 E Parade
Leeds
LS1 2BH, UK

T: +44(0)330 133 1942
E: contact@hyper-group.co.uk