Python Pandas at Extreme Performance
Yaron Haviv | August 8, 2019
Today we all choose between the simplicity of Python tools (pandas, Scikit-learn), the scalability of Spark and Hadoop, and the operation readiness of Kubernetes. We end up using them all. We keep separate teams of Python-oriented data scientists, Java and Scala Spark masters, and an army of devops to manage those siloed solutions.
Data scientists explore with pandas. Then other teams of data engineers re-code the same logic and make it work at scale, or make it work with live streams using Spark. We go through that iteration again and again when a data scientist needs to change the logic or use a different data set for his/her model.
In addition to taking care of the business logic, we build clusters on Hadoop or Kubernetes or even both and manage them manually along with an entire CI/CD pipeline. The bottom line is that we’re all working hard, without enough business impact to show for it…
What if you could write simple code in Python and run it faster than using Spark, without requiring any re-coding, and without devops overhead to address deployment, scaling, and monitoring?
Continue reading on Towards Data Science.