Most modern data architectures employ many different data stores and processing engines. Hadoop, Cassandra, HBase, Spark, Storm, Phoenix. Data analysts looking to unearth insights within these data stores must move data back and forth between different systems and different data formats. As the number of new open source projects continues to grow geometrically, this data fragmentation is likely to splinter further.
Apache Arrow is a new open-source project that helps data analysts wrestle diverse data sets into a single format. Apache Arrow is a collaborative effort that spans many of the largest providers and users of data infrastructure today including Amazon, Cloudera, Databricks, DataStax, Dremio, Hortonworks MapR, Salesforce, Trifacta and Twitter. That so many different companies can collaborate on one initiative to improve data analysis industry-wide is a testament to the power of open-source to inspire and engender great change.
I'm really excited about this project. I write many of the analyses for this blog in R and I've seen this data fragmentation problem for myself and across many different companies. It's one of the major reasons Redpoint invested in Dremio: to solve fragmentation for data engineers. As Wes McKinney, author of Pandas, the most widely distributed Python data analysis toolkit, says, “Arrow will enable Python and R to become first class languages across the entire Big Data Stack.”
Apache Arrow is one of the fastest projects to attain Top Level Project status, a fact that underscores the need for the technology, the strength and breadth of the coalition to support it, and the potential to change the way data analysts work today.