Big data CSV parser plugin

Mentor: Hernán Morales Durand
Second mentor:
Level: Intermediate
Invited students: Jan Kricka, Jiri Srejber
Students interested: Jan Kricka(very), Jiri Srejber(very), Igor Lyakh, Le Nam(no biography!!!)


With the advent of inexpensive DNA microarray technology, big data is now available to many small and medium laboratories which performs statistical analysis based in microarray experiments. Most of the times the data produced by genotyping services is delivered in CSV format, as it represents a currently cross-platform "standard" which is easily readable, and still used in hundreds of business applications. In Smalltalk we have several CSV parsers but the performance is far from being competitive with libraries implemented in other languages. The goal of this project is to measure time execution and build a plugin to access CSV data in a fast and competitive way.

Technical Details

Currently exists several open source projects which implements C functions to access CSV data. The challenge of this project is to learn tools like VMMaker and Interpreter Plugin classes to develop a Squeak/Pharo internal or external plugin.

Benefits to the Student

The student will learn about interfacing highly efficient libraries to Smalltalk.

Benefits to the Community

The Smalltalk community will gain a winning library for a extremely common task like dealing with CSV files.

Updated: 18.3.2012