Eliseo Papa

How i fixed my corrupted sparsebundle image

2015-11-02T13:01:00+00:00

I save all my paper scans and important documents in an encrypted sparsebundle. In lieu of proper backup, I place the file in my dropbox. However, when I got too defiant and tried to use the same image from more than one computers connected to dropbox, the image was corrupted and could not be mounted anymore. Panic.

How do i repair the sparsebundle image i had in dropbox?

Apparently this happened more than once. A quick google search returns a bunch of questions on stack overflow trying to find a way around it. Not many with encouraging results. Sparsebundles are treated as directories by dropbox and it’s not possible to restore entire directories at once from the graphical interface. But encrypted bundles are definitely not directories and as soon as two computers change a chunk at the same time, mayhem occurs.

After a while, however, I did come up with something: There is a python command called dropbox-restore which interfaces with the dropbox API and retrieves every independent file in the directory (or sparsebundle) at a certain time stamp.

I haven’t tested it, but I ended up installing installing an alternative called revisions app, which gives a graphical interface to the same functionality and a lot more ways of interacting with the dropbox API. It was free to install and after 2-3 hours of indexing work, i was able to succesfully retrieve my encrypted sparsebundle down to a date when it was not corrupted.

This near disaster taught me the importance of splitting syncing and backup.

You Are What You Eat - an interactive visualization of food data

2015-06-02T15:45:00+00:00

I have created a visual dashboard, which allows to explore daily food diaries. I wanted to try using some of the interactive capabilities of D3.js and Crossfilter. I have ended up leveraging the dc.js library, which integrates the two with good templates.

The data is scraped from 200 users of the MyFitnessPal app, who shared their food logs publicly. I used um.ai ontology of food to categorize each food item in its food group and to estimate the calories and macronutrients consumed.

While quite easy to use and visually appealing, using dc.js still required a bit of attention. Sure, my javascript skills are minimal, but there were a number of little details which required to dig into stackoverflow,etc. It’s a good tool to create and expose stable dashboards, but probably not great for quick prototyping of different metrics.

SLiMEbook

2015-06-01T10:00:00+00:00

I published an IBD paper in PLoSONE - showing one of the first application of ML methods to microbiome data. We dubbed the code used for the analysis SLiME.

The SLiME code (written in R) became outdated pretty quickly. We decided that the code, as it was packaged for the publication of the paper, should not be in the open anymore.

In its place, this ipython notebook will serve as a repository of the analysis. I hope this can be the starting point for others trying to follow the same approach and improve upon it. It leverages python, sklearn and pandas.

Check the full repo, which also contains the raw data

What is the fastest way to find duplicate pictures?

2015-02-19T00:00:00+00:00

I needed to clean up duplicates photos from my personal library. And because I could not choose which duplicate finder to try, I have decided to test them all. Amongst the free tools that were correct in identifying all duplicates in my test, dupd was the fastest. Other python and perl based solution did also very well, often better than their C/C++ colleagues.

Why a duplicate file finder?

As a parent, my picture collection has mushroomed in size. I do try to save everything in one place and have backups, but it is difficult to keep track of it. Too afraid of losing family memories, often I end up downloading pictures from my phone multiple times, “just in case”. Now that I have reached the limits of my hard drive space, I figured the best thing to do was to remove some duplicate pictures.

One could try with iPhoto or some other graphical interface, but the approach is simply too slow for a large library (over 100GB). The obvious choice was to search for a command line tool.

I quickly realized that there are far too many tools which have been written for this task. And no easy way to find which one is best. So I decided to compare the speed of most of them.

A speed comparison

Wikipedia has a pretty complete list of duplicate file finders. From there I downloaded and installed all free/open source command line tools.

I have excluded from this comparison those tools which could not be readily installed on my macbook (eg. Makefile would have needed fixing) or appear to have very limited support (less than 10 stars on github, slow or stagnant development, etc.). I did not take into considerations those tools that do not have the option of a “dry run” or simple listing of duplicates, but instead attempt to delete or hardlink the duplicate files. I find this behaviour too aggressive for most users. Definitely too risky to run on the folder containing the pictures of my kids.

Finally, I excluded those files that failed to find all correct duplicates in my test folder. The folder was designed to contain 1195 duplicates in 325 clusters. Here are the results, tested on a 7GB mixture of pictures, videos, symlinks, small files and recursively nested files:

Since I keep my pictures in a separate NAS storage, it is also useful how much each of this methods hinges on memory or CPU:

Conclusion

Dupd was the clear speed winner, also with an acceptable memory footprint. liten and fastdupes come close second and may be slightly more portable as they do not require to be compiled. Compiling of the C/C++ tools tend to be a little fragile out of the main UNIX distros, which is a problem when working on a NAS. I ended up using fastdupes.

It is interesting to see how the (arguably) best known solution fdupes was also the slowest. Though it remains one of the only tools which can do byte-by-byte comparison. Both the fastest and second fastest tool rely on SQLite databases and allow you to explore duplicates interactively, after they run.

Please let me know if I forgot any other tool which should have been in this list. The commands included in the analysis were:

duff (C)
dupseek (Perl)
fastdupes.py (Python)
fdf (perl)
fdupe (perl)
fdupes(C)
fdupes-jody a speedier fork of fdupes (C)
liten (Python)
liten2 (Python)
rdfind (C)
ssdeep (C) also does partial matches
python active state recipe (Python)

fix ruby and bundler after upgrading openssl

2014-12-19T13:01:00+00:00

I had trouble running bundler on this blog because of a ruby error. Though I am not sure what I have broken for this to occur, it appears that ruby could not find openssl libraries anymore. A quick search on stack overflow pointed me on the right direction. Apparently upgrading openssl via homebrew breaks ruby dependencies. What I had to do was to install another ruby version with my ruby manager of choice (rbenv):

CFLAGS='-g -O2' RUBY_CONFIGURE_OPTS=--with-openssl-dir=`brew --prefix openssl` rbenv install 2.2.0
rbenv global 2.2.02
rbenv rehash
gem install bundler

And that was enough to bring me back to square one.

Why I switched to markdown for my CV

2012-09-20T15:46:00+00:00

I have previously used LaTeX to typeset my curriculum vitae, as it invariably produces a beautiful looking document. I have now become frustrated at how long it takes me to relearn LaTeX from scratch every time I want to change something. Plus my .tex source file was becoming completely unreadable. In an attempt to clear up the mess, I decided to put all the content in a markdown file and use CSS to style it. The result is markdown-cv, which can be forked and used as a template by anyone who wants to do the same thing.

I have kept my curriculum vitae in LaTeX for a long time. My workflow included a plain text editor and a Mac installation of LaTeX. LaTeX’s elegant typography meant my CV would always look a tad better than its corresponding version in word, and I stuck to it. I also enjoyed having to deal only with plain text files, saving me from the converting files from one word processor to the other. Not to mention that having a single file to update, rather than a series of word processor files scattered everywhere, kept me more disciplined about updating it regularly.

However it was definitely not perfect. For a document as short as my CV, LaTeX was probably overkill. One or two hours spent in word can easily produce a new style or version of the CV, while doing the same thing in LaTeX would always require much more tinkering. I began to get tired of sparkling small LaTeX commands throughout the file, adding \vskip and page breaks to maintain the overall appearance. Albeit plain text, the LaTeX file which contained my CV became increasingly less readable. Though the primary purpose of LaTeX should be to separate content from presentation, the division was becoming to me always less evident. The last drop was when I tried to update the overall look of my resume. I ended up having to rearrange paragraphs and dates, while making up obscure LaTeX macros just to move the years to a different place or change the color of a particular element.

Having recently worked with static blog engines, I immediately thought the division between content and presentation was much more elegant there. The content is usually kept in Markdown, which is very easily readable in any text editor and can be readily converted to HTML. Styling the content is done using CSS and can be done differently for print media or for visualization in a web browser using media queries.

So I proceded to transfer my CV in Markdown format, created a simple HTML template and rendered it using jekyll, the engine used by GitHub to render static pages (and what I use to render this blog). I created a CSS to render the page in a style inspired by kjhealy’s vita template.

To obtain a pdf version of the cv, I am using the great wkhtmltopdf, which I call by means of a rake task: rake pdf.

You can see the various pieces of the workflow arranged on GitHub (Hosting on GitHub there also means I can maintain different versions of the same file, eg. one resume for tech jobs and another for academic purposes). I am quite happy with the final result.

I can’t say that this workflow was completely succesful in doing away with the tinkering, but at the very least the main content of my curriculum vitae is now available in plain readable text. I can easily edit it without having to remember any LaTeX commands. That and I needed the CSS practice…

Ph.D Defense talk

2012-05-16T15:47:00+00:00

I’ve seen many share slides on slideshare before, but I was impressed with the ease and the clean interface of Speaker Deck. To test it out, I have uploaded my Ph.D defense talk.

Not only the permalink is quite intuitive, but one can also embed the presentation in a page quite easily:

Needless to say I am impressed. This may be a good way to store presentations: arrive in the lecture hall, use whatever computer is connected to the projector and browse to Speaker Deck. Click fullscreen and go. No more thumb drives or lost video connectors.

Poster for the host-microbiome interactions conference at the Sanger Institute

2012-05-08T22:33:00+00:00

My abstract was selected for the Host-Microbiome Interactions in Health and Disease conference at the Sanger Institute, which is great news. Already a few hours of the meeting went by and I don’t think I ever talked so much microbiome in so little time. Here is a low resolution pdf of my poster.

I would probably design it differently, now that I have had the chance to see it on the wall. Too much text and the font was too tiny.