Mad Data Science

Data Scientists From Tumblr, Kickstarter Confess One Big Goof

When spam filters get out of control.
screen shot 2012 09 14 at 11 37 47 am Data Scientists From Tumblr, Kickstarter Confess One Big Goof

The dataists gather. (Photo: Livestream)

DataGotham is currently unfolding downtown at NYU Stern, and around lunchtime, a roundtable gathered for a discussion of what it’s like to be the first data scientist at a company. Panelists included Tumblr’s Adam Laiacano, Kickstarter’s Fred Benenson, and Etsy’s Roberto Medri. The common denominators, according to moderator Hilary Mason? “A love of math, a curiosity, and a lot of stubbornness.”

Much of the discussion revolved around the weediest of data science topics, dwelling on R and SQL and so forth. But the best part was when each of the panelists–at the prompting of Ms. Mason–admitted to something that had gone horribly awry. Not just because everyone loves a good blooper reel, but because they provide a pretty good snapshot of what data scientists actually do.

Mr. Laiacano–who, prior to joining up with the microblogging site, designed atomic clocks–admitted that Tumblr has a slight spam problem.  He has written some “pretty good” classifiers for finding what does crop up, though there are false positives. But every now and then there’s a batch that, “I’m sure this is all spam.” And one time, he confessed, “I accidentally suspended hundreds, maybe a thousand users all in one day.”

Perhaps we’ve just solved the mystery of the missing NSFW sites!

“I’m sorry,” he added, looking as sheepish as a bearded adult possibly can. (Which is to say, very.)

Mr. Benenson confessed that he once spent a couple of hours panicked that various departments and people within Kickstarter had been confusing the numbers in its internal report–the median pledge–with the numbers provided to the outside world–the popular pledge.

“I’m like, oh, this is, I hope they’re the same.” After pulling the numbers he was reassured, but “it was one of those moments like–communication! We need to be clear,” he said.

Etsy’s Mr. Medri (who, besides his datalogical prowess, majored “in dead languages” as an undergraduate) realized their internal reports featured what might be the least helpful data point of all time: The page with the largest “conversion rate” was the help page, because people who’ve ordered something tend to look up additional details. It conveyed little in the way of actionable information. They adjusted accordingly.

Big ups to Mr.  Laiacano for being the only man bold enough not to softball his answer.

However, the panelists didn’t merely embarrass themselves. They also got a chance to offer an example of how their ultra-wonky skills had helped make a difference.

Mr.  Laiacano admitted that much of what he’s done hasn’t been internal, rather than stuff the world can see.  However, he helped track customer use data, i.e. the way users moved through the site, so as to demonstrate that best way to reorganize the settings would be to put them in one place. Simple sounding, but that required churning through a whole lot of data, but “People have been responding much better–it’s much easier to use the site, change your password and change your picture, stuff like that,” he said.

Mr. Benenson’s big move drew on a Kickstarted project he did called Emoji Dick, where he used Mechanical Turk to translate the beginning of Moby Dick into–you guessed it–emojis. It occurred to him that the process could be neatly applied to get the “training data” for a system to classify the site’s many, many campaigns.

Mr. Medri said he’d once been given a weekend to figure out the lifetime value of Etsy customers–something the company apparently hadn’t previously calculated and which a prospective investor wanted to see before making a commitment. (They got the money.)

DataGotham runs through the remainder of the afternoon; you can catch the livestream here.

Follow Kelly Faircloth on Twitter or via RSS. kfaircloth@observer.com