PHP UK Conference 2013

Cómo evité volverme loco en Twitter gracias al aprendizaje automático

Andrei Zmievski  · 

Presentación

Vídeo

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

I work at a company called AB dynamics and we just briefly work on we have a product that as production level monitoring and troubleshooting of like mission critical business apps and that's why the picture hoop superhero there because customers consider

us superheroes we're working on a PHP version of our product right now which is a PHP extension you drop into production environment and it starts collecting analyzing a source of data but your app and helping you instrument and troubleshoot and get to

the root cause of the problems really quickly so if you're interested in that come talk to me afterwards but we're not here to talk about that we're here to talk about math so these slides will contain some math but don't be afraid of it because

it's awesome math unless you do some really cool stuff this is my Twitter and handle for those of you who don't know already and this is both very cool and very horrible thing i got it in october two thousand eight I used to have a different one before

but then through various shenanigans I was able to acquire this one and I didn't have the foresight to see what was going to come next and I know a bunch of other single letter Twitter people one of whom is here helgi if you know him he is at h but i know

a bunch of others and so the advantage of heaven single twitter handle is fame apparently you people like to do saal sorts of profiles on single letter twitter handles like from A to Z the Atlantic did that Wall Street Journal came and did an article and like

photographs and stuff like that unlike what wall street journal is interested in this stuff seriously followers I've like fifty-three fifty-four thousand followers right now I have no idea who they are I mean I know like a couple of hundred maybe but that's

it but another advantage is maximum reply space right hundred forty characters monitor the length of AD sign a in space gives friends and other people 137 to reply back to me I feel like I'm doing them the cons are this is what my reply stream looks like

most of the time it is just full of garbage and if you look at the at the times here they are all very closely spaced together so it's pretty much impossible to filter anything visually and I thought for a while about how to solve this thing it's like

well maybe I'll write some sort of list of rules by which to reject or accept some of these and so on but it's really hard to do any sort of hard code the system is not going to able to adapt so I decided to use machine learning because I'm interested

in that stuff and the best way to learn something is to apply it and do some sort of hack so after I wrote this little tool called replied cleaner this may not look that much different to you but in my estimate it's like eighty percent better already first

of all because there is a there much less closely spaced together so I can actually visually filter through them just scroll on my phone and so on and in actual quality is improved and the way this thing works is that it uses a trained model based on a number

of tweets that I've collected to classify them into good and bad ones and for bad ones it just blocks the user say like disappear from my view completely and that's because Twitter doesn't really give you a way to individually like exclude single

tweets from your replies from your reply stream and so it looks much cleaner like this is an example on my phone where somebody replied to me and I was actually able to look at it even though I don't follow them and have a conversation with the person

but if you look at this like there's still some even though it's readable but it's still some junk there know why people use a day and very random ways and then yeah I can you know filters are in fine stuff like that people don't believe that

they are single letters exist or this stuff how can people like a day and Dixie who is the CEO of Twitter will have a million less followers so conclusion i still hate humanity just a little bit less so a machine learning machine learning is branch of artificial

intelligence there's no really widely accepted definition of what it is but I kind of like this one which was given by Arthur Samuel in 1959 machine learning is a field of study that gives computer is the ability to learn without being explicitly programmed

they learn based on the data that you feed them based on the experience so to say and it's used in a variety of different industries and especially the proliferation of data based like big data and data-driven analytics and everything like that for things

like spam filtering which is very helpful because nobody wants to read spams this is like just a screenshot of what ends up in my spam folder I'm very thankful to gmail for that things like recommendations based on what you buy and what merchants like

Amazon think you owe more on to buy based on your patterns machine translation google translate is actually a statistic uses statistical machine translation it's not rule based or example based so they have a big body of examples of source language and

a target language and I do a whole bunch of analysis and machine learning of that in order to use whatever 58 languages that they support right now things like clustering this is google news example where it looks at the incoming news stories and then figures

out what topics that belong to automatically and then clusters them around those topics so this is all like very cool stuff and related to big data and one of the fundamental things to understand machine learning is that one it's based on the concept of

features and a feature is an individual property of something that you are observing a measuring the really easy way to explain this is if you look at the like a label nutritional facts on some type of food you can see things like you know fats cholesterol

sodium carbohydrates and they all have numeric values next to them you can think of these as features that describe that particular type of food and if we take all of those and put them in an array then we get what's called a feature vector it's basically

a set of features for a certain observation so as an example if we were trying to do a model that predicts house prices based on some parameters or features of the house we may come up with things like how many rooms does it have what's the square footage

of the house how old is it doesn't have a yard and so on and for each one when we can when we have that feature vector if we also derive some sort of weights that apply to each feature so let's say these weights are priced based on like hundreds of

hundred thousand dollars so if if each room gives you approximately one hundred and two thousand dollars increases the price of a house by that or each year of the house age decreases by ten thousand and so on so you take the features in the weights but before

you do that there's one more thing that the usually is usually done is that you pad them and for you basically add one prepend one to the feature vector because that corresponds to this sort of offset wait it's a it's starting weight and then you

use dot product to multiply them and you get your prediction now here's where a little bit of math comes in is the first slide for those of you who don't remember what the dot product is it looks something like this right we have our input array of

various features with the one in front and theta is usually denotes the weights for those features they're also called parameters and it's actually really easy all it is is you take the pairs of those you multiply them together and you that's it

that's a dot product with me so far okay good i see nodding heads now the goal of what you want to do is to take a body of some training data that you already have examples of so things like if you're trying to do spam filtering it would be a whole

bunch of emails and then to sort of label that says whether it's a good or bad email and you so you feed that into some sort of learning algorithm and their various types of learning algorithms will will mention just one in this one like one of the simplest

[ ... ]

Nota: se han omitido las otras 4.118 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.