About

The Challenges

Let me paint you a few pictures of various situations that can come up for a business or a person.

The Solutions

With enough carefully selected data, all of these things can be answered quickly with some degree of accuracy. That is what this site does. To do this, you will need 2 things. Examples from previous attempts and what you want to predict. The examples and the predictions should have as many features as possible. What's a feature? It is some detail about what you are trying to get answers on. If you are trying to predict the sale price of a house, Features would be things like Square Footage, Number of stories, previous sale price, Age of the house, Size of Yard, how long ago it sold ... etc. And of course how much those homes sold for, for as many homes in the area as you can get. You then need similar detail for the homes you are trying to get prices on. If you are trying to predict the odds someone will get the flu, you would want things like: age, sex, height, household income, a homeschooled indicator, time of year, did they get their immunization, do they have problems with their immune system... etc.

How It works

Data mining which is of course MAGIC! No wait... Fairies! Actually it is done with computer science and math. You can check out kaggle.com for more on the subject of data mining. The point of this site is to make it so you don't have to use a data scientist or learn to do data mining if you have the data. You can also use it as a quick and dirty estimate if you are a data scientist. Which is to say, the estimates you get from this site might be good but it's likely a human will piece the data together in a better way. But of course they will want a salary too :)

How far can you take it?

The imagination is the limit. The big thing is to have real indicators and good generalizations. For example, if you put a date in your data, the code is going to try to solve the problem using that date. This is generally not what you want since you are predicting the future. It's better to stick with the age or what month and/or day of month something occurred in. This makes the results more general and prevents the program from solving for the wrong thing. You can produce for example a heat map of where auto theft might occur, but you have to be real careful about what kind of data you use. Average home price might be an indicator but a better one might be the home price as a percentage of the average home price for the area. This would allow you to use data from one city in another. Most data scientists spend a lot of their time trying to figure out the best way to organize their data to get a good result. You can do that too, there just might be some learning curve.

Where can I get my data?

Well, I mean that's the rub in the end. You have to find good sources for whatever it is you are trying to work with. A few good public sources are

You will likely have to do some compilation of multiple files in to 1 file, unless you get lucky.