Passion is like genius; a miracle. – Page 22 – Blog on Software, Statistics, and Quant

Chevyshev’s Inequality

http://en.wikipedia.org/wiki/Chebyshev’s_inequality 평균으로부터 k sigma 이상 떨어진 데이터의 비율은 1/k^2 보다 작다. 예를들어 평균에서 6 sigma이상 떨어진 데이터의 비율은 1/36이하이다. Chevyshev Inequality는 아주 tight한 bound는 아니라서 실제로는 이 식으로 구해진 값보다 더 적은 데이터만 평균보다 k sigma 떨어져있다.

October 5, 2012

Tags:

statistics
Machine Learning에 알아둘만한 몇가지 유용한 것들

http://www.kdnuggets.com/2012/09/pedro-domingos-useful-things-about-machine-learning.html 머신 러닝에 대한 folk knowledge가 정리된글. Communications of ACM글인데 링크된 곳에 무료 버젼도 링크도있습니다.

October 1, 2012

Tags:

statistics
MAC용 다변량 분석 도구

Wizard for Mac 이라는 툴인데 꽤나 편해 보입니다. 아무래도 이런툴들은 R로도 쉽게 할수 있는 일을 해주는 것 뿐이지만 각종 분석 기법이 잘 정리되어 있고, 기법마다 관련된 시각화나 통계량이 역시 잘 정리되어있는게 강점입니다. 예를들면 ordered probit과 multinomial logit이 라디오 단추로 제공되는것처럼요.

September 26, 2012

Tags:

statistics
MySQL on OSX

After installing mysql51 and mysql51-server using macports, we need to make it secure by running /local/lib/mysql51/bin/mysql_secure_installation. But before that, as default installation does not set root password, root password should be set first: And then run mysql_secure_installation. Here’s how to start and stop mysql: ‘mysqlstop’ needs you to enter mysql root’s password. If that’s annoying,…

September 1, 2012

Tags:

software
Topic Sensitive Pagerank

Topic sensitive pagerank is a way of getting pageranks per topic instead of using just one pagerank for all pages. In the book Mining of Massive Datasets, biased random walk algorithm is introduced. In the algorithm, we let the random surfer jumps to the page with the specific topic when it wants to teleport. That…

August 22, 2012

Tags:

software
Certification from ml-class.org

두번째 course.org로 부터 받은 certification. Machine Learning 분야에서 잘 알려진 Andrew Ng의 수업이었고, Octave도 잘 배워서 좋았습니다. 시간 좀 내서extra credit 프로그래밍도 다 마쳤더니 만점을 넘는 점수를 받았습니다. ml-class.org에서 8월 20일(미국시간기준)에 새 클래스가 열리니 관심있으신분은 수강신청하세요~.

August 20, 2012

Tags:

statistics
Scaling Up Machine Learning – LinkedIn Techtalk

요즘 기계학습이나 데이터 분석을 이야기할때 많이 거론되는 회사중 하나가 LinkedIn인데, 이 회사에서도 최근 Youtube에 Techtalk채널을 개설해 동영상을 올리고 있네요. 위의 동영상은 P2P, Virtual Cluster, HPC Cluster, Multicore, GPU, FPGA의 각 수준에서 병렬화를 이용한 machine learning입니다. 이 동영상에서 보인 발표자료는 http://hunch.net/~large_scale_survey/에 있습니다. 마침 KDD2011의 튜토리얼에도 사용된 자료이더군요.

August 19, 2012

Tags:

statistics
Bagged tree imputation for missing values using caret

Output: It was just luck that we reached 100% accuracy. Running this multiple times may show 96% accuracy, too.

August 10, 2012

Tags:

statistics
Partial Least Square

http://en.wikipedia.org/wiki/Partial_least_squares_regression PCA 와 유사하지만 데이터의 분산만을 잡으려는 PCA와 달리 Y값까지 함께 고려해 orthogonal한 새로운 feature와 response들을 X, Y로 부터 만들고, 이로부터 수행하는 linear regression. 데이터에 비해 변수가 많을때 OLS(Ordinary Least Square)에 비해 유용함. Tutorial: http://en.wikipedia.org/wiki/Partial_least_squares_regression 책: http://www.maths.bath.ac.uk/~jjf23/LMR/

August 9, 2012

Tags:

statistics
전통적인 process viewer top 의 대체품 htop

요즘 늘어난 cpu 코어를 충분히 활용해보고자 멀티 프로세스, 멀티 쓰레드로 애플리케이션을 종종 돌리고 있습니다. 그런데 top 은 이럴때 시스템 전반의 상황을 쉽게 보기가 어렵더군요. 그래서 찾아보니 htop이란게 있네요. http://htop.sourceforge.net/에서 더 많은 스크린샷을 볼 수 있고, OSX라면 macports로 설치가능합니다. 안해봤지만 리눅스에서도 yum이나 apt-get으로도 쉽게 설치가능할 것입니다. 장점은 기본 동작이 코어별로 cpu load를 보여주는 것이고, top보다는 기본으로…

July 25, 2012

Tags:

software