Thursday, December 2, 2010

Letter size: DVI > PS > PDF

LaTeX will generate A4 size paper as default.

use command:
dvips -t letter mydoc.dvi
--> will produce .ps file.

then just use dvipdf:
--> dvipdf calls dvips behind the scenes, so if you pass it "-t letter", you should end up with an 8.5"x11" (letter size) PDF document.

Friday, August 27, 2010

LaTeX for beginner: useful tips to avoid headache!

Maybe not for a good beginner, but I just got a massive headache while dealing with tables and figures placement in LaTeX. This is the cost of learning-by-doing (yea, serves me right). To avoid repeating the same headache-cause, I listed the tips & related links here.

1. LaTeX/Floats, Figures and Captions by WikiBooks

2. TeX Resources by A.J. Hildebrand

3. LaTeX Tips: Basic tips (also by A.J. Hildebrand)
Below is the copy of the page (just in case it is unavailable).

------------------------------------------------------------------------------------------------
LaTeX Tips: Basic tips

Books

  • G. Gratzer, Math into LaTeX. A must-have for anyone using LaTeX for mathematical typing, this is the only book describing, in detail, the AMS enhancements to LaTeX ("ams-latex") that greatly facilitate the typesetting of mathematical material. It is suitable for beginners, but is also an indispensable reference for experienced TeX users.
  • H. Kopka and P.W. Daly, A guide to LaTeX. The best general LaTeX book. For intermediate users.
  • M. Goossens and F. Mittelbach, The LaTeX Companion; M. Goossens, S. Raatz, and F. Mittelbach, The LaTeX Graphics Companion. This pair of books covers many of the "packages" and add-ons that are available for LaTeX. For intermediate to advanced users, these books complement Gratzer and Kopka/Daly.

General

  • Avoid manual formatting commands. Of all the mistakes people make when typesetting LaTeX, attempting to format text manually instead of using predefined LaTeX macros for this purpose, is probably the most common. and the most frustrating for a publisher. Manual formatting includes inserting vertical or horizontal spacing with \bigskip, \vskip, \vfill, etc., setting explicit line breaks (\\, \newline), preventing paragraph indentation with \noindent's, setting theorems via explicit font instructions ({\bf Theorem:}), coding section headings manually (\centerline{\bf Introduction}), etc. Avoid such commands, and use instead proper LaTeX constructs such as \section{...}, or \begin{theorem}... \end{theorem}. Leave the formatting up to LaTeX, which does a very good job at that. The output obtained by letting TeX decide on the amount of spacing looks almost always better than what an author could achieve by inserting spacing commands; if you want a different "look", change to a different documentclass (e.g., use article instead of amsart, or vice versa), or change parameters globally (e.g., setting \parskip=8pt adds a bit of vertical space between paragraphs). The latter, however, should be used only sparingly, if at all.
    There are situations where manual formatting commands are appropriate; for example, the "bibitems" in the "thebibliography" environment must be formatted manually. However, those situations are very rare.
  • Avoid using nonstandard documentclasses; use article or amsart (or, for book-size documents, book) as documentclass. The "article" and "amsart" documentclasses are all-purpose documentclasses that are part of the standard TeX distribution and which can be used for almost everything, not just "articles." The two classes format articles differently (e.g., section headings are set in different font sizes), so pick whichever you like best. You might want to pick one class and then stick with it, rather than switching back and forth. One reason for this is that the syntax for the topmatter material (author, title, etc.) is slightly different in the two classes, so if you want to switch from one class to the other, you will need to edit that part of the document. Some publishers have their own customized documentclasses and ask authors to submit papers written in those classes. Papers written with such nonstandard documentclasses are not "portable", since the documentclasses are not part of the standard TeX distribution, and others will not be able to compile the tex file (at least not without going through the trouble of having to download the customized documentclass from the publisher's website). For this reason, I would suggest to write your paper in one of the standard documentclasses (article or amsart), and use that version for posting on websites or submitting to preprint servers, or for circulating by email. I would only convert the paper to the publisher's documentclass (a process that is usually straightforward, in contrast to the a conversion in the other direction - from a customized documentclass to one of the standard classes), when the paper is in final form, accepted, and the publisher or editor is asking for the tex file of the paper.
  • If you use the article class, be sure to load the ams packages with \usepackage{amsmath, amsthm}. Adding this line after \documentclass makes the standard ams latex enhancements (such as align and theorem/proof environments) available. (With the amsart documentclass these packages are loaded automatically, so this isn't necessary.) For most purposes you won't need any of the additional ams packages; an exception is the "amssymb" package which you may need to load if you require special symbols.
  • To enlarge or scale a LaTeX document, increase the font size by adding the option "[12pt]" or "[11pt]" to the documentclass. The plain TeX \magnification command does not work with LaTeX.

Math

  • Use align instead of eqnarray for multiline displayed equations. Align is part of the ams-latex package and is available whenever the amsmath package has been loaded (see above). There rarely is a need for using anything else (other than the variant align* which works just like align, except that it does not generate equation numbers). In particular, align supersedes the eqnarray environment, it is easier to use, and it does a much better job in displaying mathematics than eqnarray. For example, one annoying problem with eqnarray is that in displays with long lines equation numbers may get partially overwritten. Align is much smarter in handling this situation: if there is not enough room for an equation number, the equation number automatically gets moved up or down.
  • Other ams-latex display environments. Ams-latex provides several other environments for multiline displays, such as gather, multline, aligned, split, etc. The variety of options may be confusing, but none of these is particularly important, and you can get by with just using align or align*.
  • Align multiline displays right before equal signs or their equivalents (e.g., a "less than" symbol). If that's not possible and you have to break up an expression in the middle, move the continuation line a bit to the right by placing a \qquad right after the alignment symbol.
  • Use \quad or \qquad for spacing in displayed math material. Usually it's best to leave the spacing up to TeX. However, if explicit horizontal spacing is needed (for example, to set an expression like "(n \to \infty)" apart from the rest of the display, or to separate two equations on the same line), \quad (or, occasionally, \qquad which equals two \quad's) in most cases generates the right amount of space. Don't try to create spacing with a bunch of explicit spaces ("\ "); the spacing generated in this way is usually not optimal, and the explict spaces will likely have to be removed (and possibly replaced by \quad) when the paper is typeset at the publisher's end.
  • Avoid blank lines before or after a display, unless you really want to start a new paragraph. It is tempting to surround displayed math material by blank lines in the source file, to make them stand out and easier to locate. However, this is usually wrong, since blank lines are interpreted as paragraph breaks, may generate some additional vertical spacing and cause the next line of text to be indented - something you usually don't want. If you want to set off displays in your source file, do so by inserting a line with comment symbols, such as "%%%%% equation 3.1 %%%%%%%%%%%%%%" before and/or after the display.
  • Use the bracket pair \[ and \] instead of double dollar signs ($$). In TeX and amstex the double dollar symbol is used to delimit displayed math material. This still works in LaTeX (and a lot of people, including myself, still use it since old habits are hard to get rid of), but its use is discouraged, and it is possible that in future versions it may no longer work.
  • Use "\tag" for manually set equation numbers. Parentheses are generated automatically, so to get "(4a)", you'd use \tag{4a}.
  • Use "\eqref" instead of "\ref" for references to (labelled) equations. This ams-latex command works just like \ref, but it automatically creates parentheses, which makes it easier to use.
  • Use \substack{...} for multiline subscripts on sums or integrals. \substack is provided by the ams-latex package and works much like the \sb ... \endsb pair in amstex. It is much easier to use, and produces better looking output, than an array environment or a construct using \atop (derived from plain tex).
  • Declare theorems with \newtheorem or \newtheorem*. If you don't want to use the automatic numbering mechanism, just add one \newtheorem* declaration for each theorem, lemma, etc. to the preamble, using some simple labeling scheme. (Use the asterisk version, \newtheorem*, to prevent theorem numbers from being generated.) E.g., \newtheorem*{theoremA1}{Theorem A1}, \newtheorem*{theoremA2}{Theorem A2}, etc. Note that theorem declarations can contain numbers and punctuation symbols, in contrast to ordinary macros; thus you can give "Theorem A.2" the label "theoremA.2".
  • Use the \begin{proof} ... \end{proof} environment for proofs. This is part of the ams-latex package and works much like the \demo ... \enddemo pair in amstex. In particular, it adds a bit of space before and after the proof, and a "qed symbol" (a hollow square) at the end of the proof. Placing of qed symbol. An important rule is that you should not leave a blank line before "\end{proof}" since that would indicate a paragraph break and would cause the qed symbol to be placed one line below where you want it. If the proof ends with a displayed equation, then "\end{proof}" would normally place the symbol one line below the display, which looks odd. To place the symbol on the same line as the display, add "\qedhere" at the end of the display. (This is explained in Gratzer's book.)
  • Use \operatorname{...} or \DeclareMathOperator for "math operators" that are not predefined. Most common functions and operators in mathematics have predefined macros (such as \sin, \arctan, \max, \limsup, \mod) that automatically print the "operator" in an upright (rather than italic) font when used in math mode; this is the desired look. However, if you need an operator that is not predefined, say "rank", it will not look right if you just type $rank(A)$. What you should do is replace "rank" by "\operatorname{rank}"; if you need this more than a few times, it is worth defining a new operator, say \rank, with the \DeclareMathOperator macro (see the Gratzer book for details).
  • Use \left and \right for delimiters surrounding "large" expressions (like sums or fractions). An expression like $(\sum_{i=1}^na_i)^2$, surrounded by ordinary parentheses, looks very poor when typeset. Preceding the two parentheses by \left and \right causes TeX to automatically size the parentheses. Note that \left and \right must occur in pairs and you cannot break lines, or put an alignment symbol, inside such a pair. A rather common, but hard to diagnose, error arises when this rule is not followed.

Tables, pictures, and graphics

  • Use the options [h], [!h], etc., to finetune the placement of tables and figures. LaTeX uses sophisticated algorithms to decide where to place tables and figures enclosed in \begin{table} ... \end{table} or \begin{figure} ... \end{figure} environments. Usually, this works just fine, but occasionally (especially with documents that have lots of figures or tables), this results in a poor placement - for example, in the middle of a bibliography. To correct this, first try adding one of the options [h], [t], or [b], to \begin{table} or \begin{figure}; e.g., \begin{table}[h] asks for placement of the table "here" (i.e., at the place where table appears in the document). (Similarly, the options "[t]" and "[b]" ask for placement at the top resp. bottom of the page.) If this does not work, add an exclamation mark to the option (e.g., "[!h]"). As a last resort, you could insert a pagebreak with \clearpage at a place where you want the table to appear, possibly combined with one or more "\suppressfloats" instructions at places where you don't want the table to appear.
  • Use the "graphicx" package to include graphics produced by external programs. The ideal graphics format for inclusion in a LaTeX document is "encapsulated postscript" or eps. Files in this format usually can be recognized by a filename with a ".eps" extension. Nearly all picture generating programs (including Mathematica, xfig, or Windows/Mac tools like MS Word, Paint, etc.) have the ability to save the graphics as an eps file. Once you have your graphics in eps format, use the "graphicx" package to import these files into your TeX document, by (i) adding \usepackage{graphicx} (note the "x" at the end!) to the preamble to load the graphicx package, and (ii) adding an instruction of the form \includegraphics{file.eps} at the place you want the graphics to appear, for each such file. For (much) more on this, consult the "epslatex" documentation, which you can call up and print out with the command "texdoc epslatex.ps". Other methods of including graphics, such as the "epsf" or "epsfig" packages, are considered obsolete and their use is discouraged (though they may still work).
  • Commutative diagrams. Simple diagrams can be created with the "CD" environment, provided by the "amscd" package (to load this package, add the line "\usepackage{amscd}" to the preamble). This environment is derived from the \CD ... \endCD environment in amstex, and the syntax is basically the same. For more complex diagrams, there is the "xy-pic" program (to be loaded by the line "\usepackage{xy}"), an amazingly powerful and versatile tool, with which you can draw pretty much every diagram that you might encounter in mathematics. The program is documented in a short guide "xyguide" and a comprehensive reference manual "xyrefer", which you can call up and print out with "texdoc xyguide.ps" and "texdoc xyrefer.ps".
  • Drawing figures by hand. If you need to draw a picture by hand, use "xfig", which is available on the math department's Unix system. xfig is powerful, yet easy to use and intuitive, and it comes with extensive documentation and help files. Once you have created a picture in xfig, save it in eps format and import it into the LaTeX document as shown above.

Miscellany

  • Putting TeX documents on the web. The cleanest, easiest, and quickest way to make a TeX document available on the web is to convert it to pdf format and then post the pdf file. The conversion from LaTeX to pdf is a painless one step process: just say "pdflatex file.tex" to generate a pdf file "file.pdf" directly from a LaTeX file "file.tex". I have been using this method to make class materials available to students, and I have converted hundreds of documents in this way, without encountering a single problem. Other approaches, such as converting TeX files to html files with embedded gif's, are more cumbersome to use, more prone to errors, and the resulting web pages generally look rather poor.
  • Printing dvi files from within xdvi. Unfortunately, direct printing from the xdvi screen is not possible. You need to exit xdvi and then use dvips to get a printed copy of the program. If the dvi file is one that you have created and which is therefore readily accessible, this is no problem, but you might find yourself in a situation where a program such as netscape or texdoc calls up xdvi to view a file, and you have no idea where the file is located or what the file name is. Here is what you can do in these cases: In netscape, if a postscript or pdf version of the document is provided (this is the case with, for example, MathSciNet or the ArXiv), click on the links corresponding to those versions. If only a dvi version is provided, download the file before clicking on the link with the filename, then use dvips on the downloaded copy of this file to get a printout. With texdoc, try to specify the filename with a ps extension; e.g., "texdoc thesis.ps" instead of "texdoc thesis". If there are both a ps and a dvi version of the document on our system, then specifying the ps extension will force the ps version to be displayed with ghostview or gv, and you can print the file directly from ghostview. If there is a dvi version, but no ps version (as in the case of "fancyhdr"), find the parent directory of the dvi version using the "locate" tool, then use dvips with the full pathname to the dvi file as argument. In the fancyhdr example, you'd say "locate fancyhdr.dvi" to find the location of fancyhdr.dvi (underneath /usr/local/encap/teTeX/), and and then use "dvips /usr/local/encap/teTeX/share/texmf/doc/latex/fancyhdr/fancyhdr.dvi".
  • Citations. The advice against manual coding applies here as well. Use the built-in cite mechanism of LaTeX: instead of "[3]" or "[Wi96]" use "\cite{3}" or "\cite{Wi96}" to reference bibliography items. This has a number of advantages, the most important of which is that it makes adding or deleting a bibliography item a painless process since LaTeX automatically renumbers references. (When you do this, be sure to run latex on the file (at least) twice, since the renumbering process requires two (or more) passes.) Another advantage of using the \cite mechanism is that it makes it easy to change citation styles: if the bibliography is generated by bibtex, all you need to do is replace one bibliography style by another, e.g. \bibliographystyle{amsplain}" by "\bibliographystyle{amsalpha}". If the bibliography is set with \bibitem's, change the optional argument in \bibitem to whatever you want the label for the corresponding record in the bibliography to show, regardless of the citation key. E.g., \bibitem[Wiles1995]{wi95} produces a record with label "[Wiles1995]" but can be cited with \cite{wi95}.
    Arguments in citations. Often one needs to refer to a specific theorem, section, page, etc., in a reference. The standard way to do this is by saying something like "by [5, Theorem 3.5] we have ..."; the proper way to code this using the \cite mechanism is to include the page/theorem/etc. reference in brackets, as an argument to \cite: "by \cite[Theorem 3.5]{5}" we have ..."
  • Bibliographies set with bibitems. There are two ways to generate bibliographies in LaTeX: either by coding each reference as a "\bibitem", placed inside a \begin{thebibliography} ... \end{thebibliography} environment within the main tex file; or by creating a separate database with bibliography records, and using a program called "bibtex" to process that database. The BibTeX approach is much more complicated and has a steep learning curve (it takes up an entire chapter in Gratzer's book), so I would recommend that beginners stick to the "\bibitem" method. I would also recommend using the "\bibitem" method for documents that have only a few bibliography items; for short bibliographies creating a bibtex database is overkill. There are several commonly accepted ways to format bibliography records with \bibitem's (e.g., putting titles inside \emph{...} and setting journal names in ordinary (Roman) font, or vice versa); look at some examples from Gratzer's book, but whatever style you choose, be sure to be consistent and format all records in the same manner.
  • Bibliographies set with BibTeX. The alternative method to create a bibliography is to create a separate file containing the bibliography records (and standard extension ".bib"), and read in this file into the main tex file with a "\bibliography{...}" command. The bib file has to be formatted according to rather rigid specifications; learning the proper syntax of the bib records takes some time, and keying in bibliography records in this syntax takes longer than keying the same records as \bibitems. However, the advantage of this approach is that you have to do this only once; if you write another paper that references some of the same records, you can use the same bibliography file. In fact, you can create a bibliographical database of all literature items that are of interest to you and use this database as a master database for your papers. Another significant advantage of using BibTeX is that you can download BibTeX formatted citations from MathSciNet and add these to your bibtex database. This saves you from having to enter the records manually and, more importantly, it ensures that the citations follow the standard conventions for journal abbreviations, punctuation, etc.

Tuesday, June 15, 2010

Evaluation: Validation, Metrics, Graphical Measures

Source: malibu (machine learning workbench)
http://proteomics.bioengr.uic.edu/malibu/docs/evaluation.html

Evaluation

malibu implements a number of algorithms and measures to evaluate the performance of a two-class supervised learning tool. This is necessary to both select the best model and evaluate the performance of the learner.



Validation Algorithms

To facillate understanding of each validation algorithm, there is a graphical representation illustraing the proportion of data from the dataset used for training, testing or nothing. The circles below represent some proportion fo the dataset. For each example, they are divided into 12 slices. The coloring system goes as follows: red slices represent the proportion of the dataset used for testing, blue for training and yellow are unused or not used yet. The one exception is the bootstrap method were dark blue represents repeated examples such that the training set size equals the size of the dataset.

Holdout Validation

The best method to determine the true generalization ability of the learning algorithm is evaluate its performance on unseen examples. The holdout method divides the dataset into two portions, usually 2/3 for training and 1/3 for testing. For small datasets, this method is often repeated for random paritions of the dataset.

Resubstitution

The resubstituion or self-consistency test evaluations the classifier over the same set of examples used for training (training set). This test often gives an overly optimistic estimate of the true generalization error. This method is not often used to evaluate the performance of a classifier.

Crossvalidation

When the dataset is small, holdout fails to give a accurate estimation of the classifiers performance. Instead, crossvalidation has been show to outperform holdout in this case. n-fold cross-validation divides the dataset into n parts where n-1 parts are used for training and the left out n part is used for testing. This process is iterated such that every part is used for testing once. An exterme case of cross-validation is leave-one-out crossvalidation where every instance expect one is used for training and the left out instance for testing. This procedure is repeated for every instance in the dataset. A single run of 10-fold cross-validation is often used in practice (to obtain a valid confidence interval).

Cross-validation

Progressive Validation

Another method used to improve the holdout estimate is progressive validation. The generalization error is as good as a single holdout however, progressive validation uses half the number of examples on average for training. Progressive validation is performed by splitting the dataset into two parts then training the classifier on the first part and testing a single instance from the second part. This instance is then added to the training set and the procedure continues for each instance in the second part.


Bootstrap Validation

The bootstrap method samples the data with replacement generating a training set the size of the original dataset. Any examples not used in the training set are used for testing. This procedure is repeated. Approximatly 1/3 of the examples are left out on any given round of bootstrap for testing. In many cases, boostrap in superior to cross-validation; however, there seems to be just as many cases where cross-validation is superior.

Kohavi, Ron. "A Study of Cross-Validation and Bootstrap for Accuracy Esimation and Model Selection." Paper presented at the International Joint Conference on Artificial Intelligence, Montreal, Canada 1995.

Blum, Avrim, Adam Kalai, and John Langford. "Beating the Hold-Out: Bounds for K-Fold and Progressive Cross-Validation." Paper presented at the Twelfth Annual Conference on Computational Learning Theory Santa Cruz, California 1999.

Efron, Bradley. "Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation." Journal of the American Statistical Association 78, no. 382 (1983): 316-31.



Metrics

A scalar metric can easily be tabulated in terms of successes, failures and draws. It should be kept in mind the circumstances that a scalar represents.

Threshold

A threshold metric reflects the performance of a classifier at a particular threshold, usually 0. These metrics are derived from a confusion matrix (or contingency table), which counts the number of correct/incorrect predictions for a particular class. For two classes:

  • True positive (TP): a prediction that is postive and correct
  • True negative (TN): a prediction that is negative and correct
  • False positive (FP): a prediction that is postive and incorrect
  • False negative (FN): a prediction that is negative and incorrect
PositiveNegative
Postive
TP
FN
Negative
FP
TN

AccuracyACC

The accuracy measures the ratio of correction predictions to the total number of predictions. P(ŷ=y)

SensitivitySEN

The sensitivity measures the proportion of positive cases correctly predicted as positive. In terms of conditional probability, sensitivity is the probability a case will be predicted positive given the case is positive. In information retrieval, sensitivity is known as recall and measures the fraction of relevant material returned by a search. It is also known as the true positive rate (TPR).

SpecificitySPE

The specificity measures the proportion of negative cases correctly predicted as negative. In terms of conditional probability, specificity is the probability a case will be predicted negative given the case is negative. A high specificity means a low Type I error.

Positive Predictive ValuePPV

The positive predictive value measures the proportion of cases predicted positive which are correctly predicted. In terms of conditional probability, it is the probability that a case is truely positive given it is predicted positive. In terms of information retrieval, positive predictive value is known as precision and measures the fraction of documents returned that are relevant.

Negative Predictive ValueNPV

The negative predictive value measures the proportion of cases predicted negative which are correctly predicted. In terms of conditional probability, it is the probability that a case is truely negative given it is predicted negative.

False Positive RateFPR

The false positive rate measures the proportion of negative cases that are incorrectly predicted positive. It is also known as a Type I error i.e. the error of rejecting a hypothesis that should have been accepted.

LiftLFT

The lift measures accuracy of prediction for the top p% of predictions. In this case, it measures the accuracy of the top 25%. This is a typical measure in database marketing.

F-scoreFSC

The f-score measures the weighted harmonic mean of precision and recall. This is a common metric in information retrieval.

Matthew's Correlation CoefficientMCC

The Matthew's corrletation coefficient measures the accuracy of prediction where a predictor that is always right has a value 1 and a predictor that is always wrong has the value -1. A random guess will have a value 0.

Ranking/Ordering

Ranking (Ordering) metrics measure the classifiers ability to correctly order predictions. These metrics do not depend on the relative values of the predictions or the threshold of prediction.

Area Under the ROC CurveAUR

A ROC plot compares the true positive rate v.s the false positive rate as the threshold is swept from 0 to 1. A area Under the ROC of 1 means a perfect prediction and 0.5 a random guess. Another interpetation of the area under the ROC is that it measures how badly sorted a set of predictions are in terms of the class value. That is, it measures the number of swaps to repair the sort normalized by the number of positive times the number of negative cases.

Area Under the Precision Recall CurveAUP

The precision recall curve pose an alternative for tasks with a large skew in class distribution and is often used in information retrieval. A classifier that is near optimal in ROC space may not be optimal in precision/recall space.

Break Even PointBEP

The breakeven point is defined as the point where the precision and recall are equal. This is a common metric in information retrieval.

Area Under the Cost CurveAUC

The area under the cost curve measurs the expected cost of a classifier assuming all possible probability cost values are equally likely. A lower value is better.

Probability/Regression

The probability (regression) metrics measure how close the predicted values match the target values. The target values T are choose to be 0 or 1 depending on the case.

Root Mean Squared ErrorRMSE

The root mean squared error measures how close predicted values match target values. This uses the squared error divided by the number of cases and under the square root.

Mean Cross EntropyCXE

The cross entropy (log loss) measures how close predicted values match target values and it assumes the predicted probability between [0,1] that indicate that the class of the examples is 1. The cross entropy is nomalized to the size of the dataset, mean cross entropy. Note, the cross-entropy is infinity at 0 and 1.

Joachims, Thorsten. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features." Universität Dortmund, 1997.

Matthews, B. W. "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme." Biochimica et biophysica acta 405, no. 2 (1975): 442–51.



Graphical Measures

A single scalar cannot account for both types of error that a binary classifier may make. Moreover, a scalar cannot account for other circumstances under which one classifier is superior to another. There are several methods to generate such curves. The method used below to generate the following curves involves sweeps the threshold from 0 to 1 for a classifier with real-outputs.

ROC Curves

The reciever operating characteristic (ROC) curve depicts the performance of a classifiers in terms of the true positive rate (sensitivity) and the false positive rate (1-specificity). The closer the curve is to the left side and the top the better the classifier.

The ROC Convex Hull: Tom Fawcett

Precision Recall Curves

The precision recall curve depicts the performance of a classifier in terms of precision (y-axis) and recall (x-axis). The precision recall curve pose an alternative for tasks with a large skew in class distribution and is often used in information retrieval. A classifier that is near optimal in ROC space may not be optimal in precision/recall space. An optimal precision recall curve will be in the upper right hand corner.

Sometimes you need to see the precision/recall curve with the classes reversed. The curve shown below has the negative predictive value on the y-axis and the specificity and x-axis.

Cost Curves

The cost curve depicts the performance of a classifiers in terms of expected cost (or error rate) over a full range of class distributions and misclassification costs. The x-axis measures with respect to the class distribution and the y-axis measures the normalized expected cost (or error). A single cost line interpolates to a point in ROC space where the y-coordinate measure false positive rate and the right y-coordinate one minus the true positive rate. The combination of the cost lines form the lower envelope. One classifier is superior to another classifier if its lower envelope is lower than the other classifier's.

Using the above cost curve, it would be hard to compare classifiers. Note, the white space form below the lines; this forms a curve called the lower envelope. This lower envelope captures much of the information of the cost curve while allowing inituitive comparision like a ROC curve.

Another problem with the above cost curve is that it is hard to see individual cost lines. The plot below shows the lower envelope with some important cost lines. Likewise, it is easier to see for what range of the lower envelope this classifier outperforms the trival classifiers (0,0), (1,1) and (1,0),(0,1).

Lift Curves

The lift curve depicts the performance of a classifiers probability estimates in terms of lift (y-axis) and probability of predicting positive (x-axis).

Davis, Jesse, and Mark Goadrich. "The Relationship between Precision-Recall and Roc Curves." Paper presented at the 23rd International Conference on Machine learning Pittsburgh, Pennsylvania 2006.

Peterson, W.W., T.G. Birdsall, and W.C. Fox. "The Theory of Signal Detectibility." Transactions of the IRE Professional Group in Information Theory 2, no. 4 (1954): 171-212.

Chris, Drummond, and C. Holte Robert. "Cost Curves: An Improved Method for Visualizing Classifier Performance." Machine Learning V65, no. 1 (2006): 95-130.


Thursday, March 4, 2010

PDF to Word Document

Case: I wrote a journal in LaTeX, the output is in PDF. It was rejected and I want to send to another publisher, but they don't accept LaTeX, only PDF or .doc, and they limit the paper up to 20 pages only. Mine is 24 pages, and there are a lot of blank spaces in the paper (especially in the figure & table pages) because of the LaTeX formatting. I'm thinking of doing it in Word so can adjust it manually, which is easier. But I don't want to copy and reformat everything, which is very time consuming.

Solution: The best found so far: PDF to Word Converter
1. Upload the PDF file
2. Fill in your email
3. They will email you the converted file. Please allow some time for queuing and converting process (my case is ~4hours), and it's worth waiting.


Monday, January 4, 2010

Face Dataset

Face Recognition Homepage - Databases
http://www.face-rec.org/databases/

LFWcrop Face Dataset
http://www.itee.uq.edu.au/~conrad/lfwcrop/

Psychological Image Collection at Stirling (PICS) Image Database
http://pics.psych.stir.ac.uk/cgi-bin/PICS/New/pics.cgi

Resources for Face Detection
http://vision.ai.uiuc.edu/mhyang/face-detection-survey.html


Simple steps on nose and mouth detection:
Source: From here (credits to Alan Balkany)
1. Use your existing algorithm to get the bounds of the face.
2. Replace each pixel with a number proportional to the variance in the neighborhood of that pixel. (There will be more variance in the regions of the eyes, mouth, and around the nose.)
3. Use the Open operation (Erode morphological operator followed by Dilate) to eliminate stray isolated pixels with high variance.
4. Use the Close operation (Dilate morphological operator followed by Erode) to fill in the gaps.
5. You should have four regions remaining: Two eyes, nose, and mouth.