I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

Uploaded by
Rumi

119 downloads 3087 Views 441KB Size

Great leaders are authentic leaders

Seek knowledge from cradle to the grave. Prophet Muhammad (Peace be upon him)

Social and Emotional Learning in Out-of-School Time Settings

It always seems impossible until it is done. Nelson Mandela

Future Leaders

How wonderful it is that nobody need wait a single moment before starting to improve the world. Anne

Sustainability Leaders

How wonderful it is that nobody need wait a single moment before starting to improve the world. Anne

Out of the Alps

The best time to plant a tree was 20 years ago. The second best time is now. Chinese Proverb

Imaginaries Out of Place

The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

OUT OF THE DUST

I tried to make sense of the Four Books, until love arrived, and it all became a single syllable. Yunus

Learning Out of Leaders Gerard Kerkyacharian, Mathilde Mougeot, Dominique Picard, Karine Tribouley

To cite this version: Gerard Kerkyacharian, Mathilde Mougeot, Dominique Picard, Karine Tribouley. Learning Out of Leaders. 2010.

HAL Id: hal-00445690 https://hal.archives-ouvertes.fr/hal-00445690 Submitted on 11 Jan 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Learning Out of Leaders G´ erard KERKYACHARIANa , Mathilde MOUGEOTa , Dominique PICARDa and Karine TRIBOULEYb a

Universit´e Paris-Diderot, CNRS LPMA, 175 rue du Chevaleret 75013 Paris, France kerkmath.jussieu.fr, mougeotmath.jussieu.fr, picardmath.jussieu.fr b

Universit´e Nanterre Paris Ouest, MODALX & LPMA Rue de la r´epublique 92001 Nanterre, France karine.tribouleyu-paris10.fr

R´ esum´ e This paper investigates the problem of selection and estimation in a high dimensional regression-type model. We propose a procedure with no optimization called LOL, for Learning Out of Leaders. LOL is an auto-driven algorithm with two thresholding steps. A first adaptive thresholding helps to select leaders among the initial regressors in such a way to reduce the dimensionality. Then a second thresholding follows the estimations and predictions performed by linear regression on the leaders. Theoretical results are proved. As an estimation procedure, LOL is optimal since the upper exponential bounds are achieved. Rates of convergence are provided and show that LOL is also consistent as a selection procedure. An extensive computational experiment is conducted to emphasize the practical good performances of LOL.

1

Introduction The general linear model is considered in this paper, with a focus on cases where the number

p of regressors is large compared to the number n of the observations (although there is no such restrictions). These type of models have lots of practical applications in many areas of science and engineering including collaborative filtering, machine learning, control, remote sensing, and computer vision just to name a few. Examples in statistical signal processing and nonparametric estimation include the recovery of a continuous-time curve or surface from a finite number of noisy samples. Other interesting fields of application are radiology and biomedical imaging when fewer measurements about an image are available compared to the unknown number of pixels collected. In biostatistics, high dimensional data frequently arise in genomics to study gene expression given a huge number of initial genes and a relatively low number of observations. 1

A considerable amount of work have been produced in this domain in the last years, which has been a large source of inspiration for this paper. We have especially considered the algorithms coming from the learning framework ([Barron et al., 2008], [Binev et al., 2005], [Binev et al., 2007a], [Binev et al., 2007b]), as well as the extraordinary explosive domain of ℓ1 penalties (among many others [Tibshirani, 1996], [Candes and Tao, 2007], [Bickel et al., 2007], [Bunea et al., 2007a], [Bunea et al., 2007b], [Fan and Lv, 2008] and [Cand`es and Plan, 2008]. See also [Lounici, 2008] and [Alquier and Hebiri, 2009] ). The essential motivation of this work is to provide one of the simplest procedures which achieves, in the same time, good performances. LOL algorithm (for Learning Out of Leaders) consists in a two steps thresholding procedure. As we do not perform any optimization step, it is important to address in which domains the procedure is competitive to more sophisticated algorithms and especially to algorithms performing a one or two steps ℓ1 minimization. One of our aim is to delimit where LOL is performant and where its simplicity induces a slight lack of efficiency from a theoretical point of view as from a practical aspect. Let us start by introducing the ideas of the emergence of LOL algorithm. This simple procedure can be viewed as an ’explanation’ or as a ’cartoon’ of ℓ1 minimizations. It is well known that when the regressors are exactly normalized and orthogonal, ℓ1 minimization corresponds to soft thresholding which itself is close to hard thresholding. Hence, it is quite natural to expect that thresholding should perform well, at least in cases not so far from these orthonormal conditions which correspond, as noted below, to small coherence conditions. A tricky problem occurs when the regressors are not orthonormal or when the number of regressors is large. Then, the minimum least squares estimator has a non unique solution and is very unstable. This stays the main difficulty for the ℓ1 minimizers or more generally for all methods based on sparsity assumptions. Moreover, this is the part of the algorithms where the computation cost shows up. Obviously a simple thresholding would not fit, but assuming some sparsity conditions, in this case, ensures that it is possible to choose some regressors and exclude some others. LOL algorithm solves the difficult problem of the choice of the regressors in a quite crude way by adaptively selecting N regressors which are the most correlated to the target : this defines the first step thresholding of LOL, determining the N leaders. The number N is chosen using a fine tuning depending on the coherence, and it has to be emphasized that the choice is auto driven. In a second thresholding step, LOL regresses on the previous leaders and thresholds the result to take into account the noise of the model. Properties of LOL procedure are investigated through two different points of view : the prediction problem and the estimation problem. More precisely, it is established that LOL procedure has a prediction error which is going to zero in probability with exponential rates. These types of

2

results are often called Bahadur type efficiency. Although Bahadur efficiency of test and estimation procedures goes back to the sixties (see [Bahadur, 1960]), it has seen recently a revival in learning theory, where the rates of convergence (preferably exponential) of being at some fixed distance of the target are investigated and compared to optimality. This is also the connection to learning theory which guides here the choice to measure LOL performances as the mean of the empirical quadratic distance between the observations and the predicted values. We also establish that LOL procedure works quite well regarding the detection since the number of false negative as well as false positive are going to zero in probability with pretty fast rates. Of course, because of the simplicity of the method, some loss of efficiency can be expected compare to more elaborate and costly procedures. But even when there is a loss, the limitations of the procedure could be an interesting information on the ℓ1 minimizers themselves. From both a theoretical and a practical point of view, when the coherence is small, LOL procedure is as powerful as the best procedures. Also when there is a loss in the rate, a positive aspect of the method is that the practitioner is informed of the possible instability since the coherence is provided by the observations. An intensive calculation program is performed to show the advantages and the limitations of LOL procedure in several practical aspects. In Section 6, the case where the regressors are forming a random design matrix with i.i.d. entries is investigated. Different laws of the entries are considered (Gaussian, Uniform, Bernoulli or Student laws) inducing specific coherence for the design matrix. Several interesting features are discussed in this section. The impact of the sparsity and the undetermination of the regression on LOL performances are studied. A comparison with two others two-step procedures namely [Fan and Lv, 2008] and [Cand`es and Plan, 2008] is also provided and shows the additional benefits brought by LOL. The most interesting conclusion being that the practical results are even better and more comforting than the theoretical ones in the sense that even when the coherence is pretty high, LOL procedure shows good performances. The paper is organized as follows. In Section 2, the general model and the notations are presented. In Section 3, LOL procedure is detailed as other procedures with a ℓ1 optimization step ; practical comparisons with other procedures are later discussed in Section 5. In Section 4, after stating the hypotheses needed in the model, theoretical results are established. Practical performances of the LOL procedure are investigated in Section 6 and the proofs are detailed in Section ??.

2

Description of the models In this part, the model of interest is presented with a focus on two specific cases : the random

matrices design and the functional regression.

3

2.1

General model

A Gaussian (or sub-gaussian) high dimensional linear model is here considered and more precisely data Y = (Y1 , . . . , Yn )t are observed coming from the following regression model Y = Φα + u + ε

(1)

where the parameter α ∈ Rp is the unknown vector to be estimated and

– the vector ε = (ε1 , . . . , εn )t is a (non observed) vector of random errors. It is assumed to be independent Gaussian variables N (0, σ 2 ) but essentially comparable results can be obtained in the case of zero mean subgaussian errors (see the remark before Lemma ??). – the vector u = (u1 , . . . , un )t is a non observed vector of (possibly) random errors. Its amplitude is assumed to be small. The differences between the two previously described ”errors” lies in the fact that the εi ’s are centered but unbounded and independent, while the ui ’s are only bounded. The importance of introducing these two types of errors becomes clear in the functional regression example (see section 2.3). – Φ is a n × p known matrix. This paper focuses on the interesting case where p >> n but it is not necessary. We assume that Φ has normalized columns (or normalize them) in the following sense :

n

1X 2 Φiℓ = 1, n i=1

2.2

∀ ℓ = 1 . . . , p.

(2)

Coherence

The following Gram p × p matrix is M :=

1 t Φ Φ. n

The quantity

n

1X Φiℓ Φim | τn = sup |Mℓm | = sup | ℓ6=m ℓ6=m n i=1

is called the coherence of the matrix M . This quantity is important because it induces a bound on the size of the invertible matrices built with the columns of M . More precisely, fix 0 < ν < 1 and let C be a subset of indices of {1, . . . , p} with cardinality m. Denote ΦC the matrix restricted to the columns of Φ whose indices are in C. If 2τn ≤ ν, the associated Gram matrix M (C) :=

4

1 t Φ ΦC n C

is almost diagonal as soon as m is smaller than ⌊ν/τn ⌋ in the sense that it satisfies the following so called Restricted Isometry Property (RIP).

∀x ∈ Rm , kxk2l2 (m) (1 − ν) ≤ xt M(C)x ≤ kxk2l2 (m) (1 + ν),

(3)

This proves in particular that the matrix M (C) is invertible.

2.3

Models of interest

Although these results apply in the general case, two typical cases of applications are especially considered. The first application concerns a random matrix Φ composed of n independent random vectors of size p. The important role played by the distribution of these random vectors is detailed in the simulation study, Section 6. The second application is the learning (also called functional regression) framework Yi = f (Xi ) + εi , i = 1 . . . n

(4)

where f is the functional parameter of interest to be estimated. The Xi′ s are i.i.d. random variables living in a compact domain of Rd . The errors ε′i s, are i.i.d. standard Gaussian random variables and independent of the Xi′ s (or centered sub-gaussian variables). ρ denotes the common (unknown) law of the (1 + d)−vectors Zi = (Xi , Yi )′ s. To relate this framework to our model, let us consider a dictionary D of size p, of real functions

defined on Rd . Assume that f can be reasonably well approximated using the elements of the dictionary which means that there exists a sequence {αg , g ∈ D} such that X f= αg g + h g∈D

where h is hopefully small. Then the regression model becomes X Yi = αg g(Xi ) + h(Xi ) + εi , i = 1, . . . , n g∈D

which coincides with the general model Y = Φα + u + ε, setting ui = h(Xi ) for any i = 1, . . . , n and Φ being the matrix with general terms Φiℓ = gℓ (Xi ) (after choosing an enumeration of D). Again, the dictionary has to be normalized and (2) translates

here as

n

1X 2 g (Xi ) = 1, ∀ g ∈ D. n i=1

5

3

Estimation procedures As explained in the introduction, the essential motivation of this work is to provide one of

the simplest procedures, finding its inspiration among a lot of works around the same theme. First, the estimation of the unknown parameter α using LOL is described. The procedure has the particularity to perform a selection method of the regressors in the same time. Next, a short review on the procedures directly connected to LOL is proposed. Once for all, the constant ν is fixed. This constant will obviously be related to the precision of LOL main procedure (for instance ν = 1/2 can be taken as default value).

3.1

LOL Procedure

Once τn (or a bound for τn ) is evaluated and N = ⌊ν/τn ⌋ is computed, LOL procedure has

three successive steps : Find N leaders, Regress on the leaders, Threshold. 1. Find the leaders : • For some constant T1 > 0, fix a threshold λn (1) = T1

• Compute the ’correlations’

log p n

1/2

∨ τn

!

.

(5)

n

Kℓ = |

1X Φiℓ Yi | n i=1

and consider the ordered sequence K(1) ≥ K(2) ≥ . . . ≥ K(N ) of the N largest, and the

associated set of indices K = {κ(1) , κ(2) , . . . , κ(N ) }.

• The final set of the leaders is defined by the following set of columns Φℓ of the matrix Φ : B = {Φℓ , ℓ ∈ K and Kℓ ≥ λn (1)} and B denotes the set of their indices (which might of course be different from K). It is clear

from this construction that N appears as a bound for the number of leaders (equal to the cardinal of B). 2. Regress on the leaders : • Consider the pseudo-regression model : Yi =

X

Φiℓ αℓ + ei

ℓ∈B

and define the extracted matrix ΦB by (ΦB )ℓ, i = Φiℓ

for any ℓ ∈ B and i ∈ {1, . . . , n}. 6

(6)

• Let α b(B) = (c αℓ (B), ℓ ∈ B) be the minimum least square error in this model : ! n X X α ˆ (B) = Arg min (Yi − Φiℓ αℓ )2 = (ΦtB ΦB )−1 ΦtB Y . α=(αℓ )ℓ∈B

i=1

ℓ∈B

• Define the vector α ˆ of Rp by

3. Threshold :

αℓ := c

cℓ (B) α 0

if ℓ ∈ B if ℓ ∈ 6 B

For some constant T2 > 0, fix a threshold λn (2) = T2

log n n

1/2

(7)

and threshold again the estimated coefficients to obtain the final predictor α ˆ ∗ whose coordinates are α ˆ ∗ℓ = c αℓ I{|c αℓ | ≥ λn (2)}.

The selected regressors are then the columns of Φ whose indices belong to L = {ℓ = 1, . . . , p, α cℓ ∗ 6= 0}

Notice that the formula (5) and (7) are the ’default’ values for the tuning sequences λn (1) and λn (2) given for the procedure. However, the presentation as well as the theoretical results in sequel are given for arbitrary sequences λn (1) and λn (2).

3.2

Several inspirations

Although it is impossible to be exhaustive in such a productive domain, some of the works directly in relation to our construction are hereafter mentioned. We apologize in advance for all the works that are not mentioned but still in connection. For a comprehensive overview, we refer to [Fan and Lv, 2009]. In the context of the learning theory (second application), various methods are already been proposed, including kernel methods and search within dictionaries. Let us especially mention following works providing greedy algorithms [Barron et al., 2008], or adjusting tree algorithms [Binev et al., 2005], [Binev et al., 2007a]. A one step algorithm rough version of LOL is given in [Kerkyacharian and Picard, 2007] as well as in [Kerkyacharian et al., 2009] for the case p ≤ n. [Bunea, 2009] also proposes an estimation procedure based on the lasso and derives a selection

procedure by keeping the non zero estimated coefficients. 7

In the context of the linear regression (first application), several authors propose procedures to solve the selection problem and the estimation problem in the case where the vector α has only a small number of non zero components, and (often) when the design matrix Φ is composed of i.i.d. random vectors : see among many others [Tibshirani, 1996], [Candes and Tao, 2007], [Bickel et al., 2007], [Bunea et al., 2007a] and [Bunea et al., 2007b]. We especially refer to the 2-steps procedures which are also commonly used. Apparently, as soon as in 1959 such a procedure is already discussed (see [Satterthwaite, 1959]). In [Candes and Tao, 2007] and [Cand`es and Plan, 2008], the leaders are selected with (respectively) the Danzig procedure and the lasso procedure. Then, the estimated coefficients are obtained via a linear regression on the leaders. Using an intensive simulation program, [Fan and Lv, 2008] show that it could be unfavorable to use the procedures lasso or Danzig before the reduction of the dimension. They also provide a search among leaders called Sure Independence Screening (SIS) procedure. This procedure is very close to the one discussed in this paper : the leaders are the N = ⌊γn n⌋ columns of Φ with largest

correlations to the target variable Y (γn is a tuning sequence tending to zero). This step is followed

with a subsequent estimation procedure using Danzig or lasso. All these methods focus on the complexity of the algorithms.

4

Main theoretical results This section states the theoretical results of the procedure LOL. First, the assumptions on the

model are described. Next, the quantities allowing to measure the performances of the procedures are defined. The consistency of LOL is shown using two different points of view : the prediction problem and the estimation problem.

4.1

Sparsity conditions on the model

Recall that the model specifies a gaussian (or sub-gaussian) observation of the following form : Y = Φα + u + ε. The following sparsity conditions are assumed. There exist S ≤ N and constants

8

M, c1 , ct , c′t , c0 , such that the sequences {αℓ }ℓ≤p and {ui }i≤n satisfy the following conditions 1/2 S (8) sup |ui | ≤ c1 n i=1,...,n p X ℓ=1

|αℓ | ≤ M,

(9)

# {ℓ ∈ {1, . . . , p}, |αℓ | ≥ λn (2)/2} ≤ S X S 1/2 |α(ℓ) | ≤ ct nτn

(10) (11)

(ℓ)>N

X

(ℓ)>N p X ℓ=1

|α(ℓ) |2 ≤ c′2 t

S n

(12)

|αℓ |2 I{|αℓ | ≤ 2λn (1)} ≤ c20

S n

(13)

Recall that (α(ℓ ) ) is the ordered sequence (for the modulus) |α(1) | ≥ |α(2) | ≥ . . . |α(p) |. For S, M > 0,

V (S, M ) denotes the class of models of type (1) satisfying the sparsity conditions (9), (8), (10), (11), (13). A very important example of such a class occurs when all the coefficients of α are 0 except S coefficients (with S ≤ N ) with a modulus greater that λn (2)/2 but bounded : Spars(S, M ) denotes such a class.

The conditions (9)–(13) are also satisfied if the lq conditions are assumed, as in [Raskutti et al., 2009] which provide upper and lower bounds. More precisely, for q ∈ (0, 1], define the lq -balls as the sets p

Bq (M ) := {α ∈ R ,

p X j=1

|αj |q ≤ M q }.

(14)

It is not difficult to prove that if α belongs to Bq (M ) then (9)–(13) are verified for S ≥ λn (2)−q ∨ nλn (1)2−q ∨ nτn(2−q)/q . In particular, in order to compare our results to the lower bounds q in[Raskutti et al., 2009], it is log p important to verify that the conditions are verified for τn = O and for the defaults values n for λn (1) and λn (2). In this precise case, this means that S/n has to be of order τn2−q .

In the context of the learning theory (second application), the sparsity conditions are required on the target function f . The above assumptions are easily translated by replacing the condition (8) by the following one : khk2∞

1/2 S ≤ c1 . n

The other conditions are quite usual in functional analysis and relate to Lorentz spaces. 9

4.2

Measures of performances

First, let us define loss functions to measure the difference between the true value α ∈ Rp and

the result α ˆ ∗ of LOL procedure. Denote Φi the i−th line of the matrix Φ and recall that the i−th observation is given by the model Yi = Φi α + ui + εi .

The predicted i−th observation is Ybi = Φi α ˆ ∗ . The empirical quadratic distance between the predicted observations and the expected value is here considered

p n n 2 1X X ∗ 1 Xb Yi − EYi = d(ˆ α∗ , α)2 = (ˆ αℓ − αℓ )Φiℓ + ui n n i=1

i=1

ℓ=1

!2

.

(15)

Notice that in the functional regression case, this error coincides with the L2 error with respect to the empirical measure

1X δXi n where δx denotes the Dirac measure at point x. Indeed, we get ρˆ =

n 2 1 X ˆ f (Xi ) − f (Xi ) = kfˆ − f k2ρˆ. d(ˆ α , α) = n ∗

2

i=1

With a slight abuse of notations, we also write the distance defined in (15) in the general model d(ˆ α∗ , α) := k

p X (ˆ α∗ℓ − αℓ )Φ•ℓ + u• kρˆ l=1

where Φ•ℓ is the ℓ−th row of Φ. The first measure of performance under consideration is issued from the Bahadur efficiency of test and estimation procedures and is defined for any tolerance η > 0 as AC n (LOL, η) = P (d(ˆ α∗ , α) > η) .

(16)

Obviously, if the tolerance is low (smaller than a critical value ηn ), this quantity is large. In the opposite, for η ≥ ηn , the quality of the procedure is given by the rate of convergence of AC n (LOL, η)

towards zero. Observe that the value of the critical value ηn is essential since it yields, as a conse-

quence, bounds for Ed(ˆ α∗ , α) which is another (more standard) measure of performance of the procedure. More generally, in the learning framework, given priors Θ on the class of probability distributions generating the observations, it has been defined in [DeVore et al., 2006] the accuracy confidence function of the procedure fˆ : AC n (Θ, fˆ, η) := sup ρ⊗n {kf − fˆkρX > η}. ρ∈Θ

10

(17)

This quantity measures a uniform confidence (over the class Θ) that the estimator fˆ is accurate to the tolerance η. In most examples, there exist a phase transition and a critical value ηn depending on n and Θ such that AC n (Θ, fˆ, η) decreases exponentially for any η > ηn . More precisely, in terms of lower bound, it is proved in [DeVore et al., 2006] q ¯ (Θ, η)e−cnη2 , inf AC n (Θ, fˆ, η) ≥ C N fˆ

(18)

¯ (Θ, η) is the tight entropy analogue of the Sobolev covering numbers. The results in where N [DeVore et al., 2006] are obtained in the learning framework ; however identical bounds can eap sily be obtained in the setting (1) of this paper, leading to ηn = O( S/n).

If the focus is made on the case where α ∈ Spars(S, M ), it could be interesting to adopt the

point of view of the ”detection” instead of the ”prediction”. Two quantities become then crucial in view to measure the ”similarity” between the true value and its estimator. The number of False Positive decisions (FP) and the number of False Negative decisions (FN) are given by F P :=

p X ℓ=1

I{αℓ = 0}I{ˆ α∗ℓ 6= 0}

and

F N :=

p X ℓ=1

I{αℓ 6= 0}I{ˆ α∗ℓ = 0}.

In order to evaluate the performances of LOL selection procedure using these distances between α and α ˆ , the quantity P (F P > pη) + P (F N > pη) for η ≥ 0 is studied. A selection procedure is said

consistent if P ( {ℓ, αℓ 6= 0} = {ℓ, α ˆ ∗ℓ 6= 0}

4.3

) is tending towards 1.

Performances of the procedure LOL

The performances of the LOL procedure are summarized in the following theorems. In Theorem 1, we establish that LOL procedure is a good procedure for estimation since the prediction error is going to zero in probability with exponential rates. Indeed, the LOL estimator is optimal (up to p a logarithmic factor) in terms of the critical value ηn ∼ S/n, as well as in terms of exponential

rates if the coherence is small enough (see the discussion below). In Theorem 2, we establish that

LOL procedure works also quite well for detection since quantities F N and F P are going to zero in probability with pretty fast rates. Theorem 1. Let S, M > 0 and fix ν in ]0, 1[. Suppose p ≤ na , for some constant a > 0 and choose the thresholds λn (1) and λn (2) such that λn (1) ≥

T11

log p n

1/2

∨ T12 τn

11

!

and

λn (2) ≤ λn (1)

√ √ 1/2 for T11 = 16 2σ 2 /(1 + ν) and T12 = M (1−ν) 2 . Then, the model is of class V (S, M ) ∨ 4 4 defined above, there exist positive constants D and γ, such that S 2 ≥D −γnη2 for η 4e n ∨ sup P (d(ˆ α∗ , α) > η) ≤

V (S,M )

1

for

η2

≤D

S n

∨

S| log τn | n

∨ Sτn2 ,

S| log τn | n

Sτn2

∨

(19)

Observe that the result given in Theorem 1 is concerning LOL procedures associated with more general thresholds than λn (1), λn (2) than those prescribed in (5) and (7). It is interesting to notice the very few conditions on the threshold λn (2) (λn (2) ≤ λn (1) and Condition (10) relating to the considered set of α’s ).

The constants D and γ are precisely given at the end of the proof of Theorem 1. For a sake of completeness, precision on the constants is given. However, it is obvious that the constants provided here are not optimal : for instance in the proof, in order to avoid unnecessary technicalities, most of the events are divided as if they had equal importance, leading to constants which are each time divided by 2. Obviously there is room for improvement at any of these stages. An elementary consequence of Theorem 1 is the following corollary which details the behavior of the expectation of d(ˆ α∗ , α). Notice also that we did not give here explicite oracle inequalities, which however could be derived from the proof of Theorem 1. Corollary 1. For r ≥ 1 arbitrary, under the same assumptions as in Theorem 1, we get r/2 S| log τn | ∗ r ′ S 2 sup Ed(ˆ α , α) ≤ D ∨ ∨ Sτn n n V (S,M ) for some positive constant D ′ . Notice that in the case of the lq balls Bq (M ) for q ∈ (0, 1]) (see (14)) and taking the defaults

values for λn (1) and λn (2), LOL procedure has optimal rates in the minimax sense (compare the q log p . upper bound to the lower bounds in [Raskutti et al., 2009]) as soon as τn = O n Let us now focus on the selection point of view. As usual, an additional assumption is needed

on the non zero coefficients : they have to be large enough to be detected. Theorem 2 establishes that LOL procedure is consistent as a selection procedure. Theorem 2. Let k be a given positive number. Let S, M > 0 and fix ν in ]0, 1[. Suppose p ≤ na ,

for some constant a > 0, choose λn (1) ≥ λn (2) and assume that the model is of class Spars(S, M ) described above, then

– False Positive : Assume that min |αℓ |I{αℓ 6= 0} ≥ µn

ℓ=1,...,p

12

where µn satisfies µn = T3

λn (2) ∨ τn

r

S ∨ k

r

S ∨ nk

r

S| log τn | nk

!

where T3 is a constant large enough. Then there exists a constant c > 0 such that P (F P > k) ≤ c exp{−c knλ2n (2)}. – False Negative : Choose the thresholds such that λn (1) ≥

T11

log p n

1/2

∨ T12 τn

!

where the constants T11 , T12 defined as in Theorem 1 and √ (1 + ν)1/2 λn (2) ≥ σ 32 c′1 c′ ∨ 256c1 (1 − ν)1/2

!r

S . nk

There exists some constant c > 0 such that P (F N > k) ≤ c exp{−c knλ2n (2)} As for Theorem 1, Theorem 2 states for general thresholds λn (1), λn (2) (which are valid for (5) and (7) but also more widely). Observe that the choice of λn (2) is crucial from a detection point of view. For the specific choices (5) and (7), we get Corollary 2. Assume that min |αℓ | I{αℓ 6= 0} ≥ O

ℓ=1,...,p

log n √ n

.

Let S, M > 0 and fix ν in ]0, 1[. Suppose p ≤ na , for some constant a > 0 and assume that the

model is of class Spars(S, M ) described above. The LOL procedure the specific choices (5) and (7)

satisfies ′

P (F N + F P > k) ≤ c′ n−c k . for k larger than O(S/ log n). Note that LOL procedure works better and better as S gets smaller, as it is confirmed by the practical simulations.

13

5

Discussion and Comparisons Comparison with other theoretical results in the literature are hereafter presented with a spe-

cific focus on domains where LOL is competitive to more sophisticated algorithms and where its simplicity induces a slight lack of efficiency. To summarize, the great benefits of LOL is to produce a very simple and auto driven algorithm with no optimization step, and with quite elementary assumptions leading to optimal exponential rates.

5.1

Estimation bounds in learning theory

As mentioned in the previous section, LOL finds its inspiration in the learning framework, especially in [Barron et al., 2008], [Binev et al., 2005], [Binev et al., 2007a],[Binev et al., 2007b]. In all these papers, consistency results are obtained under fewer assumptions but with no exponential bounds and a higher cost in implementation. In the learning context, [Temlyakov, 2008] provides optimal critical value ηn as well as exponential bounds with fewer assumptions since there is no coherence restriction. However, the procedure is very difficult to implement for large values of p and n (N -P hard).

5.2

Comparison with other penalization procedure and coherence conditions

Comparisons has to be conducted with various procedures affiliated to the Lasso or Danzig procedures for instance [Tibshirani, 1996], [Candes and Tao, 2007], [Bickel et al., 2007], [Bunea et al., 2007a], [Bunea et al., 2007b]. First, the normalization needs to be stressed since it plays a crucial role. In many papers, the model is Y = Xβ + ε and the columns of X are normalized. For comparison, our model needs to be identified in the following way Φ X := √ , n

β :=

√ nα.

Of course, each normalization brings its own benefit. Our choice has a natural interpretation in terms of prediction in the functional learning model. However, it is interesting to notice that precisely because of this normalization, the sparsity conditions on the function (model V (S, M )) are lighter for LOL. LOL estimation bounds are compared with the lower bounds produced in [DeVore et al., 2006], p LOL procedure gives optimal results when the coherence satisfies τn ≤ O( log n/n). This is to be

compared with conditions of type τn ≤ O(S −1 ) (see for instance [Bickel et al., 2007], [Bunea et al., 2007a],

[Bunea et al., 2007b]) which are lighter except for large S, or τn ≤ O(1/ log p) in [Cand`es and Plan, 2008]

which is better. However, in these papers, there is generally additional assumptions

14

– either on the matrix X itself which generally are not possible to verify in practice. In the opposite, notice that the coherence can always be calculated. – or on the way X as well as the β coefficients are produced, namely all these values are in fact random and independent. In our case, it can allow to less drastic coherence conditions. We p infer that conditions of type τn ≤ O( S log n/n) could suffice in this case, but these precise types of models are not the scope of this paper.

5.3

Selection properties

[Meinshausen and Buhlmann, 2006] show that the selection by lasso type algorithm is consistent in graphical models, under assumptions that are tailored to models for which the vector (Y, Φ1 , . . . , Φp ) is gaussian. Basically, to establish the consistency property of the selection procedure, a minimal size of the (non zero) coordinates of α is required : it is generally assumed that there exists some sequence υn > 0 such that min |αℓ | ≥ O(υn ).

ℓ∈I ∗

(20)

[Zhao and Yu, 2006] establish the consistency of selection for fixed design linear regression models, assuming that Hypothesis (20) holds for υn = n−κ for some κ ∈ (0, 1/2). Under the same hypothesis,

[Fan and Lv, 2008] prove that Sure Independence Screening (SIS) is accurate in the sense that SIS selects (with large probability) at least the regressors which have to be selected. They need to assume that there exists some τ > 0 which is the indicator of the growth of the largest eigenvalue of the variance matrix Σ of Φ defined by λmax (Σ) ≤ O(nτ ). The main advantage of [Fan and Lv, 2008]

is that their results are basically concerning a linear model in ultra high dimension p = exp(cnξ )

for constants c, ξ > 0 with the restriction ξ ∈ (0, 1 − 2κ). Practical inconvenient is that the tuning sequence γn is not auto driven since it has to verify n1−2κ−τ −→ ∞. The selection procedure of

[Bunea, 2009] proposed in the learning framework is also shown to be consistent. Hypothesis (20) p is required for υn = S log n/n imposing some restriction on S because υn is supposed to tend to zero. Finally, [Cand`es and Plan, 2008] prove consistency results as soon as Hypothesis (20) is √ satisfied for υn = 8σ 2 log p and if S ≤ O(p/[kΦk2 log p]) (in a non asymptotical framework, for any

dimension p). When the selection procedure is derived from an estimation procedure, a coherence

restriction could be asked. In [Bunea, 2009] and [Bunea et al., 2007b], it is assumed in addition that

n

1X |Φiℓ Φim | ≤ O(S −1 ). ∗ ∗ n ℓ∈I , m6∈I i=1 P √ p and an exponential bound (tending to zero) is established for P αℓ − αℓ | > S η when ℓ=1 |ˆ p p η ≥ S log p/n. In [Cand`es and Plan, 2008], if τn ≤ O(c/ log p) and again η ≥ S log p/n, it is sup

15

proved that P (d(ˆ αℓ , αℓ ) > η) ≤ 6p−2 log 2 − p−1 (2π log p)−1/2 . [Temlyakov, 2008] provides optimal critical value ηn as well as exponential bounds with fewer assumptions : there is no coherence restriction and the setting is the learning framework. In [Fan and Lv, 2008], under some hypothesis of RIP type, the procedure SIS-D (SIS followed by danzig) is asymptotically consistent P

p X ℓ=1

6

(ˆ αℓSIS−D − αℓ )2 > S

p

log N

!

−→ 0

Practical results In this section, an extensive computational experiment is conducted using LOL. The procedure

is dedicated to find sparse solutions of linear models assuming that the target variable Y is a linear combination of only S predictors among p. The performances of LOL procedure are studied over various ranges of level of indeterminacy δ = 1 − n/p and over various ranges of sparsity rates

ρ = S/n (see [Maleki and Donoho, 2009]). The influence of the choice of the distribution family for the design matrix is analyzed through the performances. LOL procedure is finally compared to some others two-steps procedures described in Section 3.2.

6.1

Experimental design

The design matrix Φ has p i.i.d. columns of size n. Different distributions are studied : Gaussian, Uniform, Bernoulli, or Student laws. It is important to notice that this choice of laws yields different values of the coherence τn and then different behaviors of the procedure. Each column vector of Φ is normalized to have unit norms. Given Φ, the target observations are Y = Φα + ε for ε i.i.d. variables with a normal distribution N (0, σε ), σε chosen such that the signal over noise ratio is close to 2. The vector α is built as follows : all coordinates are zero except S non zero coordinates with αℓ = (−1)b |z| where b is drawn from a Bernoulli distribution with parameter 0.5 and z from

a N (2, 1) (see [Fan and Lv, 2008]).

To evaluate the quality of the prediction, the relative l2 error EY is computed on the target Y and the relative quadratic error Eα is computed on the α coefficients EY = kY − Ye k22 /kY k22

and

Eα = kα − α ˆ k22 /kαk22 .

The sparsity S is estimated by the cardinal of L = {ℓ = 1, . . . , p, α b∗ 6= 0} where α b∗ is the LOL

estimator. The number of False Positive and of False Negative as defined in Section 4.2 are also 16

computed. All these quantities are estimated by averaging the results obtained over K = 200 replications of the experiment.

6.2

Algorithm

Let us explain how to determine in a really adaptive way the thresholds λn (1) and λn (2). These are critical values quite hard to tune practically since they depend on inaccessible constants (see the theoretical results). Since the first threshold λn (1) is used to select the candidates to the regression, the aim is to split the set of ’correlations’ {Kℓ , ℓ = 1, . . . , p}, in two clusters in such a way to pick

up the regression candidates in one group. Here, the sparsity assumption is used : some predictors are more correlated to the target Y than some others associated to a weak correlation value, close to zero. This remark implies that the distribution of correlations (in absolute value) should be distributed in two clusters : one for the leaders (high correlations) and one for the others (very small correlations). The frontier between the clusters is adaptively computed by minimizing the deviance of the absolute value correlations for two classes as described in [Kerkyacharian et al., 2009]. The same procedure is used to threshold adaptively the estimated coefficients α bℓ obtained by

linear regression on the leaders. Indeed, notice that the distribution of the α bℓ provides two clusters : one cluster associated to the largest coefficients (in absolute value) corresponding to the non zero coefficients and one cluster composed of coefficients closed to zero, which should not be involved

in the model. The frontier between the two clusters, which defines λn (2), is again computed by minimizing the deviance between the two classes of regression coefficients. Finally, an improvement for LOL is proposed. It seems more appropriate to perform a second regression using the final set L of selected predictors involved in the model : the estimators of the

(non zero) coefficients should be more accurate. This updating procedure is denoted LOL+ in the sequel.

6.3

Results with random gaussian design matrices

First, the design matrix Φ is defined with i.i.d. gaussian variables. The computed coherence is also τn = 0.33 (see Figure 5). As we are interested in quantifying LOL performance in an overwhelming majority of cases, we study the impact of the level of indeterminacy δ from 0 to 0.9 by 0.05 step and the impact of the the sparsity rate ρ from 0.01 to 0.16 by 20 steps. p = 1000 is chosen and for specific studies n = 250. Influence of the indeterminacy level : Figure 1 studies LOL prediction and estimation performances when the indeterminacy level is varying (p = 1000, n varying). Both errors EY and Eα continuously increase with the indeterminacy δ, as the number of available observations

17

decreases compared to the number of variables. For a given value of δ, EY decreases as the sparsity does. For δ ≤ 0.75 (n ≥ 0.25p), the prediction error is weak, below 5%. In this case, the estimation

error on the coefficients is less than 10%. When the number of available observations is at least higher than half of the number of potential predictors (δ < 0.5), the prediction and the estimation errors are negligible : LOL performances are in this case exceptionally good. For a given number of observations and potential predictors, the prediction is more accurate as the sparsity rate decreases. For a fixed number observations, regarding the joint values of both indeterminacy and sparsity parameters, the errors tends to be null as δ and/or ρ decrease. Influence of the sparsity rate : Figure 2 illustrates LOL prediction and estimation performances when the sparsity rate is varying. For small values of sparsity rate (ρ ≤ 5%), both prediction

and estimation errors are very good (less than 5%). For an extreme level of sparsity (ρ ≤ 2%), the

performances are, as expected, excellent. As observed before, for a given sparsity rate value, the performances are improved as the indeterminacy decreases. Estimator of the Sparsity S : Figure 3 shows the estimated sparsity as a function of the

effective sparsity S. For weak sparsity values (ρ ≤ 5%), LOL procedure is excellent because it estimates exactly (with no error) the sparsity S and that for all studied indeterminacy levels.

As the sparsity increases, LOL procedure tends to underestimate the parameter S. For a given sparsity value, the underestimation becomes weaker as the indeterminacy level δ decreases. This observation is detailed in Table 1 where the False Positive and False Negative numbers are computed for different values of sparsity. Two different cases of indeterminacy are presented (δ = 0.75, 0.5). For each indeterminacy level, we observe that False Negative and False Positive numbers increase with S both in mean and variability. As the indeterminacy level decreases from δ = 0.75 to δ = 0.5, meaning that more observations are available relatively to the number of potential predictors, the detection of True Positive is improved. Estimator of the coefficients : Figure 4 presents the improvements provided by LOL+ compared to LOL as a function of sparsity rate for the prediction error. For all indeterminacy and sparsity values, the prediction error decreases using LOL+ procedure instead of LOL. Improvements are stronger as both sparsity rate and indeterminacy level increase. The prediction improvements are observed as ρ increases given all studied indeterminacy levels δ. Obviously, the estimated sparsity in the same for both procedures LOL and LOL+ (see Table 1).

6.4

Impact of the variable distribution in the design matrix

This section investigates the impact of the law of the regressor variables. Eight different distributions are studied : Gaussian (N (0, 1)), Uniform (U [−1, 1]), Bernoulli (B{−1, +1}) and Student

18

(T (m) with m ∈ {5, 4, 3, 2, 1}). The column of the design matrix Φ are empirically normalized.

Figure 5 shows the empirical density of the coherence τn computed for each law. Similar distri-

butions are observed for Gaussian, Uniform or Bernoulli laws with a mode of the coherence equal to τn = 0.30. For Student’s families, a shift of the mode of the empirical distributions can be observed from left to right equaled to 0.36 for T (5), 0.47 for T(4), 0.68 for T(3), 0.92 for T(2) to 0.99 for T(1). Figure 6 studies the estimation of S as a function of the sparsity rate ρ for those distributions. All the curves, except the one for the Student law T(1), are confounded and show similar evolution as the one observed for gaussian predictors (see Figure 3 for δ = 0.25). LOL provides similar results for Gaussian, Uniform, Bernoulli, or Student laws, T(m) with m large enough. It is amazing to observe that the procedure works fine even when the empirical coherence of the distribution τn reaches large values closed to 0.99. But LOL procedure does not work fine for heavy tailed variables as for T (1). Figure 7 shows the coherence of the matrix restricted to the N leaders. This ”restricted” coherence is much lower than the coherence computed on all the predictors. For the Student T (1) law, τn = 0.99 (see Figure 5) while the coherence computed just on the leaders is 0.3 (see Figure 7 by instance for S = 10). LOL procedure provides also good results even when the global coherence approaches 1 : it seems that the practical results are much more optimistic (although they do show some deterioration under high coherence). Conclusions would be that it could be interesting to find new measures of collinearity to best reflect the performances of the method. This is true in general, for all the methods concerned with high dimension. Table 2 shows the false detections F P and F N estimated for different distributions and values of sparsity S. For a given distribution, they increase with sparsity. This increment is stronger for distributions with high coherence. For a given sparsity number S, False Positive and False Negative increase as the coherence τn does. LOL tends to underestimate the number of non-zero coefficients. The underestimation is stronger as the coherence of the predictors increases.

6.5

Comparison with other two-steps procedures

In this part, the performances of LOL and LOL+ are compared with the performances of two two-step procedures. The first one referred as SIS-Lasso is coming from [Fan and Lv, 2008] : the selection step called SIS is followed by the Lasso procedure. The second one, called Lasso-Reg, is proposed in [Cand`es and Plan, 2008]. First, the Lasso algorithm performs the selection of the leaders and then the coefficients are estimated by regression. The performances of the four procedures (LOL, LOL+, SIS-Lasso, Lasso-Reg) are studied over a large range of sparsity rates in order to merge previous results already presented in [Fan and Lv, 2008]

19

and [Cand`es and Plan, 2008]. In this section, the sparsity S varies from 5 to 50 in 10 steps and the number of initial predictors is p = 1000. This experimental design let us analyze extreme sparsity values (0.02 ≤ ρ ≤ 0.05) (as in [Fan and Lv, 2008]) as values as large as 1/log(p) (ρ = 0.20) (as

in [Cand`es and Plan, 2008]). For the Lasso procedures, the regularization parameter is chosen by

crossvalidation. Figure 8 presents the prediction error for the different design matrices distributions presented in the previous section. For extreme sparsity levels, ρ < 5%, all the procedures performs extremely well. For middle sparsity levels (5% ≤ ρ ≤ 15%), the Lasso-Reg performs better than the others

ones, as the design matrix is defined with Gaussian, Uniform, Bernoulli or Student distributions (m = 4, 5). For this range of sparsity levels, the Lasso-Reg procedure seems to be more efficient to

select the leaders than the SIS-Lasso and the LOL procedures. For largest values of the sparsity level ρ ≥ 0.15, it appears that SIS-Lasso and LOL are better than Lasso-Reg. A phase transition

can be observed for the Lasso-Reg procedure as described in [Maleki and Donoho, 2009]. As the coherence of the design matrix increases, the phase transition appears sooner for smallest ρ values. The performances of the SIS-Lasso and LOL are globally similar. Note that LOL+ procedure improves continuously the performances compared to LOL and SIS-Lasso.

R´ ef´ erences [Alquier and Hebiri, 2009] Alquier, P. and Hebiri, M. (2009). Transductive versions of the lasso and the dantzig selector. [Bahadur, 1960] Bahadur, R. R. (1960). On the asymptotic efficiency of tests and estimates. Sankhy¯ a, 22 :229–252. [Barron et al., 2008] Barron, A. R., Cohen, A., Dahmen, W., and DeVore, R. A. (2008). Approximation and learning by greedy algorithms. Ann. Statist., 36(1) :64–94. [Bickel et al., 2007] Bickel, P. J., Ritov, Y., and Tsybakov, A. (2007). Simultaneous analysis of lasso and dantzig selector. Preprint Submitted to the Annals of Statistics. [Binev et al., 2007a] Binev, P., Cohen, A., Dahmen, W., and DeVore, R. (2007a). Universal algorithms for learning theory. II. Piecewise polynomial functions. Constr. Approx., 26(2) :127–152. [Binev et al., 2007b] Binev, P., Cohen, A., Dahmen, W., and DeVore, R. (2007b). Universal piecewise polynomial estimators for machine learning. In Curve and surface design : Avignon 2006, Mod. Methods Math., pages 48–77. Nashboro Press, Brentwood, TN.

20

[Binev et al., 2005] Binev, P., Cohen, A., Dahmen, W., DeVore, R., and Temlyakov, V. (2005). Universal algorithms for learning theory. I. Piecewise constant functions. J. Mach. Learn. Res., 6 :1297–1321 (electronic). [Bunea, 2009] Bunea, F. (2009). Consistent selection via the lasso for high dimensional approximation regression models. [Bunea et al., 2007a] Bunea, F., Tsybakov, A., and Wegkamp, M. (2007a). Sparsity oracle inequalities for the Lasso. Electron. J. Stat., 1 :169–194 (electronic). [Bunea et al., 2007b] Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. (2007b). Sparse density estimation with ℓ1 penalties. In Learning theory, volume 4539 of Lecture Notes in Comput. Sci., pages 530–543. Springer, Berlin. [Cand`es and Plan, 2008] Cand`es, E. and Plan, Y. (2008). Near-ideal model selection by ℓ1 minimization. [Candes and Tao, 2007] Candes, E. and Tao, T. (2007). The Dantzig selector : statistical estimation when p is much larger than n. Ann. Statist., 35(6) :2313–2351. [DeVore et al., 2006] DeVore, R., Kerkyacharian, G., Picard, D., and Temlyakov, V. (2006). Approximation methods for supervised learning. Found. Comput. Math., 6(1) :3–58. [Fan and Lv, 2008] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Statist. Soc. B, 70 :849–911. [Fan and Lv, 2009] Fan, J. and Lv, J. (2009). A selective overview of variable selection in high dimensional feature space. [Kerkyacharian et al., 2009] Kerkyacharian, G., Mougeot, M., Picard, D., and Tribouley, K. (2009). Learning out leaders : exponential rates of convergence in high dimensional regression. [Kerkyacharian and Picard, 2007] Kerkyacharian, G. and Picard, D. (2007). Thresholding in learning theory. Constr. Approx., 26(2) :173–203. [Lounici, 2008] Lounici, K. (2008). High-dimensional stochastic optimization with the generalized dantzig estimator. [Maleki and Donoho, 2009] Maleki, A. and Donoho, D. L. (2009). Optimally tuned iterative thresholding algorithm for compressed sensing. IEEE journal in signal processing, in press. [Meinshausen and Buhlmann, 2006] Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34 :1436–1462. [Raskutti et al., 2009] Raskutti, G., Wainwright, M. J., and Yu, B. (2009). Minimax rates of estimation for high-dimensional linear regression over $ ell q$-balls. 21

[Satterthwaite, 1959] Satterthwaite, F. E. (1959). Random balance experimentation. Technometrics, 1 :111–137. [Temlyakov, 2008] Temlyakov, V. N. (2008). Approximation in learning theory. Constr. Approx., 27(1) :33–74. [Tibshirani, 1996] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1) :267–288. [Zhao and Yu, 2006] Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Journal, 7 :2541–2563.

22

0.35

0.8

0.7

0.3

0.6 0.25 0.5 0.2 0.4 0.15 0.3 0.1 0.2

0.05

0

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 1 – X−axis : indeterminacy level δ, Y −axis : Prediction error (left) and estimation error (right). S = 10 (solid line-red) ; S = 12 (dot dash line-blue) ; S = 15 (dashed line -green) ; S = 20 (dot line-black).

23

0.5 0.3 0.45

0.4

0.25

0.35 0.2

0.3

0.25 0.15 0.2 0.1

0.15

0.1 0.05 0.05

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Fig. 2 – X−axis : sparsity rate ρ, Y −axis : Prediction error (left) and estimation error (right). δ = 0.4 (dot line-black) ; δ = 0.7 (dot dash line-blue) ; δ = 0.75 (solid line-red) ; δ = 0.875, (dashed line-green).

24

0.16

0

5

10

15

20

25

30

35

40 40

0.14

35

0.12

30

0.1

25

0.08

20

0.06

15

0.04

10

0.02

5

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.16

Fig. 3 – LOL Sparsity Estimation (ρ : bottom, left ; S : right, top). δ = 0.875 (dashed line-green) ; δ = 0.75 (solid line-red) ; δ = 0.7 (dot dash line-blue) ; δ = 0.4 (dot line-black). The columns of Φ are Gaussian of size n = 250.

δ 0.5 0.5 0.5 0.5 0.5 0.75 0.75 0.75 0.75 0.75

S (ρ) 5 10 15 20 25 5 10 15 20 25

4.98 9.88 14.54 18.78 22.74 4.98 9.90 14.57 18.77 21.94

TP (0.14) (0.35) (0.66) (1.04) (1.42) (0.12) (0.32) (0.56) (0.95) (1.91)

0.00 0.03 0.24 0.76 1.67 0.00 0.04 0.30 1.03 2.81

FN (0.00) (0.18) (0.47) (0.88) (1.26) (0.00) (0.19) (0.51) (0.89) (1.90)

0.00 0.00 0.01 0.03 0.07 0.00 0.00 0.01 0.05 0.19

FP (0.00) (0.00) (0.09) (0.17) (0.25) (0.00) (0.00) (0.12) (0.24) (0.48)

Tab. 1 – Detection, n = 250. The columns of Φ are i.i.d. gaussian. True Positive, False positive and False negative. Means over K = 200 replications, variances into the brackets.

25

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Fig. 4 – Error. X−axis : sparsity rate ρ. Y −axis : Prediction errors for LOL (dot lines) and LOL+ (solid lines). δ = 0.4 (blue color) ; δ = 0.75 (red color) ; δ = 0.875 (green color). The columns of Φ are Gaussian of size n = 250. S 5 10 15 20 25 30 35 40

FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN

0.00 0.00 0.01 0.04 0.03 0.41 0.09 1.26 0.19 2.78 0.39 5.90 0.70 9.44 1.24 14.73

G (0.0) (0.0) (0.1) (0.2) (0.2) (0.6) (0.3) (1.0) (0.4) (1.5) (0.8) (2.9) (1.5) (3.7) (1.5) (4.5)

0.00 0.00 0.00 0.05 0.02 0.35 0.07 1.26 0.10 2.93 0.42 6.05 0.61 9.19 1.21 14.98

U (0.0) (0.0) (0.0) (0.2) (0.1) (0.6) (0.3) (1.1) (0.3) (1.8) (0.9) (3.0) (1.0) (3.9) (1.5) (4.8)

0.00 0.00 0.00 0.05 0.02 0.36 0.08 1.25 0.17 2.61 0.39 5.45 0.78 9.54 1.18 14.72

B (0.0) (0.0) (0.0) (0.2) (0.1) (0.6) (0.3) (1.0) (0.4) (1.7) (0.6) (2.5) (1.3) (3.6) (1.4) (5.0)

0.00 0.00 0.00 0.06 0.03 0.31 0.07 1.24 0.17 2.69 0.34 5.93 0.68 9.63 1.06 15.15

T(5) (0.0) (0.0) (0.0) (0.3) (0.2) (0.6) (0.3) (1.0) (0.4) (1.8) (0.6) (2.8) (1.1) (4.3) (1.5) (4.6)

0.00 0.00 0.00 0.04 0.01 0.30 0.14 1.37 0.21 2.84 0.36 5.29 0.63 10.02 1.15 14.56

T(4) (0.0) (0.0) (0.0) (0.2) (0.1) (0.5) (0.4) (1.1) (0.5) (1.7) (0.7) (2.8) (1.0) (4.0) (1.5) (4.9)

0.00 0.00 0.01 0.06 0.03 0.34 0.08 1.25 0.23 2.75 0.41 5.42 0.84 9.73 1.31 15.53

T(3) (0.0) (0.0) (0.1) (0.2) (0.2) (0.6) (0.3) (1.0) (0.6) (1.8) (0.7) (2.7) (1.7) (4.1) (1.7) (4.3)

0.00 0.00 0.01 0.04 0.11 0.35 0.39 1.26 0.53 2.92 0.83 5.47 1.02 10.01 1.60 15.34

T(2) (0.0) (0.0) (0.1) (0.2) (0.4) (0.6) (0.6) (1.2) (0.7) (1.9) (1.0) (3.0) (1.3) (3.9) (2.1) (5.1)

0.01 0.01 0.19 0.16 1.08 0.56 1.91 1.59 3.92 4.12 4.69 8.76 5.71 14.77 6.24 21.70

Tab. 2 – False Detection, n = 250, p = 1000. First line : Common law of the columns of Φ. First column : Sparsity S. First lines : False positive, Second lines : False negative. Means over K = 200 replications, variances into the brackets. 26

T(1) (0.2) (0.1) (0.8) (0.8) (1.9) (1.3) (2.2) (2.8) (2.7) (3.9) (2.7) (7.4) (3.0) (8.6) (3.0) (9.0)

50 40 30 20 10 0

0.4

0.6

0.8

1.0

Fig. 5 – n = 250, p = 1000. Empirical densities of the coherence. The columns of Φ are Gaussian (solid line-red) ; uniform (solid line-blue) ; Bernoulli (solid line-green) ; Student 5, 4, 3, 2, 1 black lines from left to right.

27

0.16

0

5

10

15

20

25

30

35

40 40

0.14

35

0.12

30

0.1

25

0.08

20

0.06

15

0.04

10

0.02

5

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.16

Fig. 6 – LOL Sparsity estimation for different families of laws for the predictors. Gauss (solid line-red) ; U nif orm (solid line-blue) ; Bernoulli : (solid line-green) ; T(1-5) (black-lines). n = 250, p = 1000. (K = 200)

28

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

10

20

30

40

50

60

70

80

Fig. 7 – Coherence computed for the N selected Leaders. Gauss (solid line-red) ; U nif orm (solid line-blue) ; Bernoulli : (solid line-green) ; T(1) (dot line-black). n = 250, p = 1000. (K = 200)

29

N(0,1)

U[−1,1]

B(−1,1)

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0

0

0.1

0.2

0

0.1

T(4)

0.2

T(3) 0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0

0.1

0.2

0

0.1

0.1

0.2

T(2)

0.8

0

0

0.2

0

0.1

0.2

Fig. 8 – X-axis : Sparsity rate. Y-Axis : Prediction error for different design matrices. LOL (red solid line), LOL+ (red dotted lines), SIS − Lasso (green solid lines), and Lasso − Reg (blue solid line). n = 250, p = 1000. (K = 200) 30

To cite this version: Gerard Kerkyacharian, Mathilde Mougeot, Dominique Picard, Karine Tribouley. Learning Out of Leaders. 2010.

HAL Id: hal-00445690 https://hal.archives-ouvertes.fr/hal-00445690 Submitted on 11 Jan 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Learning Out of Leaders G´ erard KERKYACHARIANa , Mathilde MOUGEOTa , Dominique PICARDa and Karine TRIBOULEYb a

Universit´e Paris-Diderot, CNRS LPMA, 175 rue du Chevaleret 75013 Paris, France kerkmath.jussieu.fr, mougeotmath.jussieu.fr, picardmath.jussieu.fr b

Universit´e Nanterre Paris Ouest, MODALX & LPMA Rue de la r´epublique 92001 Nanterre, France karine.tribouleyu-paris10.fr

R´ esum´ e This paper investigates the problem of selection and estimation in a high dimensional regression-type model. We propose a procedure with no optimization called LOL, for Learning Out of Leaders. LOL is an auto-driven algorithm with two thresholding steps. A first adaptive thresholding helps to select leaders among the initial regressors in such a way to reduce the dimensionality. Then a second thresholding follows the estimations and predictions performed by linear regression on the leaders. Theoretical results are proved. As an estimation procedure, LOL is optimal since the upper exponential bounds are achieved. Rates of convergence are provided and show that LOL is also consistent as a selection procedure. An extensive computational experiment is conducted to emphasize the practical good performances of LOL.

1

Introduction The general linear model is considered in this paper, with a focus on cases where the number

p of regressors is large compared to the number n of the observations (although there is no such restrictions). These type of models have lots of practical applications in many areas of science and engineering including collaborative filtering, machine learning, control, remote sensing, and computer vision just to name a few. Examples in statistical signal processing and nonparametric estimation include the recovery of a continuous-time curve or surface from a finite number of noisy samples. Other interesting fields of application are radiology and biomedical imaging when fewer measurements about an image are available compared to the unknown number of pixels collected. In biostatistics, high dimensional data frequently arise in genomics to study gene expression given a huge number of initial genes and a relatively low number of observations. 1

A considerable amount of work have been produced in this domain in the last years, which has been a large source of inspiration for this paper. We have especially considered the algorithms coming from the learning framework ([Barron et al., 2008], [Binev et al., 2005], [Binev et al., 2007a], [Binev et al., 2007b]), as well as the extraordinary explosive domain of ℓ1 penalties (among many others [Tibshirani, 1996], [Candes and Tao, 2007], [Bickel et al., 2007], [Bunea et al., 2007a], [Bunea et al., 2007b], [Fan and Lv, 2008] and [Cand`es and Plan, 2008]. See also [Lounici, 2008] and [Alquier and Hebiri, 2009] ). The essential motivation of this work is to provide one of the simplest procedures which achieves, in the same time, good performances. LOL algorithm (for Learning Out of Leaders) consists in a two steps thresholding procedure. As we do not perform any optimization step, it is important to address in which domains the procedure is competitive to more sophisticated algorithms and especially to algorithms performing a one or two steps ℓ1 minimization. One of our aim is to delimit where LOL is performant and where its simplicity induces a slight lack of efficiency from a theoretical point of view as from a practical aspect. Let us start by introducing the ideas of the emergence of LOL algorithm. This simple procedure can be viewed as an ’explanation’ or as a ’cartoon’ of ℓ1 minimizations. It is well known that when the regressors are exactly normalized and orthogonal, ℓ1 minimization corresponds to soft thresholding which itself is close to hard thresholding. Hence, it is quite natural to expect that thresholding should perform well, at least in cases not so far from these orthonormal conditions which correspond, as noted below, to small coherence conditions. A tricky problem occurs when the regressors are not orthonormal or when the number of regressors is large. Then, the minimum least squares estimator has a non unique solution and is very unstable. This stays the main difficulty for the ℓ1 minimizers or more generally for all methods based on sparsity assumptions. Moreover, this is the part of the algorithms where the computation cost shows up. Obviously a simple thresholding would not fit, but assuming some sparsity conditions, in this case, ensures that it is possible to choose some regressors and exclude some others. LOL algorithm solves the difficult problem of the choice of the regressors in a quite crude way by adaptively selecting N regressors which are the most correlated to the target : this defines the first step thresholding of LOL, determining the N leaders. The number N is chosen using a fine tuning depending on the coherence, and it has to be emphasized that the choice is auto driven. In a second thresholding step, LOL regresses on the previous leaders and thresholds the result to take into account the noise of the model. Properties of LOL procedure are investigated through two different points of view : the prediction problem and the estimation problem. More precisely, it is established that LOL procedure has a prediction error which is going to zero in probability with exponential rates. These types of

2

results are often called Bahadur type efficiency. Although Bahadur efficiency of test and estimation procedures goes back to the sixties (see [Bahadur, 1960]), it has seen recently a revival in learning theory, where the rates of convergence (preferably exponential) of being at some fixed distance of the target are investigated and compared to optimality. This is also the connection to learning theory which guides here the choice to measure LOL performances as the mean of the empirical quadratic distance between the observations and the predicted values. We also establish that LOL procedure works quite well regarding the detection since the number of false negative as well as false positive are going to zero in probability with pretty fast rates. Of course, because of the simplicity of the method, some loss of efficiency can be expected compare to more elaborate and costly procedures. But even when there is a loss, the limitations of the procedure could be an interesting information on the ℓ1 minimizers themselves. From both a theoretical and a practical point of view, when the coherence is small, LOL procedure is as powerful as the best procedures. Also when there is a loss in the rate, a positive aspect of the method is that the practitioner is informed of the possible instability since the coherence is provided by the observations. An intensive calculation program is performed to show the advantages and the limitations of LOL procedure in several practical aspects. In Section 6, the case where the regressors are forming a random design matrix with i.i.d. entries is investigated. Different laws of the entries are considered (Gaussian, Uniform, Bernoulli or Student laws) inducing specific coherence for the design matrix. Several interesting features are discussed in this section. The impact of the sparsity and the undetermination of the regression on LOL performances are studied. A comparison with two others two-step procedures namely [Fan and Lv, 2008] and [Cand`es and Plan, 2008] is also provided and shows the additional benefits brought by LOL. The most interesting conclusion being that the practical results are even better and more comforting than the theoretical ones in the sense that even when the coherence is pretty high, LOL procedure shows good performances. The paper is organized as follows. In Section 2, the general model and the notations are presented. In Section 3, LOL procedure is detailed as other procedures with a ℓ1 optimization step ; practical comparisons with other procedures are later discussed in Section 5. In Section 4, after stating the hypotheses needed in the model, theoretical results are established. Practical performances of the LOL procedure are investigated in Section 6 and the proofs are detailed in Section ??.

2

Description of the models In this part, the model of interest is presented with a focus on two specific cases : the random

matrices design and the functional regression.

3

2.1

General model

A Gaussian (or sub-gaussian) high dimensional linear model is here considered and more precisely data Y = (Y1 , . . . , Yn )t are observed coming from the following regression model Y = Φα + u + ε

(1)

where the parameter α ∈ Rp is the unknown vector to be estimated and

– the vector ε = (ε1 , . . . , εn )t is a (non observed) vector of random errors. It is assumed to be independent Gaussian variables N (0, σ 2 ) but essentially comparable results can be obtained in the case of zero mean subgaussian errors (see the remark before Lemma ??). – the vector u = (u1 , . . . , un )t is a non observed vector of (possibly) random errors. Its amplitude is assumed to be small. The differences between the two previously described ”errors” lies in the fact that the εi ’s are centered but unbounded and independent, while the ui ’s are only bounded. The importance of introducing these two types of errors becomes clear in the functional regression example (see section 2.3). – Φ is a n × p known matrix. This paper focuses on the interesting case where p >> n but it is not necessary. We assume that Φ has normalized columns (or normalize them) in the following sense :

n

1X 2 Φiℓ = 1, n i=1

2.2

∀ ℓ = 1 . . . , p.

(2)

Coherence

The following Gram p × p matrix is M :=

1 t Φ Φ. n

The quantity

n

1X Φiℓ Φim | τn = sup |Mℓm | = sup | ℓ6=m ℓ6=m n i=1

is called the coherence of the matrix M . This quantity is important because it induces a bound on the size of the invertible matrices built with the columns of M . More precisely, fix 0 < ν < 1 and let C be a subset of indices of {1, . . . , p} with cardinality m. Denote ΦC the matrix restricted to the columns of Φ whose indices are in C. If 2τn ≤ ν, the associated Gram matrix M (C) :=

4

1 t Φ ΦC n C

is almost diagonal as soon as m is smaller than ⌊ν/τn ⌋ in the sense that it satisfies the following so called Restricted Isometry Property (RIP).

∀x ∈ Rm , kxk2l2 (m) (1 − ν) ≤ xt M(C)x ≤ kxk2l2 (m) (1 + ν),

(3)

This proves in particular that the matrix M (C) is invertible.

2.3

Models of interest

Although these results apply in the general case, two typical cases of applications are especially considered. The first application concerns a random matrix Φ composed of n independent random vectors of size p. The important role played by the distribution of these random vectors is detailed in the simulation study, Section 6. The second application is the learning (also called functional regression) framework Yi = f (Xi ) + εi , i = 1 . . . n

(4)

where f is the functional parameter of interest to be estimated. The Xi′ s are i.i.d. random variables living in a compact domain of Rd . The errors ε′i s, are i.i.d. standard Gaussian random variables and independent of the Xi′ s (or centered sub-gaussian variables). ρ denotes the common (unknown) law of the (1 + d)−vectors Zi = (Xi , Yi )′ s. To relate this framework to our model, let us consider a dictionary D of size p, of real functions

defined on Rd . Assume that f can be reasonably well approximated using the elements of the dictionary which means that there exists a sequence {αg , g ∈ D} such that X f= αg g + h g∈D

where h is hopefully small. Then the regression model becomes X Yi = αg g(Xi ) + h(Xi ) + εi , i = 1, . . . , n g∈D

which coincides with the general model Y = Φα + u + ε, setting ui = h(Xi ) for any i = 1, . . . , n and Φ being the matrix with general terms Φiℓ = gℓ (Xi ) (after choosing an enumeration of D). Again, the dictionary has to be normalized and (2) translates

here as

n

1X 2 g (Xi ) = 1, ∀ g ∈ D. n i=1

5

3

Estimation procedures As explained in the introduction, the essential motivation of this work is to provide one of

the simplest procedures, finding its inspiration among a lot of works around the same theme. First, the estimation of the unknown parameter α using LOL is described. The procedure has the particularity to perform a selection method of the regressors in the same time. Next, a short review on the procedures directly connected to LOL is proposed. Once for all, the constant ν is fixed. This constant will obviously be related to the precision of LOL main procedure (for instance ν = 1/2 can be taken as default value).

3.1

LOL Procedure

Once τn (or a bound for τn ) is evaluated and N = ⌊ν/τn ⌋ is computed, LOL procedure has

three successive steps : Find N leaders, Regress on the leaders, Threshold. 1. Find the leaders : • For some constant T1 > 0, fix a threshold λn (1) = T1

• Compute the ’correlations’

log p n

1/2

∨ τn

!

.

(5)

n

Kℓ = |

1X Φiℓ Yi | n i=1

and consider the ordered sequence K(1) ≥ K(2) ≥ . . . ≥ K(N ) of the N largest, and the

associated set of indices K = {κ(1) , κ(2) , . . . , κ(N ) }.

• The final set of the leaders is defined by the following set of columns Φℓ of the matrix Φ : B = {Φℓ , ℓ ∈ K and Kℓ ≥ λn (1)} and B denotes the set of their indices (which might of course be different from K). It is clear

from this construction that N appears as a bound for the number of leaders (equal to the cardinal of B). 2. Regress on the leaders : • Consider the pseudo-regression model : Yi =

X

Φiℓ αℓ + ei

ℓ∈B

and define the extracted matrix ΦB by (ΦB )ℓ, i = Φiℓ

for any ℓ ∈ B and i ∈ {1, . . . , n}. 6

(6)

• Let α b(B) = (c αℓ (B), ℓ ∈ B) be the minimum least square error in this model : ! n X X α ˆ (B) = Arg min (Yi − Φiℓ αℓ )2 = (ΦtB ΦB )−1 ΦtB Y . α=(αℓ )ℓ∈B

i=1

ℓ∈B

• Define the vector α ˆ of Rp by

3. Threshold :

αℓ := c

cℓ (B) α 0

if ℓ ∈ B if ℓ ∈ 6 B

For some constant T2 > 0, fix a threshold λn (2) = T2

log n n

1/2

(7)

and threshold again the estimated coefficients to obtain the final predictor α ˆ ∗ whose coordinates are α ˆ ∗ℓ = c αℓ I{|c αℓ | ≥ λn (2)}.

The selected regressors are then the columns of Φ whose indices belong to L = {ℓ = 1, . . . , p, α cℓ ∗ 6= 0}

Notice that the formula (5) and (7) are the ’default’ values for the tuning sequences λn (1) and λn (2) given for the procedure. However, the presentation as well as the theoretical results in sequel are given for arbitrary sequences λn (1) and λn (2).

3.2

Several inspirations

Although it is impossible to be exhaustive in such a productive domain, some of the works directly in relation to our construction are hereafter mentioned. We apologize in advance for all the works that are not mentioned but still in connection. For a comprehensive overview, we refer to [Fan and Lv, 2009]. In the context of the learning theory (second application), various methods are already been proposed, including kernel methods and search within dictionaries. Let us especially mention following works providing greedy algorithms [Barron et al., 2008], or adjusting tree algorithms [Binev et al., 2005], [Binev et al., 2007a]. A one step algorithm rough version of LOL is given in [Kerkyacharian and Picard, 2007] as well as in [Kerkyacharian et al., 2009] for the case p ≤ n. [Bunea, 2009] also proposes an estimation procedure based on the lasso and derives a selection

procedure by keeping the non zero estimated coefficients. 7

In the context of the linear regression (first application), several authors propose procedures to solve the selection problem and the estimation problem in the case where the vector α has only a small number of non zero components, and (often) when the design matrix Φ is composed of i.i.d. random vectors : see among many others [Tibshirani, 1996], [Candes and Tao, 2007], [Bickel et al., 2007], [Bunea et al., 2007a] and [Bunea et al., 2007b]. We especially refer to the 2-steps procedures which are also commonly used. Apparently, as soon as in 1959 such a procedure is already discussed (see [Satterthwaite, 1959]). In [Candes and Tao, 2007] and [Cand`es and Plan, 2008], the leaders are selected with (respectively) the Danzig procedure and the lasso procedure. Then, the estimated coefficients are obtained via a linear regression on the leaders. Using an intensive simulation program, [Fan and Lv, 2008] show that it could be unfavorable to use the procedures lasso or Danzig before the reduction of the dimension. They also provide a search among leaders called Sure Independence Screening (SIS) procedure. This procedure is very close to the one discussed in this paper : the leaders are the N = ⌊γn n⌋ columns of Φ with largest

correlations to the target variable Y (γn is a tuning sequence tending to zero). This step is followed

with a subsequent estimation procedure using Danzig or lasso. All these methods focus on the complexity of the algorithms.

4

Main theoretical results This section states the theoretical results of the procedure LOL. First, the assumptions on the

model are described. Next, the quantities allowing to measure the performances of the procedures are defined. The consistency of LOL is shown using two different points of view : the prediction problem and the estimation problem.

4.1

Sparsity conditions on the model

Recall that the model specifies a gaussian (or sub-gaussian) observation of the following form : Y = Φα + u + ε. The following sparsity conditions are assumed. There exist S ≤ N and constants

8

M, c1 , ct , c′t , c0 , such that the sequences {αℓ }ℓ≤p and {ui }i≤n satisfy the following conditions 1/2 S (8) sup |ui | ≤ c1 n i=1,...,n p X ℓ=1

|αℓ | ≤ M,

(9)

# {ℓ ∈ {1, . . . , p}, |αℓ | ≥ λn (2)/2} ≤ S X S 1/2 |α(ℓ) | ≤ ct nτn

(10) (11)

(ℓ)>N

X

(ℓ)>N p X ℓ=1

|α(ℓ) |2 ≤ c′2 t

S n

(12)

|αℓ |2 I{|αℓ | ≤ 2λn (1)} ≤ c20

S n

(13)

Recall that (α(ℓ ) ) is the ordered sequence (for the modulus) |α(1) | ≥ |α(2) | ≥ . . . |α(p) |. For S, M > 0,

V (S, M ) denotes the class of models of type (1) satisfying the sparsity conditions (9), (8), (10), (11), (13). A very important example of such a class occurs when all the coefficients of α are 0 except S coefficients (with S ≤ N ) with a modulus greater that λn (2)/2 but bounded : Spars(S, M ) denotes such a class.

The conditions (9)–(13) are also satisfied if the lq conditions are assumed, as in [Raskutti et al., 2009] which provide upper and lower bounds. More precisely, for q ∈ (0, 1], define the lq -balls as the sets p

Bq (M ) := {α ∈ R ,

p X j=1

|αj |q ≤ M q }.

(14)

It is not difficult to prove that if α belongs to Bq (M ) then (9)–(13) are verified for S ≥ λn (2)−q ∨ nλn (1)2−q ∨ nτn(2−q)/q . In particular, in order to compare our results to the lower bounds q in[Raskutti et al., 2009], it is log p important to verify that the conditions are verified for τn = O and for the defaults values n for λn (1) and λn (2). In this precise case, this means that S/n has to be of order τn2−q .

In the context of the learning theory (second application), the sparsity conditions are required on the target function f . The above assumptions are easily translated by replacing the condition (8) by the following one : khk2∞

1/2 S ≤ c1 . n

The other conditions are quite usual in functional analysis and relate to Lorentz spaces. 9

4.2

Measures of performances

First, let us define loss functions to measure the difference between the true value α ∈ Rp and

the result α ˆ ∗ of LOL procedure. Denote Φi the i−th line of the matrix Φ and recall that the i−th observation is given by the model Yi = Φi α + ui + εi .

The predicted i−th observation is Ybi = Φi α ˆ ∗ . The empirical quadratic distance between the predicted observations and the expected value is here considered

p n n 2 1X X ∗ 1 Xb Yi − EYi = d(ˆ α∗ , α)2 = (ˆ αℓ − αℓ )Φiℓ + ui n n i=1

i=1

ℓ=1

!2

.

(15)

Notice that in the functional regression case, this error coincides with the L2 error with respect to the empirical measure

1X δXi n where δx denotes the Dirac measure at point x. Indeed, we get ρˆ =

n 2 1 X ˆ f (Xi ) − f (Xi ) = kfˆ − f k2ρˆ. d(ˆ α , α) = n ∗

2

i=1

With a slight abuse of notations, we also write the distance defined in (15) in the general model d(ˆ α∗ , α) := k

p X (ˆ α∗ℓ − αℓ )Φ•ℓ + u• kρˆ l=1

where Φ•ℓ is the ℓ−th row of Φ. The first measure of performance under consideration is issued from the Bahadur efficiency of test and estimation procedures and is defined for any tolerance η > 0 as AC n (LOL, η) = P (d(ˆ α∗ , α) > η) .

(16)

Obviously, if the tolerance is low (smaller than a critical value ηn ), this quantity is large. In the opposite, for η ≥ ηn , the quality of the procedure is given by the rate of convergence of AC n (LOL, η)

towards zero. Observe that the value of the critical value ηn is essential since it yields, as a conse-

quence, bounds for Ed(ˆ α∗ , α) which is another (more standard) measure of performance of the procedure. More generally, in the learning framework, given priors Θ on the class of probability distributions generating the observations, it has been defined in [DeVore et al., 2006] the accuracy confidence function of the procedure fˆ : AC n (Θ, fˆ, η) := sup ρ⊗n {kf − fˆkρX > η}. ρ∈Θ

10

(17)

This quantity measures a uniform confidence (over the class Θ) that the estimator fˆ is accurate to the tolerance η. In most examples, there exist a phase transition and a critical value ηn depending on n and Θ such that AC n (Θ, fˆ, η) decreases exponentially for any η > ηn . More precisely, in terms of lower bound, it is proved in [DeVore et al., 2006] q ¯ (Θ, η)e−cnη2 , inf AC n (Θ, fˆ, η) ≥ C N fˆ

(18)

¯ (Θ, η) is the tight entropy analogue of the Sobolev covering numbers. The results in where N [DeVore et al., 2006] are obtained in the learning framework ; however identical bounds can eap sily be obtained in the setting (1) of this paper, leading to ηn = O( S/n).

If the focus is made on the case where α ∈ Spars(S, M ), it could be interesting to adopt the

point of view of the ”detection” instead of the ”prediction”. Two quantities become then crucial in view to measure the ”similarity” between the true value and its estimator. The number of False Positive decisions (FP) and the number of False Negative decisions (FN) are given by F P :=

p X ℓ=1

I{αℓ = 0}I{ˆ α∗ℓ 6= 0}

and

F N :=

p X ℓ=1

I{αℓ 6= 0}I{ˆ α∗ℓ = 0}.

In order to evaluate the performances of LOL selection procedure using these distances between α and α ˆ , the quantity P (F P > pη) + P (F N > pη) for η ≥ 0 is studied. A selection procedure is said

consistent if P ( {ℓ, αℓ 6= 0} = {ℓ, α ˆ ∗ℓ 6= 0}

4.3

) is tending towards 1.

Performances of the procedure LOL

The performances of the LOL procedure are summarized in the following theorems. In Theorem 1, we establish that LOL procedure is a good procedure for estimation since the prediction error is going to zero in probability with exponential rates. Indeed, the LOL estimator is optimal (up to p a logarithmic factor) in terms of the critical value ηn ∼ S/n, as well as in terms of exponential

rates if the coherence is small enough (see the discussion below). In Theorem 2, we establish that

LOL procedure works also quite well for detection since quantities F N and F P are going to zero in probability with pretty fast rates. Theorem 1. Let S, M > 0 and fix ν in ]0, 1[. Suppose p ≤ na , for some constant a > 0 and choose the thresholds λn (1) and λn (2) such that λn (1) ≥

T11

log p n

1/2

∨ T12 τn

11

!

and

λn (2) ≤ λn (1)

√ √ 1/2 for T11 = 16 2σ 2 /(1 + ν) and T12 = M (1−ν) 2 . Then, the model is of class V (S, M ) ∨ 4 4 defined above, there exist positive constants D and γ, such that S 2 ≥D −γnη2 for η 4e n ∨ sup P (d(ˆ α∗ , α) > η) ≤

V (S,M )

1

for

η2

≤D

S n

∨

S| log τn | n

∨ Sτn2 ,

S| log τn | n

Sτn2

∨

(19)

Observe that the result given in Theorem 1 is concerning LOL procedures associated with more general thresholds than λn (1), λn (2) than those prescribed in (5) and (7). It is interesting to notice the very few conditions on the threshold λn (2) (λn (2) ≤ λn (1) and Condition (10) relating to the considered set of α’s ).

The constants D and γ are precisely given at the end of the proof of Theorem 1. For a sake of completeness, precision on the constants is given. However, it is obvious that the constants provided here are not optimal : for instance in the proof, in order to avoid unnecessary technicalities, most of the events are divided as if they had equal importance, leading to constants which are each time divided by 2. Obviously there is room for improvement at any of these stages. An elementary consequence of Theorem 1 is the following corollary which details the behavior of the expectation of d(ˆ α∗ , α). Notice also that we did not give here explicite oracle inequalities, which however could be derived from the proof of Theorem 1. Corollary 1. For r ≥ 1 arbitrary, under the same assumptions as in Theorem 1, we get r/2 S| log τn | ∗ r ′ S 2 sup Ed(ˆ α , α) ≤ D ∨ ∨ Sτn n n V (S,M ) for some positive constant D ′ . Notice that in the case of the lq balls Bq (M ) for q ∈ (0, 1]) (see (14)) and taking the defaults

values for λn (1) and λn (2), LOL procedure has optimal rates in the minimax sense (compare the q log p . upper bound to the lower bounds in [Raskutti et al., 2009]) as soon as τn = O n Let us now focus on the selection point of view. As usual, an additional assumption is needed

on the non zero coefficients : they have to be large enough to be detected. Theorem 2 establishes that LOL procedure is consistent as a selection procedure. Theorem 2. Let k be a given positive number. Let S, M > 0 and fix ν in ]0, 1[. Suppose p ≤ na ,

for some constant a > 0, choose λn (1) ≥ λn (2) and assume that the model is of class Spars(S, M ) described above, then

– False Positive : Assume that min |αℓ |I{αℓ 6= 0} ≥ µn

ℓ=1,...,p

12

where µn satisfies µn = T3

λn (2) ∨ τn

r

S ∨ k

r

S ∨ nk

r

S| log τn | nk

!

where T3 is a constant large enough. Then there exists a constant c > 0 such that P (F P > k) ≤ c exp{−c knλ2n (2)}. – False Negative : Choose the thresholds such that λn (1) ≥

T11

log p n

1/2

∨ T12 τn

!

where the constants T11 , T12 defined as in Theorem 1 and √ (1 + ν)1/2 λn (2) ≥ σ 32 c′1 c′ ∨ 256c1 (1 − ν)1/2

!r

S . nk

There exists some constant c > 0 such that P (F N > k) ≤ c exp{−c knλ2n (2)} As for Theorem 1, Theorem 2 states for general thresholds λn (1), λn (2) (which are valid for (5) and (7) but also more widely). Observe that the choice of λn (2) is crucial from a detection point of view. For the specific choices (5) and (7), we get Corollary 2. Assume that min |αℓ | I{αℓ 6= 0} ≥ O

ℓ=1,...,p

log n √ n

.

Let S, M > 0 and fix ν in ]0, 1[. Suppose p ≤ na , for some constant a > 0 and assume that the

model is of class Spars(S, M ) described above. The LOL procedure the specific choices (5) and (7)

satisfies ′

P (F N + F P > k) ≤ c′ n−c k . for k larger than O(S/ log n). Note that LOL procedure works better and better as S gets smaller, as it is confirmed by the practical simulations.

13

5

Discussion and Comparisons Comparison with other theoretical results in the literature are hereafter presented with a spe-

cific focus on domains where LOL is competitive to more sophisticated algorithms and where its simplicity induces a slight lack of efficiency. To summarize, the great benefits of LOL is to produce a very simple and auto driven algorithm with no optimization step, and with quite elementary assumptions leading to optimal exponential rates.

5.1

Estimation bounds in learning theory

As mentioned in the previous section, LOL finds its inspiration in the learning framework, especially in [Barron et al., 2008], [Binev et al., 2005], [Binev et al., 2007a],[Binev et al., 2007b]. In all these papers, consistency results are obtained under fewer assumptions but with no exponential bounds and a higher cost in implementation. In the learning context, [Temlyakov, 2008] provides optimal critical value ηn as well as exponential bounds with fewer assumptions since there is no coherence restriction. However, the procedure is very difficult to implement for large values of p and n (N -P hard).

5.2

Comparison with other penalization procedure and coherence conditions

Comparisons has to be conducted with various procedures affiliated to the Lasso or Danzig procedures for instance [Tibshirani, 1996], [Candes and Tao, 2007], [Bickel et al., 2007], [Bunea et al., 2007a], [Bunea et al., 2007b]. First, the normalization needs to be stressed since it plays a crucial role. In many papers, the model is Y = Xβ + ε and the columns of X are normalized. For comparison, our model needs to be identified in the following way Φ X := √ , n

β :=

√ nα.

Of course, each normalization brings its own benefit. Our choice has a natural interpretation in terms of prediction in the functional learning model. However, it is interesting to notice that precisely because of this normalization, the sparsity conditions on the function (model V (S, M )) are lighter for LOL. LOL estimation bounds are compared with the lower bounds produced in [DeVore et al., 2006], p LOL procedure gives optimal results when the coherence satisfies τn ≤ O( log n/n). This is to be

compared with conditions of type τn ≤ O(S −1 ) (see for instance [Bickel et al., 2007], [Bunea et al., 2007a],

[Bunea et al., 2007b]) which are lighter except for large S, or τn ≤ O(1/ log p) in [Cand`es and Plan, 2008]

which is better. However, in these papers, there is generally additional assumptions

14

– either on the matrix X itself which generally are not possible to verify in practice. In the opposite, notice that the coherence can always be calculated. – or on the way X as well as the β coefficients are produced, namely all these values are in fact random and independent. In our case, it can allow to less drastic coherence conditions. We p infer that conditions of type τn ≤ O( S log n/n) could suffice in this case, but these precise types of models are not the scope of this paper.

5.3

Selection properties

[Meinshausen and Buhlmann, 2006] show that the selection by lasso type algorithm is consistent in graphical models, under assumptions that are tailored to models for which the vector (Y, Φ1 , . . . , Φp ) is gaussian. Basically, to establish the consistency property of the selection procedure, a minimal size of the (non zero) coordinates of α is required : it is generally assumed that there exists some sequence υn > 0 such that min |αℓ | ≥ O(υn ).

ℓ∈I ∗

(20)

[Zhao and Yu, 2006] establish the consistency of selection for fixed design linear regression models, assuming that Hypothesis (20) holds for υn = n−κ for some κ ∈ (0, 1/2). Under the same hypothesis,

[Fan and Lv, 2008] prove that Sure Independence Screening (SIS) is accurate in the sense that SIS selects (with large probability) at least the regressors which have to be selected. They need to assume that there exists some τ > 0 which is the indicator of the growth of the largest eigenvalue of the variance matrix Σ of Φ defined by λmax (Σ) ≤ O(nτ ). The main advantage of [Fan and Lv, 2008]

is that their results are basically concerning a linear model in ultra high dimension p = exp(cnξ )

for constants c, ξ > 0 with the restriction ξ ∈ (0, 1 − 2κ). Practical inconvenient is that the tuning sequence γn is not auto driven since it has to verify n1−2κ−τ −→ ∞. The selection procedure of

[Bunea, 2009] proposed in the learning framework is also shown to be consistent. Hypothesis (20) p is required for υn = S log n/n imposing some restriction on S because υn is supposed to tend to zero. Finally, [Cand`es and Plan, 2008] prove consistency results as soon as Hypothesis (20) is √ satisfied for υn = 8σ 2 log p and if S ≤ O(p/[kΦk2 log p]) (in a non asymptotical framework, for any

dimension p). When the selection procedure is derived from an estimation procedure, a coherence

restriction could be asked. In [Bunea, 2009] and [Bunea et al., 2007b], it is assumed in addition that

n

1X |Φiℓ Φim | ≤ O(S −1 ). ∗ ∗ n ℓ∈I , m6∈I i=1 P √ p and an exponential bound (tending to zero) is established for P αℓ − αℓ | > S η when ℓ=1 |ˆ p p η ≥ S log p/n. In [Cand`es and Plan, 2008], if τn ≤ O(c/ log p) and again η ≥ S log p/n, it is sup

15

proved that P (d(ˆ αℓ , αℓ ) > η) ≤ 6p−2 log 2 − p−1 (2π log p)−1/2 . [Temlyakov, 2008] provides optimal critical value ηn as well as exponential bounds with fewer assumptions : there is no coherence restriction and the setting is the learning framework. In [Fan and Lv, 2008], under some hypothesis of RIP type, the procedure SIS-D (SIS followed by danzig) is asymptotically consistent P

p X ℓ=1

6

(ˆ αℓSIS−D − αℓ )2 > S

p

log N

!

−→ 0

Practical results In this section, an extensive computational experiment is conducted using LOL. The procedure

is dedicated to find sparse solutions of linear models assuming that the target variable Y is a linear combination of only S predictors among p. The performances of LOL procedure are studied over various ranges of level of indeterminacy δ = 1 − n/p and over various ranges of sparsity rates

ρ = S/n (see [Maleki and Donoho, 2009]). The influence of the choice of the distribution family for the design matrix is analyzed through the performances. LOL procedure is finally compared to some others two-steps procedures described in Section 3.2.

6.1

Experimental design

The design matrix Φ has p i.i.d. columns of size n. Different distributions are studied : Gaussian, Uniform, Bernoulli, or Student laws. It is important to notice that this choice of laws yields different values of the coherence τn and then different behaviors of the procedure. Each column vector of Φ is normalized to have unit norms. Given Φ, the target observations are Y = Φα + ε for ε i.i.d. variables with a normal distribution N (0, σε ), σε chosen such that the signal over noise ratio is close to 2. The vector α is built as follows : all coordinates are zero except S non zero coordinates with αℓ = (−1)b |z| where b is drawn from a Bernoulli distribution with parameter 0.5 and z from

a N (2, 1) (see [Fan and Lv, 2008]).

To evaluate the quality of the prediction, the relative l2 error EY is computed on the target Y and the relative quadratic error Eα is computed on the α coefficients EY = kY − Ye k22 /kY k22

and

Eα = kα − α ˆ k22 /kαk22 .

The sparsity S is estimated by the cardinal of L = {ℓ = 1, . . . , p, α b∗ 6= 0} where α b∗ is the LOL

estimator. The number of False Positive and of False Negative as defined in Section 4.2 are also 16

computed. All these quantities are estimated by averaging the results obtained over K = 200 replications of the experiment.

6.2

Algorithm

Let us explain how to determine in a really adaptive way the thresholds λn (1) and λn (2). These are critical values quite hard to tune practically since they depend on inaccessible constants (see the theoretical results). Since the first threshold λn (1) is used to select the candidates to the regression, the aim is to split the set of ’correlations’ {Kℓ , ℓ = 1, . . . , p}, in two clusters in such a way to pick

up the regression candidates in one group. Here, the sparsity assumption is used : some predictors are more correlated to the target Y than some others associated to a weak correlation value, close to zero. This remark implies that the distribution of correlations (in absolute value) should be distributed in two clusters : one for the leaders (high correlations) and one for the others (very small correlations). The frontier between the clusters is adaptively computed by minimizing the deviance of the absolute value correlations for two classes as described in [Kerkyacharian et al., 2009]. The same procedure is used to threshold adaptively the estimated coefficients α bℓ obtained by

linear regression on the leaders. Indeed, notice that the distribution of the α bℓ provides two clusters : one cluster associated to the largest coefficients (in absolute value) corresponding to the non zero coefficients and one cluster composed of coefficients closed to zero, which should not be involved

in the model. The frontier between the two clusters, which defines λn (2), is again computed by minimizing the deviance between the two classes of regression coefficients. Finally, an improvement for LOL is proposed. It seems more appropriate to perform a second regression using the final set L of selected predictors involved in the model : the estimators of the

(non zero) coefficients should be more accurate. This updating procedure is denoted LOL+ in the sequel.

6.3

Results with random gaussian design matrices

First, the design matrix Φ is defined with i.i.d. gaussian variables. The computed coherence is also τn = 0.33 (see Figure 5). As we are interested in quantifying LOL performance in an overwhelming majority of cases, we study the impact of the level of indeterminacy δ from 0 to 0.9 by 0.05 step and the impact of the the sparsity rate ρ from 0.01 to 0.16 by 20 steps. p = 1000 is chosen and for specific studies n = 250. Influence of the indeterminacy level : Figure 1 studies LOL prediction and estimation performances when the indeterminacy level is varying (p = 1000, n varying). Both errors EY and Eα continuously increase with the indeterminacy δ, as the number of available observations

17

decreases compared to the number of variables. For a given value of δ, EY decreases as the sparsity does. For δ ≤ 0.75 (n ≥ 0.25p), the prediction error is weak, below 5%. In this case, the estimation

error on the coefficients is less than 10%. When the number of available observations is at least higher than half of the number of potential predictors (δ < 0.5), the prediction and the estimation errors are negligible : LOL performances are in this case exceptionally good. For a given number of observations and potential predictors, the prediction is more accurate as the sparsity rate decreases. For a fixed number observations, regarding the joint values of both indeterminacy and sparsity parameters, the errors tends to be null as δ and/or ρ decrease. Influence of the sparsity rate : Figure 2 illustrates LOL prediction and estimation performances when the sparsity rate is varying. For small values of sparsity rate (ρ ≤ 5%), both prediction

and estimation errors are very good (less than 5%). For an extreme level of sparsity (ρ ≤ 2%), the

performances are, as expected, excellent. As observed before, for a given sparsity rate value, the performances are improved as the indeterminacy decreases. Estimator of the Sparsity S : Figure 3 shows the estimated sparsity as a function of the

effective sparsity S. For weak sparsity values (ρ ≤ 5%), LOL procedure is excellent because it estimates exactly (with no error) the sparsity S and that for all studied indeterminacy levels.

As the sparsity increases, LOL procedure tends to underestimate the parameter S. For a given sparsity value, the underestimation becomes weaker as the indeterminacy level δ decreases. This observation is detailed in Table 1 where the False Positive and False Negative numbers are computed for different values of sparsity. Two different cases of indeterminacy are presented (δ = 0.75, 0.5). For each indeterminacy level, we observe that False Negative and False Positive numbers increase with S both in mean and variability. As the indeterminacy level decreases from δ = 0.75 to δ = 0.5, meaning that more observations are available relatively to the number of potential predictors, the detection of True Positive is improved. Estimator of the coefficients : Figure 4 presents the improvements provided by LOL+ compared to LOL as a function of sparsity rate for the prediction error. For all indeterminacy and sparsity values, the prediction error decreases using LOL+ procedure instead of LOL. Improvements are stronger as both sparsity rate and indeterminacy level increase. The prediction improvements are observed as ρ increases given all studied indeterminacy levels δ. Obviously, the estimated sparsity in the same for both procedures LOL and LOL+ (see Table 1).

6.4

Impact of the variable distribution in the design matrix

This section investigates the impact of the law of the regressor variables. Eight different distributions are studied : Gaussian (N (0, 1)), Uniform (U [−1, 1]), Bernoulli (B{−1, +1}) and Student

18

(T (m) with m ∈ {5, 4, 3, 2, 1}). The column of the design matrix Φ are empirically normalized.

Figure 5 shows the empirical density of the coherence τn computed for each law. Similar distri-

butions are observed for Gaussian, Uniform or Bernoulli laws with a mode of the coherence equal to τn = 0.30. For Student’s families, a shift of the mode of the empirical distributions can be observed from left to right equaled to 0.36 for T (5), 0.47 for T(4), 0.68 for T(3), 0.92 for T(2) to 0.99 for T(1). Figure 6 studies the estimation of S as a function of the sparsity rate ρ for those distributions. All the curves, except the one for the Student law T(1), are confounded and show similar evolution as the one observed for gaussian predictors (see Figure 3 for δ = 0.25). LOL provides similar results for Gaussian, Uniform, Bernoulli, or Student laws, T(m) with m large enough. It is amazing to observe that the procedure works fine even when the empirical coherence of the distribution τn reaches large values closed to 0.99. But LOL procedure does not work fine for heavy tailed variables as for T (1). Figure 7 shows the coherence of the matrix restricted to the N leaders. This ”restricted” coherence is much lower than the coherence computed on all the predictors. For the Student T (1) law, τn = 0.99 (see Figure 5) while the coherence computed just on the leaders is 0.3 (see Figure 7 by instance for S = 10). LOL procedure provides also good results even when the global coherence approaches 1 : it seems that the practical results are much more optimistic (although they do show some deterioration under high coherence). Conclusions would be that it could be interesting to find new measures of collinearity to best reflect the performances of the method. This is true in general, for all the methods concerned with high dimension. Table 2 shows the false detections F P and F N estimated for different distributions and values of sparsity S. For a given distribution, they increase with sparsity. This increment is stronger for distributions with high coherence. For a given sparsity number S, False Positive and False Negative increase as the coherence τn does. LOL tends to underestimate the number of non-zero coefficients. The underestimation is stronger as the coherence of the predictors increases.

6.5

Comparison with other two-steps procedures

In this part, the performances of LOL and LOL+ are compared with the performances of two two-step procedures. The first one referred as SIS-Lasso is coming from [Fan and Lv, 2008] : the selection step called SIS is followed by the Lasso procedure. The second one, called Lasso-Reg, is proposed in [Cand`es and Plan, 2008]. First, the Lasso algorithm performs the selection of the leaders and then the coefficients are estimated by regression. The performances of the four procedures (LOL, LOL+, SIS-Lasso, Lasso-Reg) are studied over a large range of sparsity rates in order to merge previous results already presented in [Fan and Lv, 2008]

19

and [Cand`es and Plan, 2008]. In this section, the sparsity S varies from 5 to 50 in 10 steps and the number of initial predictors is p = 1000. This experimental design let us analyze extreme sparsity values (0.02 ≤ ρ ≤ 0.05) (as in [Fan and Lv, 2008]) as values as large as 1/log(p) (ρ = 0.20) (as

in [Cand`es and Plan, 2008]). For the Lasso procedures, the regularization parameter is chosen by

crossvalidation. Figure 8 presents the prediction error for the different design matrices distributions presented in the previous section. For extreme sparsity levels, ρ < 5%, all the procedures performs extremely well. For middle sparsity levels (5% ≤ ρ ≤ 15%), the Lasso-Reg performs better than the others

ones, as the design matrix is defined with Gaussian, Uniform, Bernoulli or Student distributions (m = 4, 5). For this range of sparsity levels, the Lasso-Reg procedure seems to be more efficient to

select the leaders than the SIS-Lasso and the LOL procedures. For largest values of the sparsity level ρ ≥ 0.15, it appears that SIS-Lasso and LOL are better than Lasso-Reg. A phase transition

can be observed for the Lasso-Reg procedure as described in [Maleki and Donoho, 2009]. As the coherence of the design matrix increases, the phase transition appears sooner for smallest ρ values. The performances of the SIS-Lasso and LOL are globally similar. Note that LOL+ procedure improves continuously the performances compared to LOL and SIS-Lasso.

R´ ef´ erences [Alquier and Hebiri, 2009] Alquier, P. and Hebiri, M. (2009). Transductive versions of the lasso and the dantzig selector. [Bahadur, 1960] Bahadur, R. R. (1960). On the asymptotic efficiency of tests and estimates. Sankhy¯ a, 22 :229–252. [Barron et al., 2008] Barron, A. R., Cohen, A., Dahmen, W., and DeVore, R. A. (2008). Approximation and learning by greedy algorithms. Ann. Statist., 36(1) :64–94. [Bickel et al., 2007] Bickel, P. J., Ritov, Y., and Tsybakov, A. (2007). Simultaneous analysis of lasso and dantzig selector. Preprint Submitted to the Annals of Statistics. [Binev et al., 2007a] Binev, P., Cohen, A., Dahmen, W., and DeVore, R. (2007a). Universal algorithms for learning theory. II. Piecewise polynomial functions. Constr. Approx., 26(2) :127–152. [Binev et al., 2007b] Binev, P., Cohen, A., Dahmen, W., and DeVore, R. (2007b). Universal piecewise polynomial estimators for machine learning. In Curve and surface design : Avignon 2006, Mod. Methods Math., pages 48–77. Nashboro Press, Brentwood, TN.

20

[Binev et al., 2005] Binev, P., Cohen, A., Dahmen, W., DeVore, R., and Temlyakov, V. (2005). Universal algorithms for learning theory. I. Piecewise constant functions. J. Mach. Learn. Res., 6 :1297–1321 (electronic). [Bunea, 2009] Bunea, F. (2009). Consistent selection via the lasso for high dimensional approximation regression models. [Bunea et al., 2007a] Bunea, F., Tsybakov, A., and Wegkamp, M. (2007a). Sparsity oracle inequalities for the Lasso. Electron. J. Stat., 1 :169–194 (electronic). [Bunea et al., 2007b] Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. (2007b). Sparse density estimation with ℓ1 penalties. In Learning theory, volume 4539 of Lecture Notes in Comput. Sci., pages 530–543. Springer, Berlin. [Cand`es and Plan, 2008] Cand`es, E. and Plan, Y. (2008). Near-ideal model selection by ℓ1 minimization. [Candes and Tao, 2007] Candes, E. and Tao, T. (2007). The Dantzig selector : statistical estimation when p is much larger than n. Ann. Statist., 35(6) :2313–2351. [DeVore et al., 2006] DeVore, R., Kerkyacharian, G., Picard, D., and Temlyakov, V. (2006). Approximation methods for supervised learning. Found. Comput. Math., 6(1) :3–58. [Fan and Lv, 2008] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Statist. Soc. B, 70 :849–911. [Fan and Lv, 2009] Fan, J. and Lv, J. (2009). A selective overview of variable selection in high dimensional feature space. [Kerkyacharian et al., 2009] Kerkyacharian, G., Mougeot, M., Picard, D., and Tribouley, K. (2009). Learning out leaders : exponential rates of convergence in high dimensional regression. [Kerkyacharian and Picard, 2007] Kerkyacharian, G. and Picard, D. (2007). Thresholding in learning theory. Constr. Approx., 26(2) :173–203. [Lounici, 2008] Lounici, K. (2008). High-dimensional stochastic optimization with the generalized dantzig estimator. [Maleki and Donoho, 2009] Maleki, A. and Donoho, D. L. (2009). Optimally tuned iterative thresholding algorithm for compressed sensing. IEEE journal in signal processing, in press. [Meinshausen and Buhlmann, 2006] Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34 :1436–1462. [Raskutti et al., 2009] Raskutti, G., Wainwright, M. J., and Yu, B. (2009). Minimax rates of estimation for high-dimensional linear regression over $ ell q$-balls. 21

[Satterthwaite, 1959] Satterthwaite, F. E. (1959). Random balance experimentation. Technometrics, 1 :111–137. [Temlyakov, 2008] Temlyakov, V. N. (2008). Approximation in learning theory. Constr. Approx., 27(1) :33–74. [Tibshirani, 1996] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1) :267–288. [Zhao and Yu, 2006] Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Journal, 7 :2541–2563.

22

0.35

0.8

0.7

0.3

0.6 0.25 0.5 0.2 0.4 0.15 0.3 0.1 0.2

0.05

0

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 1 – X−axis : indeterminacy level δ, Y −axis : Prediction error (left) and estimation error (right). S = 10 (solid line-red) ; S = 12 (dot dash line-blue) ; S = 15 (dashed line -green) ; S = 20 (dot line-black).

23

0.5 0.3 0.45

0.4

0.25

0.35 0.2

0.3

0.25 0.15 0.2 0.1

0.15

0.1 0.05 0.05

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Fig. 2 – X−axis : sparsity rate ρ, Y −axis : Prediction error (left) and estimation error (right). δ = 0.4 (dot line-black) ; δ = 0.7 (dot dash line-blue) ; δ = 0.75 (solid line-red) ; δ = 0.875, (dashed line-green).

24

0.16

0

5

10

15

20

25

30

35

40 40

0.14

35

0.12

30

0.1

25

0.08

20

0.06

15

0.04

10

0.02

5

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.16

Fig. 3 – LOL Sparsity Estimation (ρ : bottom, left ; S : right, top). δ = 0.875 (dashed line-green) ; δ = 0.75 (solid line-red) ; δ = 0.7 (dot dash line-blue) ; δ = 0.4 (dot line-black). The columns of Φ are Gaussian of size n = 250.

δ 0.5 0.5 0.5 0.5 0.5 0.75 0.75 0.75 0.75 0.75

S (ρ) 5 10 15 20 25 5 10 15 20 25

4.98 9.88 14.54 18.78 22.74 4.98 9.90 14.57 18.77 21.94

TP (0.14) (0.35) (0.66) (1.04) (1.42) (0.12) (0.32) (0.56) (0.95) (1.91)

0.00 0.03 0.24 0.76 1.67 0.00 0.04 0.30 1.03 2.81

FN (0.00) (0.18) (0.47) (0.88) (1.26) (0.00) (0.19) (0.51) (0.89) (1.90)

0.00 0.00 0.01 0.03 0.07 0.00 0.00 0.01 0.05 0.19

FP (0.00) (0.00) (0.09) (0.17) (0.25) (0.00) (0.00) (0.12) (0.24) (0.48)

Tab. 1 – Detection, n = 250. The columns of Φ are i.i.d. gaussian. True Positive, False positive and False negative. Means over K = 200 replications, variances into the brackets.

25

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Fig. 4 – Error. X−axis : sparsity rate ρ. Y −axis : Prediction errors for LOL (dot lines) and LOL+ (solid lines). δ = 0.4 (blue color) ; δ = 0.75 (red color) ; δ = 0.875 (green color). The columns of Φ are Gaussian of size n = 250. S 5 10 15 20 25 30 35 40

FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN

0.00 0.00 0.01 0.04 0.03 0.41 0.09 1.26 0.19 2.78 0.39 5.90 0.70 9.44 1.24 14.73

G (0.0) (0.0) (0.1) (0.2) (0.2) (0.6) (0.3) (1.0) (0.4) (1.5) (0.8) (2.9) (1.5) (3.7) (1.5) (4.5)

0.00 0.00 0.00 0.05 0.02 0.35 0.07 1.26 0.10 2.93 0.42 6.05 0.61 9.19 1.21 14.98

U (0.0) (0.0) (0.0) (0.2) (0.1) (0.6) (0.3) (1.1) (0.3) (1.8) (0.9) (3.0) (1.0) (3.9) (1.5) (4.8)

0.00 0.00 0.00 0.05 0.02 0.36 0.08 1.25 0.17 2.61 0.39 5.45 0.78 9.54 1.18 14.72

B (0.0) (0.0) (0.0) (0.2) (0.1) (0.6) (0.3) (1.0) (0.4) (1.7) (0.6) (2.5) (1.3) (3.6) (1.4) (5.0)

0.00 0.00 0.00 0.06 0.03 0.31 0.07 1.24 0.17 2.69 0.34 5.93 0.68 9.63 1.06 15.15

T(5) (0.0) (0.0) (0.0) (0.3) (0.2) (0.6) (0.3) (1.0) (0.4) (1.8) (0.6) (2.8) (1.1) (4.3) (1.5) (4.6)

0.00 0.00 0.00 0.04 0.01 0.30 0.14 1.37 0.21 2.84 0.36 5.29 0.63 10.02 1.15 14.56

T(4) (0.0) (0.0) (0.0) (0.2) (0.1) (0.5) (0.4) (1.1) (0.5) (1.7) (0.7) (2.8) (1.0) (4.0) (1.5) (4.9)

0.00 0.00 0.01 0.06 0.03 0.34 0.08 1.25 0.23 2.75 0.41 5.42 0.84 9.73 1.31 15.53

T(3) (0.0) (0.0) (0.1) (0.2) (0.2) (0.6) (0.3) (1.0) (0.6) (1.8) (0.7) (2.7) (1.7) (4.1) (1.7) (4.3)

0.00 0.00 0.01 0.04 0.11 0.35 0.39 1.26 0.53 2.92 0.83 5.47 1.02 10.01 1.60 15.34

T(2) (0.0) (0.0) (0.1) (0.2) (0.4) (0.6) (0.6) (1.2) (0.7) (1.9) (1.0) (3.0) (1.3) (3.9) (2.1) (5.1)

0.01 0.01 0.19 0.16 1.08 0.56 1.91 1.59 3.92 4.12 4.69 8.76 5.71 14.77 6.24 21.70

Tab. 2 – False Detection, n = 250, p = 1000. First line : Common law of the columns of Φ. First column : Sparsity S. First lines : False positive, Second lines : False negative. Means over K = 200 replications, variances into the brackets. 26

T(1) (0.2) (0.1) (0.8) (0.8) (1.9) (1.3) (2.2) (2.8) (2.7) (3.9) (2.7) (7.4) (3.0) (8.6) (3.0) (9.0)

50 40 30 20 10 0

0.4

0.6

0.8

1.0

Fig. 5 – n = 250, p = 1000. Empirical densities of the coherence. The columns of Φ are Gaussian (solid line-red) ; uniform (solid line-blue) ; Bernoulli (solid line-green) ; Student 5, 4, 3, 2, 1 black lines from left to right.

27

0.16

0

5

10

15

20

25

30

35

40 40

0.14

35

0.12

30

0.1

25

0.08

20

0.06

15

0.04

10

0.02

5

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.16

Fig. 6 – LOL Sparsity estimation for different families of laws for the predictors. Gauss (solid line-red) ; U nif orm (solid line-blue) ; Bernoulli : (solid line-green) ; T(1-5) (black-lines). n = 250, p = 1000. (K = 200)

28

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

10

20

30

40

50

60

70

80

Fig. 7 – Coherence computed for the N selected Leaders. Gauss (solid line-red) ; U nif orm (solid line-blue) ; Bernoulli : (solid line-green) ; T(1) (dot line-black). n = 250, p = 1000. (K = 200)

29

N(0,1)

U[−1,1]

B(−1,1)

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0

0

0.1

0.2

0

0.1

T(4)

0.2

T(3) 0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0

0.1

0.2

0

0.1

0.1

0.2

T(2)

0.8

0

0

0.2

0

0.1

0.2

Fig. 8 – X-axis : Sparsity rate. Y-Axis : Prediction error for different design matrices. LOL (red solid line), LOL+ (red dotted lines), SIS − Lasso (green solid lines), and Lasso − Reg (blue solid line). n = 250, p = 1000. (K = 200) 30

*When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile*

© Copyright 2015 - 2021 PDFFOX.COM - All rights reserved.