Title: | Canonical Correlation for Survey Data |
---|---|
Description: | Performs canonical correlation for survey data, including multiple tests of significance for secondary canonical correlations. A key feature of this package is that it incorporates survey data structure directly in a novel test of significance via a sequence of simple linear regression models on the canonical variates. See reference - Cruz-Cano, Cohen, and Mead-Morse (2024) "Canonical Correlation Analysis of Survey data: the SurveyCC R package" The R Journal under review. |
Authors: | Raul Cruz-Cano [aut, cre]
|
Maintainer: | Raul Cruz-Cano <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2025-02-01 04:43:41 UTC |
Source: | https://github.com/237triangle/surveycc |
Plot cc object
## S3 method for class 'surveycc' plot(x, dim1 = 1, dim2 = 2, ...)
## S3 method for class 'surveycc' plot(x, dim1 = 1, dim2 = 2, ...)
x |
the survey cc object, produced by [surveycc()] |
dim1 , dim2
|
determines which canonical variates serve as the horizontal and vertical axes in the plot. Default is dim1 = 1, dim2 = 2. Must not exceed total number of canonical variates in survey cc object. |
... |
Not used. |
the plots
# PATH example design_object <- survey::svrepdesign( id = ~PERSONID, weights = ~R01_A_PWGT, repweights = "R01_A_PWGT[1-9]+", type = "Fay", rho = 0.3, data=reducedPATHdata, mse = TRUE ) var.x <- c("R01_AC1022", "R01_AE1022", "R01_AG1022CG") var.y <- c("R01_AX0075", "R01_AX0076") howmany <- 2 out <- surveycc(design_object, var.x, var.y, howmany = howmany, selection = "ROWS") plot(out, dim1 = 1, dim2 = 2)
# PATH example design_object <- survey::svrepdesign( id = ~PERSONID, weights = ~R01_A_PWGT, repweights = "R01_A_PWGT[1-9]+", type = "Fay", rho = 0.3, data=reducedPATHdata, mse = TRUE ) var.x <- c("R01_AC1022", "R01_AE1022", "R01_AG1022CG") var.y <- c("R01_AX0075", "R01_AX0076") howmany <- 2 out <- surveycc(design_object, var.x, var.y, howmany = howmany, selection = "ROWS") plot(out, dim1 = 1, dim2 = 2)
A subset of data from the National Youth Tobacco Survey (NYTS) Study
reducedNYTS2021data
reducedNYTS2021data
## 'reducedNYTS2021data' A data frame with 1150 rows and 24 columns:
Primary sampling unit
Strata information
...
<https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html>
A subset of data from the Population Assessment of Tobacco and Health (PATH) Study
reducedPATHdata
reducedPATHdata
## 'reducedPATHdata' A data frame with 132 rows and 107 columns:
Participant ID
Ever smoked a cigarette
...
<https://www.icpsr.umich.edu/web/NAHDAP/studies/36498>
This command extends the functionality of [candisc::cancor] by calculating the test statistics, degrees of freedom and p-values necessary to estimate and interpret the statistical significance of the secondary canonical corr according to the methods Wilks' lambda, Pillai's trace, and Hotelling-Lawley trace (Caliński et al., 2006) and Roy's largest root (Johnstone, 2009). The units and variables graphs (Gittins, 1986) can also be drawn by 'surveycc' further complementing the information listed by the existing 'cancor'.
Moreover, 'csdcanon' implements an algorithm (Cruz-Cano, Cohen, and Mead-Morse, 2024) that allows the inclusion of complex survey design elements, e.g. strata, cluster and replicate weights, in the estimation of the statistical significance of the canonical correlations. The core idea of the algorithm is to reduce the problem of finding the correlations among the canonical variates and their corresponding statistical significance to calculating an equivalent sequence of univariate linear regression. This switch allows the user to take advantage of the existing theoretical and computational resources that integrate the complex survey design elements into these regression models (Valliant and Dever, 2018). Hence, this algorithm can include the same complex design elements as in 'survey'.
surveycc( design_object, var.x, var.y, howmany = NA, selection = c("FREQ", "ROWS") )
surveycc( design_object, var.x, var.y, howmany = NA, selection = c("FREQ", "ROWS") )
design_object |
a survey design object generated from package 'survey', eg [survey::svydesign] |
var.x |
the first set of variables; a vector of names |
var.y |
the second set of variables; a vector of names |
howmany |
positive integer; allows the user to choose the number of canonical correlations for which the statistical significance statistics are displayed. Default is to choose the minimum of 'length(var.x)' and 'length(var.y)'. Cannot exceed this value. |
selection |
allows the user to choose whether the type of sample size is equal to the number of rows ('ROWS') in the data set or the sum of the survey weights ('FREQ'). |
An object with S3 class "surveycc". A list, containing the canonical correlation object, dimensions for plotting, as well as tables of the various tests of significance. This includes the test statistics, degrees of freedom, and p-values for: * Wilk's lambda * Pillai's trace * Hotelling-Lawley * Roy's greatest root * the Cruz-Cano algorithm using the survey design object
NOTE: For more information on the statistics presented, i.e. test statistic, df1, df2, Chi-Sq/F and p-val, please see the documentation in [candisc::cancor] for Wilk's Lambda, Pillai's Trace and Hotelling-Lawley Trace (although the present package uses a Chi-squared approximation to the F-distribution), and see the documentation in [survey::svyglm] for the Weighted/Complex Survey Design regression.
* Cruz-Cano, Cohen, and Mead-Morse. Canonical Correlation Analysis of Survey data: The SurveyCC R package. The R Journal under review; 2024. * Gentzke AS, Wang TW, Cornelius M, Park-Lee E, Ren C, Sawdey MD, Cullen KA, Loretan C, Jamal A, Homa DM. Tobacco Product Use and Associated Factors among Middle and High School Students - National Youth Tobacco Survey, United States, 2021. rveill Summ. 2022;71(5):1-29. doi: 10.15585/mmwr.ss7105a1. PubMed PMID: 35271557. * Gittins R. Canonical Analysis: A Review with Applications in Ecology: Springer Berlin Heidelberg; 1986. * Caliński T., Krzyśko M. and WOłyński W. (2006) A Comparison of Some Tests for Determining the Number of Nonzero Canonical Correlations, Communications in Statistics -Simulation and Computation, 35:3, 727-749, DOI: 10.1080/03610 6290. * Hyland A, Ambrose BK, Conway KP, et al. Design and methods of the Population Assessment of Tobacco and Health (PATH) StudyTobacco Control 2017;26:371-378. * Johnstone IM. Approximate Null Distribution of the largest root in a Multivariate Analysis. Ann Appl Stat. 2009;3(4):1616-1633. doi: 10.1214/08-AOAS220. PMID: 20526465; PMCID: PMC2880335. * Valliant R. and Dever JA. Survey Weights: A Step-by-Step Guide to Calculation: Stata Press; 2018. ISBN-13: 978-1-59718-260-7.
# PATH example design_object <- survey::svrepdesign( id = ~PERSONID, weights = ~R01_A_PWGT, repweights = "R01_A_PWGT[1-9]+", type = "Fay", rho = 0.3, data=reducedPATHdata, mse = TRUE ) var.x <- c("R01_AC1022", "R01_AE1022", "R01_AG1022CG") var.y <- c("R01_AX0075", "R01_AX0076") howmany <- 2 surveycc(design_object, var.x, var.y, howmany = howmany, selection = "ROWS") # NYTS example design_object <- survey::svydesign( ids = ~psu2, nest = TRUE, strata = ~v_stratum2, weights = ~finwgt, data = reducedNYTS2021data ) var.x <- c("qn9", "qn38", "qn40", "qn53", "qn54", "qn64", "qn69", "qn74", "qn76", "qn78", "qn80", "qn82", "qn85", "qn88", "qn89") var.y <- c("qn128", "qn129", "qn130", "qn131", "qn132", "qn134") howmany <- 3 surveycc(design_object = design_object, var.x = var.x, var.y = var.y, howmany = howmany, selection = "ROWS")
# PATH example design_object <- survey::svrepdesign( id = ~PERSONID, weights = ~R01_A_PWGT, repweights = "R01_A_PWGT[1-9]+", type = "Fay", rho = 0.3, data=reducedPATHdata, mse = TRUE ) var.x <- c("R01_AC1022", "R01_AE1022", "R01_AG1022CG") var.y <- c("R01_AX0075", "R01_AX0076") howmany <- 2 surveycc(design_object, var.x, var.y, howmany = howmany, selection = "ROWS") # NYTS example design_object <- survey::svydesign( ids = ~psu2, nest = TRUE, strata = ~v_stratum2, weights = ~finwgt, data = reducedNYTS2021data ) var.x <- c("qn9", "qn38", "qn40", "qn53", "qn54", "qn64", "qn69", "qn74", "qn76", "qn78", "qn80", "qn82", "qn85", "qn88", "qn89") var.y <- c("qn128", "qn129", "qn130", "qn131", "qn132", "qn134") howmany <- 3 surveycc(design_object = design_object, var.x = var.x, var.y = var.y, howmany = howmany, selection = "ROWS")