haven/0000755000176200001440000000000013230073603011350 5ustar liggesusershaven/inst/0000755000176200001440000000000013227731765012345 5ustar liggesusershaven/inst/examples/0000755000176200001440000000000012743423326014154 5ustar liggesusershaven/inst/examples/iris.sas7bdat0000644000176200001440000040000012743423326016546 0ustar liggesusers`Ͻ 1""332""3323#3>SAS FILEIRIS DATA FnAFnA 9.0401M2X64_8PROFn /a?FnA/a?  0T<P4`4,4444ffffff@ @ffffff??setosa@@ffffff??setosa@ @??setosaffffff@@??setosa@ @ffffff??setosa@333333@333333??setosaffffff@333333 @ffffff?333333?setosa@333333 @??setosa@333333@ffffff??setosa@@??setosa@ @??setosa333333@333333 @??setosa333333@@ffffff??setosa333333@@??setosa333333@@333333??setosa@@??setosa@333333@??setosaffffff@ @ffffff?333333?setosa@ffffff@333333?333333?setosaffffff@ffffff@?333333?setosa@333333 @333333??setosaffffff@ @??setosaffffff@ @??setosaffffff@ffffff @333333??setosa333333@333333 @ffffff??setosa@@??setosa@333333 @??setosa@ @??setosa@333333 @ffffff??setosa@ @??setosa333333@@??setosa@333333 @??setosa@ffffff@??setosa@@ffffff??setosa@@??setosa@ @333333??setosa@ @??setosa@ @ffffff??setosa@@??setosaffffff@333333 @??setosa@ @?333333?setosa@ffffff@?333333?setosa@ @??setosa@ @?333333?setosaffffff@ffffff@ffffff??setosa333333@@ffffff?333333?setosaffffff@ffffff@??setosaffffff@ @ffffff??setosa333333@ @??setosa@ffffff @ffffff??setosa@ @@ffffff?versic@ @@?versic@@@?versic@ffffff@@?versic@ffffff@ffffff@?versic@ffffff@@?versic333333@ffffff @@?versic@333333@ffffff @?versicffffff@333333@ffffff@?versic@@333333@ffffff?versic@@ @?versic@@@?versic@@@?versicffffff@333333@@ffffff?versicffffff@333333@ @?versic@@@ffffff?versicffffff@@@?versic333333@@ffffff@?versic@@@?versicffffff@@333333@?versic@ @333333@?versicffffff@ffffff@@?versic333333@@@?versicffffff@ffffff@@333333?versic@333333@333333@?versicffffff@@@ffffff?versic333333@ffffff@333333@ffffff?versic@@@333333?versic@333333@@?versic@@ @?versic@333333@ffffff@?versic@333333@ @?versic333333@@333333@333333?versic@@ffffff@?versic@@@?versic@333333 @@?versic@@@?versic333333@ffffff@@?versicffffff@@ffffff@?versic@@@?versic@@@333333?versicffffff@@ffffff@ffffff?versic333333@@@333333?versic@ffffff@ffffff @?versicffffff@@@?versic@@@333333?versic@333333@@?versic@333333@333333@?versicffffff@@@?versic@ffffff@ffffff@?versic333333@ffffff @@@virgin333333@@ffffff@ffffff?virginffffff@@@@virgin333333@333333@ffffff@?virgin@@333333@@virginffffff@@ffffff@@virgin@@@333333?virgin333333@333333@333333@?virgin@@333333@?virgin@ @ffffff@@virgin@ @ffffff@@virgin@@333333@ffffff?virgin333333@@@@virgin@@@@virgin333333@ffffff@ffffff@333333@virgin@ @333333@ffffff@virgin@@@?virgin@ffffff@@@virgin@@@ffffff@virgin@@@?virgin@ @@ffffff@virginffffff@ffffff@@@virgin@ffffff@@@virgin333333@@@?virgin@ffffff @@@virgin@ @@?virgin@ffffff@333333@?virginffffff@@@?virgin@ffffff@ffffff@@virgin@@333333@?virgin@ffffff@ffffff@ffffff?virgin@ffffff@@@virgin@ffffff@ffffff@@virgin333333@ffffff@ffffff@?virginffffff@@ffffff@ffffff?virgin@@ffffff@ffffff@virgin333333@333333 @ffffff@333333@virgin@@@?virgin@@333333@?virgin@@@@virgin@@ffffff@333333@virgin@@ffffff@ffffff@virgin333333@@ffffff@ffffff?virgin333333@ @@ffffff@virgin@ffffff @@@virgin@@@ffffff@virgin333333@@@ffffff?virgin@@@@virgin@333333 @@ffffff@virgin@@ffffff@?virgin~  ~ ~~| pl \X HD 40( D 0$ 8 L ` t DATASTEPSepal_LengthBESTBESTSepal_WidthBESTBESTPetal_LengthBESTBESTPetal_WidthBESTBESTSpecies$$ 0"( 5:/a?   Ghaven/inst/examples/iris.dta0000644000176200001440000002002512743423326015613 0ustar liggesusers
118LSF10 Jun 2016 13:00
D/C  sepallengthsepalwidthpetallengthpetalwidthspecies%9.0g%9.0g%9.0g%9.0g%10scvSepal.Lengthw @%@B @ QZ@8B B118%AZ@ ?8B @ @ @ QZ@QZ@Sepal.Widthw @%@B @ QZ@8B B118%AZ@ ?8B @ @ @ QZ@QZ@Petal.Lengthw @%@B @ QZ@8B B118%AZ@ ?8B @ @ @ QZ@QZ@Petal.Widthw @%@B @ QZ@8B B118%AZ@ ?8B @ @ @ QZ@QZ@Speciesdthw @%@B @ QZ@8B B118%AZ@ ?8B @ @ @ QZ@QZ@ _dtatmp\stata\iris-stata-14.dta", encoding(utf-8) ` ` f_lang_list(!(X 'ؚ'H)''<'ȷ''('Y (0(default _dtatmp\stata\iris-stata-14.dta", encoding(utf-8) ` ` f_lang_cp7 (!(X 'ؚ'H)''<'ȷ''('Y (0(default33@`@33?L>setosax̜@@@33?L>setosaxff@L@ff?L>setosax33@ffF@?L>setosax@fff@33?L>setosax̬@y@?>setosax33@Y@33?>setosax@Y@?L>setosax̌@9@33?L>setosax̜@ffF@?=setosax̬@l@?L>setosax@Y@?L>setosax@@@33?=setosax@@@̌?=setosax@@?L>setosaxff@̌@?>setosax̬@y@ff?>setosax33@`@33?>setosaxff@33s@?>setosax33@33s@?>setosax̬@Y@?L>setosax33@l@?>setosax33@fff@?L>setosax33@33S@??setosax@Y@33?L>setosax@@@?L>setosax@Y@?>setosaxff@`@?L>setosaxff@Y@33?L>setosaxff@L@?L>setosax@ffF@?L>setosax̬@Y@?>setosaxff@33@?=setosax@ff@33?L>setosax̜@ffF@?L>setosax@L@?L>setosax@`@ff?L>setosax̜@fff@33?=setosax̌@@@ff?L>setosax33@Y@?L>setosax@`@ff?>setosax@33@ff?>setosax̌@L@ff?L>setosax@`@??setosax33@33s@33?>setosax@@@33?>setosax33@33s@?L>setosax33@L@33?L>setosax@l@?L>setosax@33S@33?L>setosax@L@ff@33?versicolor@L@@?versicolor@ffF@̜@?versicolor@33@@ff?versicolor@333@33@?versicolorff@333@@ff?versicolor@33S@ff@?versicolor̜@@33S@?versicolor33@9@33@ff?versicolorff@,@y@33?versicolor@@`@?versicolor̼@@@ff@?versicolor@ @@?versicolor33@9@ff@33?versicolor33@9@fff@ff?versicolorff@ffF@̌@33?versicolor33@@@@?versicolor@,@33@?versicolorff@ @@?versicolor33@ @y@̌?versicolor̼@L@@ff?versicolor33@333@@ff?versicolor@ @̜@?versicolor33@333@ff@?versicolor@9@@ff?versicolor33@@@̌@33?versicolor@333@@33?versicolorff@@@@?versicolor@9@@?versicolorff@ff&@`@?versicolor@@33s@̌?versicolor@@l@?versicolor@,@y@?versicolor@,@33@?versicolor̬@@@@?versicolor@Y@@?versicolorff@ffF@ff@?versicolor@33@̌@ff?versicolor33@@@33@ff?versicolor@ @@ff?versicolor@ff&@̌@?versicolor33@@@33@33?versicolor@ff&@@?versicolor@33@33S@?versicolor33@,@ff@ff?versicolorff@@@ff@?versicolorff@9@ff@ff?versicolorff@9@@ff?versicolor33@ @@@̌?versicolorff@333@33@ff?versicolor@33S@@ @virginica@,@33@33?virginica33@@@̼@ff@virginica@9@33@ff?virginica@@@@ @virginica33@@@33@ff@virginica̜@ @@?virginica@9@@ff?virginicaff@ @@ff?virginicaff@fff@33@ @virginica@L@33@@virginica@,@@33?virginica@@@@ff@virginicaff@ @@@virginica@333@33@@virginica@L@@33@virginica@@@@ff?virginicaff@33s@ff@ @virginicaff@ff&@@33@virginica@ @@?virginica@L@ff@33@virginica33@333@̜@@virginicaff@333@ff@@virginica@,@̜@ff?virginicaff@33S@ff@ff@virginicaff@L@@ff?virginicaff@333@@ff?virginica33@@@̜@ff?virginica@333@33@ff@virginicaff@@@@?virginica@333@33@33?virginica@33s@@@virginica@333@33@ @virginica@333@33@?virginica33@ff&@33@33?virginicaff@@@33@33@virginica@Y@33@@virginica@ffF@@ff?virginica@@@@ff?virginica@ffF@̬@ff@virginicaff@ffF@33@@virginica@ffF@33@33@virginica@,@33@33?virginica@L@̼@33@virginicaff@33S@ff@ @virginicaff@@@ff@33@virginica@ @@33?virginica@@@ff@@virginicaff@Y@̬@33@virginica̼@@@33@ff?virginica
haven/inst/examples/iris.sav0000644000176200001440000001504212743423326015637 0ustar liggesusers$FL2@(#) SPSS DATA FILE - https://github.com/WizardMac/ReadStat Y@10 Jun 1611:25:39 VAR0 VAR1 VAR2 VAR3 VAR4 ?setosa @ versicolor @ virginica   RVAR0=Sepal.Length VAR1=Sepal.Width VAR2=Petal.Length VAR3=Petal.Width VAR4=Speciesffffff@ @ffffff???@@ffffff???@ @???ffffff@@???@ @ffffff???@333333@333333???ffffff@333333 @ffffff?333333??@333333 @???@333333@ffffff???@@???@ @???333333@333333 @???333333@@ffffff???333333@@???333333@@333333???@@???@333333@???ffffff@ @ffffff?333333??@ffffff@333333?333333??ffffff@ffffff@?333333??@333333 @333333???ffffff@ @???ffffff@ @???ffffff@ffffff @333333???333333@333333 @ffffff???@@???@333333 @???@ @???@333333 @ffffff???@ @???333333@@???@333333 @???@ffffff@???@@ffffff???@@???@ @333333???@ @???@ @ffffff???@@???ffffff@333333 @???@ @?333333??@ffffff@?333333??@ @???@ @?333333??ffffff@ffffff@ffffff???333333@@ffffff?333333??ffffff@ffffff@???ffffff@ @ffffff???333333@ @???@ffffff @ffffff???@ @@ffffff?@@ @@?@@@@?@@ffffff@@?@@ffffff@ffffff@?@@ffffff@@?@333333@ffffff @@?@@333333@ffffff @?@ffffff@333333@ffffff@?@@@333333@ffffff?@@@ @?@@@@?@@@@?@ffffff@333333@@ffffff?@ffffff@333333@ @?@@@@ffffff?@ffffff@@@?@333333@@ffffff@?@@@@?@ffffff@@333333@?@@ @333333@?@ffffff@ffffff@@?@333333@@@?@ffffff@ffffff@@333333?@@333333@333333@?@ffffff@@@ffffff?@333333@ffffff@333333@ffffff?@@@@333333?@@333333@@?@@@ @?@@333333@ffffff@?@@333333@ @?@333333@@333333@333333?@@@ffffff@?@@@@?@@333333 @@?@@@@?@333333@ffffff@@?@ffffff@@ffffff@?@@@@?@@@@333333?@ffffff@@ffffff@ffffff?@333333@@@333333?@@ffffff@ffffff @?@ffffff@@@?@@@@333333?@@333333@@?@@333333@333333@?@ffffff@@@?@@ffffff@ffffff@?@333333@ffffff @@@@333333@@ffffff@ffffff?@ffffff@@@@@333333@333333@ffffff@?@@@333333@@@ffffff@@ffffff@@@@@@333333?@333333@333333@333333@?@@@333333@?@@ @ffffff@@@@ @ffffff@@@@@333333@ffffff?@333333@@@@@@@@@@333333@ffffff@ffffff@333333@@@ @333333@ffffff@@@@@?@@ffffff@@@@@@@ffffff@@@@@?@@ @@ffffff@@ffffff@ffffff@@@@@ffffff@@@@333333@@@?@@ffffff @@@@@ @@?@@ffffff@333333@?@ffffff@@@?@@ffffff@ffffff@@@@@333333@?@@ffffff@ffffff@ffffff?@@ffffff@@@@@ffffff@ffffff@@@333333@ffffff@ffffff@?@ffffff@@ffffff@ffffff?@@@ffffff@ffffff@@333333@333333 @ffffff@333333@@@@@?@@@333333@?@@@@@@@@ffffff@333333@@@@ffffff@ffffff@@333333@@ffffff@ffffff?@333333@ @@ffffff@@@ffffff @@@@@@@ffffff@@333333@@@ffffff?@@@@@@@333333 @@ffffff@@@@ffffff@?@haven/inst/doc/0000755000176200001440000000000013227731764013111 5ustar liggesusershaven/inst/doc/semantics.Rmd0000644000176200001440000001445313224443423015537 0ustar liggesusers--- title: "Conversion semantics" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Conversion semantics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(haven) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` There are some differences between the way that R, SAS, SPSS, and Stata represented labelled data and missing values. While SAS, SPSS, and Stata share some obvious similarities, R is little different. This vignette explores the differences, and shows you how haven bridges the gap. ## Value labels Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways: * SPSS and SAS can label numeric and character values, not just integer values. * The value do not need to be exhaustive. It is common to label the special missing values (e.g. `.D` = did not respond, `.N` = not applicable), while leaving other values as is. Value labels in SAS are a little different again. In SAS, labels are just special case of general formats. Formats include currencies and dates, but user-defined just assigns labels to individual values (including special missings value). Formats have names and existing independently of the variables they are associated with. You create a named format with `PROC FORMAT` and then associated with variables in a `DATA` step (the names of character formats thealways start with `$`). ### `labelled()` To allow you to import labelled vectors into R, haven provides the S3 labelled class, created with `labelled()`. This class allows you to associated arbitrary labels with numeric or character vectors: ```{r} x1 <- labelled( sample(1:5), c(Good = 1, Bad = 5) ) x1 x2 <- labelled( c("M", "F", "F", "F", "M"), c(Male = "M", Female = "F") ) x2 ``` The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels: ```{r} as_factor(x1) zap_labels(x1) as_factor(x2) zap_labels(x2) ``` See the documentation for `as_factor()` for more options to control exactly what the factor uses for levels. Both `as_factor()` and `zap_labels()` have data frame methods if you want to apply the same strategy to every column in a data frame: ```{r} df <- tibble::data_frame(x1, x2, z = 1:5) df zap_labels(df) as_factor(df) ``` ## Missing values All three tools provide a global "system missing value" which is displayed as `.`. This is roughly equivalent to R's `NA`, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. `-inf`), and Stata treats it as the largest possible number (i.e. `inf`). Each tool also provides a mechanism for recording multiple types of missingness: * Stata has "extended" missing values, `.A` through `.Z`. * SAS has "special" missing values, `.A` through `.Z` plus `._`. * SPSS has per-column "user" missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing. Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value. Haven models these missing values in two different ways: * For SAS and Stata, haven provides "tagged" missing values which extend R's regular `NA` to add a single character label. * For SPSS, haven provides a subclass of `labelled` that also provides user defined values and ranges. ### Tagged missing values To support Stata's extended and SAS's special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag. The R interface for creating with tagged NAs is a little clunky because generally they'll be created by haven for you. But you can create your own with `tagged_na()`: ```{r} x <- c(1:3, tagged_na("a", "z"), 3:1) x ``` Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use `print_tagged_na()`: ```{r} print_tagged_na(x) ``` To test if a value is a tagged NA, use `is_tagged_na()`, and to extract the value of the tag, use `na_tag()`: ```{r} is_tagged_na(x) is_tagged_na(x, "a") na_tag(x) ``` My expectation is that tagged missings are most often used in conjuction with labels (described below), so labelled vectors print the tags for you, and `as_factor()` knows how to relabel: ```{r} y <- labelled(x, c("Not home" = tagged_na("a"), "Refused" = tagged_na("z"))) y as_factor(y) ``` ### User defined missing values SPSS's user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides `labelled_spss()` as a subclass of `labelled()` to model these additional user-defined missings. ```{r} x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99) x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf)) x1 x2 ``` These objects are somewhat dangerous to work with in R because most R functions don't know those values are missing: ```{r} mean(x1) ``` Because of that danger, the default behaviour of `read_spss()` is to return regular labelled objects where user-defined missing values have been converted to `NA`s. To get `read_spss()` to return `labelled_spss()` objects, you'll need to set `user_na = TRUE`. I've defined an `is.na()` method so you can find them yourself: ```{r} is.na(x1) ``` And the presence of that method does mean many functions with an `na.rm` argument will work correctly: ```{r} mean(x1, na.rm = TRUE) ``` But generally you should either convert to a factor, convert to regular missing vaues, or strip the all the labels: ```{r} as_factor(x1) zap_missing(x1) zap_labels(x1) ``` haven/inst/doc/datetimes.html0000644000176200001440000001325213227731762015757 0ustar liggesusers Dates and times

Dates and times

Formats

There are three common formats across SAS, SPSS and Stata.

Date (number of days)

Time (number of seconds):

DateTime (number of seconds):

Offsets

Dates and date times use a difference offset to R:

References

haven/inst/doc/semantics.R0000644000176200001440000000337213227731764015227 0ustar liggesusers## ---- include = FALSE---------------------------------------------------- library(haven) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ## ------------------------------------------------------------------------ x1 <- labelled( sample(1:5), c(Good = 1, Bad = 5) ) x1 x2 <- labelled( c("M", "F", "F", "F", "M"), c(Male = "M", Female = "F") ) x2 ## ------------------------------------------------------------------------ as_factor(x1) zap_labels(x1) as_factor(x2) zap_labels(x2) ## ------------------------------------------------------------------------ df <- tibble::data_frame(x1, x2, z = 1:5) df zap_labels(df) as_factor(df) ## ------------------------------------------------------------------------ x <- c(1:3, tagged_na("a", "z"), 3:1) x ## ------------------------------------------------------------------------ print_tagged_na(x) ## ------------------------------------------------------------------------ is_tagged_na(x) is_tagged_na(x, "a") na_tag(x) ## ------------------------------------------------------------------------ y <- labelled(x, c("Not home" = tagged_na("a"), "Refused" = tagged_na("z"))) y as_factor(y) ## ------------------------------------------------------------------------ x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99) x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf)) x1 x2 ## ------------------------------------------------------------------------ mean(x1) ## ------------------------------------------------------------------------ is.na(x1) ## ------------------------------------------------------------------------ mean(x1, na.rm = TRUE) ## ------------------------------------------------------------------------ as_factor(x1) zap_missing(x1) zap_labels(x1) haven/inst/doc/datetimes.Rmd0000644000176200001440000000242213006437521015521 0ustar liggesusers--- title: "Dates and times" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Date times} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Formats There are three common formats across SAS, SPSS and Stata. Date (number of days) * SAS: MMDDYY, DDMMYY, YYMMDD, DATE * Spss: n/a * Stata: %td Time (number of seconds): * SAS: TIME, HHMM, TOD * Spss: TIME, DTIME * Stata: n/a DateTime (number of seconds): * SAS: DATETIME * Spss: DATE, ADATE, SDATE, DATETIME (as milliseconds) * Stata: %tc, %tC ## Offsets Dates and date times use a difference offset to R: * SAS: 1960-01-01 (`r -as.integer(as.Date("1960-01-01"))` days) * Spss: 1582-10-14. (`r -as.integer(as.Date("1582-10-14"))` days) * Stata: 1960-01-01. (`r -as.integer(as.Date("1960-01-01"))` days) ## References * SAS: , * Spss: * Stata: haven/inst/doc/semantics.html0000644000176200001440000007212613227731764015775 0ustar liggesusers Conversion semantics

Conversion semantics

There are some differences between the way that R, SAS, SPSS, and Stata represented labelled data and missing values. While SAS, SPSS, and Stata share some obvious similarities, R is little different. This vignette explores the differences, and shows you how haven bridges the gap.

Value labels

Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:

  • SPSS and SAS can label numeric and character values, not just integer values.

  • The value do not need to be exhaustive. It is common to label the special missing values (e.g. .D = did not respond, .N = not applicable), while leaving other values as is.

Value labels in SAS are a little different again. In SAS, labels are just special case of general formats. Formats include currencies and dates, but user-defined just assigns labels to individual values (including special missings value). Formats have names and existing independently of the variables they are associated with. You create a named format with PROC FORMAT and then associated with variables in a DATA step (the names of character formats thealways start with $).

labelled()

To allow you to import labelled vectors into R, haven provides the S3 labelled class, created with labelled(). This class allows you to associated arbitrary labels with numeric or character vectors:

x1 <- labelled(
sample(1:5),
c(Good = 1, Bad = 5)
)
x1
#> <Labelled integer>
#> [1] 3 4 1 5 2
#>
#> Labels:
#> value label
#> 1 Good
#> 5 Bad
x2 <- labelled(
c("M", "F", "F", "F", "M"),
c(Male = "M", Female = "F")
)
x2
#> <Labelled character>
#> [1] M F F F M
#>
#> Labels:
#> value label
#> M Male
#> F Female

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels:

as_factor(x1)
#> [1] 3 4 Good Bad 2
#> Levels: Good 2 3 4 Bad
zap_labels(x1)
#> [1] 3 4 1 5 2
as_factor(x2)
#> [1] Male Female Female Female Male
#> Levels: Female Male
zap_labels(x2)
#> [1] "M" "F" "F" "F" "M"

See the documentation for as_factor() for more options to control exactly what the factor uses for levels.

Both as_factor() and zap_labels() have data frame methods if you want to apply the same strategy to every column in a data frame:

df <- tibble::data_frame(x1, x2, z = 1:5)
df
#> # A tibble: 5 x 3
#> x1 x2 z
#> <int+lbl> <chr+lbl> <int>
#> 1 3 M 1
#> 2 4 F 2
#> 3 1 F 3
#> 4 5 F 4
#> 5 2 M 5
zap_labels(df)
#> # A tibble: 5 x 3
#> x1 x2 z
#> <int> <chr> <int>
#> 1 3 M 1
#> 2 4 F 2
#> 3 1 F 3
#> 4 5 F 4
#> 5 2 M 5
as_factor(df)
#> # A tibble: 5 x 3
#> x1 x2 z
#> <fct> <fct> <int>
#> 1 3 Male 1
#> 2 4 Female 2
#> 3 Good Female 3
#> 4 Bad Female 4
#> 5 2 Male 5

Missing values

All three tools provide a global “system missing value” which is displayed as .. This is roughly equivalent to R’s NA, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. -inf), and Stata treats it as the largest possible number (i.e. inf).

Each tool also provides a mechanism for recording multiple types of missingness:

  • Stata has “extended” missing values, .A through .Z.

  • SAS has “special” missing values, .A through .Z plus ._.

  • SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing.

Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value.

Haven models these missing values in two different ways:

  • For SAS and Stata, haven provides “tagged” missing values which extend R’s regular NA to add a single character label.

  • For SPSS, haven provides a subclass of labelled that also provides user defined values and ranges.

Tagged missing values

To support Stata’s extended and SAS’s special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.

The R interface for creating with tagged NAs is a little clunky because generally they’ll be created by haven for you. But you can create your own with tagged_na():

x <- c(1:3, tagged_na("a", "z"), 3:1)
x
#> [1] 1 2 3 NA NA 3 2 1

Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use print_tagged_na():

print_tagged_na(x)
#> [1] 1 2 3 NA(a) NA(z) 3 2 1

To test if a value is a tagged NA, use is_tagged_na(), and to extract the value of the tag, use na_tag():

is_tagged_na(x)
#> [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
is_tagged_na(x, "a")
#> [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
na_tag(x)
#> [1] NA NA NA "a" "z" NA NA NA

My expectation is that tagged missings are most often used in conjuction with labels (described below), so labelled vectors print the tags for you, and as_factor() knows how to relabel:

y <- labelled(x, c("Not home" = tagged_na("a"), "Refused" = tagged_na("z")))
y
#> <Labelled double>
#> [1] 1 2 3 NA(a) NA(z) 3 2 1
#>
#> Labels:
#> value label
#> NA(a) Not home
#> NA(z) Refused
as_factor(y)
#> [1] 1 2 3 Not home Refused 3 2 1
#> Levels: 1 2 3 Not home Refused

User defined missing values

SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.

x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99)
x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf))
x1
#> <Labelled SPSS double>
#> [1] 1 2 3 4 5 6 7 8 9 10 99
#> Missing values: 99
#>
#> Labels:
#> value label
#> 99 Missing
x2
#> <Labelled SPSS double>
#> [1] 1 2 3 4 5 6 7 8 9 10 99
#> Missing range: [90, Inf]
#>
#> Labels:
#> value label
#> 99 Missing

These objects are somewhat dangerous to work with in R because most R functions don’t know those values are missing:

mean(x1)
#> [1] 14

Because of that danger, the default behaviour of read_spss() is to return regular labelled objects where user-defined missing values have been converted to NAs. To get read_spss() to return labelled_spss() objects, you’ll need to set user_na = TRUE.

I’ve defined an is.na() method so you can find them yourself:

is.na(x1)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

And the presence of that method does mean many functions with an na.rm argument will work correctly:

mean(x1, na.rm = TRUE)
#> [1] 5.5

But generally you should either convert to a factor, convert to regular missing vaues, or strip the all the labels:

as_factor(x1)
#> [1] 1 2 3 4 5 6 7 8
#> [9] 9 10 Missing
#> Levels: 1 2 3 4 5 6 7 8 9 10 Missing
zap_missing(x1)
#> <Labelled double>
#> [1] 1 2 3 4 5 6 7 8 9 10 NA
#>
#> Labels:
#> value label
#> 99 Missing
zap_labels(x1)
#> [1] 1 2 3 4 5 6 7 8 9 10 NA
haven/tests/0000755000176200001440000000000013227731765012532 5ustar liggesusershaven/tests/testthat.R0000644000176200001440000000006612464715603014512 0ustar liggesuserslibrary(testthat) library(haven) test_check("haven") haven/tests/testthat/0000755000176200001440000000000013230073603014352 5ustar liggesusershaven/tests/testthat/hadley.sas7bdat0000644000176200001440000040000012466441773017265 0ustar liggesusers`Ͻ 1""332""3323#3>SAS FILEHADLEY DATA h%3Ah%3A9.0401M1X64_8PROh%$J$J$JCkh%3ABk  0LHh <44p4<4444????@?f @@@?@?f @?@@@@f @@@?@ @?@@@@m @@@@@@m @?@@@@m @@@@@@m ~  ~ ~1$~1$$p L @00( \0 (@$(8Hl DATASTEPidworkshopWORKSHOPgender$GENDERq1The instructor was well preparedq2The instructor communicated wellq3The course material was helpfulq4Overall, I found the workhsop useful 0"8(nBk $}haven/tests/testthat/test-utils.R0000644000176200001440000000071013227416744016625 0ustar liggesuserscontext("test-utils.R") # max_level_lengths ------------------------------------------------------- test_that("works with NA levels", { x <- factor(c("a", "abc", NA), exclude = NULL) expect_equal(max_level_length(x), 3) }) test_that("works with empty factors", { x <- factor(character(), levels = character()) expect_equal(max_level_length(x), 0) x <- factor(character(), levels = c(NA_character_)) expect_equal(max_level_length(x), 0) }) haven/tests/testthat/helper-lump.R0000644000176200001440000000012712754135646016747 0ustar liggesusers lump_test <- function(x) { ifelse(in_smallest(x), "other", letters[seq_along(x)]) } haven/tests/testthat/test-write-dta.R0000644000176200001440000000516413042171172017361 0ustar liggesuserscontext("write_dta") test_that("can roundtrip basic types", { x <- runif(10) expect_equal(roundtrip_var(x, "dta"), x) expect_equal(roundtrip_var(1:10, "dta"), 1:10) expect_equal(roundtrip_var(c(TRUE, FALSE), "dta"), c(1, 0)) expect_equal(roundtrip_var(letters, "dta"), letters) }) test_that("can roundtrip missing values (as much as possible)", { expect_equal(roundtrip_var(NA, "dta"), NA_integer_) expect_equal(roundtrip_var(NA_real_, "dta"), NA_real_) expect_equal(roundtrip_var(NA_integer_, "dta"), NA_integer_) expect_equal(roundtrip_var(NA_character_, "dta"), "") }) test_that("can roundtrip date times", { x1 <- c(as.Date("2010-01-01"), NA) x2 <- as.POSIXct(x1) attr(x2, "tzone") <- "UTC" expect_equal(roundtrip_var(x1, "dta"), x1) expect_equal(roundtrip_var(x2, "dta"), x2) }) test_that("infinity gets converted to NA", { expect_equal(roundtrip_var(c(Inf, 0, -Inf), "dta"), c(NA, 0, NA)) }) test_that("factors become labelleds", { f <- factor(c("a", "b"), levels = letters[1:3]) rt <- roundtrip_var(f, "dta") expect_is(rt, "labelled") expect_equal(as.vector(rt), 1:2) expect_equal(attr(rt, "labels"), c(a = 1, b = 2, c = 3)) }) test_that("labels are preserved", { x <- 1:10 attr(x, "label") <- "abc" expect_equal(attr(roundtrip_var(x, "dta"), "label"), "abc") }) test_that("labelleds are round tripped", { int <- labelled(c(1L, 2L), c(a = 1L, b = 3L)) num <- labelled(c(1, 2), c(a = 1, b = 3)) chr <- labelled(c("a", "b"), c(a = "b", b = "a")) expect_equal(roundtrip_var(int, "dta"), int) # FIXME! # expect_equal(roundtrip_var(chr, "dta"), chr) }) test_that("factors become labelleds", { f <- factor(c("a", "b"), levels = letters[1:3]) rt <- roundtrip_var(f, "dta") expect_is(rt, "labelled") expect_equal(as.vector(rt), 1:2) expect_equal(attr(rt, "labels"), c(a = 1, b = 2, c = 3)) }) test_that("labels are converted to utf-8", { labels_utf8 <- c("\u00e9\u00e8", "\u00e0", "\u00ef") labels_latin1 <- iconv(labels_utf8, "utf-8", "latin1") v_utf8 <- labelled(3:1, setNames(1:3, labels_utf8)) v_latin1 <- labelled(3:1, setNames(1:3, labels_latin1)) expect_equal(names(attr(roundtrip_var(v_utf8, "dta"), "labels")), labels_utf8) expect_equal(names(attr(roundtrip_var(v_latin1, "dta"), "labels")), labels_utf8) }) test_that("throws error on invalid variable names", { df <- data.frame(1) names(df) <- "x y" expect_error(write_dta(df, tempfile()), "not valid Stata variables: `x y`") }) test_that("throws error on labelled numerics", { df <- data.frame(labelled(c(1, 2, 3), c("a" = 1))) names(df) <- "x" expect_error(write_dta(df, tempfile()), "Problems: `x`") }) haven/tests/testthat/labelled-str.sav0000755000176200001440000000101513042170233017435 0ustar liggesusers$FL2@(#) IBM SPSS STATISTICS 64-bit MS Windows 24.0.0.0 Y@13 Sep 1613:20:53 GENDER F Female M Male    GENDER=gendergender:$@Role('0' )UTF-8M F haven/tests/testthat/tagged-na.sas7bdat0000644000176200001440000040000012743423326017636 0ustar liggesusers`Ͻ 1""332""3323#3>SAS FILEAPPLE DATA *ŞA*ŞA 9.0401M2X64_8PRO*EEEh*ŞAh  08p <4<?@@@@($,> DATASTEPxXFMT,  0"h  haven/tests/testthat/test-replace_with.R0000644000176200001440000000062512743423326020134 0ustar liggesuserscontext("replace_with") test_that("updates numeric values", { x <- 1:5 expect_equal(replace_with(x, -1, 5), x) expect_equal(replace_with(x, 1, 5), c(5, 2:5)) expect_equal(replace_with(x, 5, 1), c(1:4, 1)) expect_equal(replace_with(x, 1:5, rep(1, 5)), rep(1, 5)) }) test_that("udpates tagged NAs", { x <- c(tagged_na("a"), 1:3) expect_equal(replace_with(x, tagged_na("a"), 0), 0:3) }) haven/tests/testthat/labelled-num-na.sav0000755000176200001440000000102712465157740020043 0ustar liggesusers$FL2@(#) IBM SPSS STATISTICS MS Windows 22.0.0.0 Y@06 Feb 1514:34:22 VAR00002Only one value "@? This is one   VAR00002=VAR00002VAR00002:$@Role('0' )UTF-8emhaven/tests/testthat/types.dta0000644000176200001440000000427612743423326016232 0ustar liggesusers
117LSF 1 Dec 2015 03:48
Eb$S! vfloatvdoublevlong01vint001vbyte01vstrvdatevdatetime%9.0g%9.0g%9.0g%9.0g%9.0g%13s%td0g%tcgwQuuBU@QtvB0T@QuuBU@QtvB0T@QuuBU@QtvB0T@QuuBU@QtvB0T@QuuBU@QtvB0T@QuuBU@QtvB0T@QuuBU@QtvB0T@QuuBU@QtvB0T@H@Q @2lloB@u~yB@@Hello, World!O
haven/tests/testthat/tagged-na-int.dta0000755000176200001440000000366112743423326017505 0ustar liggesusers
118LSF 7 Jun 2016 11:09
<7Cx000001%9.0gtestlabelA?(B @t @ @x_ QZ@8B B118%AZ@x_ 8B @ @ @x_ QZ@QZ@ _dta_lang_cstQwHJ @HJx @Z@W Շdefault _dta_lang_listQwHJ @HJBZ@W Շdefault%testlabel @x_  @x_ QZ@B apple.zebra
haven/tests/testthat/test-as-factor.R0000644000176200001440000000555413224443423017346 0ustar liggesuserscontext("as_factor") # Base types -------------------------------------------------------------- test_that("leaves factors unchanged", { f <- factor(letters, ordered = TRUE) expect_equal(as_factor(f), f) }) test_that("converts characters to factors", { expect_equal(as_factor(letters), factor(letters)) }) test_that("variable label is kept when converting characters to factors (#178)", { s1 <- structure(letters, "label" = "letters") expect_identical(attr(as_factor(s1), "label"), "letters") }) # Labelled values --------------------------------------------------------- test_that("all labels (implicit missing values) are preserved when levels is 'default' or 'both' (#172)", { s1 <- labelled(rep(1, 3), c("A" = 1, "B" = 2, "C" = 3)) exp <- factor(rep("A", 3), levels = c("A", "B", "C")) expect_equal(as_factor(s1), exp) exp <- factor(rep("[1] A", 3), levels = c("[1] A", "[2] B", "[3] C")) expect_equal(as_factor(s1, levels = "both"), exp) }) test_that("all labels (existing and missing) are sorted by values (#172)", { s1 <- labelled(c(1, 4), c("Agree" = 1, "Neutral" = 2, "Disagree" = 3, "Don't know" = 5)) exp <- factor(c("Agree", "4"), levels = c("Agree", "Neutral", "Disagree", "4", "Don't know")) expect_equal(as_factor(s1), exp) }) test_that("all values are preserved", { s1 <- labelled(1:3, c("A" = 2)) exp <- factor(c("1", "A", "3"), levels = c("1", "A", "3")) expect_equal(as_factor(s1), exp) }) test_that("character labelled converts to factor", { s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F")) exp <- factor(c("Male", "Male", "Female"), levels = c("Female", "Male")) expect_equal(as_factor(s1), exp) }) test_that("converts tagged NAs", { s1 <- labelled(c(1:2, tagged_na("a", "b")), c("Apple" = tagged_na("a"))) exp <- factor(c("1", "2", "Apple", NA)) expect_equal(as_factor(s1), exp) }) # Both test_that("both combines values and levels", { s1 <- labelled(2:1, c("A" = 1)) exp <- factor(c("2", "[1] A"), levels = c("[1] A", "2")) expect_equal(as_factor(s1, "both"), exp) }) # Values test_that("character labelled uses values when requested", { s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F")) exp <- factor(c("M", "M", "F"), levels = c("M", "F")) expect_equal(as_factor(s1, "values"), exp) }) # Labels test_that("labels preserves all label values", { var <- labelled(1L, c(female = 1L, male = 2L)) expect_equal(as_factor(var, "labels"), factor("female", levels = c("female", "male"))) }) test_that("order of labels doesn't matter", { var <- labelled(1L, c(female = 2L, male = 1L)) expect_equal(as_factor(var, "labels"), factor("male", levels = c("female", "male"))) }) # Variable label test_that("variable label is kept when converting labelled to factor (#178)", { s1 <- labelled(1:3, c("A" = 2)) attr(s1, "label") <- "labelled" expect_identical(attr(as_factor(s1), "label"), "labelled") }) haven/tests/testthat/variable-label.sav0000644000176200001440000000076012464767250017752 0ustar liggesusers$FL2@(#) IBM SPSS STATISTICS MS Windows 22.0.0.0 Y@05 Feb 1516:12:01 SEX Gender ?female   SEX=sexsex:$@Role('0' )UTF-8ehaven/tests/testthat/test-zap_widths.R0000644000176200001440000000056513227701174017643 0ustar liggesuserscontext("test-zap_widths.R") test_that("can zap width attribute from vector", { x <- structure(1:5, display_width = 10) y <- zap_widths(x) expect_null(attributes(y)) }) test_that("can zap width attribute from vector in data frame", { x <- structure(1:5, display_width = 10) df <- data.frame(x = x) out <- zap_widths(df) expect_null(attributes(out$x)) }) haven/tests/testthat/test-tagged_na.R0000644000176200001440000000255112752362660017402 0ustar liggesuserscontext("tagged_na") test_that("tagged_na is NA (but not NaN)", { x <- tagged_na("a") expect_true(is.na(x)) expect_false(is.nan(x)) }) # tag_na ------------------------------------------------------------------ test_that("can extract value of tagged na", { expect_equal(na_tag(tagged_na(letters)), letters) }) test_that("tag of system NA is NA", { expect_equal(na_tag(NA_real_), NA_character_) }) test_that("tag of non-NA is NA", { expect_equal(na_tag(1), NA_character_) }) # is_tagged_na ------------------------------------------------------------ test_that("regular NA isn't tagged", { expect_false(is_tagged_na(NA_real_)) }) test_that("non-missing isn't tagged", { expect_false(is_tagged_na(1)) }) test_that("tagged values are tagged", { x <- tagged_na(c("a", "z")) expect_equal(is_tagged_na(x), c(TRUE, TRUE)) }) test_that("values are checked if required", { x <- tagged_na(c("a", "z")) expect_equal(is_tagged_na(x, "a"), c(TRUE, FALSE)) }) # character output ----------------------------------------------------------- test_that("format_tagged_na displays tagged NA's specially", { x <- c(1, tagged_na("a"), NA) expect_equal(format_tagged_na(x), c( " 1", "NA(a)", " NA" )) }) test_that("print_tagged_na is stable", { x <- c(1:100, tagged_na(letters), NA) expect_output_file(print_tagged_na(x), "tagged-na.txt") }) haven/tests/testthat/tagged-na.txt0000644000176200001440000000147413227703236016760 0ustar liggesusers [1] 1 2 3 4 5 6 7 8 9 10 11 12 [13] 13 14 15 16 17 18 19 20 21 22 23 24 [25] 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 [49] 49 50 51 52 53 54 55 56 57 58 59 60 [61] 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 [85] 85 86 87 88 89 90 91 92 93 94 95 96 [97] 97 98 99 100 NA(a) NA(b) NA(c) NA(d) NA(e) NA(f) NA(g) NA(h) [109] NA(i) NA(j) NA(k) NA(l) NA(m) NA(n) NA(o) NA(p) NA(q) NA(r) NA(s) NA(t) [121] NA(u) NA(v) NA(w) NA(x) NA(y) NA(z) NA haven/tests/testthat/test-labelled_spss.R0000644000176200001440000000211713224443423020273 0ustar liggesuserscontext("labelled_spss") test_that("constructor checks na_value", { expect_error(labelled_spss(1:10, na_values = "a"), "must be same type") }) test_that("constructor checks na_range", { expect_error(labelled_spss(1:10, na_range = "a"), "must be a numeric vector") expect_error(labelled_spss(1:10, na_range = 1:3), "of length two") expect_error( labelled_spss("a", c(a = "a"), na_range = 1:2), "only applicable for labelled numeric" ) }) test_that("printed output is stable", { x <- labelled_spss( 1:5, c("Good" = 1, "Bad" = 5), na_value = c(1, 2), na_range = c(3, Inf) ) expect_output_file(print(x), "labelled-spss-output.txt") }) # is.na ------------------------------------------------------------------- test_that("values in na_range flagged as missing", { x <- labelled_spss(1:5, c("a" = 1), na_range = c(1, 3)) expect_equal(is.na(x), c(TRUE, TRUE, TRUE, FALSE, FALSE)) }) test_that("values in na_values flagged as missing", { x <- labelled_spss(1:5, c("a" = 1), na_values = c(1, 3, 5)) expect_equal(is.na(x), c(TRUE, FALSE, TRUE, FALSE, TRUE)) }) haven/tests/testthat/labelled-output.txt0000644000176200001440000000032213227703236020222 0ustar liggesusers [1] 1 2 3 4 5 NA NA(x) NA(y) NA(z) Labels: value label 1 Good 5 Bad NA(x) Not Applicable NA(y) Refused to answer haven/tests/testthat/test-read-connection.R0000644000176200001440000000043112743423326020531 0ustar liggesuserscontext("connections") test_that("connections are read", { file_conn <- file("hadley.sas7bdat") expect_identical(read_sas(file_conn), read_sas("hadley.sas7bdat")) }) test_that("zip files are read", { expect_identical(read_sas("hadley.zip"), read_sas("hadley.sas7bdat")) }) haven/tests/testthat/hadley.zip0000644000176200001440000000226312743423326016360 0ustar liggesusersPK)UF{ hadley.sas7bdatUT TPWux kUIj"M  jD&HIM4hJPAАlM&'+zՃJA^HBv, ZNիZ ó!t3c#coDmtdz$ɚ>_GXҕwb}!{W^z譓gMKg/n]Ο~8H.g7-cqCVygZV֭HxUOxϧygJ_ a(]\O\͝<]>{&m=f}]whk:ֹRdQC_Oc]}l~gߥަ=qOSߑ]q}vx0 7CQ[ Ł,poU*{6d_нK{VC} ^zCǾ{mW*p$?#P?ˡUlWZWBh z7si ՘%t{syˏLQ_h L|)4܎׺GMFGG.̅pa>乗'ϾS\+8ع6|1ZX^(on 3хb3ډ;jfW6fg6sZfy- 3bRX abyTz2:_\6+z1)ݜ;{lW;|YεsmIC\ӹt.ydz.s7w{nIJGCӝ -MjV{^jI9uy;jIiJ4vܵ{7g'Y͏qxL_{wYP~/PK)UF{ hadley.sas7bdatUTTux PKUHhaven/tests/testthat/datetime-d.dta0000644000176200001440000000047712743423326017102 0ustar liggesuserss`%`$r$~r$BB 2 Nov 2015 16:07date%d0g''wwwwhwTFhaven/tests/testthat/test-zap_missing.R0000644000176200001440000000132012743423326020002 0ustar liggesuserscontext("zap_missing") test_that("strips na tags", { x1 <- labelled(tagged_na("a", "b"), c(a = tagged_na("a"), b = 1)) x2 <- zap_missing(x1) expect_equal(na_tag(x2), c(NA_character_, NA)) expect_equal(attr(x2, "labels"), c(b = 1)) }) test_that("converts user-defined missings", { x1 <- labelled_spss(c(1, 2, 99), c(missing = 99), na_values = 99) x2 <- zap_missing(x1) expect_equal(x2[[3]], NA_real_) expect_s3_class(x2, "labelled") }) test_that("converts data frame", { x1 <- labelled(tagged_na("a", "b"), c(a = tagged_na("a"), b = 1)) df1 <- tibble::data_frame(x1 = 1, x2 = 2:1) df2 <- zap_missing(df1) expect_equal(na_tag(df1$x1), c(NA_character_, NA)) expect_equal(df1$x2, df2$x2) }) haven/tests/testthat/labelled-num.sav0000755000176200001440000000077312465157740017456 0ustar liggesusers$FL2@(#) IBM SPSS STATISTICS MS Windows 22.0.0.0 Y@06 Feb 1514:33:36 VAR00002? This is one   VAR00002=VAR00002VAR00002:$@Role('0' )UTF-8ehaven/tests/testthat/datetime.sas7bdat0000644000176200001440000001200012465223672017605 0ustar liggesusers`Ͻ 1""2"2""2"22"">SAS FILEDATETIME DATA M3AM3A9.0301M0XP_PROMJ"J"J"ChhM3ABhh\   0 dD < P 4 4X 4$ 4 4 4 . A@@@@@@@dAC@C@C@@A@@@@D@heck the spelling of all wordsCheck spelling and suggest correct wordsSuggest correct word for misspelled wordRemember misspelled word as a correct wordAdd misspelled word to a dictionaryInclude a dictionary to be used for spell checkingClose an included dictionaryCreate a dictionaryAGSRDIFC "Enter dictionary name: DICT CREATE @1" Enter dictionary name: DICT FREE @1"Enter dictionary name: DICT INCLUDE @1" Enter dictionary name: SPELL ADD @1"Enter dictionary name: SPELL REMEMBER @1 Spell suggestSpell all suggest Spell all% {qf gSh h h h h h h f g f gf gf g f gf gf gf gSProgram Editor Log Output Graph Results Explorer Contents Only My Favorite Folders   (1?]^_n`p p"View the Program Editor windowView the Log windowView the Output windowView the Graph windowView the Results Navigator windowView the Explorer windowView the Explorer showing only its contentsExplore Favorite FoldersPLOaexcy  exproot filesdmsexpEXPLORER odsresultsGRAPH1LISTINGLOGPGM "f g~h$h$h$h$h$h$h$ f g f gf gf gf gf gf gf g f g~ ?Query Table Editor Graphics Editor ODS Graphics Designer Report Editor Image Editor Text Editor New Library New File Shortcut  #9GT`lwx}z{|p  p pOpen the query toolOpen the table editorOpen the graphics editorOpen the ODS graphics designerTH @ 4((h D 0$0<DPX DATASTEPVAR1DATETIMEVAR2MMDDYYVAR3DATEVAR4WEEKDATEVAR5TIMEX 0"( ?Bhh  Lhaven/tests/testthat/test-write-sas.R0000644000176200001440000000216313227702451017400 0ustar liggesuserscontext("write_sas") test_that("can roundtrip basic types", { x <- runif(10) expect_equal(roundtrip_var(x, "sas"), x) expect_equal(roundtrip_var(1:10, "sas"), 1:10) expect_equal(roundtrip_var(c(TRUE, FALSE), "sas"), c(1, 0)) expect_equal(roundtrip_var(letters, "sas"), letters) }) test_that("can roundtrip missing values (as much as possible)", { expect_equal(roundtrip_var(NA, "sas"), NA_integer_) expect_equal(roundtrip_var(NA_real_, "sas"), NA_real_) expect_equal(roundtrip_var(NA_integer_, "sas"), NA_integer_) expect_equal(roundtrip_var(NA_character_, "sas"), "") }) test_that("can roundtrip date times", { x1 <- c(as.Date("2010-01-01"), NA) x2 <- as.POSIXct(x1) attr(x2, "tzone") <- "UTC" expect_equal(roundtrip_var(x1, "sas"), x1) expect_equal(roundtrip_var(x2, "sas"), x2) }) test_that("can roundtrip format attribute", { df <- data.frame(x = structure(1:5, format.sas = "xyz")) path <- tempfile() write_sas(df, path) out <- read_sas(path) expect_equal(df$x, out$x) }) test_that("infinity gets converted to NA", { expect_equal(roundtrip_var(c(Inf, 0, -Inf), "sas"), c(NA, 0, NA)) }) haven/tests/testthat/test-read-stata.R0000644000176200001440000000260712752362700017513 0ustar liggesuserscontext("read_stata") test_that("stata data types read into expected types (#45)", { df <- read_stata("types.dta") types <- vapply(df, typeof, character(1)) expect_equal(types, c( vfloat = "double", vdouble = "double", vlong = "double", vint = "double", vbyte = "double", vstr = "character", vdate = "double", vdatetime = "double" )) }) test_that("Stata %td (date) and %tc (datetime) read into expected classes", { df <- read_stata("types.dta") expect_is(df$vdate, "Date") expect_is(df$vdatetime, "POSIXct") }) test_that("Old %d format read into Date class", { df <- zap_formats(read_stata(test_path("datetime-d.dta"))) expect_equal(df$date, as.Date("2015-11-02")) }) test_that("tagged double missings are read correctly", { x <- read_dta(test_path("tagged-na-double.dta"))$x expect_equal(na_tag(x), c(rep(NA, 5), "a", "h", "z")) labels <- attr(x, "labels") expect_equal(na_tag(labels), c("a", "z")) }) test_that("tagged integer missings are read correctly", { x <- read_dta(test_path("tagged-na-int.dta"))$x expect_equal(na_tag(x), c(rep(NA, 5), "a", "h", "z")) labels <- attr(x, "labels") expect_equal(na_tag(labels), c("a", "z")) }) test_that("file label and notes stored as attributes", { df <- read_dta(test_path("notes.dta")) expect_equal(attr(df, "label"), "This is a test dataset.") expect_length(attr(df, "notes"), 2) }) haven/tests/testthat/datetime.sav0000644000176200001440000000123212474132452016666 0ustar liggesusers$FL2@(#) IBM SPSS STATISTICS DATA FILE MS Windows 19.0.0 Y@27 Feb 1512:05:48   DATE DATE.POS  TIME    'DATE=date DATE.POS=date.posix TIME=time;date:$@Role('0' )/date.posix:$@Role('0' )/time:$@Role('0' ) windows-1252c BVBGk@0c B6c BQ@haven/tests/testthat/test-zap_labels.R0000644000176200001440000000131313224443423017570 0ustar liggesuserscontext("zap_labels") test_that("zap_labels strips labelled attributes", { var <- labelled(c(1L, 98L, 99L), c(not_answered = 98L, not_applicable = 99L)) exp <- c(1L, 98L, 99L) expect_equal(zap_labels(var), exp) }) test_that("zap_labels returns variables not of class('labelled') unmodified", { var <- c(1L, 98L, 99L) expect_equal(zap_labels(var), var) }) test_that("zap_labels is applied to every column in data frame", { df <- tibble::data_frame(x = 1:10, y = labelled(10:1, c("good" = 1))) expect_equal(zap_labels(df)$y, 10:1) }) test_that("replaces user-defined missings for spss", { x <- labelled_spss(1:5, c(a = 1), na_values = c(2, 4)) expect_equal(zap_labels(x), c(1, NA, 3, NA, 5)) }) haven/tests/testthat/test-labelled.R0000644000176200001440000000172713224443423017231 0ustar liggesuserscontext("Labelled") test_that("x must be numeric or character", { expect_error(labelled(TRUE), "must be a numeric or a character vector") }) test_that("x and labels must be compatible", { expect_error(labelled(1, "a"), "must be same type") expect_error(labelled(1, c(female = 2L, male = 1L)), NA) expect_error(labelled(1L, c(female = 2, male = 1)), NA) }) test_that("labels must have names", { expect_error(labelled(1, 1), "must have names") }) # methods ----------------------------------------------------------------- test_that("printed output is stable", { x <- labelled( c(1:5, NA, tagged_na("x", "y", "z")), c( Good = 1, Bad = 5, "Not Applicable" = tagged_na("x"), "Refused to answer" = tagged_na("y") ) ) expect_output_file(print(x), "labelled-output.txt") }) test_that("given correct name in data frame", { x <- labelled(1:3, c(a = 1)) expect_named(data.frame(x), "x") expect_named(data.frame(y = x), "y") }) haven/tests/testthat/tagged-na-double.dta0000755000176200001440000000372112743423326020162 0ustar liggesusers
118LSF 7 Jun 2016 18:03
<7Cx000001%9.0gtestlabelQwP @%@B @QZ@8B B118%AZ@]8B @ @ @1QZ@QZ@ _dta_lang_listx"Qw8L @x" x @Z@Cdefault _dta_lang_cx"Qw8L @x" x @Z@Cdefault?@@@@%testlabel @ @QZ@B apple.zebra
haven/tests/testthat/test-zap-empty.R0000644000176200001440000000021712743423326017411 0ustar liggesuserscontext("zap_empty") test_that("empty strings replaced with missing", { x <- c("", "a", NA) expect_equal(zap_empty(x), c(NA, "a", NA)) }) haven/tests/testthat/formats.sas7bcat0000644000176200001440000004200012466441772017472 0ustar liggesuserscϽ 1""332""3323#3SAS FILEFORMATS CATALOG x2AOMd3A9.0401M1X64_8PROx^^^xx2AxHSYSRESR PGBITMAP`FORMATC GENDER8FORMAT WORKSHOPxKXLCH(XLSRshy2A/Ld3A  XLSRLd3ALd3AxO0001000100060000 xF  XLSRVMd3AMd3AxO0010001000030000 xF  XLSRM$AfN$A  < IOM Cache Service XLSR 8N$AfN$A   < Base A. M. for Catalog's XLSR M$AfN$A   < Clipboard Access Method XLSR 8N$AfN$A   < Communication Ports XLSR M$AfN$A   < Base A.M. for URL XLSR 8N$AfN$A   < Dynamic Data Exchange XLSR8N$AfN$A  < Base A. M. for Disk files XLSR 8N$AfN$A   < Drive Map access method XLSR 8N$AfN$A   < Base A. M. for Dummy files XLSR 8N$AfN$A   < Base A. M. for EMAIL XLSR 8N$AfN$A   <  FTP A. M. xjx`Kx$GENDER Ld3A ( @ f  m  Female MalexWORKSHOPVMd3A(-q=`0ѿ0hRSAS  haven/tests/testthat/test-read-xpt.R0000644000176200001440000000045113227425301017200 0ustar liggesuserscontext("test-read-xpt.R") test_that("can read date/times", { x <- as.Date("2018-01-01") df <- data.frame(date = x, datetime = as.POSIXct(x)) path <- tempfile() write_xpt(df, path) res <- read_xpt(path) expect_s3_class(res$date, "Date") expect_s3_class(res$datetime, "POSIXct") }) haven/tests/testthat/test-read-sav.R0000644000176200001440000000622413042751476017174 0ustar liggesuserscontext("read_sav") test_that("variable label stored as attributes", { df <- read_spss("variable-label.sav") expect_equal(attr(df$sex, "label"), "Gender") }) test_that("value labels stored as labelled class", { num <- zap_formats(read_spss(test_path("labelled-num.sav"))) str <- zap_formats(read_spss(test_path("labelled-str.sav"))) expect_equal(num[[1]], labelled(1, c("This is one" = 1))) expect_equal(str[[1]], labelled(c("M", "F"), c(Female = "F", Male = "M"))) }) test_that("value labels read in as same type as vector", { df <- read_spss("variable-label.sav") num <- read_spss("labelled-num.sav") str <- read_spss("labelled-str.sav") expect_equal(typeof(df$sex), typeof(attr(df$sex, "labels"))) expect_equal(typeof(num[[1]]), typeof(attr(num[[1]], "labels"))) expect_equal(typeof(str[[1]]), typeof(attr(str[[1]], "labels"))) }) test_that("non-ASCII labels converted to utf-8", { x <- read_spss("umlauts.sav")[[1]] expect_equal(attr(x, "label"), "This is an \u00e4-umlaut") expect_equal(names(attr(x, "labels"))[1], "the \u00e4 umlaut") }) test_that("datetime variables converted to the correct class", { df <- read_spss("datetime.sav") expect_true(inherits(df$date, "Date")) expect_true(inherits(df$date.posix, "POSIXct")) expect_true(inherits(df$time, "hms")) }) test_that("datetime values correctly imported (offset)", { df <- read_spss("datetime.sav") expect_equal(df$date[1], as.Date("2014-09-22d")) expect_equal(df$date.posix[2], as.POSIXct("2014-09-23 15:59:20", tz = "UTC")) expect_equal(as.integer(df$time[1]), 43870) }) test_that("formats roundtrip", { df <- tibble::data_frame( a = structure(c(1, 1, 2), format.spss = "F1.0"), b = structure(4:6, format.spss = "F2.1"), c = structure(7:9, format.spss = "N2"), d = structure(c("Text", "Text", ""), format.spss = "A100") ) tmp <- tempfile() on.exit(unlink(tmp)) write_sav(df, tmp) df2 <- read_sav(tmp) expect_equal(df$a, df$a) expect_equal(df$b, df$b) expect_equal(df$c, df$c) expect_equal(df$d, df$d) }) test_that("widths roundtrip", { df <- tibble::data_frame( a = structure(c(1, 1, 2), display_width = 10), b = structure(4:6, display_width = 11), c = structure(7:9, display_width = 12), d = structure(c("Text", "Text", ""), display_width = 10) ) tmp <- tempfile() on.exit(unlink(tmp)) write_sav(df, tmp) df2 <- read_sav(tmp) expect_equal(df$a, df$a) expect_equal(df$b, df$b) expect_equal(df$c, df$c) expect_equal(df$d, df$d) }) # User-defined missings --------------------------------------------------- test_that("user-defined missing values read as missing by default", { num <- read_spss(test_path("labelled-num-na.sav"))[[1]] expect_equal(num[[2]], NA_real_) }) test_that("user-defined missing values can be preserved", { num <- read_spss(test_path("labelled-num-na.sav"), user_na = TRUE)[[1]] expect_s3_class(num, "labelled_spss") expect_equal(num[[2]], 9) expect_equal(attr(num, "na_values"), 9) expect_equal(attr(num, "na_range"), NULL) num }) test_that("system missings read as NA", { df <- tibble::tibble(x = c(1, NA)) out <- roundtrip_sav(df) expect_identical(df$x, c(1, NA)) }) haven/tests/testthat/test-read-sas.R0000644000176200001440000000262513224443423017162 0ustar liggesuserscontext("read_sas") test_that("variable label stored as attributes", { df <- read_sas("hadley.sas7bdat") expect_equal(attr(df$gender, "label"), NULL) expect_equal(attr(df$q1, "label"), "The instructor was well prepared") }) test_that("value labels parsed from bcat file", { df <- read_sas("hadley.sas7bdat", "formats.sas7bcat") expect_is(df$gender, "labelled") expect_equal(attr(df$gender, "labels"), c(Female = "f", Male = "m")) expect_equal(attr(df$workshop, "labels"), c(R = 1, SAS = 2)) }) test_that("value labels read in as same type as vector", { df <- read_sas("hadley.sas7bdat", "formats.sas7bcat") expect_equal(typeof(df$gender), typeof(attr(df$gender, "labels"))) expect_equal(typeof(df$workshop), typeof(attr(df$workshop, "labels"))) }) test_that("date times are converted into corresponding R types", { df <- read_sas(test_path("datetime.sas7bdat")) expect_equal(df$VAR1[1], ISOdatetime(2015, 02, 02, 14, 42, 12, "UTC")) expect_equal(df$VAR2[1], as.Date("2015-02-02")) expect_equal(df$VAR3[1], as.Date("2015-02-02")) expect_equal(df$VAR4[1], as.Date("2015-02-02")) expect_equal(df$VAR5[1], hms::hms(52932)) }) test_that("tagged missings are read correctly", { x <- read_sas(test_path("tagged-na.sas7bdat"), test_path("tagged-na.sas7bcat"))$x expect_equal(na_tag(x), c(rep(NA, 5), "a", "h", "z")) labels <- attr(x, "labels") expect_equal(na_tag(labels), c("a", "z")) }) haven/tests/testthat/test-write-sav.R0000644000176200001440000000617513227444042017411 0ustar liggesuserscontext("write_sav") test_that("can roundtrip basic types", { x <- runif(10) expect_equal(roundtrip_var(x, "sav"), x) expect_equal(roundtrip_var(1:10, "sav"), 1:10) expect_equal(roundtrip_var(c(TRUE, FALSE), "sav"), c(1, 0)) expect_equal(roundtrip_var(letters, "sav"), letters) }) test_that("can roundtrip missing values (as much as possible)", { expect_equal(roundtrip_var(NA, "sav"), NA_integer_) expect_equal(roundtrip_var(NA_real_, "sav"), NA_real_) expect_equal(roundtrip_var(NA_integer_, "sav"), NA_integer_) expect_equal(roundtrip_var(NA_character_, "sav"), "") }) test_that("can roundtrip date times", { x1 <- c(as.Date("2010-01-01"), NA) x2 <- as.POSIXct(x1) attr(x2, "tzone") <- "UTC" expect_equal(roundtrip_var(x1, "sav"), x1) expect_equal(roundtrip_var(x2, "sav"), x2) }) test_that("can roundtrip times", { x <- hms::hms(c(1, NA, 86400)) expect_equal(roundtrip_var(x, "sav"), x) }) test_that("infinity gets converted to NA", { expect_equal(roundtrip_var(c(Inf, 0, -Inf), "sav"), c(NA, 0, NA)) }) test_that("factors become labelleds", { f <- factor(c("a", "b"), levels = letters[1:3]) rt <- roundtrip_var(f, "sav") expect_is(rt, "labelled") expect_equal(as.vector(rt), 1:2) expect_equal(attr(rt, "labels"), c(a = 1, b = 2, c = 3)) }) test_that("labels are preserved", { x <- 1:10 attr(x, "label") <- "abc" expect_equal(attr(roundtrip_var(x, "sav"), "label"), "abc") }) test_that("labelleds are round tripped", { int <- labelled(c(1L, 2L), c(a = 1L, b = 3L)) num <- labelled(c(1, 2), c(a = 1, b = 3)) chr <- labelled(c("a", "b"), c(a = "b", b = "a")) expect_equal(roundtrip_var(int, "sav"), int) expect_equal(roundtrip_var(num, "sav"), num) expect_equal(roundtrip_var(chr, "sav"), chr) }) test_that("spss labelleds are round tripped", { df <- tibble( x = labelled_spss( c(1, 2, 1, 9), labels = c(no = 1, yes = 2, unknown = 9), na_values = 9, na_range = c(80, 90) ) ) path <- tempfile() write_sav(df, path) df2 <- read_sav(path) expect_s3_class(df2$x, "labelled") df3 <- read_sav(path, user_na = TRUE) expect_s3_class(df3$x, "labelled_spss") expect_equal(attr(df3$x, "na_values"), attr(df$x, "na_values")) expect_equal(attr(df3$x, "na_range"), attr(df$x, "na_range")) }) test_that("factors become labelleds", { f <- factor(c("a", "b"), levels = letters[1:3]) rt <- roundtrip_var(f, "sav") expect_is(rt, "labelled") expect_equal(as.vector(rt), 1:2) expect_equal(attr(rt, "labels"), c(a = 1, b = 2, c = 3)) }) test_that("labels are converted to utf-8", { labels_utf8 <- c("\u00e9\u00e8", "\u00e0", "\u00ef") labels_latin1 <- iconv(labels_utf8, "utf-8", "latin1") v_utf8 <- labelled(3:1, setNames(1:3, labels_utf8)) v_latin1 <- labelled(3:1, setNames(1:3, labels_latin1)) expect_equal(names(attr(roundtrip_var(v_utf8, "sav"), "labels")), labels_utf8) expect_equal(names(attr(roundtrip_var(v_latin1, "sav"), "labels")), labels_utf8) }) test_that("complain about long factor labels", { x <- paste(rep("a", 200), collapse = "") df <- data.frame(x = factor(x)) expect_error(roundtrip_sav(df), "levels with <= 120 characters") }) haven/tests/testthat/helper-roundtrip.R0000644000176200001440000000161113227442102017777 0ustar liggesusers roundtrip_sav <- function(x) { tmp <- tempfile() on.exit(unlink(tmp)) write_sav(x, tmp) zap_formats(read_sav(tmp)) } roundtrip_dta <- function(x) { tmp <- tempfile() on.exit(unlink(tmp)) write_dta(x, tmp) zap_formats(read_dta(tmp)) } roundtrip_sas <- function(x) { tmp <- tempfile() on.exit(unlink(tmp)) write_sas(x, tmp) zap_formats(read_sas(tmp)) } roundtrip_xpt <- function(x) { tmp <- tempfile() on.exit(unlink(tmp)) write_xpt(x, tmp) zap_formats(read_xpt(tmp)) } roundtrip_var <- function(x, type = "sav") { df <- tibble::tibble(x = x) # Forces xpt files to be correct length even when ending with # empty character strings if (type == "xpt") { df$y <- seq_along(x) } switch(type, sav = roundtrip_sav(df)$x, dta = roundtrip_dta(df)$x, sas = roundtrip_sas(df)$x, xpt = roundtrip_xpt(df)$x, stop("Unsupported type") ) } haven/tests/testthat/labelled-spss-output.txt0000644000176200001440000000020413227703236021207 0ustar liggesusers [1] 1 2 3 4 5 Missing values: 1, 2 Missing range: [3, Inf] Labels: value label 1 Good 5 Bad haven/tests/testthat/notes.dta0000644000176200001440000000224612746727743016226 0ustar liggesuserssThis is a test dataset.VX2П!xZП,";29 Jul 2016 14:34idfaceHeight #include #define CK_HASH_KEY_SIZE 128 typedef struct ck_hash_entry_s { char key[CK_HASH_KEY_SIZE]; const void *value; } ck_hash_entry_t; typedef struct ck_hash_table_s { uint64_t capacity; uint64_t count; ck_hash_entry_t *entries; } ck_hash_table_t; int ck_str_hash_insert(const char *key, const void *value, ck_hash_table_t *table); const void *ck_str_hash_lookup(const char *key, ck_hash_table_t *table); int ck_float_hash_insert(float key, const void *value, ck_hash_table_t *table); const void *ck_float_hash_lookup(float key, ck_hash_table_t *table); int ck_double_hash_insert(double key, const void *value, ck_hash_table_t *table); const void *ck_double_hash_lookup(double key, ck_hash_table_t *table); ck_hash_table_t *ck_hash_table_init(size_t size); void ck_hash_table_wipe(ck_hash_table_t *table); int ck_hash_table_grow(ck_hash_table_t *table); void ck_hash_table_free(ck_hash_table_t *table); uint64_t ck_hash_str(const char *str); haven/src/readstat/readstat_convert.c0000644000176200001440000000217113227731765017502 0ustar liggesusers #include #include "readstat.h" #include "readstat_iconv.h" #include "readstat_convert.h" readstat_error_t readstat_convert(char *dst, size_t dst_len, const char *src, size_t src_len, iconv_t converter) { /* strip off spaces from the input because the programs use ASCII space * padding even with non-ASCII encoding. */ while (src_len && src[src_len-1] == ' ') { src_len--; } if (converter) { size_t dst_left = dst_len; char *dst_end = dst; size_t status = iconv(converter, (readstat_iconv_inbuf_t)&src, &src_len, &dst_end, &dst_left); if (status == (size_t)-1) { if (errno == E2BIG) { return READSTAT_ERROR_CONVERT_LONG_STRING; } else if (errno == EILSEQ) { return READSTAT_ERROR_CONVERT_BAD_STRING; } else if (errno != EINVAL) { /* EINVAL indicates improper truncation; accept it */ return READSTAT_ERROR_CONVERT; } } dst[dst_len - dst_left] = '\0'; } else { memcpy(dst, src, src_len); dst[src_len] = '\0'; } return READSTAT_OK; } haven/src/readstat/spss/0000755000176200001440000000000013227731765014756 5ustar liggesusershaven/src/readstat/spss/readstat_sav_read.c0000644000176200001440000014237413227731765020610 0ustar liggesusers #include #include #include #include #include #include #include #include #include "../readstat.h" #include "../readstat_bits.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_malloc.h" #include "readstat_sav.h" #include "readstat_sav_parse.h" #include "readstat_sav_parse_timestamp.h" #define DATA_BUFFER_SIZE 65536 /* Others defined in table below */ /* See http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx */ static readstat_charset_entry_t _charset_table[] = { { .code = 1, .name = "EBCDIC-US" }, { .code = 2, .name = "WINDOWS-1252" }, /* supposed to be ASCII, but some files are miscoded */ { .code = 3, .name = "WINDOWS-1252" }, { .code = 4, .name = "DEC-KANJI" }, { .code = 437, .name = "CP437" }, { .code = 708, .name = "ASMO-708" }, { .code = 737, .name = "CP737" }, { .code = 775, .name = "CP775" }, { .code = 850, .name = "CP850" }, { .code = 852, .name = "CP852" }, { .code = 855, .name = "CP855" }, { .code = 857, .name = "CP857" }, { .code = 858, .name = "CP858" }, { .code = 860, .name = "CP860" }, { .code = 861, .name = "CP861" }, { .code = 862, .name = "CP862" }, { .code = 863, .name = "CP863" }, { .code = 864, .name = "CP864" }, { .code = 865, .name = "CP865" }, { .code = 866, .name = "CP866" }, { .code = 869, .name = "CP869" }, { .code = 874, .name = "CP874" }, { .code = 932, .name = "SHIFT-JIS" }, { .code = 936, .name = "ISO-IR-58" }, { .code = 949, .name = "ISO-IR-149" }, { .code = 950, .name = "BIG-5" }, { .code = 1200, .name = "UTF-16LE" }, { .code = 1201, .name = "UTF-16BE" }, { .code = 1250, .name = "WINDOWS-1250" }, { .code = 1251, .name = "WINDOWS-1251" }, { .code = 1252, .name = "WINDOWS-1252" }, { .code = 1253, .name = "WINDOWS-1253" }, { .code = 1254, .name = "WINDOWS-1254" }, { .code = 1255, .name = "WINDOWS-1255" }, { .code = 1256, .name = "WINDOWS-1256" }, { .code = 1257, .name = "WINDOWS-1257" }, { .code = 1258, .name = "WINDOWS-1258" }, { .code = 1361, .name = "CP1361" }, { .code = 10000, .name = "MACROMAN" }, { .code = 10004, .name = "MACARABIC" }, { .code = 10005, .name = "MACHEBREW" }, { .code = 10006, .name = "MACGREEK" }, { .code = 10007, .name = "MACCYRILLIC" }, { .code = 10010, .name = "MACROMANIA" }, { .code = 10017, .name = "MACUKRAINE" }, { .code = 10021, .name = "MACTHAI" }, { .code = 10029, .name = "MACCENTRALEUROPE" }, { .code = 10079, .name = "MACICELAND" }, { .code = 10081, .name = "MACTURKISH" }, { .code = 10082, .name = "MACCROATIAN" }, { .code = 12000, .name = "UTF-32LE" }, { .code = 12001, .name = "UTF-32BE" }, { .code = 20127, .name = "US-ASCII" }, { .code = 20866, .name = "KOI8-R" }, { .code = 20932, .name = "EUC-JP" }, { .code = 21866, .name = "KOI8-U" }, { .code = 28591, .name = "ISO-8859-1" }, { .code = 28592, .name = "ISO-8859-2" }, { .code = 28593, .name = "ISO-8859-3" }, { .code = 28594, .name = "ISO-8859-4" }, { .code = 28595, .name = "ISO-8859-5" }, { .code = 28596, .name = "ISO-8859-6" }, { .code = 28597, .name = "ISO-8859-7" }, { .code = 28598, .name = "ISO-8859-8" }, { .code = 28599, .name = "ISO-8859-9" }, { .code = 28603, .name = "ISO-8859-13" }, { .code = 28605, .name = "ISO-8859-15" }, { .code = 50220, .name = "ISO-2022-JP" }, { .code = 50221, .name = "ISO-2022-JP" }, // same as above? { .code = 50222, .name = "ISO-2022-JP" }, // same as above? { .code = 50225, .name = "ISO-2022-KR" }, { .code = 50229, .name = "ISO-2022-CN" }, { .code = 51932, .name = "EUC-JP" }, { .code = 51936, .name = "GBK" }, { .code = 51949, .name = "EUC-KR" }, { .code = 52936, .name = "HZ-GB-2312" }, { .code = 54936, .name = "GB18030" }, { .code = 65000, .name = "UTF-7" }, { .code = 65001, .name = "UTF-8" } }; #define SAV_LABEL_NAME_PREFIX "labels" typedef struct value_label_s { char value[8]; unsigned char label_len; char label[256*4+1]; } value_label_t; static readstat_error_t sav_update_progress(sav_ctx_t *ctx); static readstat_error_t sav_read_data(sav_ctx_t *ctx); static readstat_error_t sav_read_compressed_data(sav_ctx_t *ctx); static readstat_error_t sav_read_uncompressed_data(sav_ctx_t *ctx); static readstat_error_t sav_skip_variable_record(sav_ctx_t *ctx); static readstat_error_t sav_read_variable_record(sav_ctx_t *ctx); static readstat_error_t sav_skip_document_record(sav_ctx_t *ctx); static readstat_error_t sav_read_document_record(sav_ctx_t *ctx); static readstat_error_t sav_skip_value_label_record(sav_ctx_t *ctx); static readstat_error_t sav_read_value_label_record(sav_ctx_t *ctx); static readstat_error_t sav_read_dictionary_termination_record(sav_ctx_t *ctx); static readstat_error_t sav_parse_machine_floating_point_record(const void *data, size_t size, size_t count, sav_ctx_t *ctx); static readstat_error_t sav_store_variable_display_parameter_record(const void *data, size_t size, size_t count, sav_ctx_t *ctx); static readstat_error_t sav_parse_variable_display_parameter_record(sav_ctx_t *ctx); static readstat_error_t sav_parse_machine_integer_info_record(const void *data, size_t data_len, sav_ctx_t *ctx); static readstat_error_t sav_parse_long_value_labels_record(const void *data, size_t data_len, sav_ctx_t *ctx); static void sav_tag_missing_double(readstat_value_t *value, sav_ctx_t *ctx) { double fp_value = value->v.double_value; uint64_t long_value = 0; memcpy(&long_value, &fp_value, 8); if (long_value == ctx->missing_double) value->is_system_missing = 1; if (long_value == ctx->lowest_double) value->is_system_missing = 1; if (long_value == ctx->highest_double) value->is_system_missing = 1; if (isnan(fp_value)) value->is_system_missing = 1; } static readstat_error_t sav_update_progress(sav_ctx_t *ctx) { readstat_io_t *io = ctx->io; return io->update(ctx->file_size, ctx->progress_handler, ctx->user_ctx, io->io_ctx); } static readstat_error_t sav_skip_variable_record(sav_ctx_t *ctx) { sav_variable_record_t variable; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (io->read(&variable, sizeof(sav_variable_record_t), io->io_ctx) < sizeof(sav_variable_record_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (variable.has_var_label) { uint32_t label_len; if (io->read(&label_len, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } label_len = ctx->bswap ? byteswap4(label_len) : label_len; uint32_t label_capacity = (label_len + 3) / 4 * 4; if (io->seek(label_capacity, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } } if (variable.n_missing_values) { int n_missing_values = ctx->bswap ? byteswap4(variable.n_missing_values) : variable.n_missing_values; if (io->seek(abs(n_missing_values) * sizeof(double), READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } } cleanup: return retval; } static readstat_error_t sav_read_variable_label(spss_varinfo_t *info, sav_ctx_t *ctx) { readstat_io_t *io = ctx->io; readstat_error_t retval = READSTAT_OK; uint32_t label_len, label_capacity; size_t out_label_len; char *label_buf = NULL; if (io->read(&label_len, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } label_len = ctx->bswap ? byteswap4(label_len) : label_len; label_capacity = (label_len + 3) / 4 * 4; if ((label_buf = readstat_malloc(label_capacity)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } out_label_len = (size_t)label_len*4+1; if ((info->label = readstat_malloc(out_label_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (io->read(label_buf, label_capacity, io->io_ctx) < label_capacity) { retval = READSTAT_ERROR_READ; free(info->label); info->label = NULL; goto cleanup; } retval = readstat_convert(info->label, out_label_len, label_buf, label_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; cleanup: if (label_buf) free(label_buf); return retval; } static readstat_error_t sav_read_variable_missing_values(spss_varinfo_t *info, sav_ctx_t *ctx) { readstat_io_t *io = ctx->io; readstat_error_t retval = READSTAT_OK; int i; if (info->n_missing_values < 0) { info->missing_range = 1; info->n_missing_values = abs(info->n_missing_values); } else { info->missing_range = 0; } if (info->n_missing_values > 3) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (io->read(info->missing_values, info->n_missing_values * sizeof(double), io->io_ctx) < info->n_missing_values * sizeof(double)) { retval = READSTAT_ERROR_READ; goto cleanup; } for (i=0; in_missing_values; i++) { if (ctx->bswap) { info->missing_values[i] = byteswap_double(info->missing_values[i]); } uint64_t long_value = 0; memcpy(&long_value, &info->missing_values[i], 8); if (long_value == ctx->missing_double) info->missing_values[i] = NAN; if (long_value == ctx->lowest_double) info->missing_values[i] = -HUGE_VAL; if (long_value == ctx->highest_double) info->missing_values[i] = HUGE_VAL; } cleanup: return retval; } static readstat_error_t sav_read_variable_record(sav_ctx_t *ctx) { readstat_io_t *io = ctx->io; sav_variable_record_t variable; readstat_error_t retval = READSTAT_OK; if (ctx->var_index == ctx->varinfo_capacity) { if ((ctx->varinfo = readstat_realloc(ctx->varinfo, (ctx->varinfo_capacity *= 2) * sizeof(spss_varinfo_t))) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } } if (io->read(&variable, sizeof(sav_variable_record_t), io->io_ctx) < sizeof(sav_variable_record_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } variable.print = ctx->bswap ? byteswap4(variable.print) : variable.print; variable.write = ctx->bswap ? byteswap4(variable.write) : variable.write; int32_t type = ctx->bswap ? byteswap4(variable.type) : variable.type; if (type < 0) { if (ctx->var_index == 0) { return READSTAT_ERROR_PARSE; } ctx->var_offset++; spss_varinfo_t *prev = &ctx->varinfo[ctx->var_index-1]; prev->width++; return 0; } spss_varinfo_t *info = &ctx->varinfo[ctx->var_index]; memset(info, 0, sizeof(spss_varinfo_t)); info->width = 1; info->n_segments = 1; info->index = ctx->var_index; info->offset = ctx->var_offset; info->labels_index = -1; retval = readstat_convert(info->name, sizeof(info->name), variable.name, sizeof(variable.name), ctx->converter); if (retval != READSTAT_OK) goto cleanup; retval = readstat_convert(info->longname, sizeof(info->longname), variable.name, sizeof(variable.name), ctx->converter); if (retval != READSTAT_OK) goto cleanup; info->print_format.decimal_places = (variable.print & 0x000000FF); info->print_format.width = (variable.print & 0x0000FF00) >> 8; info->print_format.type = (variable.print & 0x00FF0000) >> 16; info->write_format.decimal_places = (variable.write & 0x000000FF); info->write_format.width = (variable.write & 0x0000FF00) >> 8; info->write_format.type = (variable.write & 0x00FF0000) >> 16; if (type > 0 || info->print_format.type == SPSS_FORMAT_TYPE_A || info->write_format.type == SPSS_FORMAT_TYPE_A) { info->type = READSTAT_TYPE_STRING; } else { info->type = READSTAT_TYPE_DOUBLE; } if (variable.has_var_label) { if ((retval = sav_read_variable_label(info, ctx)) != READSTAT_OK) { goto cleanup; } } if (variable.n_missing_values) { info->n_missing_values = ctx->bswap ? byteswap4(variable.n_missing_values) : variable.n_missing_values; if ((retval = sav_read_variable_missing_values(info, ctx)) != READSTAT_OK) { goto cleanup; } } ctx->var_index++; ctx->var_offset++; cleanup: return retval; } static readstat_error_t sav_skip_value_label_record(sav_ctx_t *ctx) { uint32_t label_count; uint32_t rec_type; uint32_t var_count; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (io->read(&label_count, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) label_count = byteswap4(label_count); int i; for (i=0; iread(&vlabel, 9, io->io_ctx) < 9) { retval = READSTAT_ERROR_READ; goto cleanup; } size_t label_len = (vlabel.label_len + 8) / 8 * 8 - 1; if (io->seek(label_len, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } } if (io->read(&rec_type, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) rec_type = byteswap4(rec_type); if (rec_type != 4) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (io->read(&var_count, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) var_count = byteswap4(var_count); if (io->seek(var_count * sizeof(uint32_t), READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } cleanup: return retval; } static readstat_error_t sav_submit_value_labels(value_label_t *value_labels, int32_t label_count, readstat_type_t value_type, sav_ctx_t *ctx) { char label_name_buf[256]; readstat_error_t retval = READSTAT_OK; int32_t i; snprintf(label_name_buf, sizeof(label_name_buf), SAV_LABEL_NAME_PREFIX "%d", ctx->value_labels_count); for (i=0; ivalue, 8); if (ctx->bswap) val_d = byteswap_double(val_d); value.v.double_value = val_d; sav_tag_missing_double(&value, ctx); } else { retval = readstat_convert(unpadded_val, sizeof(unpadded_val), vlabel->value, 8, ctx->converter); if (retval != READSTAT_OK) break; value.v.string_value = unpadded_val; } if (ctx->value_label_handler(label_name_buf, value, vlabel->label, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: return retval; } static readstat_error_t sav_read_value_label_record(sav_ctx_t *ctx) { uint32_t label_count; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; uint32_t *vars = NULL; uint32_t var_count; int32_t rec_type; readstat_type_t value_type = READSTAT_TYPE_STRING; char label_buf[256]; value_label_t *value_labels = NULL; if (io->read(&label_count, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) label_count = byteswap4(label_count); if (label_count && (value_labels = readstat_malloc(label_count * sizeof(value_label_t))) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } int i; for (i=0; iread(vlabel, 9, io->io_ctx) < 9) { retval = READSTAT_ERROR_READ; goto cleanup; } size_t label_len = (vlabel->label_len + 8) / 8 * 8 - 1; if (io->read(label_buf, label_len, io->io_ctx) < label_len) { retval = READSTAT_ERROR_READ; goto cleanup; } retval = readstat_convert(vlabel->label, sizeof(vlabel->label), label_buf, label_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; } if (io->read(&rec_type, sizeof(int32_t), io->io_ctx) < sizeof(int32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) rec_type = byteswap4(rec_type); if (rec_type != 4) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (io->read(&var_count, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) var_count = byteswap4(var_count); if (var_count && (vars = readstat_malloc(var_count * sizeof(uint32_t))) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (io->read(vars, var_count * sizeof(uint32_t), io->io_ctx) < var_count * sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } for (i=0; ibswap) var_offset = byteswap4(var_offset); var_offset--; // Why subtract 1???? spss_varinfo_t *var = bsearch(&var_offset, ctx->varinfo, ctx->var_index, sizeof(spss_varinfo_t), &spss_varinfo_compare); if (var) { var->labels_index = ctx->value_labels_count; value_type = var->type; } } if (ctx->value_label_handler) { sav_submit_value_labels(value_labels, label_count, value_type, ctx); } ctx->value_labels_count++; cleanup: if (vars) free(vars); if (value_labels) free(value_labels); return retval; } static readstat_error_t sav_skip_document_record(sav_ctx_t *ctx) { uint32_t n_lines; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (io->read(&n_lines, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) n_lines = byteswap4(n_lines); if (io->seek(n_lines * SPSS_DOC_LINE_SIZE, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } cleanup: return retval; } static readstat_error_t sav_read_document_record(sav_ctx_t *ctx) { if (!ctx->note_handler) return sav_skip_document_record(ctx); uint32_t n_lines; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (io->read(&n_lines, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) n_lines = byteswap4(n_lines); char raw_buffer[SPSS_DOC_LINE_SIZE]; char utf8_buffer[4*SPSS_DOC_LINE_SIZE+1]; int i; for (i=0; iread(raw_buffer, SPSS_DOC_LINE_SIZE, io->io_ctx) < SPSS_DOC_LINE_SIZE) { retval = READSTAT_ERROR_READ; goto cleanup; } retval = readstat_convert(utf8_buffer, sizeof(utf8_buffer), raw_buffer, sizeof(raw_buffer), ctx->converter); if (retval != READSTAT_OK) goto cleanup; if (ctx->note_handler(i, utf8_buffer, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: return retval; } static readstat_error_t sav_read_dictionary_termination_record(sav_ctx_t *ctx) { int32_t filler; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (io->read(&filler, sizeof(int32_t), io->io_ctx) < sizeof(int32_t)) { retval = READSTAT_ERROR_READ; } return retval; } static readstat_error_t sav_read_data(sav_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int longest_string = 256; int i; for (i=0; ivar_count; i++) { spss_varinfo_t *info = &ctx->varinfo[i]; if (info->string_length > longest_string) { longest_string = info->string_length; } } ctx->raw_string_len = longest_string + sizeof(SAV_EIGHT_SPACES)-2; ctx->raw_string = readstat_malloc(ctx->raw_string_len); ctx->utf8_string_len = 4*longest_string+1 + sizeof(SAV_EIGHT_SPACES)-2; ctx->utf8_string = readstat_malloc(ctx->utf8_string_len); if (ctx->raw_string == NULL || ctx->utf8_string == NULL) { retval = READSTAT_ERROR_MALLOC; goto done; } if (ctx->data_is_compressed) { retval = sav_read_compressed_data(ctx); } else { retval = sav_read_uncompressed_data(ctx); } if (retval != READSTAT_OK) goto done; if (ctx->record_count != -1 && ctx->current_row != ctx->row_limit) { retval = READSTAT_ERROR_ROW_COUNT_MISMATCH; } done: return retval; } static readstat_error_t sav_process_row(unsigned char *buffer, size_t buffer_len, sav_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; double fp_value; int offset = 0; readstat_off_t data_offset = 0; size_t raw_str_used = 0; int segment_offset = 0; int var_index = 0, col = 0; while (data_offset < buffer_len && col < ctx->var_index) { spss_varinfo_t *col_info = &ctx->varinfo[col]; spss_varinfo_t *var_info = &ctx->varinfo[var_index]; readstat_value_t value = { .type = var_info->type }; if (offset > 31) { retval = READSTAT_ERROR_PARSE; goto done; } if (var_info->type == READSTAT_TYPE_STRING) { if (raw_str_used + 8 <= ctx->raw_string_len) { memcpy(ctx->raw_string + raw_str_used, &buffer[data_offset], 8); raw_str_used += 8; } if (++offset == col_info->width) { if (++segment_offset < var_info->n_segments) { raw_str_used--; } offset = 0; col++; } if (segment_offset == var_info->n_segments) { if (!ctx->variables[var_info->index]->skip) { retval = readstat_convert(ctx->utf8_string, ctx->utf8_string_len, ctx->raw_string, raw_str_used, ctx->converter); if (retval != READSTAT_OK) goto done; value.v.string_value = ctx->utf8_string; if (ctx->value_handler(ctx->current_row, ctx->variables[var_info->index], value, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto done; } } raw_str_used = 0; segment_offset = 0; var_index += var_info->n_segments; } } else if (var_info->type == READSTAT_TYPE_DOUBLE) { if (!ctx->variables[var_info->index]->skip) { memcpy(&fp_value, &buffer[data_offset], 8); if (ctx->bswap) { fp_value = byteswap_double(fp_value); } value.v.double_value = fp_value; sav_tag_missing_double(&value, ctx); if (ctx->value_handler(ctx->current_row, ctx->variables[var_info->index], value, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto done; } } var_index += var_info->n_segments; col++; } data_offset += 8; } ctx->current_row++; done: return retval; } static readstat_error_t sav_read_uncompressed_data(sav_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; unsigned char *buffer = NULL; size_t bytes_read = 0; size_t buffer_len = ctx->var_offset * 8; buffer = readstat_malloc(buffer_len); while (ctx->row_limit == -1 || ctx->current_row < ctx->row_limit) { retval = sav_update_progress(ctx); if (retval != READSTAT_OK) goto done; if ((bytes_read = io->read(buffer, buffer_len, io->io_ctx)) != buffer_len) goto done; retval = sav_process_row(buffer, buffer_len, ctx); if (retval != READSTAT_OK) goto done; } done: if (buffer) free(buffer); return retval; } static readstat_error_t sav_read_compressed_data(sav_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; unsigned char chunk[8]; int i; double fp_value; uint64_t missing_value = ctx->missing_double; readstat_off_t data_offset = 0; unsigned char buffer[DATA_BUFFER_SIZE]; int buffer_used = 0; size_t uncompressed_row_len = ctx->var_offset * 8; readstat_off_t uncompressed_offset = 0; unsigned char *uncompressed_row = NULL; int bswap = ctx->bswap; ctx->bswap = 0; if (uncompressed_row_len && (uncompressed_row = readstat_malloc(uncompressed_row_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto done; } while (1) { if (data_offset >= buffer_used) { retval = sav_update_progress(ctx); if (retval != READSTAT_OK) goto done; if ((buffer_used = io->read(buffer, sizeof(buffer), io->io_ctx)) == -1 || buffer_used == 0 || (buffer_used % 8) != 0) goto done; data_offset = 0; } memcpy(chunk, &buffer[data_offset], 8); data_offset += 8; for (i=0; i<8; i++) { switch (chunk[i]) { case 0: break; case 252: goto done; case 253: if (data_offset >= buffer_used) { if ((buffer_used = io->read(buffer, sizeof(buffer), io->io_ctx)) == -1 || buffer_used == 0 || (buffer_used % 8) != 0) goto done; data_offset = 0; } memcpy(&uncompressed_row[uncompressed_offset], &buffer[data_offset], 8); uncompressed_offset += 8; data_offset += 8; break; case 254: memcpy(&uncompressed_row[uncompressed_offset], SAV_EIGHT_SPACES, 8); uncompressed_offset += 8; break; case 255: memcpy(&uncompressed_row[uncompressed_offset], &missing_value, sizeof(uint64_t)); uncompressed_offset += 8; break; default: fp_value = chunk[i] - 100.0; memcpy(&uncompressed_row[uncompressed_offset], &fp_value, sizeof(double)); uncompressed_offset += 8; break; } if (uncompressed_offset == uncompressed_row_len) { retval = sav_process_row(uncompressed_row, uncompressed_row_len, ctx); if (retval != READSTAT_OK) goto done; uncompressed_offset = 0; } if (ctx->current_row == ctx->row_limit) goto done; } } done: if (uncompressed_row) free(uncompressed_row); ctx->bswap = bswap; return retval; } static readstat_error_t sav_parse_machine_integer_info_record(const void *data, size_t data_len, sav_ctx_t *ctx) { if (data_len != 32) return READSTAT_ERROR_PARSE; const char *src_charset = NULL; const char *dst_charset = ctx->output_encoding; sav_machine_integer_info_record_t record; memcpy(&record, data, data_len); if (ctx->bswap) { record.character_code = byteswap4(record.character_code); } if (ctx->input_encoding) { src_charset = ctx->input_encoding; } else { int i; for (i=0; ierror_handler) { char error_buf[1024]; snprintf(error_buf, sizeof(error_buf), "Unsupported character set: %d\n", record.character_code); ctx->error_handler(error_buf, ctx->user_ctx); } return READSTAT_ERROR_UNSUPPORTED_CHARSET; } ctx->input_encoding = src_charset; } if (src_charset && dst_charset && strcmp(src_charset, dst_charset) != 0) { iconv_t converter = iconv_open(dst_charset, src_charset); if (converter == (iconv_t)-1) { return READSTAT_ERROR_UNSUPPORTED_CHARSET; } ctx->converter = converter; } return READSTAT_OK; } static readstat_error_t sav_parse_machine_floating_point_record(const void *data, size_t size, size_t count, sav_ctx_t *ctx) { if (size != 8 || count != 3) return READSTAT_ERROR_PARSE; sav_machine_floating_point_info_record_t fp_info; memcpy(&fp_info, data, sizeof(sav_machine_floating_point_info_record_t)); ctx->missing_double = ctx->bswap ? byteswap8(fp_info.sysmis) : fp_info.sysmis; ctx->highest_double = ctx->bswap ? byteswap8(fp_info.highest) : fp_info.highest; ctx->lowest_double = ctx->bswap ? byteswap8(fp_info.lowest) : fp_info.lowest; return READSTAT_OK; } /* We don't yet know how many real variables there are, so store the values in the record * and make sense of them later. */ static readstat_error_t sav_store_variable_display_parameter_record(const void *data, size_t size, size_t count, sav_ctx_t *ctx) { if (size != 4) return READSTAT_ERROR_PARSE; const uint32_t *data_ptr = data; int i; ctx->variable_display_values = readstat_realloc(ctx->variable_display_values, count * sizeof(uint32_t)); if (count > 0 && ctx->variable_display_values == NULL) return READSTAT_ERROR_MALLOC; ctx->variable_display_values_count = count; for (i=0; ivariable_display_values[i] = ctx->bswap ? byteswap4(data_ptr[i]) : data_ptr[i]; } return READSTAT_OK; } static readstat_error_t sav_parse_variable_display_parameter_record(sav_ctx_t *ctx) { if (!ctx->variable_display_values) return READSTAT_OK; int i; long count = ctx->variable_display_values_count; if (count != 2 * ctx->var_count && count != 3 * ctx->var_count) { return READSTAT_ERROR_PARSE; } int has_display_width = ctx->var_count > 0 && (count / ctx->var_count == 3); int offset = 0; for (i=0; ivar_index;) { spss_varinfo_t *info = &ctx->varinfo[i]; info->measure = spss_measure_to_readstat_measure(ctx->variable_display_values[offset++]); if (has_display_width) { info->display_width = ctx->variable_display_values[offset++]; } info->alignment = spss_alignment_to_readstat_alignment(ctx->variable_display_values[offset++]); i += info->n_segments; } return READSTAT_OK; } static readstat_error_t sav_parse_long_value_labels_record(const void *data, size_t data_len, sav_ctx_t *ctx) { if (!ctx->value_label_handler) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; uint32_t label_name_len = 0; uint32_t label_count = 0; uint32_t i = 0; const char *data_ptr = data; const char *data_end = data_ptr + data_len; char var_name_buf[256*4+1]; char label_name_buf[256]; char *value_buffer = NULL; char *label_buffer = NULL; memset(label_name_buf, '\0', sizeof(label_name_buf)); if (data_ptr + sizeof(uint32_t) > data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } memcpy(&label_name_len, data_ptr, sizeof(uint32_t)); if (ctx->bswap) label_name_len = byteswap4(label_name_len); data_ptr += sizeof(uint32_t); if (data_ptr + label_name_len > data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } retval = readstat_convert(var_name_buf, sizeof(var_name_buf), data_ptr, label_name_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; data_ptr += label_name_len; for (i=0; ivar_index;) { spss_varinfo_t *info = &ctx->varinfo[i]; if (strcmp(var_name_buf, info->longname) == 0) { info->labels_index = ctx->value_labels_count++; snprintf(label_name_buf, sizeof(label_name_buf), SAV_LABEL_NAME_PREFIX "%d", info->labels_index); break; } i += info->n_segments; } if (label_name_buf[0] == '\0') { retval = READSTAT_ERROR_PARSE; goto cleanup; } data_ptr += sizeof(uint32_t); if (data_ptr + sizeof(uint32_t) > data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } memcpy(&label_count, data_ptr, sizeof(uint32_t)); if (ctx->bswap) label_count = byteswap4(label_count); data_ptr += sizeof(uint32_t); for (i=0; i data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } memcpy(&value_len, data_ptr, sizeof(uint32_t)); if (ctx->bswap) value_len = byteswap4(value_len); data_ptr += sizeof(uint32_t); value_buffer_len = value_len*4+1; value_buffer = readstat_realloc(value_buffer, value_buffer_len); if (value_buffer == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (data_ptr + value_len > data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } retval = readstat_convert(value_buffer, value_buffer_len, data_ptr, value_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; data_ptr += value_len; if (data_ptr + sizeof(uint32_t) > data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } memcpy(&label_len, data_ptr, sizeof(uint32_t)); if (ctx->bswap) label_len = byteswap4(label_len); data_ptr += sizeof(uint32_t); label_buffer_len = label_len*4+1; label_buffer = readstat_realloc(label_buffer, label_buffer_len); if (label_buffer == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (data_ptr + label_len > data_end) { retval = READSTAT_ERROR_PARSE; goto cleanup; } retval = readstat_convert(label_buffer, label_buffer_len, data_ptr, label_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; data_ptr += label_len; readstat_value_t value = { .type = READSTAT_TYPE_STRING }; value.v.string_value = value_buffer; if (ctx->value_label_handler(label_name_buf, value, label_buffer, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: if (value_buffer) free(value_buffer); if (label_buffer) free(label_buffer); return retval; } static readstat_error_t sav_parse_records_pass1(sav_ctx_t *ctx) { char data_buf[4096]; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; while (1) { uint32_t rec_type; uint32_t extra_info[3]; size_t data_len = 0; int i; int done = 0; if (io->read(&rec_type, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) { rec_type = byteswap4(rec_type); } switch (rec_type) { case SAV_RECORD_TYPE_VARIABLE: retval = sav_skip_variable_record(ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_TYPE_VALUE_LABEL: retval = sav_skip_value_label_record(ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_TYPE_DOCUMENT: retval = sav_skip_document_record(ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_TYPE_DICT_TERMINATION: done = 1; break; case SAV_RECORD_TYPE_HAS_DATA: if (io->read(extra_info, sizeof(extra_info), io->io_ctx) < sizeof(extra_info)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) { for (i=0; i<3; i++) extra_info[i] = byteswap4(extra_info[i]); } uint32_t subtype = extra_info[0]; size_t size = extra_info[1]; size_t count = extra_info[2]; data_len = size * count; if (subtype == SAV_RECORD_SUBTYPE_INTEGER_INFO) { if (data_len > sizeof(data_buf)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (io->read(data_buf, data_len, io->io_ctx) < data_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } retval = sav_parse_machine_integer_info_record(data_buf, data_len, ctx); if (retval != READSTAT_OK) goto cleanup; } else { if (io->seek(data_len, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } } break; default: retval = READSTAT_ERROR_PARSE; goto cleanup; break; } if (done) break; } cleanup: return retval; } static readstat_error_t sav_parse_records_pass2(sav_ctx_t *ctx) { void *data_buf = NULL; size_t data_buf_capacity = 4096; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if ((data_buf = readstat_malloc(data_buf_capacity)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } while (1) { uint32_t rec_type; uint32_t extra_info[3]; size_t data_len = 0; int i; int done = 0; if (io->read(&rec_type, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) { rec_type = byteswap4(rec_type); } switch (rec_type) { case SAV_RECORD_TYPE_VARIABLE: if ((retval = sav_read_variable_record(ctx)) != READSTAT_OK) goto cleanup; break; case SAV_RECORD_TYPE_VALUE_LABEL: if ((retval = sav_read_value_label_record(ctx)) != READSTAT_OK) goto cleanup; break; case SAV_RECORD_TYPE_DOCUMENT: if ((retval = sav_read_document_record(ctx)) != READSTAT_OK) goto cleanup; break; case SAV_RECORD_TYPE_DICT_TERMINATION: if ((retval = sav_read_dictionary_termination_record(ctx)) != READSTAT_OK) goto cleanup; done = 1; break; case SAV_RECORD_TYPE_HAS_DATA: if (io->read(extra_info, sizeof(extra_info), io->io_ctx) < sizeof(extra_info)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (ctx->bswap) { for (i=0; i<3; i++) extra_info[i] = byteswap4(extra_info[i]); } uint32_t subtype = extra_info[0]; size_t size = extra_info[1]; size_t count = extra_info[2]; data_len = size * count; if (data_buf_capacity < data_len) { if ((data_buf = readstat_realloc(data_buf, data_buf_capacity = data_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } } if (io->read(data_buf, data_len, io->io_ctx) < data_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } switch (subtype) { case SAV_RECORD_SUBTYPE_INTEGER_INFO: /* parsed in pass 1 */ break; case SAV_RECORD_SUBTYPE_FP_INFO: retval = sav_parse_machine_floating_point_record(data_buf, size, count, ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_SUBTYPE_VAR_DISPLAY: retval = sav_store_variable_display_parameter_record(data_buf, size, count, ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_SUBTYPE_LONG_VAR_NAME: retval = sav_parse_long_variable_names_record(data_buf, count, ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_SUBTYPE_VERY_LONG_STR: retval = sav_parse_very_long_string_record(data_buf, count, ctx); if (retval != READSTAT_OK) goto cleanup; break; case SAV_RECORD_SUBTYPE_LONG_VALUE_LABELS: retval = sav_parse_long_value_labels_record(data_buf, count, ctx); if (retval != READSTAT_OK) goto cleanup; default: /* misc. info */ break; } break; default: retval = READSTAT_ERROR_PARSE; goto cleanup; break; } if (done) break; } cleanup: if (data_buf) free(data_buf); return retval; } static void sav_set_n_segments_and_var_count(sav_ctx_t *ctx) { int i; ctx->var_count = 0; for (i=0; ivar_index;) { spss_varinfo_t *info = &ctx->varinfo[i]; if (info->string_length) { info->n_segments = (info->string_length + 251) / 252; } info->index = ctx->var_count++; i += info->n_segments; } ctx->variables = readstat_calloc(ctx->var_count, sizeof(readstat_variable_t *)); } static readstat_error_t sav_handle_variables(readstat_parser_t *parser, sav_ctx_t *ctx) { int i; int index_after_skipping = 0; readstat_error_t retval = READSTAT_OK; if (!parser->variable_handler) return retval; for (i=0; ivar_index;) { char label_name_buf[256]; spss_varinfo_t *info = &ctx->varinfo[i]; ctx->variables[info->index] = spss_init_variable_for_info(info, index_after_skipping); snprintf(label_name_buf, sizeof(label_name_buf), SAV_LABEL_NAME_PREFIX "%d", info->labels_index); int cb_retval = parser->variable_handler(info->index, ctx->variables[info->index], info->labels_index == -1 ? NULL : label_name_buf, ctx->user_ctx); if (cb_retval == READSTAT_HANDLER_ABORT) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } if (cb_retval == READSTAT_HANDLER_SKIP_VARIABLE) { ctx->variables[info->index]->skip = 1; } else { index_after_skipping++; } i += info->n_segments; } cleanup: return retval; } static readstat_error_t sav_handle_fweight(readstat_parser_t *parser, sav_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i; if (parser->fweight_handler && ctx->fweight_index >= 0) { for (i=0; ivar_index;) { spss_varinfo_t *info = &ctx->varinfo[i]; if (info->offset == ctx->fweight_index - 1) { if (parser->fweight_handler(ctx->variables[info->index], ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } break; } i += info->n_segments; } } cleanup: return retval; } readstat_error_t sav_parse_timestamp(sav_ctx_t *ctx, sav_file_header_record_t *header) { readstat_error_t retval = READSTAT_OK; struct tm timestamp = { .tm_isdst = -1 }; if ((retval = sav_parse_time(header->creation_time, sizeof(header->creation_time), ×tamp, ctx->error_handler, ctx->user_ctx)) != READSTAT_OK) goto cleanup; if ((retval = sav_parse_date(header->creation_date, sizeof(header->creation_date), ×tamp, ctx->error_handler, ctx->user_ctx)) != READSTAT_OK) goto cleanup; ctx->timestamp = mktime(×tamp); cleanup: return retval; } readstat_error_t readstat_parse_sav(readstat_parser_t *parser, const char *path, void *user_ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = parser->io; sav_file_header_record_t header; sav_ctx_t *ctx = NULL; size_t file_size = 0; if (io->open(path, io->io_ctx) == -1) { return READSTAT_ERROR_OPEN; } file_size = io->seek(0, READSTAT_SEEK_END, io->io_ctx); if (file_size == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->seek(0, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->read(&header, sizeof(sav_file_header_record_t), io->io_ctx) < sizeof(sav_file_header_record_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } ctx = sav_ctx_init(&header, io); if (ctx == NULL) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->progress_handler = parser->progress_handler; ctx->error_handler = parser->error_handler; ctx->note_handler = parser->note_handler; ctx->value_handler = parser->value_handler; ctx->value_label_handler = parser->value_label_handler; ctx->input_encoding = parser->input_encoding; ctx->output_encoding = parser->output_encoding; ctx->user_ctx = user_ctx; ctx->file_size = file_size; if (parser->row_limit > 0 && (parser->row_limit < ctx->record_count || ctx->record_count == -1)) { ctx->row_limit = parser->row_limit; } else { ctx->row_limit = ctx->record_count; } if ((retval = sav_parse_timestamp(ctx, &header)) != READSTAT_OK) goto cleanup; if ((retval = sav_parse_records_pass1(ctx)) != READSTAT_OK) goto cleanup; if (io->seek(sizeof(sav_file_header_record_t), READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if ((retval = sav_update_progress(ctx)) != READSTAT_OK) goto cleanup; if ((retval = sav_parse_records_pass2(ctx)) != READSTAT_OK) goto cleanup; sav_set_n_segments_and_var_count(ctx); if (ctx->var_count == 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (parser->info_handler) { if (parser->info_handler(ctx->record_count == -1 ? -1 : ctx->row_limit, ctx->var_count, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } if (parser->metadata_handler) { if ((retval = readstat_convert(ctx->file_label, sizeof(ctx->file_label), header.file_label, sizeof(header.file_label), ctx->converter)) != READSTAT_OK) goto cleanup; if (parser->metadata_handler(ctx->file_label, ctx->input_encoding, ctx->timestamp, 2, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } sav_parse_variable_display_parameter_record(ctx); if ((retval = sav_handle_variables(parser, ctx)) != READSTAT_OK) goto cleanup; if ((retval = sav_handle_fweight(parser, ctx)) != READSTAT_OK) goto cleanup; if (ctx->value_handler) { retval = sav_read_data(ctx); } cleanup: io->close(io->io_ctx); if (ctx) sav_ctx_free(ctx); return retval; } haven/src/readstat/spss/readstat_sav_parse_timestamp.h0000644000176200001440000000043513227731765023066 0ustar liggesusers readstat_error_t sav_parse_time(const char *data, size_t len, struct tm *timestamp, readstat_error_handler error_cb, void *user_ctx); readstat_error_t sav_parse_date(const char *data, size_t len, struct tm *timestamp, readstat_error_handler error_cb, void *user_ctx); haven/src/readstat/spss/readstat_sav_write.c0000644000176200001440000011141513227731765021017 0ustar liggesusers #include #include #include #include #include #include #include #include #include #include "../readstat.h" #include "../readstat_iconv.h" #include "../readstat_bits.h" #include "../readstat_writer.h" #include "readstat_sav.h" #include "readstat_spss_parse.h" #define MAX_STRING_SIZE 255 #define MAX_LABEL_SIZE 256 #define MAX_VALUE_LABEL_SIZE 120 static long readstat_label_set_number_short_variables(readstat_label_set_t *r_label_set) { long count = 0; int j; for (j=0; jvariables_count; j++) { readstat_variable_t *r_variable = readstat_get_label_set_variable(r_label_set, j); if (r_variable->storage_width <= 8) { count++; } } return count; } static int readstat_label_set_needs_short_value_labels_record(readstat_label_set_t *r_label_set) { return readstat_label_set_number_short_variables(r_label_set) > 0; } static int readstat_label_set_needs_long_value_labels_record(readstat_label_set_t *r_label_set) { return readstat_label_set_number_short_variables(r_label_set) < r_label_set->variables_count; } static int32_t sav_encode_format(spss_format_t *spss_format) { return ((spss_format->type << 16) | (spss_format->width << 8) | spss_format->decimal_places); } static readstat_error_t sav_encode_variable_format(int32_t *out_code, readstat_variable_t *r_variable) { spss_format_t spss_format; readstat_error_t retval = spss_format_for_variable(r_variable, &spss_format); if (retval == READSTAT_OK && out_code) *out_code = sav_encode_format(&spss_format); return retval; } static size_t sav_format_variable_name(char *output, size_t output_len, int i) { snprintf(output, output_len, "VAR%d", (unsigned int)i % 100000); return strlen(output); } static readstat_error_t sav_emit_header(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; time_t now = writer->timestamp; struct tm *time_s = localtime(&now); sav_file_header_record_t header = { { 0 } }; memcpy(header.rec_type, "$FL2", sizeof("$FL2")-1); memset(header.prod_name, ' ', sizeof(header.prod_name)); memcpy(header.prod_name, "@(#) SPSS DATA FILE - " READSTAT_PRODUCT_URL, sizeof("@(#) SPSS DATA FILE - " READSTAT_PRODUCT_URL)-1); header.layout_code = 2; header.nominal_case_size = writer->row_len / 8; header.compressed = (writer->compression == READSTAT_COMPRESS_ROWS); if (writer->fweight_variable) { int32_t dictionary_index = 1 + writer->fweight_variable->offset / 8; header.weight_index = dictionary_index; } else { header.weight_index = 0; } header.ncases = writer->row_count; header.bias = 100.0; /* There are portability issues with strftime so hack something up */ char months[][4] = { "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" }; char creation_date[sizeof(header.creation_date)+1]; snprintf(creation_date, sizeof(creation_date), "%02d %3.3s %02d", (unsigned int)time_s->tm_mday % 100, months[time_s->tm_mon], (unsigned int)time_s->tm_year % 100); strncpy(header.creation_date, creation_date, sizeof(header.creation_date)); char creation_time[sizeof(header.creation_time)+1]; snprintf(creation_time, sizeof(creation_time), "%02d:%02d:%02d", (unsigned int)time_s->tm_hour % 100, (unsigned int)time_s->tm_min % 100, (unsigned int)time_s->tm_sec % 100); strncpy(header.creation_time, creation_time, sizeof(header.creation_time)); memset(header.file_label, ' ', sizeof(header.file_label)); size_t file_label_len = strlen(writer->file_label); if (file_label_len > sizeof(header.file_label)) file_label_len = sizeof(header.file_label); if (writer->file_label[0]) memcpy(header.file_label, writer->file_label, file_label_len); retval = readstat_write_bytes(writer, &header, sizeof(header)); return retval; } static readstat_error_t sav_emit_variable_label(readstat_writer_t *writer, readstat_variable_t *r_variable) { readstat_error_t retval = READSTAT_OK; const char *title_data = r_variable->label; size_t title_data_len = strlen(title_data); if (title_data_len > 0) { char padded_label[MAX_LABEL_SIZE]; int32_t label_len = title_data_len; if (label_len > sizeof(padded_label)) label_len = sizeof(padded_label); retval = readstat_write_bytes(writer, &label_len, sizeof(label_len)); if (retval != READSTAT_OK) goto cleanup; strncpy(padded_label, title_data, (label_len + 3) / 4 * 4); retval = readstat_write_bytes(writer, padded_label, (label_len + 3) / 4 * 4); if (retval != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t sav_n_missing_values(int *out_n_missing_values, readstat_variable_t *r_variable) { int n_missing_ranges = readstat_variable_get_missing_ranges_count(r_variable); int n_missing_values = n_missing_ranges; int has_missing_range = 0; int j; for (j=0; j 3) { return READSTAT_ERROR_TOO_MANY_MISSING_VALUE_DEFINITIONS; } if (out_n_missing_values) *out_n_missing_values = has_missing_range ? -n_missing_values : n_missing_values; return READSTAT_OK; } static readstat_error_t sav_emit_variable_missing_values(readstat_writer_t *writer, readstat_variable_t *r_variable) { readstat_error_t retval = READSTAT_OK; int n_missing_values = 0; int n_missing_ranges = readstat_variable_get_missing_ranges_count(r_variable); /* ranges */ int j; for (j=0; jindex); retval = readstat_write_bytes(writer, &rec_type, sizeof(rec_type)); if (retval != READSTAT_OK) goto cleanup; sav_variable_record_t variable = {0}; if (r_variable->type == READSTAT_TYPE_STRING) { variable.type = r_variable->user_width > MAX_STRING_SIZE ? MAX_STRING_SIZE : r_variable->user_width; } variable.has_var_label = (r_variable->label[0] != '\0'); retval = sav_n_missing_values(&variable.n_missing_values, r_variable); if (retval != READSTAT_OK) goto cleanup; retval = sav_encode_variable_format(&variable.print, r_variable); if (retval != READSTAT_OK) goto cleanup; variable.write = variable.print; memset(variable.name, ' ', sizeof(variable.name)); if (name_data_len > 0 && name_data_len <= sizeof(variable.name)) memcpy(variable.name, name_data, name_data_len); retval = readstat_write_bytes(writer, &variable, sizeof(variable)); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_variable_label(writer, r_variable); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_variable_missing_values(writer, r_variable); if (retval != READSTAT_OK) goto cleanup; int extra_fields = r_variable->storage_width / 8 - 1; if (extra_fields > 31) extra_fields = 31; retval = sav_emit_blank_variable_records(writer, extra_fields); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t sav_emit_ghost_variable_record(readstat_writer_t *writer, const char *name, size_t user_width) { readstat_error_t retval = READSTAT_OK; int32_t rec_type = SAV_RECORD_TYPE_VARIABLE; size_t name_len = strlen(name); retval = readstat_write_bytes(writer, &rec_type, sizeof(rec_type)); if (retval != READSTAT_OK) goto cleanup; sav_variable_record_t variable = {0}; variable.type = user_width; memset(variable.name, ' ', sizeof(variable.name)); if (name_len > 0 && name_len <= sizeof(variable.name)) memcpy(variable.name, name, name_len); retval = readstat_write_bytes(writer, &variable, sizeof(variable)); if (retval != READSTAT_OK) goto cleanup; int extra_fields = (user_width + 7) / 8 - 1; if (extra_fields > 31) extra_fields = 31; retval = sav_emit_blank_variable_records(writer, extra_fields); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t sav_emit_full_variable_record(readstat_writer_t *writer, readstat_variable_t *r_variable) { readstat_error_t retval = READSTAT_OK; char name_data[9]; sav_format_variable_name(name_data, sizeof(name_data), r_variable->index); retval = sav_emit_base_variable_record(writer, r_variable); if (retval != READSTAT_OK) goto cleanup; if (r_variable->type == READSTAT_TYPE_STRING) { size_t n_segments = 1; if (r_variable->user_width > MAX_STRING_SIZE) { n_segments = (r_variable->user_width + 251) / 252; } int i; for (i=1; iuser_width - (n_segments - 1) * 252); } retval = sav_emit_ghost_variable_record(writer, name_data, storage_size); if (retval != READSTAT_OK) goto cleanup; } } cleanup: return retval; } static readstat_error_t sav_emit_variable_records(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i; for (i=0; ivariables_count; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); retval = sav_emit_full_variable_record(writer, r_variable); if (retval != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t sav_emit_value_label_records(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i, j; for (i=0; ilabel_sets_count; i++) { readstat_label_set_t *r_label_set = readstat_get_label_set(writer, i); if (!readstat_label_set_needs_short_value_labels_record(r_label_set)) continue; readstat_type_t user_type = r_label_set->type; int32_t label_count = r_label_set->value_labels_count; int32_t rec_type = 0; if (label_count) { rec_type = SAV_RECORD_TYPE_VALUE_LABEL; retval = readstat_write_bytes(writer, &rec_type, sizeof(rec_type)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &label_count, sizeof(label_count)); if (retval != READSTAT_OK) goto cleanup; for (j=0; jstring_key_len; if (key_len > sizeof(value)) key_len = sizeof(value); memset(value, ' ', sizeof(value)); memcpy(value, r_value_label->string_key, key_len); } else if (user_type == READSTAT_TYPE_DOUBLE) { double num_val = r_value_label->double_key; memcpy(value, &num_val, sizeof(double)); } else if (user_type == READSTAT_TYPE_INT32) { double num_val = r_value_label->int32_key; memcpy(value, &num_val, sizeof(double)); } retval = readstat_write_bytes(writer, value, sizeof(value)); const char *label_data = r_value_label->label; char label_len = r_value_label->label_len; if (label_len > MAX_VALUE_LABEL_SIZE) label_len = MAX_VALUE_LABEL_SIZE; retval = readstat_write_bytes(writer, &label_len, sizeof(label_len)); if (retval != READSTAT_OK) goto cleanup; char label[MAX_VALUE_LABEL_SIZE+8]; memset(label, ' ', sizeof(label)); memcpy(label, label_data, label_len); retval = readstat_write_bytes(writer, label, (label_len + sizeof(label_len) + 7) / 8 * 8 - sizeof(label_len)); if (retval != READSTAT_OK) goto cleanup; } rec_type = SAV_RECORD_TYPE_VALUE_LABEL_VARIABLES; int32_t var_count = readstat_label_set_number_short_variables(r_label_set); retval = readstat_write_bytes(writer, &rec_type, sizeof(rec_type)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &var_count, sizeof(var_count)); if (retval != READSTAT_OK) goto cleanup; for (j=0; jvariables_count; j++) { readstat_variable_t *r_variable = readstat_get_label_set_variable(r_label_set, j); if (r_variable->storage_width > 8) continue; int32_t dictionary_index = 1 + r_variable->offset / 8; retval = readstat_write_bytes(writer, &dictionary_index, sizeof(dictionary_index)); if (retval != READSTAT_OK) goto cleanup; } } } cleanup: return retval; } static readstat_error_t sav_emit_document_record(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int32_t rec_type = SAV_RECORD_TYPE_DOCUMENT; int32_t n_lines = writer->notes_count; if (n_lines == 0) goto cleanup; retval = readstat_write_bytes(writer, &rec_type, sizeof(rec_type)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &n_lines, sizeof(n_lines)); if (retval != READSTAT_OK) goto cleanup; int i; for (i=0; inotes_count; i++) { size_t len = strlen(writer->notes[i]); if (len > SPSS_DOC_LINE_SIZE) { retval = READSTAT_ERROR_NOTE_IS_TOO_LONG; goto cleanup; } retval = readstat_write_bytes(writer, writer->notes[i], len); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_spaces(writer, SPSS_DOC_LINE_SIZE - len); if (retval != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t sav_emit_integer_info_record(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; sav_info_record_t info_header = {0}; info_header.rec_type = SAV_RECORD_TYPE_HAS_DATA; info_header.subtype = SAV_RECORD_SUBTYPE_INTEGER_INFO; info_header.size = 4; info_header.count = 8; retval = readstat_write_bytes(writer, &info_header, sizeof(info_header)); if (retval != READSTAT_OK) goto cleanup; sav_machine_integer_info_record_t machine_info = {0}; machine_info.version_major = 1; machine_info.version_minor = 0; machine_info.version_revision = 0; machine_info.machine_code = -1; machine_info.floating_point_rep = SAV_FLOATING_POINT_REP_IEEE; machine_info.compression_code = 1; machine_info.endianness = machine_is_little_endian() ? SAV_ENDIANNESS_LITTLE : SAV_ENDIANNESS_BIG; machine_info.character_code = 65001; // utf-8 retval = readstat_write_bytes(writer, &machine_info, sizeof(machine_info)); cleanup: return retval; } static readstat_error_t sav_emit_floating_point_info_record(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; sav_info_record_t info_header = {0}; info_header.rec_type = SAV_RECORD_TYPE_HAS_DATA; info_header.subtype = SAV_RECORD_SUBTYPE_FP_INFO; info_header.size = 8; info_header.count = 3; retval = readstat_write_bytes(writer, &info_header, sizeof(info_header)); if (retval != READSTAT_OK) goto cleanup; sav_machine_floating_point_info_record_t fp_info = {0}; fp_info.sysmis = SAV_MISSING_DOUBLE; fp_info.highest = SAV_HIGHEST_DOUBLE; fp_info.lowest = SAV_LOWEST_DOUBLE; retval = readstat_write_bytes(writer, &fp_info, sizeof(fp_info)); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t sav_emit_variable_display_record(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i; sav_info_record_t info_header = {0}; info_header.rec_type = SAV_RECORD_TYPE_HAS_DATA; info_header.subtype = SAV_RECORD_SUBTYPE_VAR_DISPLAY; info_header.size = sizeof(int32_t); info_header.count = 3 * writer->variables_count; retval = readstat_write_bytes(writer, &info_header, sizeof(info_header)); if (retval != READSTAT_OK) goto cleanup; for (i=0; ivariables_count; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); readstat_measure_t measure = readstat_variable_get_measure(r_variable); int32_t sav_measure = spss_measure_from_readstat_measure(measure); retval = readstat_write_bytes(writer, &sav_measure, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; int32_t sav_display_width = readstat_variable_get_display_width(r_variable); if (sav_display_width <= 0) sav_display_width = 8; retval = readstat_write_bytes(writer, &sav_display_width, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; readstat_alignment_t alignment = readstat_variable_get_alignment(r_variable); int32_t sav_alignment = spss_alignment_from_readstat_alignment(alignment); retval = readstat_write_bytes(writer, &sav_alignment, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t sav_emit_long_var_name_record(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i; sav_info_record_t info_header = {0}; info_header.rec_type = SAV_RECORD_TYPE_HAS_DATA; info_header.subtype = SAV_RECORD_SUBTYPE_LONG_VAR_NAME; info_header.size = 1; info_header.count = 0; for (i=0; ivariables_count; i++) { char name_data[9]; size_t name_data_len = sav_format_variable_name(name_data, sizeof(name_data), i); readstat_variable_t *r_variable = readstat_get_variable(writer, i); const char *title_data = r_variable->name; size_t title_data_len = strlen(title_data); if (title_data_len > 0 && name_data_len > 0) { if (title_data_len > 64) title_data_len = 64; info_header.count += name_data_len; info_header.count += sizeof("=")-1; info_header.count += title_data_len; info_header.count += sizeof("\x09")-1; } } if (info_header.count > 0) { info_header.count--; /* no trailing 0x09 */ retval = readstat_write_bytes(writer, &info_header, sizeof(info_header)); if (retval != READSTAT_OK) goto cleanup; int is_first = 1; for (i=0; ivariables_count; i++) { char name_data[9]; sav_format_variable_name(name_data, sizeof(name_data), i); readstat_variable_t *r_variable = readstat_get_variable(writer, i); const char *title_data = r_variable->name; size_t title_data_len = strlen(title_data); char kv_separator = '='; char tuple_separator = 0x09; if (title_data_len > 0) { if (title_data_len > 64) title_data_len = 64; if (!is_first) { retval = readstat_write_bytes(writer, &tuple_separator, sizeof(tuple_separator)); if (retval != READSTAT_OK) goto cleanup; } retval = readstat_write_string(writer, name_data); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &kv_separator, sizeof(kv_separator)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, title_data, title_data_len); if (retval != READSTAT_OK) goto cleanup; is_first = 0; } } } cleanup: return retval; } static readstat_error_t sav_emit_very_long_string_record(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i; char tuple_separator[2] = { 0x00, 0x09 }; sav_info_record_t info_header = {0}; info_header.rec_type = SAV_RECORD_TYPE_HAS_DATA; info_header.subtype = SAV_RECORD_SUBTYPE_VERY_LONG_STR; info_header.size = 1; info_header.count = 0; for (i=0; ivariables_count; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); if (r_variable->user_width <= MAX_STRING_SIZE) continue; char name_data[9]; sav_format_variable_name(name_data, sizeof(name_data), i); char kv_data[8+1+5+1]; snprintf(kv_data, sizeof(kv_data), "%.8s=%05d", name_data, (unsigned int)r_variable->user_width % 100000); info_header.count += strlen(kv_data) + sizeof(tuple_separator); } if (info_header.count == 0) return READSTAT_OK; retval = readstat_write_bytes(writer, &info_header, sizeof(info_header)); if (retval != READSTAT_OK) goto cleanup; for (i=0; ivariables_count; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); if (r_variable->user_width <= MAX_STRING_SIZE) continue; char name_data[9]; sav_format_variable_name(name_data, sizeof(name_data), i); char kv_data[8+1+5+1]; snprintf(kv_data, sizeof(kv_data), "%.8s=%05d", name_data, (unsigned int)r_variable->user_width % 100000); retval = readstat_write_string(writer, kv_data); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, tuple_separator, sizeof(tuple_separator)); if (retval != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t sav_emit_long_value_labels_records(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i, j, k; char *space_buffer = NULL; sav_info_record_t info_header = {0}; info_header.rec_type = SAV_RECORD_TYPE_HAS_DATA; info_header.subtype = SAV_RECORD_SUBTYPE_LONG_VALUE_LABELS; info_header.size = 1; info_header.count = 0; for (i=0; ilabel_sets_count; i++) { readstat_label_set_t *r_label_set = readstat_get_label_set(writer, i); if (!readstat_label_set_needs_long_value_labels_record(r_label_set)) continue; int32_t label_count = r_label_set->value_labels_count; int32_t var_count = r_label_set->variables_count; for (k=0; kname); int32_t storage_width = readstat_variable_get_storage_width(r_variable); if (storage_width <= 8) continue; space_buffer = realloc(space_buffer, storage_width); memset(space_buffer, ' ', storage_width); info_header.count += sizeof(int32_t); // name length info_header.count += name_len; info_header.count += sizeof(int32_t); // variable width info_header.count += sizeof(int32_t); // label count for (j=0; jlabel_len; if (label_len > MAX_VALUE_LABEL_SIZE) label_len = MAX_VALUE_LABEL_SIZE; info_header.count += sizeof(int32_t); // value length info_header.count += storage_width; info_header.count += sizeof(int32_t); // label length info_header.count += label_len; } retval = readstat_write_bytes(writer, &info_header, sizeof(info_header)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &name_len, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, r_variable->name, name_len); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &storage_width, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &label_count, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; for (j=0; jstring_key_len; int32_t label_len = r_value_label->label_len; if (label_len > MAX_VALUE_LABEL_SIZE) label_len = MAX_VALUE_LABEL_SIZE; retval = readstat_write_bytes(writer, &storage_width, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, r_value_label->string_key, value_len); if (retval != READSTAT_OK) goto cleanup; if (value_len < storage_width) { retval = readstat_write_bytes(writer, space_buffer, storage_width - value_len); if (retval != READSTAT_OK) goto cleanup; } retval = readstat_write_bytes(writer, &label_len, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, r_value_label->label, label_len); if (retval != READSTAT_OK) goto cleanup; } } } cleanup: if (space_buffer) free(space_buffer); return retval; } static readstat_error_t sav_emit_termination_record(readstat_writer_t *writer) { sav_dictionary_termination_record_t termination_record = { .rec_type = SAV_RECORD_TYPE_DICT_TERMINATION }; return readstat_write_bytes(writer, &termination_record, sizeof(termination_record)); } static readstat_error_t sav_write_int8(void *row, const readstat_variable_t *var, int8_t value) { double dval = value; memcpy(row, &dval, sizeof(double)); return READSTAT_OK; } static readstat_error_t sav_write_int16(void *row, const readstat_variable_t *var, int16_t value) { double dval = value; memcpy(row, &dval, sizeof(double)); return READSTAT_OK; } static readstat_error_t sav_write_int32(void *row, const readstat_variable_t *var, int32_t value) { double dval = value; memcpy(row, &dval, sizeof(double)); return READSTAT_OK; } static readstat_error_t sav_write_float(void *row, const readstat_variable_t *var, float value) { double dval = value; memcpy(row, &dval, sizeof(double)); return READSTAT_OK; } static readstat_error_t sav_write_double(void *row, const readstat_variable_t *var, double value) { double dval = value; memcpy(row, &dval, sizeof(double)); return READSTAT_OK; } static readstat_error_t sav_write_string(void *row, const readstat_variable_t *var, const char *value) { memset(row, ' ', var->storage_width); if (value != NULL && value[0] != '\0') { size_t value_len = strlen(value); off_t row_offset = 0; off_t val_offset = 0; unsigned char *row_bytes = (unsigned char *)row; if (value_len > var->storage_width) return READSTAT_ERROR_STRING_VALUE_IS_TOO_LONG; while (value_len - val_offset > 255) { memcpy(&row_bytes[row_offset], &value[val_offset], 255); row_offset += 256; val_offset += 255; } memcpy(&row_bytes[row_offset], &value[val_offset], value_len - val_offset); } return READSTAT_OK; } static readstat_error_t sav_write_missing_string(void *row, const readstat_variable_t *var) { memset(row, ' ', var->storage_width); return READSTAT_OK; } static readstat_error_t sav_write_missing_number(void *row, const readstat_variable_t *var) { uint64_t missing_val = SAV_MISSING_DOUBLE; memcpy(row, &missing_val, sizeof(uint64_t)); return READSTAT_OK; } static size_t sav_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) { if (user_width > MAX_STRING_SIZE) { size_t n_segments = (user_width + 251) / 252; size_t last_segment_width = ((user_width - (n_segments - 1) * 252) + 7)/8*8; return (n_segments-1)*256 + last_segment_width; } if (user_width == 0) { return 8; } return (user_width + 7) / 8 * 8; } return 8; } static readstat_error_t sav_begin_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t retval = READSTAT_OK; if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; retval = sav_emit_header(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_variable_records(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_value_label_records(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_document_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_integer_info_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_floating_point_info_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_variable_display_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_long_var_name_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_very_long_string_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_long_value_labels_records(writer); if (retval != READSTAT_OK) goto cleanup; retval = sav_emit_termination_record(writer); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t sav_write_compressed_row(void *writer_ctx, void *row, size_t len) { readstat_error_t retval = READSTAT_OK; readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; int i; size_t output_len = len + (len/8 + 8)/8*8; unsigned char *output = malloc(output_len); char *input = (char *)row; off_t input_offset = 0; off_t output_offset = 8; off_t control_offset = 0; memset(&output[control_offset], 0, 8); for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); if (variable->type == READSTAT_TYPE_STRING) { size_t width = variable->storage_width; while (width > 0) { if (memcmp(&input[input_offset], SAV_EIGHT_SPACES, 8) == 0) { output[control_offset++] = 254; } else { output[control_offset++] = 253; memcpy(&output[output_offset], &input[input_offset], 8); output_offset += 8; } if (control_offset % 8 == 0) { control_offset = output_offset; memset(&output[control_offset], 0, 8); output_offset += 8; } input_offset += 8; width -= 8; } } else { uint64_t int_value; memcpy(&int_value, &input[input_offset], 8); if (int_value == SAV_MISSING_DOUBLE) { output[control_offset++] = 255; } else { double fp_value; memcpy(&fp_value, &input[input_offset], 8); if (fp_value > -100 && fp_value < 152 && (int)fp_value == fp_value) { output[control_offset++] = (int)fp_value + 100; } else { output[control_offset++] = 253; memcpy(&output[output_offset], &input[input_offset], 8); output_offset += 8; } } if (control_offset % 8 == 0) { control_offset = output_offset; memset(&output[control_offset], 0, 8); output_offset += 8; } input_offset += 8; } } if (writer->current_row + 1 == writer->row_count) output[control_offset] = 252; retval = readstat_write_bytes(writer, output, output_offset); free(output); return retval; } readstat_error_t readstat_begin_writing_sav(readstat_writer_t *writer, void *user_ctx, long row_count) { writer->callbacks.variable_width = &sav_variable_width; writer->callbacks.write_int8 = &sav_write_int8; writer->callbacks.write_int16 = &sav_write_int16; writer->callbacks.write_int32 = &sav_write_int32; writer->callbacks.write_float = &sav_write_float; writer->callbacks.write_double = &sav_write_double; writer->callbacks.write_string = &sav_write_string; writer->callbacks.write_missing_string = &sav_write_missing_string; writer->callbacks.write_missing_number = &sav_write_missing_number; writer->callbacks.begin_data = &sav_begin_data; if (writer->compression == READSTAT_COMPRESS_ROWS) { writer->callbacks.write_row = &sav_write_compressed_row; } else if (writer->compression == READSTAT_COMPRESS_NONE) { /* void */ } else { return READSTAT_ERROR_UNSUPPORTED_COMPRESSION; } return readstat_begin_writing_file(writer, user_ctx, row_count); } haven/src/readstat/spss/readstat_sav_parse.c0000644000176200001440000010555213227731765021004 0ustar liggesusers #line 1 "src/spss/readstat_sav_parse.rl" #include #include "../readstat.h" #include "../readstat_iconv.h" #include "../readstat_malloc.h" #include "readstat_sav.h" #include "readstat_sav_parse.h" typedef struct varlookup { char name[8*4+1]; int index; } varlookup_t; static int compare_key_varlookup(const void *elem1, const void *elem2) { const char *key = (const char *)elem1; const varlookup_t *v = (const varlookup_t *)elem2; return strcmp(key, v->name); } static int compare_varlookups(const void *elem1, const void *elem2) { const varlookup_t *v1 = (const varlookup_t *)elem1; const varlookup_t *v2 = (const varlookup_t *)elem2; return strcmp(v1->name, v2->name); } static int count_vars(sav_ctx_t *ctx) { int i; spss_varinfo_t *last_info = NULL; int var_count = 0; for (i=0; ivar_index; i++) { spss_varinfo_t *info = &ctx->varinfo[i]; if (last_info == NULL || strcmp(info->name, last_info->name) != 0) { var_count++; } last_info = info; } return var_count; } static varlookup_t *build_lookup_table(int var_count, sav_ctx_t *ctx) { varlookup_t *table = readstat_malloc(var_count * sizeof(varlookup_t)); int offset = 0; int i; spss_varinfo_t *last_info = NULL; for (i=0; ivar_index; i++) { spss_varinfo_t *info = &ctx->varinfo[i]; if (last_info == NULL || strcmp(info->name, last_info->name) != 0) { varlookup_t *entry = &table[offset++]; memcpy(entry->name, info->name, sizeof(info->name)); entry->index = info->index; } last_info = info; } qsort(table, var_count, sizeof(varlookup_t), &compare_varlookups); return table; } #line 65 "src/spss/readstat_sav_parse.c" static const char _sav_long_variable_parse_actions[] = { 0, 1, 3, 1, 5, 2, 4, 1, 3, 6, 2, 0 }; static const short _sav_long_variable_parse_key_offsets[] = { 0, 0, 8, 23, 38, 53, 68, 83, 98, 113, 114, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 412, 414, 416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438, 440, 442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466, 468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492, 494, 496, 498, 500, 502, 504, 506, 508, 510, 512, 514, 516, 518, 520, 522, 524, 526, 528, 530, 532, 534, 536, 538, 540, 542, 544, 546, 548, 550, 552, 554, 563, 571, 580, 589, 598, 607, 616, 625, 634, 643, 652, 661, 670, 679, 688, 697, 706, 715, 724, 733, 742, 751, 760, 769, 778, 787, 796, 805, 814, 823, 832, 841, 850, 859, 868, 877, 886, 895, 904, 913, 922, 931, 940, 949, 958, 967, 976, 985, 994, 1003, 1012, 1021, 1030, 1039, 1048, 1057, 1066, 1075, 1084, 1093, 1102, 1111, 1120, 1129 }; static const unsigned char _sav_long_variable_parse_trans_keys[] = { 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 61u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 32u, 126u, 192u, 223u, 224u, 239u, 240u, 247u, 9u, 0 }; static const char _sav_long_variable_parse_single_lengths[] = { 0, 0, 3, 3, 3, 3, 3, 3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }; static const char _sav_long_variable_parse_range_lengths[] = { 0, 4, 6, 6, 6, 6, 6, 6, 6, 0, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0 }; static const short _sav_long_variable_parse_index_offsets[] = { 0, 0, 5, 15, 25, 35, 45, 55, 65, 75, 77, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 412, 414, 416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438, 440, 442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466, 468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492, 494, 496, 498, 500, 502, 504, 506, 508, 510, 512, 514, 520, 525, 531, 537, 543, 549, 555, 561, 567, 573, 579, 585, 591, 597, 603, 609, 615, 621, 627, 633, 639, 645, 651, 657, 663, 669, 675, 681, 687, 693, 699, 705, 711, 717, 723, 729, 735, 741, 747, 753, 759, 765, 771, 777, 783, 789, 795, 801, 807, 813, 819, 825, 831, 837, 843, 849, 855, 861, 867, 873, 879, 885, 891, 897 }; static const short _sav_long_variable_parse_indicies[] = { 0, 2, 3, 4, 1, 5, 6, 5, 5, 5, 5, 7, 8, 9, 1, 10, 6, 10, 10, 10, 10, 11, 12, 13, 1, 14, 6, 14, 14, 14, 14, 15, 16, 17, 1, 18, 6, 18, 18, 18, 18, 19, 20, 21, 1, 22, 6, 22, 22, 22, 22, 23, 24, 25, 1, 26, 6, 26, 26, 26, 26, 27, 28, 29, 1, 30, 6, 30, 30, 30, 30, 31, 32, 33, 1, 6, 1, 34, 35, 36, 37, 1, 38, 1, 39, 1, 40, 1, 41, 1, 42, 1, 43, 1, 44, 1, 45, 1, 46, 1, 47, 1, 48, 1, 49, 1, 50, 1, 51, 1, 52, 1, 53, 1, 54, 1, 55, 1, 56, 1, 57, 1, 58, 1, 59, 1, 60, 1, 61, 1, 62, 1, 63, 1, 64, 1, 65, 1, 66, 1, 67, 1, 68, 1, 69, 1, 70, 1, 71, 1, 72, 1, 73, 1, 74, 1, 75, 1, 76, 1, 77, 1, 78, 1, 79, 1, 80, 1, 81, 1, 82, 1, 83, 1, 84, 1, 85, 1, 86, 1, 87, 1, 88, 1, 89, 1, 90, 1, 91, 1, 92, 1, 93, 1, 94, 1, 95, 1, 96, 1, 97, 1, 98, 1, 99, 1, 100, 1, 101, 1, 102, 1, 103, 1, 104, 1, 105, 1, 106, 1, 107, 1, 108, 1, 109, 1, 110, 1, 111, 1, 112, 1, 113, 1, 114, 1, 115, 1, 116, 1, 117, 1, 118, 1, 119, 1, 120, 1, 121, 1, 122, 1, 123, 1, 124, 1, 125, 1, 126, 1, 127, 1, 128, 1, 129, 1, 130, 1, 131, 1, 132, 1, 133, 1, 134, 1, 135, 1, 136, 1, 137, 1, 138, 1, 139, 1, 140, 1, 141, 1, 142, 1, 143, 1, 144, 1, 145, 1, 146, 1, 147, 1, 148, 1, 149, 1, 150, 1, 151, 1, 152, 1, 153, 1, 154, 1, 155, 1, 156, 1, 157, 1, 158, 1, 159, 1, 160, 1, 161, 1, 162, 1, 163, 1, 164, 1, 165, 1, 166, 1, 167, 1, 168, 1, 169, 1, 170, 1, 171, 1, 172, 1, 173, 1, 174, 1, 175, 1, 176, 1, 177, 1, 178, 1, 179, 1, 180, 1, 181, 1, 182, 1, 183, 1, 184, 1, 185, 1, 186, 1, 187, 1, 188, 1, 189, 1, 190, 1, 191, 1, 192, 1, 193, 1, 194, 1, 195, 1, 196, 1, 197, 1, 198, 1, 199, 1, 200, 1, 201, 1, 202, 1, 203, 1, 204, 1, 205, 1, 206, 1, 207, 1, 208, 1, 209, 1, 210, 1, 211, 1, 212, 1, 213, 1, 214, 1, 215, 1, 216, 1, 217, 1, 218, 1, 219, 1, 220, 1, 221, 1, 222, 1, 223, 1, 224, 1, 225, 1, 226, 1, 227, 1, 228, 1, 229, 1, 230, 1, 231, 1, 232, 1, 30, 1, 31, 1, 32, 1, 26, 1, 27, 1, 28, 1, 22, 1, 23, 1, 24, 1, 18, 1, 19, 1, 20, 1, 14, 1, 15, 1, 16, 1, 10, 1, 11, 1, 12, 1, 5, 1, 7, 1, 8, 1, 233, 227, 228, 229, 234, 1, 0, 2, 3, 4, 1, 233, 224, 225, 226, 235, 1, 233, 221, 222, 223, 236, 1, 233, 218, 219, 220, 237, 1, 233, 215, 216, 217, 238, 1, 233, 212, 213, 214, 239, 1, 233, 209, 210, 211, 240, 1, 233, 206, 207, 208, 241, 1, 233, 203, 204, 205, 242, 1, 233, 200, 201, 202, 243, 1, 233, 197, 198, 199, 244, 1, 233, 194, 195, 196, 245, 1, 233, 191, 192, 193, 246, 1, 233, 188, 189, 190, 247, 1, 233, 185, 186, 187, 248, 1, 233, 182, 183, 184, 249, 1, 233, 179, 180, 181, 250, 1, 233, 176, 177, 178, 251, 1, 233, 173, 174, 175, 252, 1, 233, 170, 171, 172, 253, 1, 233, 167, 168, 169, 254, 1, 233, 164, 165, 166, 255, 1, 233, 161, 162, 163, 256, 1, 233, 158, 159, 160, 257, 1, 233, 155, 156, 157, 258, 1, 233, 152, 153, 154, 259, 1, 233, 149, 150, 151, 260, 1, 233, 146, 147, 148, 261, 1, 233, 143, 144, 145, 262, 1, 233, 140, 141, 142, 263, 1, 233, 137, 138, 139, 264, 1, 233, 134, 135, 136, 265, 1, 233, 131, 132, 133, 266, 1, 233, 128, 129, 130, 267, 1, 233, 125, 126, 127, 268, 1, 233, 122, 123, 124, 269, 1, 233, 119, 120, 121, 270, 1, 233, 116, 117, 118, 271, 1, 233, 113, 114, 115, 272, 1, 233, 110, 111, 112, 273, 1, 233, 107, 108, 109, 274, 1, 233, 104, 105, 106, 275, 1, 233, 101, 102, 103, 276, 1, 233, 98, 99, 100, 277, 1, 233, 95, 96, 97, 278, 1, 233, 92, 93, 94, 279, 1, 233, 89, 90, 91, 280, 1, 233, 86, 87, 88, 281, 1, 233, 83, 84, 85, 282, 1, 233, 80, 81, 82, 283, 1, 233, 77, 78, 79, 284, 1, 233, 74, 75, 76, 285, 1, 233, 71, 72, 73, 286, 1, 233, 68, 69, 70, 287, 1, 233, 65, 66, 67, 288, 1, 233, 62, 63, 64, 289, 1, 233, 59, 60, 61, 290, 1, 233, 56, 57, 58, 291, 1, 233, 53, 54, 55, 292, 1, 233, 50, 51, 52, 293, 1, 233, 47, 48, 49, 294, 1, 233, 44, 45, 46, 295, 1, 233, 41, 42, 43, 296, 1, 233, 1, 0 }; static const short _sav_long_variable_parse_trans_targs[] = { 2, 0, 11, 12, 13, 3, 10, 224, 225, 226, 4, 221, 222, 223, 5, 218, 219, 220, 6, 215, 216, 217, 7, 212, 213, 214, 8, 209, 210, 211, 9, 206, 207, 208, 227, 203, 204, 205, 2, 11, 12, 291, 14, 15, 290, 17, 18, 289, 20, 21, 288, 23, 24, 287, 26, 27, 286, 29, 30, 285, 32, 33, 284, 35, 36, 283, 38, 39, 282, 41, 42, 281, 44, 45, 280, 47, 48, 279, 50, 51, 278, 53, 54, 277, 56, 57, 276, 59, 60, 275, 62, 63, 274, 65, 66, 273, 68, 69, 272, 71, 72, 271, 74, 75, 270, 77, 78, 269, 80, 81, 268, 83, 84, 267, 86, 87, 266, 89, 90, 265, 92, 93, 264, 95, 96, 263, 98, 99, 262, 101, 102, 261, 104, 105, 260, 107, 108, 259, 110, 111, 258, 113, 114, 257, 116, 117, 256, 119, 120, 255, 122, 123, 254, 125, 126, 253, 128, 129, 252, 131, 132, 251, 134, 135, 250, 137, 138, 249, 140, 141, 248, 143, 144, 247, 146, 147, 246, 149, 150, 245, 152, 153, 244, 155, 156, 243, 158, 159, 242, 161, 162, 241, 164, 165, 240, 167, 168, 239, 170, 171, 238, 173, 174, 237, 176, 177, 236, 179, 180, 235, 182, 183, 234, 185, 186, 233, 188, 189, 232, 191, 192, 231, 194, 195, 230, 197, 198, 229, 200, 201, 227, 203, 204, 228, 202, 199, 196, 193, 190, 187, 184, 181, 178, 175, 172, 169, 166, 163, 160, 157, 154, 151, 148, 145, 142, 139, 136, 133, 130, 127, 124, 121, 118, 115, 112, 109, 106, 103, 100, 97, 94, 91, 88, 85, 82, 79, 76, 73, 70, 67, 64, 61, 58, 55, 52, 49, 46, 43, 40, 37, 34, 31, 28, 25, 22, 19, 16 }; static const char _sav_long_variable_parse_trans_actions[] = { 1, 0, 1, 1, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; static const char _sav_long_variable_parse_eof_actions[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 }; static const int sav_long_variable_parse_start = 1; static const int sav_long_variable_parse_en_main = 1; #line 65 "src/spss/readstat_sav_parse.rl" readstat_error_t sav_parse_long_variable_names_record(void *data, int count, sav_ctx_t *ctx) { unsigned char *c_data = (unsigned char *)data; int var_count = count_vars(ctx); readstat_error_t retval = READSTAT_OK; char temp_key[4*8+1]; char temp_val[4*64+1]; unsigned char *str_start = NULL; size_t str_len = 0; char error_buf[8192]; unsigned char *p = NULL; unsigned char *pe = NULL; unsigned char *output_buffer = NULL; varlookup_t *table = build_lookup_table(var_count, ctx); if (ctx->converter) { size_t input_len = count; size_t output_len = input_len * 4; pe = p = output_buffer = readstat_malloc(output_len); size_t status = iconv(ctx->converter, (readstat_iconv_inbuf_t)&data, &input_len, (char **)&pe, &output_len); if (status == (size_t)-1) { free(output_buffer); return READSTAT_ERROR_PARSE; } } else { p = c_data; pe = c_data + count; } unsigned char *eof = pe; int cs; #line 659 "src/spss/readstat_sav_parse.c" { cs = sav_long_variable_parse_start; } #line 664 "src/spss/readstat_sav_parse.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const unsigned char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _sav_long_variable_parse_trans_keys + _sav_long_variable_parse_key_offsets[cs]; _trans = _sav_long_variable_parse_index_offsets[cs]; _klen = _sav_long_variable_parse_single_lengths[cs]; if ( _klen > 0 ) { const unsigned char *_lower = _keys; const unsigned char *_mid; const unsigned char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _sav_long_variable_parse_range_lengths[cs]; if ( _klen > 0 ) { const unsigned char *_lower = _keys; const unsigned char *_mid; const unsigned char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: _trans = _sav_long_variable_parse_indicies[_trans]; cs = _sav_long_variable_parse_trans_targs[_trans]; if ( _sav_long_variable_parse_trans_actions[_trans] == 0 ) goto _again; _acts = _sav_long_variable_parse_actions + _sav_long_variable_parse_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 104 "src/spss/readstat_sav_parse.rl" { varlookup_t *found = bsearch(temp_key, table, var_count, sizeof(varlookup_t), &compare_key_varlookup); if (found) { memcpy(ctx->varinfo[found->index].longname, temp_val, str_len); ctx->varinfo[found->index].longname[str_len] = '\0'; } else if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Failed to find %s", temp_key); ctx->error_handler(error_buf, ctx->user_ctx); } } break; case 1: #line 115 "src/spss/readstat_sav_parse.rl" { memcpy(temp_key, str_start, str_len); temp_key[str_len] = '\0'; } break; case 2: #line 120 "src/spss/readstat_sav_parse.rl" { memcpy(temp_val, str_start, str_len); temp_val[str_len] = '\0'; } break; case 3: #line 131 "src/spss/readstat_sav_parse.rl" { str_start = p; } break; case 4: #line 131 "src/spss/readstat_sav_parse.rl" { str_len = p - str_start; } break; case 5: #line 133 "src/spss/readstat_sav_parse.rl" { str_start = p; } break; case 6: #line 133 "src/spss/readstat_sav_parse.rl" { str_len = p - str_start; } break; #line 781 "src/spss/readstat_sav_parse.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} if ( p == eof ) { const char *__acts = _sav_long_variable_parse_actions + _sav_long_variable_parse_eof_actions[cs]; unsigned int __nacts = (unsigned int) *__acts++; while ( __nacts-- > 0 ) { switch ( *__acts++ ) { case 0: #line 104 "src/spss/readstat_sav_parse.rl" { varlookup_t *found = bsearch(temp_key, table, var_count, sizeof(varlookup_t), &compare_key_varlookup); if (found) { memcpy(ctx->varinfo[found->index].longname, temp_val, str_len); ctx->varinfo[found->index].longname[str_len] = '\0'; } else if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Failed to find %s", temp_key); ctx->error_handler(error_buf, ctx->user_ctx); } } break; case 2: #line 120 "src/spss/readstat_sav_parse.rl" { memcpy(temp_val, str_start, str_len); temp_val[str_len] = '\0'; } break; case 6: #line 133 "src/spss/readstat_sav_parse.rl" { str_len = p - str_start; } break; #line 821 "src/spss/readstat_sav_parse.c" } } } _out: {} } #line 141 "src/spss/readstat_sav_parse.rl" if (cs < 227|| p != pe) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error parsing string \"%.*s\" around byte #%ld/%d, character %c", count, (char *)data, (long)(p - c_data), count, *p); ctx->error_handler(error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_PARSE; } if (table) free(table); if (output_buffer) free(output_buffer); /* suppress warning */ (void)sav_long_variable_parse_en_main; return retval; } #line 853 "src/spss/readstat_sav_parse.c" static const char _sav_very_long_string_parse_actions[] = { 0, 1, 0, 1, 2, 1, 3, 2, 4, 1, 2, 5, 2 }; static const unsigned char _sav_very_long_string_parse_key_offsets[] = { 0, 0, 8, 23, 38, 53, 68, 83, 98, 113, 114, 116, 119, 121, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 145, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 169 }; static const unsigned char _sav_very_long_string_parse_trans_keys[] = { 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 46u, 61u, 95u, 35u, 36u, 48u, 57u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 61u, 48u, 57u, 0u, 48u, 57u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 128u, 191u, 0u, 9u, 64u, 90u, 192u, 223u, 224u, 239u, 240u, 247u, 0 }; static const char _sav_very_long_string_parse_single_lengths[] = { 0, 0, 3, 3, 3, 3, 3, 3, 3, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0 }; static const char _sav_very_long_string_parse_range_lengths[] = { 0, 4, 6, 6, 6, 6, 6, 6, 6, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 4 }; static const unsigned char _sav_very_long_string_parse_index_offsets[] = { 0, 0, 5, 15, 25, 35, 45, 55, 65, 75, 77, 79, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 133 }; static const char _sav_very_long_string_parse_indicies[] = { 0, 2, 3, 4, 1, 5, 6, 5, 5, 5, 5, 7, 8, 9, 1, 10, 6, 10, 10, 10, 10, 11, 12, 13, 1, 14, 6, 14, 14, 14, 14, 15, 16, 17, 1, 18, 6, 18, 18, 18, 18, 19, 20, 21, 1, 22, 6, 22, 22, 22, 22, 23, 24, 25, 1, 26, 6, 26, 26, 26, 26, 27, 28, 29, 1, 30, 6, 30, 30, 30, 30, 31, 32, 33, 1, 6, 1, 34, 1, 35, 36, 1, 37, 1, 38, 1, 39, 1, 30, 1, 31, 1, 32, 1, 26, 1, 27, 1, 28, 1, 22, 1, 23, 1, 24, 1, 18, 1, 19, 1, 20, 1, 14, 1, 15, 1, 16, 1, 10, 1, 11, 1, 12, 1, 5, 1, 7, 1, 8, 1, 40, 41, 1, 0, 2, 3, 4, 1, 0 }; static const char _sav_very_long_string_parse_trans_targs[] = { 2, 0, 12, 13, 14, 3, 10, 33, 34, 35, 4, 30, 31, 32, 5, 27, 28, 29, 6, 24, 25, 26, 7, 21, 22, 23, 8, 18, 19, 20, 9, 15, 16, 17, 11, 36, 11, 2, 12, 13, 36, 37 }; static const char _sav_very_long_string_parse_trans_actions[] = { 5, 0, 5, 5, 5, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 1, 3, 0, 0, 0, 0, 0 }; static const int sav_very_long_string_parse_start = 1; static const int sav_very_long_string_parse_en_main = 1; #line 167 "src/spss/readstat_sav_parse.rl" readstat_error_t sav_parse_very_long_string_record(void *data, int count, sav_ctx_t *ctx) { unsigned char *c_data = (unsigned char *)data; int var_count = count_vars(ctx); readstat_error_t retval = READSTAT_OK; char temp_key[8*4+1]; int temp_val = 0; unsigned char *str_start = NULL; size_t str_len = 0; size_t error_buf_len = 1024 + count; char *error_buf = readstat_malloc(error_buf_len); unsigned char *p = NULL; unsigned char *pe = NULL; unsigned char *output_buffer = NULL; varlookup_t *table = build_lookup_table(var_count, ctx); if (ctx->converter) { size_t input_len = count; size_t output_len = input_len * 4; pe = p = output_buffer = readstat_malloc(output_len); size_t status = iconv(ctx->converter, (readstat_iconv_inbuf_t)&data, &input_len, (char **)&pe, &output_len); if (status == (size_t)-1) { free(output_buffer); return READSTAT_ERROR_PARSE; } } else { p = c_data; pe = c_data + count; } int cs; #line 1004 "src/spss/readstat_sav_parse.c" { cs = sav_very_long_string_parse_start; } #line 1009 "src/spss/readstat_sav_parse.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const unsigned char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _sav_very_long_string_parse_trans_keys + _sav_very_long_string_parse_key_offsets[cs]; _trans = _sav_very_long_string_parse_index_offsets[cs]; _klen = _sav_very_long_string_parse_single_lengths[cs]; if ( _klen > 0 ) { const unsigned char *_lower = _keys; const unsigned char *_mid; const unsigned char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _sav_very_long_string_parse_range_lengths[cs]; if ( _klen > 0 ) { const unsigned char *_lower = _keys; const unsigned char *_mid; const unsigned char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: _trans = _sav_very_long_string_parse_indicies[_trans]; cs = _sav_very_long_string_parse_trans_targs[_trans]; if ( _sav_very_long_string_parse_trans_actions[_trans] == 0 ) goto _again; _acts = _sav_very_long_string_parse_actions + _sav_very_long_string_parse_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 209 "src/spss/readstat_sav_parse.rl" { varlookup_t *found = bsearch(temp_key, table, var_count, sizeof(varlookup_t), &compare_key_varlookup); if (found) { ctx->varinfo[found->index].string_length = temp_val; } } break; case 1: #line 216 "src/spss/readstat_sav_parse.rl" { memcpy(temp_key, str_start, str_len); temp_key[str_len] = '\0'; } break; case 2: #line 221 "src/spss/readstat_sav_parse.rl" { if ((*p) != '\0') { temp_val = 10 * temp_val + ((*p) - '0'); } } break; case 3: #line 233 "src/spss/readstat_sav_parse.rl" { str_start = p; } break; case 4: #line 233 "src/spss/readstat_sav_parse.rl" { str_len = p - str_start; } break; case 5: #line 235 "src/spss/readstat_sav_parse.rl" { temp_val = 0; } break; #line 1119 "src/spss/readstat_sav_parse.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} _out: {} } #line 243 "src/spss/readstat_sav_parse.rl" if (cs < 36 || p != pe) { if (ctx->error_handler) { snprintf(error_buf, error_buf_len, "Parsed %ld of %ld bytes. Remaining bytes: %.*s", (long)(p - c_data), (long)(pe - c_data), (int)(pe - p), p); ctx->error_handler(error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_PARSE; } if (table) free(table); if (output_buffer) free(output_buffer); if (error_buf) free(error_buf); /* suppress warning */ (void)sav_very_long_string_parse_en_main; return retval; } haven/src/readstat/spss/readstat_spss.h0000644000176200001440000000715213227731765020013 0ustar liggesusers #define SPSS_FORMAT_TYPE_A 1 #define SPSS_FORMAT_TYPE_AHEX 2 #define SPSS_FORMAT_TYPE_COMMA 3 #define SPSS_FORMAT_TYPE_DOLLAR 4 #define SPSS_FORMAT_TYPE_F 5 #define SPSS_FORMAT_TYPE_IB 6 #define SPSS_FORMAT_TYPE_PIBHEX 7 #define SPSS_FORMAT_TYPE_P 8 #define SPSS_FORMAT_TYPE_PIB 9 #define SPSS_FORMAT_TYPE_PK 10 #define SPSS_FORMAT_TYPE_RB 11 #define SPSS_FORMAT_TYPE_RBHEX 12 #define SPSS_FORMAT_TYPE_Z 15 #define SPSS_FORMAT_TYPE_N 16 #define SPSS_FORMAT_TYPE_E 17 #define SPSS_FORMAT_TYPE_DATE 20 #define SPSS_FORMAT_TYPE_TIME 21 #define SPSS_FORMAT_TYPE_DATETIME 22 #define SPSS_FORMAT_TYPE_ADATE 23 #define SPSS_FORMAT_TYPE_JDATE 24 #define SPSS_FORMAT_TYPE_DTIME 25 #define SPSS_FORMAT_TYPE_WKDAY 26 #define SPSS_FORMAT_TYPE_MONTH 27 #define SPSS_FORMAT_TYPE_MOYR 28 #define SPSS_FORMAT_TYPE_QYR 29 #define SPSS_FORMAT_TYPE_WKYR 30 #define SPSS_FORMAT_TYPE_PCT 31 #define SPSS_FORMAT_TYPE_DOT 32 #define SPSS_FORMAT_TYPE_CCA 33 #define SPSS_FORMAT_TYPE_CCB 34 #define SPSS_FORMAT_TYPE_CCC 35 #define SPSS_FORMAT_TYPE_CCD 36 #define SPSS_FORMAT_TYPE_CCE 37 #define SPSS_FORMAT_TYPE_EDATE 38 #define SPSS_FORMAT_TYPE_SDATE 39 #define spss_format_is_date(type) \ (type == SPSS_FORMAT_TYPE_DATE || \ type == SPSS_FORMAT_TYPE_DATETIME || \ type == SPSS_FORMAT_TYPE_ADATE || \ type == SPSS_FORMAT_TYPE_JDATE || \ type == SPSS_FORMAT_TYPE_SDATE || \ type == SPSS_FORMAT_TYPE_EDATE || \ type == SPSS_FORMAT_TYPE_QYR || \ type == SPSS_FORMAT_TYPE_MOYR || \ type == SPSS_FORMAT_TYPE_WKYR) #define SPSS_DOC_LINE_SIZE 80 #define SAV_HIGHEST_DOUBLE 0x7FEFFFFFFFFFFFFFUL #define SAV_MISSING_DOUBLE 0xFFEFFFFFFFFFFFFFUL #define SAV_LOWEST_DOUBLE 0xFFEFFFFFFFFFFFFEUL #define SAV_MEASURE_UNKNOWN 0 #define SAV_MEASURE_NOMINAL 1 #define SAV_MEASURE_ORDINAL 2 #define SAV_MEASURE_SCALE 3 #define SAV_ALIGNMENT_LEFT 0 #define SAV_ALIGNMENT_RIGHT 1 #define SAV_ALIGNMENT_CENTER 2 typedef struct spss_format_s { int type; int width; int decimal_places; } spss_format_t; typedef struct spss_varinfo_s { readstat_type_t type; int labels_index; int index; int offset; int width; int string_length; spss_format_t print_format; spss_format_t write_format; int n_segments; int n_missing_values; int missing_range; double missing_values[3]; char name[8*4+1]; char longname[64*4+1]; char *label; readstat_measure_t measure; readstat_alignment_t alignment; int display_width; } spss_varinfo_t; int spss_format(char *buffer, size_t len, spss_format_t *format); int spss_varinfo_compare(const void *elem1, const void *elem2); readstat_missingness_t spss_missingness_for_info(spss_varinfo_t *info); readstat_variable_t *spss_init_variable_for_info(spss_varinfo_t *info, int index_after_skipping); uint64_t spss_64bit_value(readstat_value_t value); uint32_t spss_measure_from_readstat_measure(readstat_measure_t measure); readstat_measure_t spss_measure_to_readstat_measure(uint32_t sav_measure); uint32_t spss_alignment_from_readstat_alignment(readstat_alignment_t alignment); readstat_alignment_t spss_alignment_to_readstat_alignment(uint32_t sav_alignment); readstat_error_t spss_format_for_variable(readstat_variable_t *r_variable, spss_format_t *spss_format); haven/src/readstat/spss/readstat_por_parse.c0000644000176200001440000001376213227731765021014 0ustar liggesusers #line 1 "src/spss/readstat_por_parse.rl" #include #include "../readstat.h" #include "readstat_por_parse.h" #line 10 "src/spss/readstat_por_parse.c" static const char _por_field_parse_actions[] = { 0, 1, 0, 1, 1, 1, 5, 1, 8, 1, 9, 1, 10, 2, 2, 0, 2, 3, 1, 2, 5, 10, 2, 7, 10, 3, 4, 2, 0, 3, 6, 2, 0 }; static const char _por_field_parse_key_offsets[] = { 0, 0, 8, 9, 14, 18, 23, 31, 35, 40, 44, 48, 55 }; static const char _por_field_parse_trans_keys[] = { 32, 42, 45, 46, 48, 57, 65, 84, 46, 46, 48, 57, 65, 84, 48, 57, 65, 84, 47, 48, 57, 65, 84, 43, 45, 46, 47, 48, 57, 65, 84, 48, 57, 65, 84, 47, 48, 57, 65, 84, 48, 57, 65, 84, 48, 57, 65, 84, 43, 45, 47, 48, 57, 65, 84, 0 }; static const char _por_field_parse_single_lengths[] = { 0, 4, 1, 1, 0, 1, 4, 0, 1, 0, 0, 3, 0 }; static const char _por_field_parse_range_lengths[] = { 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0 }; static const char _por_field_parse_index_offsets[] = { 0, 0, 7, 9, 13, 16, 20, 27, 30, 34, 37, 40, 46 }; static const char _por_field_parse_trans_targs[] = { 1, 2, 3, 4, 6, 6, 0, 12, 0, 4, 6, 6, 0, 5, 5, 0, 12, 5, 5, 0, 7, 9, 10, 12, 6, 6, 0, 8, 8, 0, 12, 8, 8, 0, 8, 8, 0, 11, 11, 0, 7, 9, 12, 11, 11, 0, 0, 0 }; static const char _por_field_parse_trans_actions[] = { 0, 9, 0, 0, 13, 13, 0, 11, 0, 7, 25, 25, 0, 16, 16, 0, 11, 3, 3, 0, 5, 5, 5, 19, 1, 1, 0, 13, 13, 0, 22, 1, 1, 0, 29, 29, 0, 16, 16, 0, 0, 0, 11, 3, 3, 0, 0, 0 }; static const int por_field_parse_start = 1; static const int por_field_parse_en_main = 1; #line 9 "src/spss/readstat_por_parse.rl" ssize_t readstat_por_parse_double(const char *data, size_t len, double *result, readstat_error_handler error_cb, void *user_ctx) { ssize_t retval = 0; double val = 0.0; double denom = 30.0; double temp_frac = 0.0; uint64_t num = 0; uint64_t exp = 0; uint64_t temp_val = 0; const unsigned char *p = (const unsigned char *)data; const unsigned char *pe = p + len; int cs; int is_negative = 0, exp_is_negative = 0; int success = 0; #line 94 "src/spss/readstat_por_parse.c" { cs = por_field_parse_start; } #line 99 "src/spss/readstat_por_parse.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _por_field_parse_trans_keys + _por_field_parse_key_offsets[cs]; _trans = _por_field_parse_index_offsets[cs]; _klen = _por_field_parse_single_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _por_field_parse_range_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: cs = _por_field_parse_trans_targs[_trans]; if ( _por_field_parse_trans_actions[_trans] == 0 ) goto _again; _acts = _por_field_parse_actions + _por_field_parse_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 30 "src/spss/readstat_por_parse.rl" { if ((*p) >= '0' && (*p) <= '9') { temp_val = 30 * temp_val + ((*p) - '0'); } else if ((*p) >= 'A' && (*p) <= 'T') { temp_val = 30 * temp_val + (10 + (*p) - 'A'); } } break; case 1: #line 38 "src/spss/readstat_por_parse.rl" { if ((*p) >= '0' && (*p) <= '9') { temp_frac += ((*p) - '0') / denom; } else if ((*p) >= 'A' && (*p) <= 'T') { temp_frac += (10 + (*p) - 'A') / denom; } denom *= 30.0; } break; case 2: #line 47 "src/spss/readstat_por_parse.rl" { temp_val = 0; } break; case 3: #line 49 "src/spss/readstat_por_parse.rl" { temp_frac = 0.0; } break; case 4: #line 53 "src/spss/readstat_por_parse.rl" { is_negative = 1; } break; case 5: #line 53 "src/spss/readstat_por_parse.rl" { num = temp_val; } break; case 6: #line 54 "src/spss/readstat_por_parse.rl" { exp_is_negative = 1; } break; case 7: #line 54 "src/spss/readstat_por_parse.rl" { exp = temp_val; } break; case 8: #line 56 "src/spss/readstat_por_parse.rl" { is_negative = 1; } break; case 9: #line 58 "src/spss/readstat_por_parse.rl" { val = NAN; } break; case 10: #line 60 "src/spss/readstat_por_parse.rl" { success = 1; {p++; goto _out; } } break; #line 229 "src/spss/readstat_por_parse.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} _out: {} } #line 64 "src/spss/readstat_por_parse.rl" if (!isnan(val)) { val = 1.0 * num + temp_frac; if (exp_is_negative) exp *= -1; if (exp) { val *= pow(10.0, exp); } if (is_negative) val *= -1; } if (!success) { retval = -1; if (error_cb) { char error_buf[1024]; snprintf(error_buf, sizeof(error_buf), "Read bytes: %ld String: %.*s Ending state: %d", (long)(p - (const unsigned char *)data), (int)len, data, cs); error_cb(error_buf, user_ctx); } } if (retval == 0) { if (result) *result = val; retval = (p - (const unsigned char *)data); } /* suppress warning */ (void)por_field_parse_en_main; return retval; } haven/src/readstat/spss/readstat_por.c0000644000176200001440000001302313227731765017610 0ustar liggesusers#include #include #include "../readstat.h" #include "../CKHashTable.h" #include "../readstat_convert.h" #include "readstat_spss.h" #include "readstat_por.h" int8_t por_ascii_lookup[256] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ', '.', '<', '(', '+', '|', '&', '[', ']', '!', '$', '*', ')', ';', '^', '-', '/', '|', ',', '%', '_', '>', '?', '`', ':', '#', '@', '\'', '=', '"', 0, 0, 0, 0, 0, 0, '~', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '{', '}', '\\', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; uint16_t por_unicode_lookup[256] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ', '.', '<', '(', '+', '|', '&', '[', ']', '!', '$', '*', ')', ';', '^', '-', '/', 0x00A3, ',', '%', '_', '>', '?', 0x2018, ':', 0x00A6, '@', 0x2019, '=', '"', 0x2264, 0x25A1, 0x00B1, 0x25A0, 0x00B0, 0x2020, '~', 0x2013, 0x2514, 0x250C, 0x2265, 0x2070, 0x2071, 0x00B2, 0x00B3, 0x2074, 0x2075, 0x2076, 0x2077, 0x2078, 0x2079, 0x2518, 0x2510, 0x2260, 0x2014, 0x207D, 0x207E, 0x2E38, '{', '}', '\\', 0x00A2, 0x2022, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; por_ctx_t *por_ctx_init() { por_ctx_t *ctx = calloc(1, sizeof(por_ctx_t)); ctx->space = ' '; ctx->base30_precision = 20; ctx->var_dict = ck_hash_table_init(1024); return ctx; } void por_ctx_free(por_ctx_t *ctx) { if (ctx->string_buffer) free(ctx->string_buffer); if (ctx->varinfo) { int i; for (i=0; ivar_count; i++) { if (ctx->varinfo[i].label) free(ctx->varinfo[i].label); } free(ctx->varinfo); } if (ctx->variables) { int i; for (i=0; ivar_count; i++) { if (ctx->variables[i]) free(ctx->variables[i]); } free(ctx->variables); } if (ctx->var_dict) ck_hash_table_free(ctx->var_dict); if (ctx->converter) iconv_close(ctx->converter); free(ctx); } ssize_t por_utf8_encode(const unsigned char *input, size_t input_len, char *output, size_t output_len, uint16_t lookup[256]) { int offset = 0; int i; for (i=0; i output_len) return offset; output[offset++] = codepoint; } else { if (codepoint <= 0x07FF) { if (offset + 2 > output_len) return offset; } else /* if (codepoint <= 0xFFFF) */{ if (offset + 3 > output_len) return offset; } /* TODO - For some reason that replacement character isn't recognized * by some systems, so be prepared to insert an ASCII space instead */ int printed = sprintf(output + offset, "%lc", codepoint); if (printed > 0) { offset += printed; } else { output[offset++] = ' '; } } } return offset; } ssize_t por_utf8_decode( const char *input, size_t input_len, char *output, size_t output_len, uint8_t *lookup, size_t lookup_len) { int offset = 0; wchar_t codepoint = 0; while (1) { int char_len = 0; if (offset + 1 > output_len) return offset; unsigned char val = *input; if (val >= 0x20 && val < 0x7F) { if (!lookup[val]) return -1; output[offset++] = lookup[val]; input++; } else { int conversions = sscanf(input, "%lc%n", &codepoint, &char_len); if (conversions == 0 || codepoint >= lookup_len || lookup[codepoint] == 0) { return -1; } output[offset++] = lookup[codepoint]; input += char_len; } } return offset; } haven/src/readstat/spss/readstat_sav.c0000644000176200001440000000425413227731765017607 0ustar liggesusers// // sav.c // #include #include #include #include #include #include #include #include #include "../readstat.h" #include "../readstat_bits.h" #include "../readstat_iconv.h" #include "../readstat_malloc.h" #include "readstat_sav.h" #define SAV_VARINFO_INITIAL_CAPACITY 512 sav_ctx_t *sav_ctx_init(sav_file_header_record_t *header, readstat_io_t *io) { sav_ctx_t *ctx = NULL; if ((ctx = readstat_malloc(sizeof(sav_ctx_t))) == NULL) { return NULL; } memset(ctx, 0, sizeof(sav_ctx_t)); ctx->bswap = !(header->layout_code == 2 || header->layout_code == 3); ctx->data_is_compressed = (header->compressed != 0); ctx->record_count = ctx->bswap ? byteswap4(header->ncases) : header->ncases; ctx->fweight_index = ctx->bswap ? byteswap4(header->weight_index) : header->weight_index; ctx->missing_double = SAV_MISSING_DOUBLE; ctx->lowest_double = SAV_LOWEST_DOUBLE; ctx->highest_double = SAV_HIGHEST_DOUBLE; double bias = ctx->bswap ? byteswap_double(header->bias) : header->bias; if (bias != 100.0) { sav_ctx_free(ctx); return NULL; } ctx->varinfo_capacity = SAV_VARINFO_INITIAL_CAPACITY; if ((ctx->varinfo = readstat_calloc(ctx->varinfo_capacity, sizeof(spss_varinfo_t))) == NULL) { sav_ctx_free(ctx); return NULL; } ctx->io = io; return ctx; } void sav_ctx_free(sav_ctx_t *ctx) { if (ctx->varinfo) { int i; for (i=0; ivar_count; i++) { if (ctx->varinfo[i].label) free(ctx->varinfo[i].label); } free(ctx->varinfo); } if (ctx->variables) { int i; for (i=0; ivar_count; i++) { if (ctx->variables[i]) free(ctx->variables[i]); } free(ctx->variables); } if (ctx->raw_string) free(ctx->raw_string); if (ctx->utf8_string) free(ctx->utf8_string); if (ctx->converter) { iconv_close(ctx->converter); } if (ctx->variable_display_values) { free(ctx->variable_display_values); } free(ctx); } haven/src/readstat/spss/readstat_por_read.c0000644000176200001440000006743713227731765020625 0ustar liggesusers// // readstat_por.c // #include #include #include #include #include #include #include #include #include "../readstat.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_malloc.h" #include "../CKHashTable.h" #include "readstat_por_parse.h" #include "readstat_spss.h" #include "readstat_por.h" #define POR_LINE_LENGTH 80 #define POR_LABEL_NAME_PREFIX "labels" #define MAX_VARS 1000000 #define MAX_WIDTH 1000000 #define MAX_LINES 1000000 #define MAX_STRINGS 1000000 #define MAX_LABELS 1000000 static ssize_t read_bytes(por_ctx_t *ctx, void *dst, size_t len); static readstat_error_t read_string(por_ctx_t *ctx, char *data, size_t len); static readstat_error_t por_update_progress(por_ctx_t *ctx) { readstat_io_t *io = ctx->io; return io->update(ctx->file_size, ctx->progress_handler, ctx->user_ctx, io->io_ctx); } static ssize_t read_bytes(por_ctx_t *ctx, void *dst, size_t len) { char *dst_pos = (char *)dst; readstat_io_t *io = ctx->io; char byte; while (dst_pos < (char *)dst + len) { if (ctx->num_spaces) { *dst_pos++ = ctx->space; ctx->num_spaces--; continue; } ssize_t bytes_read = io->read(&byte, 1, io->io_ctx); if (bytes_read == 0) { break; } if (bytes_read == -1) { return -1; } if (byte == '\r' || byte == '\n') { if (byte == '\r') { bytes_read = io->read(&byte, 1, io->io_ctx); if (bytes_read == 0 || bytes_read == -1 || byte != '\n') return -1; } ctx->num_spaces = POR_LINE_LENGTH - ctx->pos; ctx->pos = 0; continue; } else if (ctx->pos == POR_LINE_LENGTH) { return -1; } *dst_pos++ = byte; ctx->pos++; } return (int)(dst_pos - (char *)dst); } static uint16_t read_tag(por_ctx_t *ctx) { unsigned char tag; if (read_bytes(ctx, &tag, 1) != 1) { return -1; } return ctx->byte2unicode[tag]; } static readstat_error_t read_double_with_peek(por_ctx_t *ctx, double *out_double, unsigned char peek) { readstat_error_t retval = READSTAT_OK; double value = NAN; unsigned char buffer[100]; char utf8_buffer[300]; char error_buf[1024]; int64_t len = 0; ssize_t bytes_read = 0; buffer[0] = peek; bytes_read = read_bytes(ctx, &buffer[1], 1); if (bytes_read != 1) return READSTAT_ERROR_PARSE; if (ctx->byte2unicode[buffer[0]] == '*' && ctx->byte2unicode[buffer[1]] == '.') { if (out_double) *out_double = NAN; return READSTAT_OK; } int64_t i=2; while (ibyte2unicode[buffer[i-1]] != '/') { bytes_read = read_bytes(ctx, &buffer[i], 1); if (bytes_read != 1) return READSTAT_ERROR_PARSE; i++; } if (i == sizeof(buffer)) { return READSTAT_ERROR_PARSE; } len = por_utf8_encode(buffer, i, utf8_buffer, sizeof(utf8_buffer), ctx->byte2unicode); if (len == -1) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error converting double string (length=%" PRId64 "): %.*s", i, (int)i, buffer); ctx->error_handler(error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_CONVERT; goto cleanup; } bytes_read = readstat_por_parse_double(utf8_buffer, len, &value, ctx->error_handler, ctx->user_ctx); if (bytes_read == -1) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error parsing double string (length=%" PRId64 "): %.*s [%s]", len, (int)len, utf8_buffer, buffer); ctx->error_handler(error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_PARSE; goto cleanup; } cleanup: if (out_double) *out_double = value; return retval; } static readstat_error_t read_double(por_ctx_t *ctx, double *out_double) { unsigned char peek; size_t bytes_read = read_bytes(ctx, &peek, 1); if (bytes_read != 1) return READSTAT_ERROR_PARSE; return read_double_with_peek(ctx, out_double, peek); } static readstat_error_t read_integer_in_range(por_ctx_t *ctx, int min, int max, int *out_integer) { double dval = NAN; readstat_error_t retval = read_double(ctx, &dval); if (retval != READSTAT_OK) return retval; if (isnan(dval) || dval < min || dval > max) return READSTAT_ERROR_PARSE; if (out_integer) *out_integer = (int)dval; return READSTAT_OK; } static readstat_error_t maybe_read_double(por_ctx_t *ctx, double *out_double, int *out_finished) { unsigned char peek; size_t bytes_read = read_bytes(ctx, &peek, 1); if (bytes_read != 1) return READSTAT_ERROR_PARSE; if (ctx->byte2unicode[peek] == 'Z') { if (out_double) *out_double = NAN; if (out_finished) *out_finished = 1; return READSTAT_OK; } if (out_finished) *out_finished = 0; return read_double_with_peek(ctx, out_double, peek); } static readstat_error_t maybe_read_string(por_ctx_t *ctx, char *data, size_t len, int *out_finished) { readstat_error_t retval = READSTAT_OK; double value; int finished = 0; char error_buf[1024]; size_t string_length = 0; retval = maybe_read_double(ctx, &value, &finished); if (retval != READSTAT_OK || finished) { if (out_finished) *out_finished = finished; return retval; } if (value <= 0 || value > 20000 || isnan(value)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } string_length = (size_t)value; if (string_length > ctx->string_buffer_len) { ctx->string_buffer_len = string_length; ctx->string_buffer = realloc(ctx->string_buffer, ctx->string_buffer_len); } if (read_bytes(ctx, ctx->string_buffer, string_length) == -1) { retval = READSTAT_ERROR_READ; goto cleanup; } size_t bytes_encoded = por_utf8_encode(ctx->string_buffer, string_length, data, len - 1, ctx->byte2unicode); if (bytes_encoded == -1) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error converting string: %.*s", (int)string_length, ctx->string_buffer); ctx->error_handler(error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_CONVERT; goto cleanup; } data[bytes_encoded] = '\0'; if (out_finished) *out_finished = 0; cleanup: return retval; } static readstat_error_t read_string(por_ctx_t *ctx, char *data, size_t len) { int finished = 0; readstat_error_t retval = maybe_read_string(ctx, data, len, &finished); if (retval == READSTAT_OK && finished) { return READSTAT_ERROR_PARSE; } return retval; } static readstat_error_t read_variable_count_record(por_ctx_t *ctx) { int value; readstat_error_t retval = READSTAT_OK; if (ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if ((retval = read_integer_in_range(ctx, 0, MAX_VARS, &value)) != READSTAT_OK) { goto cleanup; } ctx->var_count = value; ctx->variables = readstat_calloc(ctx->var_count, sizeof(readstat_variable_t *)); ctx->varinfo = readstat_calloc(ctx->var_count, sizeof(spss_varinfo_t)); if (ctx->variables == NULL || ctx->varinfo == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (ctx->info_handler) { if (ctx->info_handler(-1, ctx->var_count, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: return retval; } static readstat_error_t read_precision_record(por_ctx_t *ctx) { int precision = 0; readstat_error_t error = read_integer_in_range(ctx, 0, 100, &precision); if (error == READSTAT_OK) ctx->base30_precision = precision; return error; } static readstat_error_t read_case_weight_record(por_ctx_t *ctx) { return read_string(ctx, ctx->fweight_name, sizeof(ctx->fweight_name)); } static readstat_error_t read_variable_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int value; int i; spss_varinfo_t *varinfo = NULL; spss_format_t *formats[2]; ctx->var_offset++; if (ctx->var_offset == ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } varinfo = &ctx->varinfo[ctx->var_offset]; formats[0] = &varinfo->print_format; formats[1] = &varinfo->write_format; varinfo->labels_index = -1; if ((retval = read_integer_in_range(ctx, 0, MAX_WIDTH, &value)) != READSTAT_OK) { goto cleanup; } varinfo->width = value; if (varinfo->width == 0) { varinfo->type = READSTAT_TYPE_DOUBLE; } else { varinfo->type = READSTAT_TYPE_STRING; } if ((retval = read_string(ctx, varinfo->name, sizeof(varinfo->name))) != READSTAT_OK) { goto cleanup; } ck_str_hash_insert(varinfo->name, varinfo, ctx->var_dict); for (i=0; itype = value; if ((retval = read_integer_in_range(ctx, 0, 100, &value)) != READSTAT_OK) { goto cleanup; } format->width = value; if ((retval = read_integer_in_range(ctx, 0, 100, &value)) != READSTAT_OK) { goto cleanup; } format->decimal_places = value; } cleanup: return retval; } static readstat_error_t read_missing_value_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; double value; char string[256]; spss_varinfo_t *varinfo = NULL; if (ctx->var_offset < 0 || ctx->var_offset >= ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } varinfo = &ctx->varinfo[ctx->var_offset]; if (varinfo->type == READSTAT_TYPE_DOUBLE) { if ((retval = read_double(ctx, &value)) != READSTAT_OK) { goto cleanup; } varinfo->missing_values[varinfo->n_missing_values++] = value; if (varinfo->n_missing_values > 3) { retval = READSTAT_ERROR_PARSE; goto cleanup; } } else { if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { goto cleanup; } } cleanup: return retval; } static readstat_error_t read_missing_value_range_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; double value; char string[256]; spss_varinfo_t *varinfo = NULL; if (ctx->var_offset < 0 || ctx->var_offset == ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } varinfo = &ctx->varinfo[ctx->var_offset]; if (varinfo->type == READSTAT_TYPE_DOUBLE) { varinfo->missing_range = 1; if ((retval = read_double(ctx, &value)) != READSTAT_OK) { goto cleanup; } varinfo->missing_values[0] = value; if ((retval = read_double(ctx, &value)) != READSTAT_OK) { goto cleanup; } varinfo->missing_values[1] = value; varinfo->n_missing_values = 2; } else { if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { goto cleanup; } if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { goto cleanup; } } cleanup: return retval; } static readstat_error_t read_missing_value_lo_range_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; double value; char string[256]; spss_varinfo_t *varinfo = NULL; if (ctx->var_offset < 0 || ctx->var_offset == ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } varinfo = &ctx->varinfo[ctx->var_offset]; if (varinfo->type == READSTAT_TYPE_DOUBLE) { varinfo->missing_range = 1; if ((retval = read_double(ctx, &value)) != READSTAT_OK) { goto cleanup; } varinfo->missing_values[0] = -HUGE_VAL; varinfo->missing_values[1] = value; varinfo->n_missing_values = 2; } else { if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { goto cleanup; } } cleanup: return retval; } static readstat_error_t read_missing_value_hi_range_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; double value; char string[256]; spss_varinfo_t *varinfo = NULL; if (ctx->var_offset < 0 || ctx->var_offset == ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } varinfo = &ctx->varinfo[ctx->var_offset]; if (varinfo->type == READSTAT_TYPE_DOUBLE) { varinfo->missing_range = 1; if ((retval = read_double(ctx, &value)) != READSTAT_OK) { goto cleanup; } varinfo->missing_values[0] = value; varinfo->missing_values[1] = HUGE_VAL; varinfo->n_missing_values = 2; } else { if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { goto cleanup; } } cleanup: return retval; } static readstat_error_t read_document_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; char string[256]; int i; int line_count = 0; if ((retval = read_integer_in_range(ctx, 0, MAX_LINES, &line_count)) != READSTAT_OK) { goto cleanup; } for (i=0; inote_handler) { if (ctx->note_handler(i, string, ctx->user_ctx) != READSTAT_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } } cleanup: return retval; } static readstat_error_t read_variable_label_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; char string[256]; spss_varinfo_t *varinfo = NULL; if (ctx->var_offset < 0 || ctx->var_offset == ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } varinfo = &ctx->varinfo[ctx->var_offset]; if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { goto cleanup; } varinfo->label = malloc(strlen(string) + 1); strcpy(varinfo->label, string); cleanup: return retval; } static readstat_error_t read_value_label_record(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; double dval; int i; char string[256]; int count = 0, label_count = 0; char label_name_buf[256]; char label_buf[256]; snprintf(label_name_buf, sizeof(label_name_buf), POR_LABEL_NAME_PREFIX "%d", ctx->labels_offset); readstat_type_t value_type = READSTAT_TYPE_DOUBLE; if ((retval = read_integer_in_range(ctx, 0, MAX_STRINGS, &count)) != READSTAT_OK) { goto cleanup; } for (i=0; ivar_dict); if (info) { value_type = info->type; info->labels_index = ctx->labels_offset; } } if ((retval = read_integer_in_range(ctx, 0, MAX_LABELS, &label_count)) != READSTAT_OK) { goto cleanup; } for (i=0; ivalue_label_handler(label_name_buf, value, label_buf, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } ctx->labels_offset++; cleanup: return retval; } static readstat_error_t read_por_file_data(por_ctx_t *ctx) { int i; char input_string[256]; char output_string[4*256+1]; char error_buf[1024]; readstat_error_t rs_retval = READSTAT_OK; if (ctx->var_count == 0) return READSTAT_OK; while (1) { int finished = 0; for (i=0; ivar_count; i++) { spss_varinfo_t *info = &ctx->varinfo[i]; readstat_value_t value = { .type = info->type }; if (info->type == READSTAT_TYPE_STRING) { rs_retval = maybe_read_string(ctx, input_string, sizeof(input_string), &finished); if (rs_retval != READSTAT_OK) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error in %s (row=%d)", info->name, ctx->obs_count+1); ctx->error_handler(error_buf, ctx->user_ctx); } goto cleanup; } else if (finished) { if (i != 0) rs_retval = READSTAT_ERROR_PARSE; goto cleanup; } rs_retval = readstat_convert(output_string, sizeof(output_string), input_string, strlen(input_string), ctx->converter); if (rs_retval != READSTAT_OK) { goto cleanup; } value.v.string_value = output_string; } else if (info->type == READSTAT_TYPE_DOUBLE) { rs_retval = maybe_read_double(ctx, &value.v.double_value, &finished); if (rs_retval != READSTAT_OK) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error in %s (row=%d)", info->name, ctx->obs_count+1); ctx->error_handler(error_buf, ctx->user_ctx); } goto cleanup; } else if (finished) { if (i != 0) rs_retval = READSTAT_ERROR_PARSE; goto cleanup; } value.is_system_missing = isnan(value.v.double_value); } if (ctx->value_handler && !ctx->variables[i]->skip) { if (ctx->value_handler(ctx->obs_count, ctx->variables[i], value, ctx->user_ctx) != READSTAT_HANDLER_OK) { rs_retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } } ctx->obs_count++; rs_retval = por_update_progress(ctx); if (rs_retval != READSTAT_OK) break; if (ctx->obs_count == ctx->row_limit) break; } cleanup: return rs_retval; } readstat_error_t read_version_and_timestamp(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; char string[256]; struct tm timestamp = { .tm_isdst = -1 }; unsigned char version; if (read_bytes(ctx, &version, sizeof(version)) != sizeof(version)) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { /* creation date */ goto cleanup; } if (sscanf(string, "%04d%02d%02d", ×tamp.tm_year, ×tamp.tm_mon, ×tamp.tm_mday) != 3) { retval = READSTAT_ERROR_BAD_TIMESTAMP; goto cleanup; } if ((retval = read_string(ctx, string, sizeof(string))) != READSTAT_OK) { /* creation time */ goto cleanup; } if (sscanf(string, "%02d%02d%02d", ×tamp.tm_hour, ×tamp.tm_min, ×tamp.tm_sec) != 3) { retval = READSTAT_ERROR_BAD_TIMESTAMP; goto cleanup; } timestamp.tm_year -= 1900; timestamp.tm_mon--; ctx->timestamp = mktime(×tamp); ctx->version = ctx->byte2unicode[version] - 'A'; cleanup: return retval; } readstat_error_t handle_variables(por_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i; int index_after_skipping = 0; for (i=0; ivar_count; i++) { char label_name_buf[256]; spss_varinfo_t *info = &ctx->varinfo[i]; info->index = i; ctx->variables[i] = spss_init_variable_for_info(info, index_after_skipping); snprintf(label_name_buf, sizeof(label_name_buf), POR_LABEL_NAME_PREFIX "%d", info->labels_index); int cb_retval = READSTAT_HANDLER_OK; if (ctx->variable_handler) { cb_retval = ctx->variable_handler(i, ctx->variables[i], info->labels_index == -1 ? NULL : label_name_buf, ctx->user_ctx); } if (cb_retval == READSTAT_HANDLER_ABORT) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } if (cb_retval == READSTAT_HANDLER_SKIP_VARIABLE) { ctx->variables[i]->skip = 1; } else { index_after_skipping++; } } if (ctx->fweight_handler && ctx->fweight_name[0]) { for (i=0; ivar_count; i++) { spss_varinfo_t *info = &ctx->varinfo[i]; if (strcmp(info->name, ctx->fweight_name) == 0) { if (ctx->fweight_handler(ctx->variables[i], ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } break; } } } cleanup: return retval; } readstat_error_t readstat_parse_por(readstat_parser_t *parser, const char *path, void *user_ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = parser->io; unsigned char reverse_lookup[256]; char vanity[5][40]; char file_label[21]; char error_buf[1024]; por_ctx_t *ctx = por_ctx_init(); ctx->info_handler = parser->info_handler; ctx->metadata_handler = parser->metadata_handler; ctx->note_handler = parser->note_handler; ctx->fweight_handler = parser->fweight_handler; ctx->variable_handler = parser->variable_handler; ctx->value_handler = parser->value_handler; ctx->value_label_handler = parser->value_label_handler; ctx->error_handler = parser->error_handler; ctx->progress_handler = parser->progress_handler; ctx->user_ctx = user_ctx; ctx->io = io; ctx->row_limit = parser->row_limit; if (parser->output_encoding) { if (strcmp(parser->output_encoding, "UTF-8") != 0) ctx->converter = iconv_open(parser->output_encoding, "UTF-8"); if (ctx->converter == (iconv_t)-1) { ctx->converter = NULL; retval = READSTAT_ERROR_UNSUPPORTED_CHARSET; goto cleanup; } } if (io->open(path, io->io_ctx) == -1) { retval = READSTAT_ERROR_OPEN; goto cleanup; } if ((ctx->file_size = io->seek(0, READSTAT_SEEK_END, io->io_ctx)) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->seek(0, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (read_bytes(ctx, vanity, sizeof(vanity)) != sizeof(vanity)) { retval = READSTAT_ERROR_READ; goto cleanup; } readstat_convert(file_label, sizeof(file_label), vanity[1] + 20, 20, NULL); if (read_bytes(ctx, reverse_lookup, sizeof(reverse_lookup)) != sizeof(reverse_lookup)) { retval = READSTAT_ERROR_READ; goto cleanup; } ctx->space = reverse_lookup[126]; int i; for (i=0; i<256; i++) { if (por_ascii_lookup[i]) { ctx->byte2unicode[reverse_lookup[i]] = por_ascii_lookup[i]; } else if (por_unicode_lookup[i]) { ctx->byte2unicode[reverse_lookup[i]] = por_unicode_lookup[i]; } } ctx->byte2unicode[reverse_lookup[64]] = por_unicode_lookup[64]; unsigned char check[8]; char tr_check[8]; if (read_bytes(ctx, check, sizeof(check)) != sizeof(check)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (por_utf8_encode(check, sizeof(check), tr_check, sizeof(tr_check), ctx->byte2unicode) == -1) { if (ctx->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error converting check string: %.*s", (int)sizeof(check), check); ctx->error_handler(error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_CONVERT; goto cleanup; } if (strncmp("SPSSPORT", tr_check, sizeof(tr_check)) != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->var_offset = -1; char string[256]; retval = read_version_and_timestamp(ctx); if (retval != READSTAT_OK) goto cleanup; if (ctx->metadata_handler) { if (ctx->metadata_handler(file_label, NULL, ctx->timestamp, ctx->version, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } while (1) { uint16_t tr_tag = read_tag(ctx); switch (tr_tag) { case '1': /* product ID */ case '2': /* author ID */ case '3': /* sub-product ID */ retval = read_string(ctx, string, sizeof(string)); break; case '4': /* variable count */ retval = read_variable_count_record(ctx); break; case '5': /* precision */ retval = read_precision_record(ctx); break; case '6': /* case weight */ retval = read_case_weight_record(ctx); break; case '7': /* variable */ retval = read_variable_record(ctx); break; case '8': /* missing value */ retval = read_missing_value_record(ctx); break; case 'B': /* missing value range */ retval = read_missing_value_range_record(ctx); break; case '9': /* LO THRU x */ retval = read_missing_value_lo_range_record(ctx); break; case 'A': /* x THRU HI */ retval = read_missing_value_hi_range_record(ctx); break; case 'C': /* variable label */ retval = read_variable_label_record(ctx); break; case 'D': /* value label */ retval = read_value_label_record(ctx); break; case 'E': /* document record */ retval = read_document_record(ctx); break; case 'F': /* file data */ if (ctx->var_offset != ctx->var_count - 1) { retval = READSTAT_ERROR_COLUMN_COUNT_MISMATCH; goto cleanup; } retval = handle_variables(ctx); if (retval != READSTAT_OK) goto cleanup; if (ctx->value_handler) { retval = read_por_file_data(ctx); } goto cleanup; default: retval = READSTAT_ERROR_PARSE; goto cleanup; } if (retval != READSTAT_OK) break; } cleanup: io->close(io->io_ctx); por_ctx_free(ctx); return retval; } haven/src/readstat/spss/readstat_spss_parse.h0000644000176200001440000000012613227731765021177 0ustar liggesusers readstat_error_t spss_parse_format(const char *data, int count, spss_format_t *fmt); haven/src/readstat/spss/readstat_sav.h0000644000176200001440000001001513227731765017604 0ustar liggesusers// // readstat_sav.h // #include "readstat_spss.h" #pragma pack(push, 1) // SAV files typedef struct sav_file_header_record_s { char rec_type[4]; char prod_name[60]; int32_t layout_code; int32_t nominal_case_size; int32_t compressed; int32_t weight_index; int32_t ncases; double bias; /* TODO is this portable? */ char creation_date[9]; char creation_time[8]; char file_label[64]; char padding[3]; } sav_file_header_record_t; typedef struct sav_variable_record_s { int32_t type; int32_t has_var_label; int32_t n_missing_values; int32_t print; int32_t write; char name[8]; } sav_variable_record_t; typedef struct sav_info_record_header_s { int32_t rec_type; int32_t subtype; int32_t size; int32_t count; } sav_info_record_t; typedef struct sav_machine_integer_info_record_s { int32_t version_major; int32_t version_minor; int32_t version_revision; int32_t machine_code; int32_t floating_point_rep; int32_t compression_code; int32_t endianness; int32_t character_code; } sav_machine_integer_info_record_t; typedef struct sav_machine_floating_point_info_record_s { uint64_t sysmis; uint64_t highest; uint64_t lowest; } sav_machine_floating_point_info_record_t; typedef struct sav_dictionary_termination_record_s { int32_t rec_type; int32_t filler; } sav_dictionary_termination_record_t; #pragma pack(pop) typedef struct sav_ctx_s { readstat_error_handler error_handler; readstat_progress_handler progress_handler; readstat_note_handler note_handler; readstat_value_handler value_handler; readstat_value_label_handler value_label_handler; size_t file_size; readstat_io_t *io; void *user_ctx; spss_varinfo_t *varinfo; size_t varinfo_capacity; readstat_variable_t **variables; const char *input_encoding; const char *output_encoding; char file_label[4*64+1]; time_t timestamp; uint32_t *variable_display_values; size_t variable_display_values_count; iconv_t converter; int var_index; int var_offset; int var_count; int record_count; int row_limit; int current_row; int value_labels_count; int fweight_index; char *raw_string; size_t raw_string_len; char *utf8_string; size_t utf8_string_len; uint64_t missing_double; uint64_t lowest_double; uint64_t highest_double; unsigned int data_is_compressed:1; unsigned int bswap:1; } sav_ctx_t; #define SAV_RECORD_TYPE_VARIABLE 2 #define SAV_RECORD_TYPE_VALUE_LABEL 3 #define SAV_RECORD_TYPE_VALUE_LABEL_VARIABLES 4 #define SAV_RECORD_TYPE_DOCUMENT 6 #define SAV_RECORD_TYPE_HAS_DATA 7 #define SAV_RECORD_TYPE_DICT_TERMINATION 999 #define SAV_RECORD_SUBTYPE_INTEGER_INFO 3 #define SAV_RECORD_SUBTYPE_FP_INFO 4 #define SAV_RECORD_SUBTYPE_PRODUCT_INFO 10 #define SAV_RECORD_SUBTYPE_VAR_DISPLAY 11 #define SAV_RECORD_SUBTYPE_LONG_VAR_NAME 13 #define SAV_RECORD_SUBTYPE_VERY_LONG_STR 14 #define SAV_RECORD_SUBTYPE_NUMBER_OF_CASES 16 #define SAV_RECORD_SUBTYPE_DATA_FILE_ATTRS 17 #define SAV_RECORD_SUBTYPE_VARIABLE_ATTRS 18 #define SAV_RECORD_SUBTYPE_CHAR_ENCODING 20 #define SAV_RECORD_SUBTYPE_LONG_VALUE_LABELS 21 #define SAV_FLOATING_POINT_REP_IEEE 1 #define SAV_FLOATING_POINT_REP_IBM 2 #define SAV_FLOATING_POINT_REP_VAX 3 #define SAV_ENDIANNESS_BIG 1 #define SAV_ENDIANNESS_LITTLE 2 #define SAV_EIGHT_SPACES " " sav_ctx_t *sav_ctx_init(sav_file_header_record_t *header, readstat_io_t *io); void sav_ctx_free(sav_ctx_t *ctx); haven/src/readstat/spss/readstat_spss.c0000644000176200001440000001716313227731765020011 0ustar liggesusers #include #include #include "../readstat.h" #include "readstat_spss.h" #include "readstat_spss_parse.h" static char spss_type_strings[][16] = { [SPSS_FORMAT_TYPE_A] = "A", [SPSS_FORMAT_TYPE_AHEX] = "AHEX", [SPSS_FORMAT_TYPE_COMMA] = "COMMA", [SPSS_FORMAT_TYPE_DOLLAR] = "DOLLAR", [SPSS_FORMAT_TYPE_F] = "F", [SPSS_FORMAT_TYPE_IB] = "IB", [SPSS_FORMAT_TYPE_PIBHEX] = "PIBHEX", [SPSS_FORMAT_TYPE_P] = "P", [SPSS_FORMAT_TYPE_PIB] = "PIB", [SPSS_FORMAT_TYPE_PK] = "PK", [SPSS_FORMAT_TYPE_RB] = "RB", [SPSS_FORMAT_TYPE_RBHEX] = "RBHEX", [SPSS_FORMAT_TYPE_Z] = "Z", [SPSS_FORMAT_TYPE_N] = "N", [SPSS_FORMAT_TYPE_E] = "E", [SPSS_FORMAT_TYPE_DATE] = "DATE", [SPSS_FORMAT_TYPE_TIME] = "TIME", [SPSS_FORMAT_TYPE_DATETIME] = "DATETIME", [SPSS_FORMAT_TYPE_ADATE] = "ADATE", [SPSS_FORMAT_TYPE_JDATE] = "JDATE", [SPSS_FORMAT_TYPE_DTIME] = "DTIME", [SPSS_FORMAT_TYPE_WKDAY] = "WKDAY", [SPSS_FORMAT_TYPE_MONTH] = "MONTH", [SPSS_FORMAT_TYPE_MOYR] = "MOYR", [SPSS_FORMAT_TYPE_QYR] = "QYR", [SPSS_FORMAT_TYPE_WKYR] = "WKYR", [SPSS_FORMAT_TYPE_PCT] = "PCT", [SPSS_FORMAT_TYPE_DOT] = "DOT", [SPSS_FORMAT_TYPE_CCA] = "CCA", [SPSS_FORMAT_TYPE_CCB] = "CCB", [SPSS_FORMAT_TYPE_CCC] = "CCC", [SPSS_FORMAT_TYPE_CCD] = "CCD", [SPSS_FORMAT_TYPE_CCE] = "CCE", [SPSS_FORMAT_TYPE_EDATE] = "EDATE", [SPSS_FORMAT_TYPE_SDATE] = "SDATE" }; int spss_format(char *buffer, size_t len, spss_format_t *format) { if (format->type < 0 || format->type >= sizeof(spss_type_strings)/sizeof(spss_type_strings[0]) || spss_type_strings[format->type][0] == '\0') { return 0; } char *string = spss_type_strings[format->type]; if (format->decimal_places || format->type == SPSS_FORMAT_TYPE_F) { snprintf(buffer, len, "%s%d.%d", string, format->width, format->decimal_places); } else if (format->width) { snprintf(buffer, len, "%s%d", string, format->width); } else { snprintf(buffer, len, "%s", string); } return 1; } int spss_varinfo_compare(const void *elem1, const void *elem2) { int offset = *(int *)elem1; const spss_varinfo_t *v = (const spss_varinfo_t *)elem2; if (offset < v->offset) return -1; return (offset > v->offset); } static readstat_value_t spss_boxed_value(double fp_value) { readstat_value_t value = { .type = READSTAT_TYPE_DOUBLE, .v = { .double_value = fp_value }, .is_system_missing = isnan(fp_value) }; return value; } uint64_t spss_64bit_value(readstat_value_t value) { double dval = readstat_double_value(value); uint64_t special_val; memcpy(&special_val, &dval, sizeof(double)); if (isinf(dval)) { if (dval < 0.0) { special_val = SAV_LOWEST_DOUBLE; } else { special_val = SAV_HIGHEST_DOUBLE; } } else if (isnan(dval)) { special_val = SAV_MISSING_DOUBLE; } return special_val; } readstat_missingness_t spss_missingness_for_info(spss_varinfo_t *info) { readstat_missingness_t missingness; memset(&missingness, 0, sizeof(readstat_missingness_t)); if (info->missing_range) { missingness.missing_ranges_count++; missingness.missing_ranges[0] = spss_boxed_value(info->missing_values[0]); missingness.missing_ranges[1] = spss_boxed_value(info->missing_values[1]); if (info->n_missing_values == 3) { missingness.missing_ranges_count++; missingness.missing_ranges[2] = missingness.missing_ranges[3] = spss_boxed_value(info->missing_values[2]); } } else if (info->n_missing_values > 0) { missingness.missing_ranges_count = info->n_missing_values; int i=0; for (i=0; in_missing_values; i++) { missingness.missing_ranges[2*i] = missingness.missing_ranges[2*i+1] = spss_boxed_value(info->missing_values[i]); } } return missingness; } readstat_variable_t *spss_init_variable_for_info(spss_varinfo_t *info, int index_after_skipping) { readstat_variable_t *variable = calloc(1, sizeof(readstat_variable_t)); variable->index = info->index; variable->index_after_skipping = index_after_skipping; variable->type = info->type; if (info->string_length) { variable->storage_width = info->string_length; } else { variable->storage_width = 8 * info->width; } if (info->longname[0]) { snprintf(variable->name, sizeof(variable->name), "%s", info->longname); } else { snprintf(variable->name, sizeof(variable->name), "%s", info->name); } if (info->label) { snprintf(variable->label, sizeof(variable->label), "%s", info->label); } spss_format(variable->format, sizeof(variable->format), &info->print_format); variable->missingness = spss_missingness_for_info(info); variable->measure = info->measure; variable->display_width = info->display_width; return variable; } uint32_t spss_measure_from_readstat_measure(readstat_measure_t measure) { uint32_t sav_measure = SAV_MEASURE_UNKNOWN; if (measure == READSTAT_MEASURE_NOMINAL) { sav_measure = SAV_MEASURE_NOMINAL; } else if (measure == READSTAT_MEASURE_ORDINAL) { sav_measure = SAV_MEASURE_ORDINAL; } else if (measure == READSTAT_MEASURE_SCALE) { sav_measure = SAV_MEASURE_SCALE; } return sav_measure; } readstat_measure_t spss_measure_to_readstat_measure(uint32_t sav_measure) { if (sav_measure == SAV_MEASURE_NOMINAL) return READSTAT_MEASURE_NOMINAL; if (sav_measure == SAV_MEASURE_ORDINAL) return READSTAT_MEASURE_ORDINAL; if (sav_measure == SAV_MEASURE_SCALE) return READSTAT_MEASURE_SCALE; return READSTAT_MEASURE_UNKNOWN; } uint32_t spss_alignment_from_readstat_alignment(readstat_alignment_t alignment) { uint32_t sav_alignment = 0; if (alignment == READSTAT_ALIGNMENT_LEFT) { sav_alignment = SAV_ALIGNMENT_LEFT; } else if (alignment == READSTAT_ALIGNMENT_CENTER) { sav_alignment = SAV_ALIGNMENT_CENTER; } else if (alignment == READSTAT_ALIGNMENT_RIGHT) { sav_alignment = SAV_ALIGNMENT_RIGHT; } return sav_alignment; } readstat_alignment_t spss_alignment_to_readstat_alignment(uint32_t sav_alignment) { if (sav_alignment == SAV_ALIGNMENT_LEFT) return READSTAT_ALIGNMENT_LEFT; if (sav_alignment == SAV_ALIGNMENT_CENTER) return READSTAT_ALIGNMENT_CENTER; if (sav_alignment == SAV_ALIGNMENT_RIGHT) return READSTAT_ALIGNMENT_RIGHT; return READSTAT_ALIGNMENT_UNKNOWN; } readstat_error_t spss_format_for_variable(readstat_variable_t *r_variable, spss_format_t *spss_format) { readstat_error_t retval = READSTAT_OK; memset(spss_format, 0, sizeof(spss_format_t)); if (r_variable->type == READSTAT_TYPE_STRING) { spss_format->type = SPSS_FORMAT_TYPE_A; if (r_variable->user_width) { spss_format->width = r_variable->user_width; } else { spss_format->width = r_variable->storage_width; } } else { spss_format->type = SPSS_FORMAT_TYPE_F; spss_format->width = 8; if (r_variable->type == READSTAT_TYPE_DOUBLE || r_variable->type == READSTAT_TYPE_FLOAT) { spss_format->decimal_places = 2; } } if (r_variable->format[0]) { spss_format->decimal_places = 0; const char *fmt = r_variable->format; if (spss_parse_format(fmt, strlen(fmt), spss_format) != READSTAT_OK) { retval = READSTAT_ERROR_BAD_FORMAT_STRING; goto cleanup; } } cleanup: return retval; } haven/src/readstat/spss/readstat_por_parse.h0000644000176200001440000000025313227731765021010 0ustar liggesusers// // readstat_por_parse.h // ssize_t readstat_por_parse_double(const char *data, size_t len, double *result, readstat_error_handler error_cb, void *user_ctx); haven/src/readstat/spss/readstat_sav_parse.h0000644000176200001440000000032113227731765020775 0ustar liggesusers// // sav_parse.h // readstat_error_t sav_parse_long_variable_names_record(void *data, int count, sav_ctx_t *ctx); readstat_error_t sav_parse_very_long_string_record(void *data, int count, sav_ctx_t *ctx); haven/src/readstat/spss/readstat_spss_parse.c0000644000176200001440000004416713227731765021207 0ustar liggesusers #line 1 "src/spss/readstat_spss_parse.rl" #include #include "../readstat.h" #include "readstat_spss.h" #include "readstat_spss_parse.h" #line 12 "src/spss/readstat_spss_parse.c" static const char _spss_format_parser_actions[] = { 0, 1, 1, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9, 1, 10, 1, 11, 1, 12, 1, 13, 1, 14, 1, 15, 1, 16, 1, 17, 1, 18, 1, 19, 1, 20, 1, 21, 1, 22, 1, 23, 1, 24, 1, 25, 1, 26, 1, 27, 1, 28, 1, 29, 1, 30, 1, 31, 1, 32, 1, 33, 1, 34, 1, 35, 1, 36, 1, 37, 1, 38, 1, 39, 1, 40, 2, 0, 1, 3, 4, 0, 1, 3, 5, 0, 1, 3, 6, 0, 1, 3, 7, 0, 1, 3, 8, 0, 1, 3, 9, 0, 1, 3, 10, 0, 1, 3, 11, 0, 1, 3, 12, 0, 1, 3, 13, 0, 1, 3, 14, 0, 1, 3, 15, 0, 1, 3, 16, 0, 1, 3, 17, 0, 1, 3, 18, 0, 1, 3, 19, 0, 1, 3, 20, 0, 1, 3, 21, 0, 1, 3, 22, 0, 1, 3, 23, 0, 1, 3, 24, 0, 1, 3, 25, 0, 1, 3, 26, 0, 1, 3, 27, 0, 1, 3, 28, 0, 1, 3, 29, 0, 1, 3, 30, 0, 1, 3, 31, 0, 1, 3, 32, 0, 1, 3, 33, 0, 1, 3, 34, 0, 1, 3, 35, 0, 1, 3, 36, 0, 1, 3, 37, 0, 1, 3, 38, 0, 1, 3, 39, 0, 1, 3, 40, 0, 1 }; static const short _spss_format_parser_key_offsets[] = { 0, 0, 34, 36, 38, 40, 42, 44, 46, 50, 60, 62, 64, 66, 72, 74, 76, 78, 80, 82, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 118, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 172, 174, 176, 178, 180, 182, 184, 186, 188, 194, 197, 199, 201, 203, 205, 207, 209, 211, 213, 215, 219, 221, 223, 225, 227, 231, 233, 235, 237, 239, 241, 243, 245, 247, 255, 257, 261, 263, 265, 267, 271, 273, 275, 277, 279, 281, 283 }; static const char _spss_format_parser_trans_keys[] = { 65, 67, 68, 69, 70, 73, 74, 77, 78, 80, 81, 82, 83, 84, 87, 89, 90, 97, 99, 100, 101, 102, 105, 106, 109, 110, 112, 113, 114, 115, 116, 119, 121, 122, 48, 57, 65, 97, 84, 116, 69, 101, 69, 101, 88, 120, 67, 79, 99, 111, 65, 66, 67, 68, 69, 97, 98, 99, 100, 101, 77, 109, 77, 109, 65, 97, 65, 79, 84, 97, 111, 116, 84, 116, 69, 101, 73, 105, 77, 109, 69, 101, 76, 84, 108, 116, 76, 108, 65, 97, 82, 114, 73, 105, 77, 109, 69, 101, 65, 97, 84, 116, 69, 101, 66, 98, 68, 100, 65, 97, 84, 116, 69, 101, 79, 84, 111, 116, 78, 89, 110, 121, 84, 116, 72, 104, 82, 114, 73, 105, 77, 109, 69, 101, 84, 116, 66, 98, 69, 101, 88, 120, 89, 121, 82, 114, 66, 98, 69, 101, 88, 120, 68, 100, 65, 97, 84, 116, 69, 101, 73, 105, 77, 109, 69, 101, 75, 107, 68, 89, 100, 121, 65, 97, 89, 121, 82, 114, 77, 109, 68, 100, 72, 104, 77, 109, 83, 115, 68, 72, 100, 104, 48, 57, 46, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 84, 116, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 68, 100, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 67, 73, 75, 99, 105, 107, 48, 57, 48, 57, 72, 104, 48, 57, 48, 57, 48, 57, 48, 57, 72, 104, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 48, 57, 0 }; static const char _spss_format_parser_single_lengths[] = { 0, 34, 0, 2, 2, 2, 2, 2, 4, 10, 2, 2, 2, 6, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0 }; static const char _spss_format_parser_range_lengths[] = { 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }; static const short _spss_format_parser_index_offsets[] = { 0, 0, 35, 37, 40, 43, 46, 49, 52, 57, 68, 71, 74, 77, 84, 87, 90, 93, 96, 99, 104, 107, 110, 113, 116, 119, 122, 125, 128, 131, 134, 137, 140, 143, 146, 151, 156, 159, 162, 165, 168, 171, 174, 177, 180, 183, 186, 189, 192, 195, 198, 201, 204, 207, 210, 213, 216, 219, 222, 225, 230, 233, 236, 239, 242, 245, 248, 251, 254, 260, 263, 265, 267, 269, 271, 273, 275, 277, 279, 281, 285, 287, 289, 291, 293, 297, 299, 301, 303, 305, 307, 309, 311, 313, 321, 323, 327, 329, 331, 333, 337, 339, 341, 343, 345, 347, 349 }; static const unsigned char _spss_format_parser_indicies[] = { 0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1, 18, 1, 19, 19, 1, 20, 20, 1, 21, 21, 1, 22, 22, 1, 23, 23, 1, 24, 25, 24, 25, 1, 26, 27, 28, 29, 30, 26, 27, 28, 29, 30, 1, 31, 31, 1, 32, 32, 1, 33, 33, 1, 34, 35, 36, 34, 35, 36, 1, 37, 37, 1, 38, 38, 1, 39, 39, 1, 40, 40, 1, 41, 41, 1, 42, 43, 42, 43, 1, 44, 44, 1, 45, 45, 1, 46, 46, 1, 47, 47, 1, 48, 48, 1, 49, 49, 1, 50, 50, 1, 51, 51, 1, 52, 52, 1, 53, 53, 1, 54, 54, 1, 55, 55, 1, 56, 56, 1, 57, 57, 1, 58, 59, 58, 59, 1, 60, 61, 60, 61, 1, 62, 62, 1, 63, 63, 1, 64, 64, 1, 65, 65, 1, 66, 66, 1, 67, 67, 1, 68, 68, 1, 69, 69, 1, 70, 70, 1, 71, 71, 1, 72, 72, 1, 73, 73, 1, 74, 74, 1, 75, 75, 1, 76, 76, 1, 77, 77, 1, 78, 78, 1, 79, 79, 1, 80, 80, 1, 81, 81, 1, 82, 82, 1, 83, 83, 1, 84, 84, 1, 85, 86, 85, 86, 1, 87, 87, 1, 88, 88, 1, 89, 89, 1, 90, 90, 1, 91, 91, 1, 92, 92, 1, 93, 93, 1, 94, 94, 1, 96, 97, 96, 97, 95, 1, 98, 99, 1, 100, 1, 101, 1, 102, 1, 103, 1, 104, 1, 105, 1, 106, 1, 107, 1, 108, 1, 110, 110, 109, 1, 111, 1, 112, 1, 113, 1, 114, 1, 116, 116, 115, 1, 117, 1, 118, 1, 119, 1, 120, 1, 121, 1, 122, 1, 123, 1, 124, 1, 126, 127, 128, 126, 127, 128, 125, 1, 129, 1, 131, 131, 130, 1, 132, 1, 133, 1, 134, 1, 136, 136, 135, 1, 137, 1, 138, 1, 139, 1, 140, 1, 141, 1, 142, 1, 143, 1, 0 }; static const char _spss_format_parser_trans_targs[] = { 68, 0, 8, 13, 84, 86, 29, 30, 34, 92, 93, 46, 48, 51, 55, 58, 63, 106, 70, 4, 5, 71, 7, 72, 9, 10, 73, 74, 75, 76, 77, 11, 12, 78, 14, 19, 23, 15, 79, 17, 18, 80, 20, 82, 21, 22, 81, 24, 25, 83, 27, 28, 85, 87, 31, 32, 33, 88, 35, 39, 36, 38, 37, 89, 90, 40, 41, 91, 94, 95, 45, 96, 47, 98, 99, 50, 100, 52, 53, 54, 101, 56, 57, 102, 59, 60, 62, 61, 103, 104, 64, 65, 66, 67, 105, 69, 3, 6, 2, 69, 70, 69, 69, 69, 69, 69, 69, 69, 69, 69, 16, 69, 69, 69, 69, 69, 26, 69, 69, 69, 69, 69, 69, 69, 69, 69, 42, 43, 97, 69, 69, 44, 69, 69, 69, 69, 49, 69, 69, 69, 69, 69, 69, 69 }; static const unsigned char _spss_format_parser_trans_actions[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 81, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 84, 0, 0, 3, 1, 1, 160, 88, 204, 208, 212, 216, 220, 92, 144, 0, 152, 96, 200, 168, 140, 0, 224, 100, 104, 164, 180, 184, 172, 136, 112, 0, 0, 0, 196, 116, 0, 108, 120, 188, 124, 0, 128, 228, 148, 176, 192, 156, 132 }; static const unsigned char _spss_format_parser_eof_actions[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 3, 5, 45, 9, 67, 69, 71, 73, 75, 11, 37, 41, 13, 65, 49, 35, 77, 15, 17, 47, 55, 57, 51, 33, 21, 63, 23, 19, 25, 59, 27, 29, 79, 39, 53, 61, 43, 31 }; static const int spss_format_parser_start = 1; static const int spss_format_parser_en_main = 1; #line 11 "src/spss/readstat_spss_parse.rl" /* TODO - SPSS v24 introduced the MTIME and YMDHMS formats; until their numeric * codes are known, map them to DTIME and DATETIME, respectively. */ readstat_error_t spss_parse_format(const char *data, int count, spss_format_t *fmt) { unsigned char *p = (unsigned char *)data; unsigned char *pe = (unsigned char *)data + count; unsigned char *eof = pe; int cs; unsigned int integer = 0; #line 279 "src/spss/readstat_spss_parse.c" { cs = spss_format_parser_start; } #line 284 "src/spss/readstat_spss_parse.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _spss_format_parser_trans_keys + _spss_format_parser_key_offsets[cs]; _trans = _spss_format_parser_index_offsets[cs]; _klen = _spss_format_parser_single_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _spss_format_parser_range_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: _trans = _spss_format_parser_indicies[_trans]; cs = _spss_format_parser_trans_targs[_trans]; if ( _spss_format_parser_trans_actions[_trans] == 0 ) goto _again; _acts = _spss_format_parser_actions + _spss_format_parser_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 25 "src/spss/readstat_spss_parse.rl" { integer = 0; } break; case 1: #line 29 "src/spss/readstat_spss_parse.rl" { integer = 10 * integer + ((*p) - '0'); } break; case 2: #line 33 "src/spss/readstat_spss_parse.rl" { fmt->width = integer; } break; case 4: #line 41 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_A; } break; case 5: #line 42 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_AHEX; } break; case 6: #line 43 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_COMMA; } break; case 7: #line 44 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DOLLAR; } break; case 8: #line 45 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_F; } break; case 9: #line 46 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_IB; } break; case 10: #line 47 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PIBHEX; } break; case 11: #line 48 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_P; } break; case 12: #line 49 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PIB; } break; case 13: #line 50 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PK; } break; case 14: #line 51 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_RB; } break; case 15: #line 52 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_RBHEX; } break; case 16: #line 53 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_Z; } break; case 17: #line 54 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_N; } break; case 18: #line 55 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_E; } break; case 19: #line 56 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DATE; } break; case 20: #line 57 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_TIME; } break; case 21: #line 58 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DATETIME; } break; case 22: #line 59 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DATETIME; } break; case 23: #line 60 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_ADATE; } break; case 24: #line 61 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_JDATE; } break; case 25: #line 62 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DTIME; } break; case 26: #line 63 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DTIME; } break; case 27: #line 64 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_WKDAY; } break; case 28: #line 65 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_MONTH; } break; case 29: #line 66 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_MOYR; } break; case 30: #line 67 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_QYR; } break; case 31: #line 68 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_WKYR; } break; case 32: #line 69 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PCT; } break; case 33: #line 70 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DOT; } break; case 34: #line 71 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCA; } break; case 35: #line 72 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCB; } break; case 36: #line 73 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCC; } break; case 37: #line 74 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCD; } break; case 38: #line 75 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCE; } break; case 39: #line 76 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_EDATE; } break; case 40: #line 77 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_SDATE; } break; #line 524 "src/spss/readstat_spss_parse.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} if ( p == eof ) { const char *__acts = _spss_format_parser_actions + _spss_format_parser_eof_actions[cs]; unsigned int __nacts = (unsigned int) *__acts++; while ( __nacts-- > 0 ) { switch ( *__acts++ ) { case 2: #line 33 "src/spss/readstat_spss_parse.rl" { fmt->width = integer; } break; case 3: #line 37 "src/spss/readstat_spss_parse.rl" { fmt->decimal_places = integer; } break; case 4: #line 41 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_A; } break; case 5: #line 42 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_AHEX; } break; case 6: #line 43 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_COMMA; } break; case 7: #line 44 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DOLLAR; } break; case 8: #line 45 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_F; } break; case 9: #line 46 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_IB; } break; case 10: #line 47 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PIBHEX; } break; case 11: #line 48 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_P; } break; case 12: #line 49 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PIB; } break; case 13: #line 50 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PK; } break; case 14: #line 51 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_RB; } break; case 15: #line 52 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_RBHEX; } break; case 16: #line 53 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_Z; } break; case 17: #line 54 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_N; } break; case 18: #line 55 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_E; } break; case 19: #line 56 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DATE; } break; case 20: #line 57 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_TIME; } break; case 21: #line 58 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DATETIME; } break; case 22: #line 59 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DATETIME; } break; case 23: #line 60 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_ADATE; } break; case 24: #line 61 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_JDATE; } break; case 25: #line 62 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DTIME; } break; case 26: #line 63 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DTIME; } break; case 27: #line 64 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_WKDAY; } break; case 28: #line 65 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_MONTH; } break; case 29: #line 66 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_MOYR; } break; case 30: #line 67 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_QYR; } break; case 31: #line 68 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_WKYR; } break; case 32: #line 69 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_PCT; } break; case 33: #line 70 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_DOT; } break; case 34: #line 71 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCA; } break; case 35: #line 72 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCB; } break; case 36: #line 73 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCC; } break; case 37: #line 74 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCD; } break; case 38: #line 75 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_CCE; } break; case 39: #line 76 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_EDATE; } break; case 40: #line 77 "src/spss/readstat_spss_parse.rl" { fmt->type = SPSS_FORMAT_TYPE_SDATE; } break; #line 700 "src/spss/readstat_spss_parse.c" } } } _out: {} } #line 90 "src/spss/readstat_spss_parse.rl" /* suppress warning */ (void)spss_format_parser_en_main; if (cs < 68 || p != eof) { return READSTAT_ERROR_PARSE; } return READSTAT_OK; } haven/src/readstat/spss/readstat_sav_parse_timestamp.c0000644000176200001440000002741513227731765023070 0ustar liggesusers #line 1 "src/spss/readstat_sav_parse_timestamp.rl" #include #include "../readstat.h" #include "../readstat_iconv.h" #include "readstat_sav.h" #include "readstat_sav_parse_timestamp.h" #line 13 "src/spss/readstat_sav_parse_timestamp.c" static const char _sav_time_parse_actions[] = { 0, 1, 0, 1, 2, 1, 3, 1, 4, 2, 1, 0 }; static const char _sav_time_parse_key_offsets[] = { 0, 0, 2, 4, 5, 7, 9, 10, 12, 14 }; static const char _sav_time_parse_trans_keys[] = { 48, 57, 48, 57, 58, 48, 57, 48, 57, 58, 48, 57, 48, 57, 0 }; static const char _sav_time_parse_single_lengths[] = { 0, 0, 0, 1, 0, 0, 1, 0, 0, 0 }; static const char _sav_time_parse_range_lengths[] = { 0, 1, 1, 0, 1, 1, 0, 1, 1, 0 }; static const char _sav_time_parse_index_offsets[] = { 0, 0, 2, 4, 6, 8, 10, 12, 14, 16 }; static const char _sav_time_parse_trans_targs[] = { 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 0, 0 }; static const char _sav_time_parse_trans_actions[] = { 9, 0, 1, 0, 3, 0, 9, 0, 1, 0, 5, 0, 9, 0, 1, 0, 0, 0 }; static const char _sav_time_parse_eof_actions[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 7 }; static const int sav_time_parse_start = 1; static const int sav_time_parse_en_main = 1; #line 12 "src/spss/readstat_sav_parse_timestamp.rl" readstat_error_t sav_parse_time(const char *data, size_t len, struct tm *timestamp, readstat_error_handler error_cb, void *user_ctx) { readstat_error_t retval = READSTAT_OK; char error_buf[8192]; const char *p = data; const char *pe = p + len; const char *eof = pe; int cs; int temp_val = 0; #line 79 "src/spss/readstat_sav_parse_timestamp.c" { cs = sav_time_parse_start; } #line 84 "src/spss/readstat_sav_parse_timestamp.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _sav_time_parse_trans_keys + _sav_time_parse_key_offsets[cs]; _trans = _sav_time_parse_index_offsets[cs]; _klen = _sav_time_parse_single_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _sav_time_parse_range_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: cs = _sav_time_parse_trans_targs[_trans]; if ( _sav_time_parse_trans_actions[_trans] == 0 ) goto _again; _acts = _sav_time_parse_actions + _sav_time_parse_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 24 "src/spss/readstat_sav_parse_timestamp.rl" { temp_val = 10 * temp_val + ((*p) - '0'); } break; case 1: #line 28 "src/spss/readstat_sav_parse_timestamp.rl" { temp_val = 0; } break; case 2: #line 30 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_hour = temp_val; } break; case 3: #line 32 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_min = temp_val; } break; #line 175 "src/spss/readstat_sav_parse_timestamp.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} if ( p == eof ) { const char *__acts = _sav_time_parse_actions + _sav_time_parse_eof_actions[cs]; unsigned int __nacts = (unsigned int) *__acts++; while ( __nacts-- > 0 ) { switch ( *__acts++ ) { case 4: #line 34 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_sec = temp_val; } break; #line 195 "src/spss/readstat_sav_parse_timestamp.c" } } } _out: {} } #line 40 "src/spss/readstat_sav_parse_timestamp.rl" if (cs < 9|| p != pe) { if (error_cb) { snprintf(error_buf, sizeof(error_buf), "Invalid time string (length=%d): %.*s", (int)len, (int)len, data); error_cb(error_buf, user_ctx); } retval = READSTAT_ERROR_BAD_TIMESTAMP; } (void)sav_time_parse_en_main; return retval; } #line 221 "src/spss/readstat_sav_parse_timestamp.c" static const char _sav_date_parse_actions[] = { 0, 1, 0, 1, 1, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9, 1, 10, 1, 11, 1, 12, 1, 13, 1, 14, 1, 15, 2, 2, 0 }; static const char _sav_date_parse_key_offsets[] = { 0, 0, 2, 4, 5, 13, 17, 18, 19, 21, 23, 24, 25, 26, 27, 29, 30, 31, 32, 34, 35, 36, 37, 41, 42, 43, 45, 46, 47, 48, 50, 52, 54, 55, 56, 58, 60, 61, 62, 63, 65, 66, 67, 68, 70, 71, 72, 73 }; static const char _sav_date_parse_trans_keys[] = { 48, 57, 48, 57, 32, 65, 68, 70, 74, 77, 78, 79, 83, 80, 85, 112, 117, 82, 32, 48, 57, 48, 57, 71, 32, 114, 103, 69, 101, 67, 32, 99, 69, 101, 66, 32, 98, 65, 85, 97, 117, 78, 32, 76, 78, 32, 32, 110, 108, 110, 65, 97, 82, 89, 32, 32, 114, 121, 79, 111, 86, 32, 118, 67, 99, 84, 32, 116, 69, 101, 80, 32, 112, 0 }; static const char _sav_date_parse_single_lengths[] = { 0, 0, 0, 1, 8, 4, 1, 1, 0, 0, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 0 }; static const char _sav_date_parse_range_lengths[] = { 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; static const unsigned char _sav_date_parse_index_offsets[] = { 0, 0, 2, 4, 6, 15, 20, 22, 24, 26, 28, 30, 32, 34, 36, 39, 41, 43, 45, 48, 50, 52, 54, 59, 61, 63, 66, 68, 70, 72, 75, 78, 81, 83, 85, 88, 91, 93, 95, 97, 100, 102, 104, 106, 109, 111, 113, 115 }; static const char _sav_date_parse_trans_targs[] = { 2, 0, 3, 0, 4, 0, 5, 14, 18, 22, 30, 35, 39, 43, 0, 6, 10, 12, 13, 0, 7, 0, 8, 0, 9, 0, 47, 0, 11, 0, 8, 0, 7, 0, 11, 0, 15, 17, 0, 16, 0, 8, 0, 16, 0, 19, 21, 0, 20, 0, 8, 0, 20, 0, 23, 25, 28, 29, 0, 24, 0, 8, 0, 26, 27, 0, 8, 0, 8, 0, 24, 0, 26, 27, 0, 31, 34, 0, 32, 33, 0, 8, 0, 8, 0, 32, 33, 0, 36, 38, 0, 37, 0, 8, 0, 37, 0, 40, 42, 0, 41, 0, 8, 0, 41, 0, 44, 46, 0, 45, 0, 8, 0, 45, 0, 0, 0 }; static const char _sav_date_parse_trans_actions[] = { 31, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 0, 31, 0, 1, 0, 0, 0, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 19, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 0, 0, 0, 0, 0, 0, 0, 25, 0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0, 0, 0, 0 }; static const char _sav_date_parse_eof_actions[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3 }; static const int sav_date_parse_start = 1; static const int sav_date_parse_en_main = 1; #line 59 "src/spss/readstat_sav_parse_timestamp.rl" readstat_error_t sav_parse_date(const char *data, size_t len, struct tm *timestamp, readstat_error_handler error_cb, void *user_ctx) { readstat_error_t retval = READSTAT_OK; char error_buf[8192]; const char *p = data; const char *pe = p + len; const char *eof = pe; int cs; int temp_val = 0; #line 342 "src/spss/readstat_sav_parse_timestamp.c" { cs = sav_date_parse_start; } #line 347 "src/spss/readstat_sav_parse_timestamp.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _sav_date_parse_trans_keys + _sav_date_parse_key_offsets[cs]; _trans = _sav_date_parse_index_offsets[cs]; _klen = _sav_date_parse_single_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _sav_date_parse_range_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: cs = _sav_date_parse_trans_targs[_trans]; if ( _sav_date_parse_trans_actions[_trans] == 0 ) goto _again; _acts = _sav_date_parse_actions + _sav_date_parse_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 71 "src/spss/readstat_sav_parse_timestamp.rl" { temp_val = 10 * temp_val + ((*p) - '0'); } break; case 2: #line 83 "src/spss/readstat_sav_parse_timestamp.rl" { temp_val = 0; } break; case 3: #line 85 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mday = temp_val; } break; case 4: #line 90 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 0; } break; case 5: #line 91 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 1; } break; case 6: #line 92 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 2; } break; case 7: #line 93 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 3; } break; case 8: #line 94 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 4; } break; case 9: #line 95 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 5; } break; case 10: #line 96 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 6; } break; case 11: #line 97 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 7; } break; case 12: #line 98 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 8; } break; case 13: #line 99 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 9; } break; case 14: #line 100 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 10; } break; case 15: #line 101 "src/spss/readstat_sav_parse_timestamp.rl" { timestamp->tm_mon = 11; } break; #line 482 "src/spss/readstat_sav_parse_timestamp.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} if ( p == eof ) { const char *__acts = _sav_date_parse_actions + _sav_date_parse_eof_actions[cs]; unsigned int __nacts = (unsigned int) *__acts++; while ( __nacts-- > 0 ) { switch ( *__acts++ ) { case 1: #line 75 "src/spss/readstat_sav_parse_timestamp.rl" { if (temp_val < 70) { timestamp->tm_year = 100 + temp_val; } else { timestamp->tm_year = temp_val; } } break; #line 508 "src/spss/readstat_sav_parse_timestamp.c" } } } _out: {} } #line 107 "src/spss/readstat_sav_parse_timestamp.rl" if (cs < 47|| p != pe) { if (error_cb) { snprintf(error_buf, sizeof(error_buf), "Invalid date string (length=%d): %.*s", (int)len, (int)len, data); error_cb(error_buf, user_ctx); } retval = READSTAT_ERROR_BAD_TIMESTAMP; } (void)sav_date_parse_en_main; return retval; } haven/src/readstat/spss/readstat_por.h0000644000176200001440000000320713227731765017620 0ustar liggesusers extern int8_t por_ascii_lookup[256]; extern uint16_t por_unicode_lookup[256]; typedef struct por_ctx_s { readstat_info_handler info_handler; readstat_metadata_handler metadata_handler; readstat_variable_handler variable_handler; readstat_note_handler note_handler; readstat_fweight_handler fweight_handler; readstat_value_handler value_handler; readstat_value_label_handler value_label_handler; readstat_error_handler error_handler; readstat_progress_handler progress_handler; size_t file_size; void *user_ctx; int pos; readstat_io_t *io; char space; long num_spaces; time_t timestamp; long version; char fweight_name[9]; uint16_t byte2unicode[256]; size_t base30_precision; iconv_t converter; unsigned char *string_buffer; size_t string_buffer_len; int labels_offset; int obs_count; int var_count; int var_offset; int row_limit; readstat_variable_t **variables; spss_varinfo_t *varinfo; ck_hash_table_t *var_dict; } por_ctx_t; por_ctx_t *por_ctx_init(); void por_ctx_free(por_ctx_t *ctx); ssize_t por_utf8_encode(const unsigned char *input, size_t input_len, char *output, size_t output_len, uint16_t lookup[256]); ssize_t por_utf8_decode( const char *input, size_t input_len, char *output, size_t output_len, uint8_t *lookup, size_t lookup_len); haven/src/readstat/spss/readstat_por_write.c0000644000176200001440000005667213227731765021043 0ustar liggesusers #include #include #include #include #include #include "../readstat.h" #include "../CKHashTable.h" #include "../readstat_writer.h" #include "readstat_spss.h" #include "readstat_por.h" #define POR_BASE30_PRECISION 50 typedef struct por_write_ctx_s { unsigned char *unicode2byte; size_t unicode2byte_len; } por_write_ctx_t; static readstat_error_t por_finish(readstat_writer_t *writer) { return readstat_write_line_padding(writer, 'Z', 80, "\r\n"); } static readstat_error_t por_write_bytes(readstat_writer_t *writer, const void *bytes, size_t len) { return readstat_write_bytes_as_lines(writer, bytes, len, 80, "\r\n"); } static readstat_error_t por_write_string_n(readstat_writer_t *writer, por_write_ctx_t *ctx, const char *string, size_t input_len) { char error_buf[1024]; readstat_error_t retval = READSTAT_OK; char *por_string = malloc(input_len); ssize_t output_len = por_utf8_decode(string, input_len, por_string, input_len, ctx->unicode2byte, ctx->unicode2byte_len); if (output_len == -1) { if (writer->error_handler) { snprintf(error_buf, sizeof(error_buf), "Error converting string (length=%" PRId64 "): %.*s", (int64_t)input_len, (int)input_len, string); writer->error_handler(error_buf, writer->user_ctx); } retval = READSTAT_ERROR_CONVERT; goto cleanup; } retval = por_write_bytes(writer, por_string, output_len); cleanup: if (por_string) free(por_string); return retval; } static readstat_error_t por_write_tag(readstat_writer_t *writer, por_write_ctx_t *ctx, char tag) { char string[2]; string[0] = tag; string[1] = '\0'; return por_write_string_n(writer, ctx, string, 1); } static ssize_t por_write_double_to_buffer(char *string, size_t buffer_len, double value, long precision) { int offset = 0; if (isnan(value)) { string[offset++] = '*'; string[offset++] = '.'; } else if (isinf(value)) { if (value < 0.0) { string[offset++] = '-'; } string[offset++] = '1'; string[offset++] = '+'; string[offset++] = 'T'; string[offset++] = 'T'; string[offset++] = '/'; } else { long integers_printed = 0; double integer_part; double fraction = modf(fabs(value), &integer_part); int64_t integer = integer_part; if (value < 0.0) { string[offset++] = '-'; } if (integer == 0) { string[offset++] = '0'; } else { int start = offset; int end = offset; while (integer) { int64_t remainder = integer % 30; if (remainder < 0) { return -1; } else if (remainder < 10) { string[offset++] = '0' + remainder; } else { string[offset++] = 'A' + (remainder - 10); } integer /= 30; integers_printed++; } end = offset; offset--; while (offset > start) { char tmp = string[start]; string[start] = string[offset]; string[offset] = tmp; offset--; start++; } offset = end; } /* should use exponents for efficiency, but this works */ if (fraction) { string[offset++] = '.'; } while (fraction && integers_printed < precision) { fraction = modf(fraction * 30, &integer_part); integer = integer_part; if (integer < 0) { return -1; } else if (integer < 10) { string[offset++] = '0' + integer; } else { string[offset++] = 'A' + (integer - 10); } integers_printed++; } string[offset++] = '/'; } string[offset] = '\0'; return offset; } static readstat_error_t por_write_double(readstat_writer_t *writer, por_write_ctx_t *ctx, double value) { char error_buf[1024]; char string[256]; ssize_t bytes_written = por_write_double_to_buffer(string, sizeof(string), value, POR_BASE30_PRECISION); if (bytes_written == -1) { if (writer->error_handler) { snprintf(error_buf, sizeof(error_buf), "Unable to encode number: %lf", value); writer->error_handler(error_buf, writer->user_ctx); } return READSTAT_ERROR_WRITE; } return por_write_string_n(writer, ctx, string, bytes_written); } static readstat_error_t por_write_string_field_n(readstat_writer_t *writer, por_write_ctx_t *ctx, const char *string, size_t len) { readstat_error_t error = por_write_double(writer, ctx, len); if (error != READSTAT_OK) return error; return por_write_string_n(writer, ctx, string, len); } static readstat_error_t por_write_string_field(readstat_writer_t *writer, por_write_ctx_t *ctx, const char *string) { return por_write_string_field_n(writer, ctx, string, strlen(string)); } static por_write_ctx_t *por_write_ctx_init() { por_write_ctx_t *ctx = calloc(1, sizeof(por_write_ctx_t)); uint16_t max_unicode = 0; int i; for (i=0; i max_unicode) max_unicode = por_unicode_lookup[i]; } ctx->unicode2byte = malloc(max_unicode+1); ctx->unicode2byte_len = max_unicode+1; for (i=0; iunicode2byte[por_unicode_lookup[i]] = por_ascii_lookup[i]; } if (por_ascii_lookup[i]) { ctx->unicode2byte[por_ascii_lookup[i]] = por_ascii_lookup[i]; } } return ctx; } static void por_write_ctx_free(por_write_ctx_t *ctx) { if (ctx->unicode2byte) free(ctx->unicode2byte); free(ctx); } static readstat_error_t por_emit_header(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; size_t file_label_len = strlen(writer->file_label); char vanity[5][40]; memset(vanity, '0', sizeof(vanity)); strncpy(vanity[1], "ASCII SPSS PORT FILE", 20); strncpy(vanity[1] + 20, writer->file_label, 20); if (file_label_len < 20) memset(vanity[1] + 20 + file_label_len, ' ', 20 - file_label_len); por_write_bytes(writer, vanity, sizeof(vanity)); char lookup[256]; int i; memset(lookup, '0', sizeof(lookup)); for (i=0; itimestamp); if ((retval = por_write_tag(writer, ctx, 'A')) != READSTAT_OK) goto cleanup; char date[9]; snprintf(date, sizeof(date), "%04d%02d%02d", timestamp->tm_year + 1900, timestamp->tm_mon + 1, timestamp->tm_mday); if ((retval = por_write_string_field(writer, ctx, date)) != READSTAT_OK) goto cleanup; char time[7]; snprintf(time, sizeof(time), "%02d%02d%02d", timestamp->tm_hour, timestamp->tm_min, timestamp->tm_sec); if ((retval = por_write_string_field(writer, ctx, time)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t por_emit_identification_records(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if ((retval = por_write_tag(writer, ctx, '1')) != READSTAT_OK) goto cleanup; if ((retval = por_write_string_field(writer, ctx, READSTAT_PRODUCT_NAME)) != READSTAT_OK) goto cleanup; if ((retval = por_write_tag(writer, ctx, '3')) != READSTAT_OK) goto cleanup; if ((retval = por_write_string_field(writer, ctx, READSTAT_PRODUCT_URL)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t por_emit_variable_count_record(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if ((retval = por_write_tag(writer, ctx, '4')) != READSTAT_OK) goto cleanup; if ((retval = por_write_double(writer, ctx, writer->variables_count)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t por_emit_precision_record(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if ((retval = por_write_tag(writer, ctx, '5')) != READSTAT_OK) goto cleanup; if ((retval = por_write_double(writer, ctx, POR_BASE30_PRECISION)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t por_emit_case_weight_variable_record(readstat_writer_t *writer, por_write_ctx_t *ctx) { if (!writer->fweight_variable) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; if ((retval = por_write_tag(writer, ctx, '6')) != READSTAT_OK) goto cleanup; if ((retval = por_write_string_field(writer, ctx, readstat_variable_get_name(writer->fweight_variable))) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t por_emit_format(readstat_writer_t *writer, por_write_ctx_t *ctx, spss_format_t *format) { readstat_error_t error = READSTAT_OK; if ((error = por_write_double(writer, ctx, format->type)) != READSTAT_OK) goto cleanup; if ((error = por_write_double(writer, ctx, format->width)) != READSTAT_OK) goto cleanup; if ((error = por_write_double(writer, ctx, format->decimal_places)) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t validate_variable_name(const char *name) { size_t len = strlen(name); if (len < 1 || len > 8) return READSTAT_ERROR_NAME_IS_TOO_LONG; int i; for (i=0; name[i]; i++) { if (name[i] >= 'A' && name[i] <= 'Z') continue; if (name[i] >= '0' && name[i] <= '9') continue; if (name[i] == '@' || name[i] == '#' || name[i] == '$') continue; if (name[i] == '_' || name[i] == '.') continue; return READSTAT_ERROR_NAME_CONTAINS_ILLEGAL_CHARACTER; } if (!(name[0] >= 'A' && name[0] <= 'Z') && name[0] != '@') return READSTAT_ERROR_NAME_BEGINS_WITH_ILLEGAL_CHARACTER; return READSTAT_OK; } static readstat_error_t por_emit_variable_label_record(readstat_writer_t *writer, por_write_ctx_t *ctx, readstat_variable_t *r_variable) { const char *label = readstat_variable_get_label(r_variable); readstat_error_t retval = READSTAT_OK; if (!label) return READSTAT_OK; if ((retval = por_write_tag(writer, ctx, 'C')) != READSTAT_OK) goto cleanup; if ((retval = por_write_string_field(writer, ctx, label)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t por_emit_missing_values_records(readstat_writer_t *writer, por_write_ctx_t *ctx, readstat_variable_t *r_variable) { readstat_error_t retval = READSTAT_OK; int n_missing_values = 0; int n_missing_ranges = readstat_variable_get_missing_ranges_count(r_variable); /* ranges */ int j; for (j=0; j 3) retval = READSTAT_ERROR_TOO_MANY_MISSING_VALUE_DEFINITIONS; cleanup: return retval; } static readstat_error_t por_emit_variable_records(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i; for (i=0; ivariables_count; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); const char *variable_name = readstat_variable_get_name(r_variable); spss_format_t print_format; if ((retval = por_write_tag(writer, ctx, '7')) != READSTAT_OK) break; retval = por_write_double(writer, ctx, (r_variable->type == READSTAT_TYPE_STRING) ? r_variable->user_width : 0); if (retval != READSTAT_OK) break; if ((retval = por_write_string_field(writer, ctx, variable_name)) != READSTAT_OK) break; if ((retval = spss_format_for_variable(r_variable, &print_format)) != READSTAT_OK) break; if ((retval = por_emit_format(writer, ctx, &print_format)) != READSTAT_OK) break; if ((retval = por_emit_format(writer, ctx, &print_format)) != READSTAT_OK) break; if ((retval = por_emit_missing_values_records(writer, ctx, r_variable)) != READSTAT_OK) break; if ((retval = por_emit_variable_label_record(writer, ctx, r_variable)) != READSTAT_OK) break; } return retval; } static readstat_error_t por_emit_value_label_records(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i, j; for (i=0; ilabel_sets_count; i++) { readstat_label_set_t *r_label_set = readstat_get_label_set(writer, i); readstat_type_t user_type = r_label_set->type; if (r_label_set->value_labels_count == 0 || r_label_set->variables_count == 0) continue; if ((retval = por_write_tag(writer, ctx, 'D')) != READSTAT_OK) goto cleanup; if ((retval = por_write_double(writer, ctx, r_label_set->variables_count)) != READSTAT_OK) goto cleanup; for (j=0; jvariables_count; j++) { readstat_variable_t *r_variable = readstat_get_label_set_variable(r_label_set, j); if ((retval = por_write_string_field(writer, ctx, readstat_variable_get_name(r_variable))) != READSTAT_OK) goto cleanup; } if ((retval = por_write_double(writer, ctx, r_label_set->value_labels_count)) != READSTAT_OK) goto cleanup; for (j=0; jvalue_labels_count; j++) { readstat_value_label_t *r_value_label = readstat_get_value_label(r_label_set, j); if (user_type == READSTAT_TYPE_STRING) { retval = por_write_string_field_n(writer, ctx, r_value_label->string_key, r_value_label->string_key_len); } else if (user_type == READSTAT_TYPE_DOUBLE) { retval = por_write_double(writer, ctx, r_value_label->double_key); } else if (user_type == READSTAT_TYPE_INT32) { retval = por_write_double(writer, ctx, r_value_label->int32_key); } if (retval != READSTAT_OK) goto cleanup; if ((retval = por_write_string_field_n(writer, ctx, r_value_label->label, r_value_label->label_len)) != READSTAT_OK) goto cleanup; } } cleanup: return retval; } static readstat_error_t por_emit_document_record(readstat_writer_t *writer, por_write_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if ((retval = por_write_tag(writer, ctx, 'E')) != READSTAT_OK) goto cleanup; if ((retval = por_write_double(writer, ctx, writer->notes_count)) != READSTAT_OK) goto cleanup; int i; for (i=0; inotes_count; i++) { size_t len = strlen(writer->notes[i]); if (len > SPSS_DOC_LINE_SIZE) { retval = READSTAT_ERROR_NOTE_IS_TOO_LONG; goto cleanup; } if ((retval = por_write_string_field_n(writer, ctx, writer->notes[i], len)) != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t por_emit_data_tag(readstat_writer_t *writer, por_write_ctx_t *ctx) { return por_write_tag(writer, ctx, 'F'); } static readstat_error_t por_begin_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; por_write_ctx_t *ctx = por_write_ctx_init(); readstat_error_t retval = READSTAT_OK; if ((retval = por_emit_header(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_version_and_timestamp(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_identification_records(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_variable_count_record(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_precision_record(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_case_weight_variable_record(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_variable_records(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_value_label_records(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_document_record(writer, ctx)) != READSTAT_OK) goto cleanup; if ((retval = por_emit_data_tag(writer, ctx)) != READSTAT_OK) goto cleanup; cleanup: if (retval != READSTAT_OK) { por_write_ctx_free(ctx); } else { writer->module_ctx = ctx; } return retval; } static readstat_error_t por_end_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t error = READSTAT_OK; if ((error = por_write_tag(writer, writer->module_ctx, 'Z')) != READSTAT_OK) goto cleanup; if ((error = por_finish(writer)) != READSTAT_OK) goto cleanup; cleanup: por_write_ctx_free(writer->module_ctx); return error; } static size_t por_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) { return POR_BASE30_PRECISION + 4 + user_width; } return POR_BASE30_PRECISION + 4; // minus sign + period + plus/minus + slash } static readstat_error_t por_variable_ok(readstat_variable_t *variable) { return validate_variable_name(readstat_variable_get_name(variable)); } static readstat_error_t por_write_double_value(void *row, const readstat_variable_t *var, double value) { if (por_write_double_to_buffer(row, POR_BASE30_PRECISION + 4, value, POR_BASE30_PRECISION) == -1) { return READSTAT_ERROR_WRITE; } return READSTAT_OK; } static readstat_error_t por_write_int8_value(void *row, const readstat_variable_t *var, int8_t value) { return por_write_double_value(row, var, value); } static readstat_error_t por_write_int16_value(void *row, const readstat_variable_t *var, int16_t value) { return por_write_double_value(row, var, value); } static readstat_error_t por_write_int32_value(void *row, const readstat_variable_t *var, int32_t value) { return por_write_double_value(row, var, value); } static readstat_error_t por_write_float_value(void *row, const readstat_variable_t *var, float value) { return por_write_double_value(row, var, value); } static readstat_error_t por_write_missing_number(void *row, const readstat_variable_t *var) { return por_write_double_value(row, var, NAN); } static readstat_error_t por_write_missing_string(void *row, const readstat_variable_t *var) { return por_write_double_value(row, var, 0); } static readstat_error_t por_write_string_value(void *row, const readstat_variable_t *var, const char *string) { size_t len = strlen(string); if (len == 0) { string = " "; len = 1; } size_t storage_width = readstat_variable_get_storage_width(var); if (len > storage_width) { len = storage_width; } ssize_t bytes_written = por_write_double_to_buffer(row, POR_BASE30_PRECISION + 4, len, POR_BASE30_PRECISION); if (bytes_written == -1) { return READSTAT_ERROR_WRITE; } strncpy(((char *)row) + bytes_written, string, len); return READSTAT_OK; } static readstat_error_t por_write_row(void *writer_ctx, void *row, size_t row_len) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; char *row_chars = (char *)row; int offset = 0, output = 0; for (offset=0; offsetmodule_ctx, row_chars, output); } readstat_error_t readstat_begin_writing_por(readstat_writer_t *writer, void *user_ctx, long row_count) { if (writer->compression != READSTAT_COMPRESS_NONE) return READSTAT_ERROR_UNSUPPORTED_COMPRESSION; writer->callbacks.variable_width = &por_variable_width; writer->callbacks.variable_ok = &por_variable_ok; writer->callbacks.write_int8 = &por_write_int8_value; writer->callbacks.write_int16 = &por_write_int16_value; writer->callbacks.write_int32 = &por_write_int32_value; writer->callbacks.write_float = &por_write_float_value; writer->callbacks.write_double = &por_write_double_value; writer->callbacks.write_string = &por_write_string_value; writer->callbacks.write_missing_string = &por_write_missing_string; writer->callbacks.write_missing_number = &por_write_missing_number; writer->callbacks.begin_data = &por_begin_data; writer->callbacks.write_row = &por_write_row; writer->callbacks.end_data = &por_end_data; return readstat_begin_writing_file(writer, user_ctx, row_count); } haven/src/readstat/readstat_variable.c0000644000176200001440000000667613227731765017625 0ustar liggesusers #include #include "readstat.h" static readstat_value_t make_blank_value(); static readstat_value_t make_double_value(double dval); static readstat_value_t make_blank_value() { readstat_value_t value = { .is_system_missing = 1, .v = { .double_value = NAN }, .type = READSTAT_TYPE_DOUBLE }; return value; } static readstat_value_t make_double_value(double dval) { readstat_value_t value = { .v = { .double_value = dval }, .type = READSTAT_TYPE_DOUBLE }; return value; } const char *readstat_variable_get_name(const readstat_variable_t *variable) { if (variable->name[0]) return variable->name; return NULL; } const char *readstat_variable_get_label(const readstat_variable_t *variable) { if (variable->label[0]) return variable->label; return NULL; } const char *readstat_variable_get_format(const readstat_variable_t *variable) { if (variable->format[0]) return variable->format; return NULL; } readstat_type_t readstat_variable_get_type(const readstat_variable_t *variable) { return variable->type; } readstat_type_class_t readstat_variable_get_type_class(const readstat_variable_t *variable) { return readstat_type_class(variable->type); } int readstat_variable_get_index(const readstat_variable_t *variable) { return variable->index; } int readstat_variable_get_index_after_skipping(const readstat_variable_t *variable) { return variable->index_after_skipping; } size_t readstat_variable_get_storage_width(const readstat_variable_t *variable) { return variable->storage_width; } readstat_measure_t readstat_variable_get_measure(const readstat_variable_t *variable) { return variable->measure; } readstat_alignment_t readstat_variable_get_alignment(const readstat_variable_t *variable) { return variable->alignment; } int readstat_variable_get_display_width(const readstat_variable_t *variable) { return variable->display_width; } int readstat_variable_get_missing_ranges_count(const readstat_variable_t *variable) { return variable->missingness.missing_ranges_count; } readstat_value_t readstat_variable_get_missing_range_lo(const readstat_variable_t *variable, int i) { if (i < variable->missingness.missing_ranges_count && 2*i+1 < sizeof(variable->missingness.missing_ranges)/sizeof(variable->missingness.missing_ranges[0])) { return variable->missingness.missing_ranges[2*i]; } return make_blank_value(); } readstat_value_t readstat_variable_get_missing_range_hi(const readstat_variable_t *variable, int i) { if (i < variable->missingness.missing_ranges_count && 2*i+1 < sizeof(variable->missingness.missing_ranges)/sizeof(variable->missingness.missing_ranges[0])) { return variable->missingness.missing_ranges[2*i+1]; } return make_blank_value(); } void readstat_variable_add_missing_double_value(readstat_variable_t *variable, double value) { readstat_variable_add_missing_double_range(variable, value, value); } void readstat_variable_add_missing_double_range(readstat_variable_t *variable, double lo, double hi) { int i = readstat_variable_get_missing_ranges_count(variable); if (2*i < sizeof(variable->missingness.missing_ranges)/sizeof(variable->missingness.missing_ranges[0])) { variable->missingness.missing_ranges[2*i] = make_double_value(lo); variable->missingness.missing_ranges[2*i+1] = make_double_value(hi); variable->missingness.missing_ranges_count++; } } haven/src/readstat/readstat_writer.c0000644000176200001440000005600313227731765017341 0ustar liggesusers #include #include #include "readstat.h" #include "readstat_writer.h" #define VARIABLES_INITIAL_CAPACITY 50 #define LABEL_SETS_INITIAL_CAPACITY 50 #define NOTES_INITIAL_CAPACITY 50 #define VALUE_LABELS_INITIAL_CAPACITY 10 #define STRING_REFS_INITIAL_CAPACITY 100 #define LABEL_SET_VARIABLES_INITIAL_CAPACITY 2 static readstat_error_t readstat_write_row_default_callback(void *writer_ctx, void *bytes, size_t len) { return readstat_write_bytes((readstat_writer_t *)writer_ctx, bytes, len); } static int readstat_compare_string_refs(const void *elem1, const void *elem2) { readstat_string_ref_t *ref1 = *(readstat_string_ref_t **)elem1; readstat_string_ref_t *ref2 = *(readstat_string_ref_t **)elem2; if (ref1->first_v == ref2->first_v) return ref1->first_o - ref2->first_o; return ref1->first_v - ref2->first_v; } readstat_string_ref_t *readstat_string_ref_init(const char *string) { size_t len = strlen(string) + 1; readstat_string_ref_t *ref = calloc(1, sizeof(readstat_string_ref_t) + len); ref->first_o = -1; ref->first_v = -1; ref->len = len; memcpy(&ref->data[0], string, len); return ref; } readstat_writer_t *readstat_writer_init() { readstat_writer_t *writer = calloc(1, sizeof(readstat_writer_t)); writer->variables = calloc(VARIABLES_INITIAL_CAPACITY, sizeof(readstat_variable_t *)); writer->variables_capacity = VARIABLES_INITIAL_CAPACITY; writer->label_sets = calloc(LABEL_SETS_INITIAL_CAPACITY, sizeof(readstat_label_set_t *)); writer->label_sets_capacity = LABEL_SETS_INITIAL_CAPACITY; writer->notes = calloc(NOTES_INITIAL_CAPACITY, sizeof(char *)); writer->notes_capacity = NOTES_INITIAL_CAPACITY; writer->string_refs = calloc(STRING_REFS_INITIAL_CAPACITY, sizeof(readstat_string_ref_t *)); writer->string_refs_capacity = STRING_REFS_INITIAL_CAPACITY; writer->timestamp = time(NULL); writer->is_64bit = 1; writer->callbacks.write_row = &readstat_write_row_default_callback; return writer; } static void readstat_variable_free(readstat_variable_t *variable) { free(variable); } static void readstat_label_set_free(readstat_label_set_t *label_set) { int i; for (i=0; ivalue_labels_count; i++) { readstat_value_label_t *value_label = readstat_get_value_label(label_set, i); if (value_label->label) free(value_label->label); if (value_label->string_key) free(value_label->string_key); } free(label_set->value_labels); free(label_set->variables); free(label_set); } static void readstat_copy_label(readstat_value_label_t *value_label, const char *label) { if (label && strlen(label)) { value_label->label_len = strlen(label); value_label->label = malloc(value_label->label_len); strncpy(value_label->label, label, value_label->label_len); } } static readstat_value_label_t *readstat_add_value_label(readstat_label_set_t *label_set, const char *label) { if (label_set->value_labels_count == label_set->value_labels_capacity) { label_set->value_labels_capacity *= 2; label_set->value_labels = realloc(label_set->value_labels, label_set->value_labels_capacity * sizeof(readstat_value_label_t)); } readstat_value_label_t *new_value_label = &label_set->value_labels[label_set->value_labels_count++]; memset(new_value_label, 0, sizeof(readstat_value_label_t)); readstat_copy_label(new_value_label, label); return new_value_label; } static readstat_error_t readstat_begin_writing_data(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; size_t row_len = 0; int i; for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); variable->storage_width = writer->callbacks.variable_width(variable->type, variable->user_width); variable->offset = row_len; row_len += variable->storage_width; } if (writer->callbacks.variable_ok) { for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); retval = writer->callbacks.variable_ok(variable); if (retval != READSTAT_OK) goto cleanup; } } if (writer->callbacks.begin_data) { retval = writer->callbacks.begin_data(writer); } writer->row_len = row_len; writer->row = malloc(writer->row_len); cleanup: return retval; } void readstat_writer_free(readstat_writer_t *writer) { int i; if (writer) { if (writer->callbacks.module_ctx_free && writer->module_ctx) { writer->callbacks.module_ctx_free(writer->module_ctx); } if (writer->variables) { for (i=0; ivariables_count; i++) { readstat_variable_free(writer->variables[i]); } free(writer->variables); } if (writer->label_sets) { for (i=0; ilabel_sets_count; i++) { readstat_label_set_free(writer->label_sets[i]); } free(writer->label_sets); } if (writer->notes) { for (i=0; inotes_count; i++) { free(writer->notes[i]); } free(writer->notes); } if (writer->string_refs) { for (i=0; istring_refs_count; i++) { free(writer->string_refs[i]); } free(writer->string_refs); } if (writer->row) { free(writer->row); } free(writer); } } readstat_error_t readstat_set_data_writer(readstat_writer_t *writer, readstat_data_writer data_writer) { writer->data_writer = data_writer; return READSTAT_OK; } readstat_error_t readstat_write_bytes(readstat_writer_t *writer, const void *bytes, size_t len) { size_t bytes_written = writer->data_writer(bytes, len, writer->user_ctx); if (bytes_written < len) { return READSTAT_ERROR_WRITE; } writer->bytes_written += bytes_written; return READSTAT_OK; } readstat_error_t readstat_write_bytes_as_lines(readstat_writer_t *writer, const void *bytes, size_t len, size_t line_len, const char *line_sep) { size_t line_sep_len = strlen(line_sep); readstat_error_t retval = READSTAT_OK; size_t bytes_written = 0; while (bytes_written < len) { ssize_t bytes_left_in_line = line_len - (writer->bytes_written % (line_len + line_sep_len)); if (len - bytes_written < bytes_left_in_line) { retval = readstat_write_bytes(writer, ((const char *)bytes) + bytes_written, len - bytes_written); bytes_written = len; } else { retval = readstat_write_bytes(writer, ((const char *)bytes) + bytes_written, bytes_left_in_line); bytes_written += bytes_left_in_line; } if (retval != READSTAT_OK) break; if (writer->bytes_written % (line_len + line_sep_len) == line_len) { if ((retval = readstat_write_bytes(writer, line_sep, line_sep_len)) != READSTAT_OK) break; } } return retval; } readstat_error_t readstat_write_line_padding(readstat_writer_t *writer, char pad, size_t line_len, const char *line_sep) { size_t line_sep_len = strlen(line_sep); if (writer->bytes_written % (line_len + line_sep_len) == 0) return READSTAT_OK; readstat_error_t error = READSTAT_OK; ssize_t bytes_left_in_line = line_len - (writer->bytes_written % (line_len + line_sep_len)); char *bytes = malloc(bytes_left_in_line); memset(bytes, pad, bytes_left_in_line); if ((error = readstat_write_bytes(writer, bytes, bytes_left_in_line)) != READSTAT_OK) goto cleanup; if ((error = readstat_write_bytes(writer, line_sep, line_sep_len)) != READSTAT_OK) goto cleanup; cleanup: if (bytes) free(bytes); return READSTAT_OK; } readstat_error_t readstat_write_string(readstat_writer_t *writer, const char *bytes) { return readstat_write_bytes(writer, bytes, strlen(bytes)); } static readstat_error_t readstat_write_repeated_byte(readstat_writer_t *writer, char byte, size_t len) { if (len == 0) return READSTAT_OK; char zeros[len]; memset(zeros, byte, len); return readstat_write_bytes(writer, zeros, len); } readstat_error_t readstat_write_zeros(readstat_writer_t *writer, size_t len) { return readstat_write_repeated_byte(writer, '\0', len); } readstat_error_t readstat_write_spaces(readstat_writer_t *writer, size_t len) { return readstat_write_repeated_byte(writer, ' ', len); } readstat_label_set_t *readstat_add_label_set(readstat_writer_t *writer, readstat_type_t type, const char *name) { if (writer->label_sets_count == writer->label_sets_capacity) { writer->label_sets_capacity *= 2; writer->label_sets = realloc(writer->label_sets, writer->label_sets_capacity * sizeof(readstat_label_set_t *)); } readstat_label_set_t *new_label_set = calloc(1, sizeof(readstat_label_set_t)); writer->label_sets[writer->label_sets_count++] = new_label_set; new_label_set->type = type; strncpy(new_label_set->name, name, sizeof(new_label_set->name)); new_label_set->value_labels = calloc(VALUE_LABELS_INITIAL_CAPACITY, sizeof(readstat_value_label_t)); new_label_set->value_labels_capacity = VALUE_LABELS_INITIAL_CAPACITY; new_label_set->variables = calloc(LABEL_SET_VARIABLES_INITIAL_CAPACITY, sizeof(readstat_variable_t *)); new_label_set->variables_capacity = LABEL_SET_VARIABLES_INITIAL_CAPACITY; return new_label_set; } readstat_label_set_t *readstat_get_label_set(readstat_writer_t *writer, int index) { if (index < writer->label_sets_count) { return writer->label_sets[index]; } return NULL; } void readstat_sort_label_set(readstat_label_set_t *label_set, int (*compare)(const readstat_value_label_t *, const readstat_value_label_t *)) { qsort(label_set->value_labels, label_set->value_labels_count, sizeof(readstat_value_label_t), (int (*)(const void *, const void *))compare); } readstat_value_label_t *readstat_get_value_label(readstat_label_set_t *label_set, int index) { if (index < label_set->value_labels_count) { return &label_set->value_labels[index]; } return NULL; } readstat_variable_t *readstat_get_label_set_variable(readstat_label_set_t *label_set, int index) { if (index < label_set->variables_count) { return ((readstat_variable_t **)label_set->variables)[index]; } return NULL; } void readstat_label_double_value(readstat_label_set_t *label_set, double value, const char *label) { readstat_value_label_t *new_value_label = readstat_add_value_label(label_set, label); new_value_label->double_key = value; new_value_label->int32_key = value; } void readstat_label_int32_value(readstat_label_set_t *label_set, int32_t value, const char *label) { readstat_value_label_t *new_value_label = readstat_add_value_label(label_set, label); new_value_label->double_key = value; new_value_label->int32_key = value; } void readstat_label_string_value(readstat_label_set_t *label_set, const char *value, const char *label) { readstat_value_label_t *new_value_label = readstat_add_value_label(label_set, label); if (value && strlen(value)) { new_value_label->string_key_len = strlen(value); new_value_label->string_key = malloc(new_value_label->string_key_len); strncpy(new_value_label->string_key, value, new_value_label->string_key_len); } } void readstat_label_tagged_value(readstat_label_set_t *label_set, char tag, const char *label) { readstat_value_label_t *new_value_label = readstat_add_value_label(label_set, label); new_value_label->tag = tag; } readstat_variable_t *readstat_add_variable(readstat_writer_t *writer, const char *name, readstat_type_t type, size_t width) { if (writer->variables_count == writer->variables_capacity) { writer->variables_capacity *= 2; writer->variables = realloc(writer->variables, writer->variables_capacity * sizeof(readstat_variable_t *)); } readstat_variable_t *new_variable = calloc(1, sizeof(readstat_variable_t)); new_variable->index = writer->variables_count++; writer->variables[new_variable->index] = new_variable; new_variable->user_width = width; new_variable->type = type; if (readstat_variable_get_type_class(new_variable) == READSTAT_TYPE_CLASS_STRING) { new_variable->alignment = READSTAT_ALIGNMENT_LEFT; } else { new_variable->alignment = READSTAT_ALIGNMENT_RIGHT; } new_variable->measure = READSTAT_MEASURE_UNKNOWN; if (name) { snprintf(new_variable->name, sizeof(new_variable->name), "%s", name); } return new_variable; } static void readstat_append_string_ref(readstat_writer_t *writer, readstat_string_ref_t *ref) { if (writer->string_refs_count == writer->string_refs_capacity) { writer->string_refs_capacity *= 2; writer->string_refs = realloc(writer->string_refs, writer->string_refs_capacity * sizeof(readstat_string_ref_t *)); } writer->string_refs[writer->string_refs_count++] = ref; } readstat_string_ref_t *readstat_add_string_ref(readstat_writer_t *writer, const char *string) { readstat_string_ref_t *ref = readstat_string_ref_init(string); readstat_append_string_ref(writer, ref); return ref; } void readstat_add_note(readstat_writer_t *writer, const char *note) { if (writer->notes_count == writer->notes_capacity) { writer->notes_capacity *= 2; writer->notes = realloc(writer->notes, writer->notes_capacity * sizeof(const char *)); } char *note_copy = malloc(strlen(note) + 1); strcpy(note_copy, note); writer->notes[writer->notes_count++] = note_copy; } void readstat_variable_set_label(readstat_variable_t *variable, const char *label) { if (label) { snprintf(variable->label, sizeof(variable->label), "%s", label); } else { memset(variable->label, '\0', sizeof(variable->label)); } } void readstat_variable_set_format(readstat_variable_t *variable, const char *format) { if (format) { snprintf(variable->format, sizeof(variable->format), "%s", format); } else { memset(variable->format, '\0', sizeof(variable->format)); } } void readstat_variable_set_measure(readstat_variable_t *variable, readstat_measure_t measure) { variable->measure = measure; } void readstat_variable_set_alignment(readstat_variable_t *variable, readstat_alignment_t alignment) { variable->alignment = alignment; } void readstat_variable_set_display_width(readstat_variable_t *variable, int display_width) { variable->display_width = display_width; } void readstat_variable_set_label_set(readstat_variable_t *variable, readstat_label_set_t *label_set) { variable->label_set = label_set; if (label_set) { if (label_set->variables_count == label_set->variables_capacity) { label_set->variables_capacity *= 2; label_set->variables = realloc(label_set->variables, label_set->variables_capacity * sizeof(readstat_variable_t *)); } ((readstat_variable_t **)label_set->variables)[label_set->variables_count++] = variable; } } readstat_variable_t *readstat_get_variable(readstat_writer_t *writer, int index) { if (index < writer->variables_count) { return writer->variables[index]; } return NULL; } readstat_string_ref_t *readstat_get_string_ref(readstat_writer_t *writer, int index) { if (index < writer->string_refs_count) { return writer->string_refs[index]; } return NULL; } readstat_error_t readstat_writer_set_file_label(readstat_writer_t *writer, const char *file_label) { snprintf(writer->file_label, sizeof(writer->file_label), "%s", file_label); return READSTAT_OK; } readstat_error_t readstat_writer_set_file_timestamp(readstat_writer_t *writer, time_t timestamp) { writer->timestamp = timestamp; return READSTAT_OK; } readstat_error_t readstat_writer_set_fweight_variable(readstat_writer_t *writer, const readstat_variable_t *variable) { if (readstat_variable_get_type_class(variable) == READSTAT_TYPE_CLASS_STRING) return READSTAT_ERROR_BAD_FREQUENCY_WEIGHT; writer->fweight_variable = variable; return READSTAT_OK; } readstat_error_t readstat_writer_set_file_format_version(readstat_writer_t *writer, long version) { writer->version = version; return READSTAT_OK; } readstat_error_t readstat_writer_set_file_format_is_64bit(readstat_writer_t *writer, int is_64bit) { writer->is_64bit = is_64bit; return READSTAT_OK; } readstat_error_t readstat_writer_set_compression(readstat_writer_t *writer, readstat_compress_t compression) { writer->compression = compression; return READSTAT_OK; } readstat_error_t readstat_writer_set_error_handler(readstat_writer_t *writer, readstat_error_handler error_handler) { writer->error_handler = error_handler; return READSTAT_OK; } readstat_error_t readstat_begin_writing_file(readstat_writer_t *writer, void *user_ctx, long row_count) { writer->row_count = row_count; writer->user_ctx = user_ctx; writer->initialized = 1; return READSTAT_OK; } readstat_error_t readstat_begin_row(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (writer->current_row == 0) retval = readstat_begin_writing_data(writer); memset(writer->row, '\0', writer->row_len); return retval; } // Then call one of these for each variable readstat_error_t readstat_insert_int8_value(readstat_writer_t *writer, const readstat_variable_t *variable, int8_t value) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_INT8) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; return writer->callbacks.write_int8(&writer->row[variable->offset], variable, value); } readstat_error_t readstat_insert_int16_value(readstat_writer_t *writer, const readstat_variable_t *variable, int16_t value) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_INT16) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; return writer->callbacks.write_int16(&writer->row[variable->offset], variable, value); } readstat_error_t readstat_insert_int32_value(readstat_writer_t *writer, const readstat_variable_t *variable, int32_t value) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_INT32) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; return writer->callbacks.write_int32(&writer->row[variable->offset], variable, value); } readstat_error_t readstat_insert_float_value(readstat_writer_t *writer, const readstat_variable_t *variable, float value) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_FLOAT) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; return writer->callbacks.write_float(&writer->row[variable->offset], variable, value); } readstat_error_t readstat_insert_double_value(readstat_writer_t *writer, const readstat_variable_t *variable, double value) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_DOUBLE) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; return writer->callbacks.write_double(&writer->row[variable->offset], variable, value); } readstat_error_t readstat_insert_string_value(readstat_writer_t *writer, const readstat_variable_t *variable, const char *value) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_STRING) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; return writer->callbacks.write_string(&writer->row[variable->offset], variable, value); } readstat_error_t readstat_insert_string_ref(readstat_writer_t *writer, const readstat_variable_t *variable, readstat_string_ref_t *ref) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type != READSTAT_TYPE_STRING_REF) return READSTAT_ERROR_VALUE_TYPE_MISMATCH; if (!writer->callbacks.write_string_ref) return READSTAT_ERROR_STRING_REFS_NOT_SUPPORTED; if (ref && ref->first_o == -1 && ref->first_v == -1) { ref->first_o = writer->current_row; ref->first_v = variable->index; } return writer->callbacks.write_string_ref(&writer->row[variable->offset], variable, ref); } readstat_error_t readstat_insert_missing_value(readstat_writer_t *writer, const readstat_variable_t *variable) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (variable->type == READSTAT_TYPE_STRING) { return writer->callbacks.write_missing_string(&writer->row[variable->offset], variable); } if (variable->type == READSTAT_TYPE_STRING_REF) { return readstat_insert_string_ref(writer, variable, NULL); } return writer->callbacks.write_missing_number(&writer->row[variable->offset], variable); } readstat_error_t readstat_insert_tagged_missing_value(readstat_writer_t *writer, const readstat_variable_t *variable, char tag) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (!writer->callbacks.write_missing_tagged) { /* Write out a missing number but return an error */ writer->callbacks.write_missing_number(&writer->row[variable->offset], variable); return READSTAT_ERROR_TAGGED_VALUES_NOT_SUPPORTED; } return writer->callbacks.write_missing_tagged(&writer->row[variable->offset], variable, tag); } readstat_error_t readstat_end_row(readstat_writer_t *writer) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; readstat_error_t error = writer->callbacks.write_row(writer, writer->row, writer->row_len); if (error == READSTAT_OK) writer->current_row++; return error; } readstat_error_t readstat_end_writing(readstat_writer_t *writer) { if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; if (writer->current_row != writer->row_count) return READSTAT_ERROR_ROW_COUNT_MISMATCH; if (writer->row_count == 0) { readstat_error_t retval = readstat_begin_writing_data(writer); if (retval != READSTAT_OK) return retval; } /* Sort if out of order */ int i; for (i=1; istring_refs_count; i++) { if (readstat_compare_string_refs(&writer->string_refs[i-1], &writer->string_refs[i]) > 0) { qsort(writer->string_refs, writer->string_refs_count, sizeof(readstat_string_ref_t *), &readstat_compare_string_refs); break; } } if (!writer->callbacks.end_data) return READSTAT_OK; return writer->callbacks.end_data(writer); } haven/src/readstat/sas/0000755000176200001440000000000013227731765014554 5ustar liggesusershaven/src/readstat/sas/readstat_sas7bcat_write.c0000644000176200001440000001475013227731765021537 0ustar liggesusers #include #include #include #include #include "../readstat.h" #include "../readstat_writer.h" #include "readstat_sas.h" #include "readstat_sas_rle.h" typedef struct sas7bcat_block_s { size_t len; char data[1]; // Flexible array; use [1] for C++-98 compatibility } sas7bcat_block_t; static sas7bcat_block_t *sas7bcat_block_for_label_set(readstat_label_set_t *r_label_set) { size_t len = 0; size_t name_len = strlen(r_label_set->name); int j; char name[32]; len += 106; if (name_len > 8) { len += 32; // long name if (name_len > 32) { name_len = 32; } } memcpy(&name[0], r_label_set->name, name_len); for (j=0; jvalue_labels_count; j++) { readstat_value_label_t *value_label = readstat_get_value_label(r_label_set, j); len += 30; // Value: 14-byte header + 16-byte padded value len += 8 + 2 + value_label->label_len + 1; } sas7bcat_block_t *block = calloc(1, sizeof(sas7bcat_block_t) + len); block->len = len; off_t begin = 106; int32_t count = r_label_set->value_labels_count; memcpy(&block->data[38], &count, sizeof(int32_t)); memcpy(&block->data[42], &count, sizeof(int32_t)); if (name_len > 8) { block->data[2] = (char)0x80; memcpy(&block->data[8], name, 8); memset(&block->data[106], ' ', 32); memcpy(&block->data[106], name, name_len); begin += 32; } else { memset(&block->data[8], ' ', 8); memcpy(&block->data[8], name, name_len); } char *lbp1 = &block->data[begin]; char *lbp2 = &block->data[begin+r_label_set->value_labels_count*30]; for (j=0; jvalue_labels_count; j++) { readstat_value_label_t *value_label = readstat_get_value_label(r_label_set, j); lbp1[2] = 24; // size - 6 int32_t index = j; memcpy(&lbp1[10], &index, sizeof(int32_t)); if (r_label_set->type == READSTAT_TYPE_STRING) { size_t string_len = value_label->string_key_len; if (string_len > 16) string_len = 16; memset(&lbp1[14], ' ', 16); memcpy(&lbp1[14], value_label->string_key, string_len); } else { uint64_t big_endian_value; double double_value = -1.0 * value_label->double_key; memcpy(&big_endian_value, &double_value, sizeof(double)); if (machine_is_little_endian()) { big_endian_value = byteswap8(big_endian_value); } memcpy(&lbp1[22], &big_endian_value, sizeof(uint64_t)); } int16_t label_len = value_label->label_len; memcpy(&lbp2[8], &label_len, sizeof(int16_t)); memcpy(&lbp2[10], value_label->label, label_len); lbp1 += 30; lbp2 += 8 + 2 + value_label->label_len + 1; } return block; } static readstat_error_t sas7bcat_emit_header(readstat_writer_t *writer, sas_header_info_t *hinfo) { sas_header_start_t header_start = { .a2 = hinfo->u64 ? SAS_ALIGNMENT_OFFSET_4 : SAS_ALIGNMENT_OFFSET_0, .a1 = SAS_ALIGNMENT_OFFSET_0, .endian = machine_is_little_endian() ? SAS_ENDIAN_LITTLE : SAS_ENDIAN_BIG, .file_format = SAS_FILE_FORMAT_UNIX, .encoding = 20, /* UTF-8 */ .file_type = "SAS FILE", .file_info = "CATALOG " }; memcpy(&header_start.magic, sas7bcat_magic_number, sizeof(header_start.magic)); strncpy(header_start.file_label, writer->file_label, sizeof(header_start.file_label)); return sas_write_header(writer, hinfo, header_start); } static readstat_error_t sas7bcat_begin_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t retval = READSTAT_OK; int i; sas_header_info_t *hinfo = sas_header_info_init(writer, 0); sas7bcat_block_t **blocks = malloc(writer->label_sets_count * sizeof(sas7bcat_block_t)); char *page = malloc(hinfo->page_size); for (i=0; ilabel_sets_count; i++) { blocks[i] = sas7bcat_block_for_label_set(writer->label_sets[i]); } hinfo->page_count = 4; // Header retval = sas7bcat_emit_header(writer, hinfo); if (retval != READSTAT_OK) goto cleanup; // Page 0 retval = readstat_write_zeros(writer, hinfo->page_size); if (retval != READSTAT_OK) goto cleanup; memset(page, '\0', hinfo->page_size); // Page 1 char *xlsr = &page[856]; int16_t block_idx, block_off; block_idx = 4; block_off = 16; for (i=0; ilabel_sets_count; i++) { if (xlsr + 212 > page + hinfo->page_size) break; memcpy(&xlsr[0], "XLSR", 4); memcpy(&xlsr[4], &block_idx, sizeof(int16_t)); memcpy(&xlsr[8], &block_off, sizeof(int16_t)); xlsr[50] = 'O'; block_off += blocks[i]->len; xlsr += 212; } retval = readstat_write_bytes(writer, page, hinfo->page_size); if (retval != READSTAT_OK) goto cleanup; // Page 2 retval = readstat_write_zeros(writer, hinfo->page_size); if (retval != READSTAT_OK) goto cleanup; // Page 3 memset(page, '\0', hinfo->page_size); char block_header[16]; block_off = 16; for (i=0; ilabel_sets_count; i++) { if (block_off + sizeof(block_header) + blocks[i]->len > hinfo->page_size) break; memset(block_header, '\0', sizeof(block_header)); int32_t next_page = 0; int16_t next_off = 0; int16_t block_len = blocks[i]->len; memcpy(&block_header[0], &next_page, sizeof(int32_t)); memcpy(&block_header[4], &next_off, sizeof(int16_t)); memcpy(&block_header[6], &block_len, sizeof(int16_t)); memcpy(&page[block_off], block_header, sizeof(block_header)); block_off += sizeof(block_header); memcpy(&page[block_off], blocks[i]->data, blocks[i]->len); block_off += blocks[i]->len; } retval = readstat_write_bytes(writer, page, hinfo->page_size); if (retval != READSTAT_OK) goto cleanup; cleanup: for (i=0; ilabel_sets_count; i++) { free(blocks[i]); } free(blocks); free(hinfo); free(page); return retval; } readstat_error_t readstat_begin_writing_sas7bcat(readstat_writer_t *writer, void *user_ctx) { if (writer->version == 0) writer->version = SAS_DEFAULT_FILE_VERSION; writer->callbacks.begin_data = &sas7bcat_begin_data; return readstat_begin_writing_file(writer, user_ctx, 0); } haven/src/readstat/sas/readstat_sas_rle.h0000644000176200001440000000055613227731765020252 0ustar liggesusers ssize_t sas_rle_decompress(void *output_buf, size_t output_len, const void *input_buf, size_t input_len); ssize_t sas_rle_compress(void *output_buf, size_t output_len, const void *input_buf, size_t input_len); ssize_t sas_rle_decompressed_len(const void *input_buf, size_t input_len); ssize_t sas_rle_compressed_len(const void *bytes, size_t len); haven/src/readstat/sas/readstat_sas.c0000644000176200001440000003311513227731765017400 0ustar liggesusers #include #include #include #include #include #include #include #include "readstat_sas.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_writer.h" #define HEADER_SIZE 1024 #define PAGE_SIZE 4096 #define SAS_DEFAULT_STRING_ENCODING "WINDOWS-1252" unsigned char sas7bdat_magic_number[32] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc2, 0xea, 0x81, 0x60, 0xb3, 0x14, 0x11, 0xcf, 0xbd, 0x92, 0x08, 0x00, 0x09, 0xc7, 0x31, 0x8c, 0x18, 0x1f, 0x10, 0x11 }; unsigned char sas7bcat_magic_number[32] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc2, 0xea, 0x81, 0x63, 0xb3, 0x14, 0x11, 0xcf, 0xbd, 0x92, 0x08, 0x00, 0x09, 0xc7, 0x31, 0x8c, 0x18, 0x1f, 0x10, 0x11 }; /* This table is cobbled together from extant files and: * https://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002607278.htm * * Discrepancies form the official documentation are noted with a comment. It * appears that in some instances that SAS software uses a newer encoding than * what's listed in the docs. In these cases the encoding used by ReadStat * represents the author's best guess. */ static readstat_charset_entry_t _charset_table[] = { { .code = 0, .name = SAS_DEFAULT_STRING_ENCODING }, { .code = 20, .name = "UTF-8" }, { .code = 28, .name = "US-ASCII" }, { .code = 29, .name = "ISO-8859-1" }, { .code = 30, .name = "ISO-8859-2" }, { .code = 31, .name = "ISO-8859-3" }, { .code = 34, .name = "ISO-8859-6" }, { .code = 35, .name = "ISO-8859-7" }, { .code = 36, .name = "ISO-8859-8" }, { .code = 39, .name = "ISO-8859-11" }, { .code = 40, .name = "ISO-8859-9" }, { .code = 60, .name = "WINDOWS-1250" }, { .code = 61, .name = "WINDOWS-1251" }, { .code = 62, .name = "WINDOWS-1252" }, { .code = 63, .name = "WINDOWS-1253" }, { .code = 64, .name = "WINDOWS-1254" }, { .code = 65, .name = "WINDOWS-1255" }, { .code = 66, .name = "WINDOWS-1256" }, { .code = 67, .name = "WINDOWS-1257" }, { .code = 68, .name = "WINDOWS-1258" }, { .code = 119, .name = "EUC-TW" }, { .code = 123, .name = "BIG-5" }, { .code = 125, .name = "GB18030" }, // "euc-cn" in SAS { .code = 134, .name = "EUC-JP" }, { .code = 138, .name = "CP932" }, // "shift-jis" in SAS { .code = 140, .name = "EUC-KR" } }; uint64_t sas_read8(const char *data, int bswap) { uint64_t tmp; memcpy(&tmp, data, 8); return bswap ? byteswap8(tmp) : tmp; } uint32_t sas_read4(const char *data, int bswap) { uint32_t tmp; memcpy(&tmp, data, 4); return bswap ? byteswap4(tmp) : tmp; } uint16_t sas_read2(const char *data, int bswap) { uint16_t tmp; memcpy(&tmp, data, 2); return bswap ? byteswap2(tmp) : tmp; } time_t sas_convert_time(double time, time_t epoch) { time += epoch; if (isnan(time)) return 0; if (time > 1.0 * INT64_MAX) return INT64_MAX; if (time < 1.0 * INT64_MIN) return INT64_MIN; return time; } readstat_error_t sas_read_header(readstat_io_t *io, sas_header_info_t *hinfo, readstat_error_handler error_handler, void *user_ctx) { sas_header_start_t header_start; sas_header_end_t header_end; int retval = READSTAT_OK; char error_buf[1024]; struct tm epoch_tm = { .tm_year = 60, .tm_mday = 1 }; time_t epoch = mktime(&epoch_tm); if (io->read(&header_start, sizeof(sas_header_start_t), io->io_ctx) < sizeof(sas_header_start_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (memcmp(header_start.magic, sas7bdat_magic_number, sizeof(sas7bdat_magic_number)) != 0 && memcmp(header_start.magic, sas7bcat_magic_number, sizeof(sas7bcat_magic_number)) != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (header_start.a1 == SAS_ALIGNMENT_OFFSET_4) { hinfo->pad1 = 4; } if (header_start.a2 == SAS_ALIGNMENT_OFFSET_4) { hinfo->u64 = 1; } int bswap = 0; if (header_start.endian == SAS_ENDIAN_BIG) { bswap = machine_is_little_endian(); hinfo->little_endian = 0; } else if (header_start.endian == SAS_ENDIAN_LITTLE) { bswap = !machine_is_little_endian(); hinfo->little_endian = 1; } else { retval = READSTAT_ERROR_PARSE; goto cleanup; } int i; for (i=0; iencoding = _charset_table[i].name; break; } } if (hinfo->encoding == NULL) { if (error_handler) { snprintf(error_buf, sizeof(error_buf), "Unsupported character set code: %d", header_start.encoding); error_handler(error_buf, user_ctx); } retval = READSTAT_ERROR_UNSUPPORTED_CHARSET; goto cleanup; } memcpy(hinfo->file_label, header_start.file_label, sizeof(header_start.file_label)); if (io->seek(hinfo->pad1, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } double creation_time, modification_time; if (io->read(&creation_time, sizeof(double), io->io_ctx) < sizeof(double)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (bswap) creation_time = byteswap_double(creation_time); if (io->read(&modification_time, sizeof(double), io->io_ctx) < sizeof(double)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (bswap) modification_time = byteswap_double(modification_time); hinfo->creation_time = sas_convert_time(creation_time, epoch); hinfo->modification_time = sas_convert_time(modification_time, epoch); if (io->seek(16, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } uint32_t header_size, page_size; if (io->read(&header_size, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (io->read(&page_size, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } hinfo->header_size = bswap ? byteswap4(header_size) : header_size; hinfo->page_size = bswap ? byteswap4(page_size) : page_size; if (hinfo->header_size < 1024 || hinfo->page_size < 1024) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (hinfo->header_size > (1<<20) || hinfo->page_size > (1<<24)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (hinfo->u64) { hinfo->page_header_size = SAS_PAGE_HEADER_SIZE_64BIT; hinfo->subheader_pointer_size = SAS_SUBHEADER_POINTER_SIZE_64BIT; } else { hinfo->page_header_size = SAS_PAGE_HEADER_SIZE_32BIT; hinfo->subheader_pointer_size = SAS_SUBHEADER_POINTER_SIZE_32BIT; } if (hinfo->u64) { uint64_t page_count; if (io->read(&page_count, sizeof(uint64_t), io->io_ctx) < sizeof(uint64_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } hinfo->page_count = bswap ? byteswap8(page_count) : page_count; } else { uint32_t page_count; if (io->read(&page_count, sizeof(uint32_t), io->io_ctx) < sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } hinfo->page_count = bswap ? byteswap4(page_count) : page_count; } if (hinfo->page_count > (1<<24)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (io->seek(8, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; if (error_handler) { snprintf(error_buf, sizeof(error_buf), "ReadStat: Failed to seek forward by %d", 8); error_handler(error_buf, user_ctx); } goto cleanup; } if (io->read(&header_end, sizeof(sas_header_end_t), io->io_ctx) < sizeof(sas_header_end_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } int major, minor, revision; if (sscanf(header_end.release, "%1d.%04dM%1d", &major, &minor, &revision) == 3) { hinfo->major_version = major; hinfo->minor_version = minor; hinfo->revision = revision; } if (major == 9 && minor == 0 && revision == 0) { /* A bit of a hack, but most SAS installations are running a minor update */ hinfo->vendor = READSTAT_VENDOR_STAT_TRANSFER; } else { hinfo->vendor = READSTAT_VENDOR_SAS; } if (io->seek(hinfo->header_size, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; if (error_handler) { snprintf(error_buf, sizeof(error_buf), "ReadStat: Failed to seek to position %" PRId64, hinfo->header_size); error_handler(error_buf, user_ctx); } goto cleanup; } cleanup: return retval; } readstat_error_t sas_write_header(readstat_writer_t *writer, sas_header_info_t *hinfo, sas_header_start_t header_start) { readstat_error_t retval = READSTAT_OK; struct tm epoch_tm = { .tm_year = 60, .tm_mday = 1 }; time_t epoch = mktime(&epoch_tm); sas_header_end_t header_end = { .host = "W32_VSPRO" }; strncpy(header_start.file_label, writer->file_label, sizeof(header_start.file_label)); retval = readstat_write_bytes(writer, &header_start, sizeof(sas_header_start_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_zeros(writer, hinfo->pad1); if (retval != READSTAT_OK) goto cleanup; double creation_time = hinfo->creation_time - epoch; retval = readstat_write_bytes(writer, &creation_time, sizeof(double)); if (retval != READSTAT_OK) goto cleanup; double modification_time = hinfo->modification_time - epoch; retval = readstat_write_bytes(writer, &modification_time, sizeof(double)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_zeros(writer, 16); if (retval != READSTAT_OK) goto cleanup; uint32_t header_size = hinfo->header_size; uint32_t page_size = hinfo->page_size; retval = readstat_write_bytes(writer, &header_size, sizeof(uint32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &page_size, sizeof(uint32_t)); if (retval != READSTAT_OK) goto cleanup; if (hinfo->u64) { uint64_t page_count = hinfo->page_count; retval = readstat_write_bytes(writer, &page_count, sizeof(uint64_t)); } else { uint32_t page_count = hinfo->page_count; retval = readstat_write_bytes(writer, &page_count, sizeof(uint32_t)); } if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_zeros(writer, 8); if (retval != READSTAT_OK) goto cleanup; char release[32]; snprintf(release, sizeof(release), "%1ld.%04ldM0", writer->version / 10000, writer->version % 10000); strncpy(header_end.release, release, sizeof(header_end.release)); retval = readstat_write_bytes(writer, &header_end, sizeof(sas_header_end_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_zeros(writer, hinfo->header_size-writer->bytes_written); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } sas_header_info_t *sas_header_info_init(readstat_writer_t *writer, int is_64bit) { sas_header_info_t *hinfo = calloc(1, sizeof(sas_header_info_t)); hinfo->creation_time = writer->timestamp; hinfo->modification_time = writer->timestamp; hinfo->header_size = HEADER_SIZE; hinfo->page_size = PAGE_SIZE; hinfo->u64 = !!is_64bit; if (hinfo->u64) { hinfo->page_header_size = SAS_PAGE_HEADER_SIZE_64BIT; hinfo->subheader_pointer_size = SAS_SUBHEADER_POINTER_SIZE_64BIT; } else { hinfo->page_header_size = SAS_PAGE_HEADER_SIZE_32BIT; hinfo->subheader_pointer_size = SAS_SUBHEADER_POINTER_SIZE_32BIT; } return hinfo; } readstat_error_t sas_fill_page(readstat_writer_t *writer, sas_header_info_t *hinfo) { if ((writer->bytes_written - hinfo->header_size) % hinfo->page_size) { size_t num_zeros = (hinfo->page_size - (writer->bytes_written - hinfo->header_size) % hinfo->page_size); return readstat_write_zeros(writer, num_zeros); } return READSTAT_OK; } static readstat_error_t sas_validate_name(const char *name) { int j; for (j=0; name[j]; j++) { if (name[j] != '_' && !(name[j] >= 'a' && name[j] <= 'z') && !(name[j] >= 'A' && name[j] <= 'Z') && !(name[j] >= '0' && name[j] <= '9')) { return READSTAT_ERROR_NAME_CONTAINS_ILLEGAL_CHARACTER; } } char first_char = name[0]; if (first_char != '_' && !(first_char >= 'a' && first_char <= 'z') && !(first_char >= 'A' && first_char <= 'Z')) { return READSTAT_ERROR_NAME_BEGINS_WITH_ILLEGAL_CHARACTER; } if (strcmp(name, "_N_") == 0 || strcmp(name, "_ERROR_") == 0 || strcmp(name, "_NUMERIC_") == 0 || strcmp(name, "_CHARACTER_") == 0 || strcmp(name, "_ALL_") == 0) { return READSTAT_ERROR_NAME_IS_RESERVED_WORD; } if (strlen(name) > 32) return READSTAT_ERROR_NAME_IS_TOO_LONG; return READSTAT_OK; } readstat_error_t sas_validate_variable(readstat_variable_t *variable) { return sas_validate_name(readstat_variable_get_name(variable)); } haven/src/readstat/sas/readstat_sas7bcat_read.c0000644000176200001440000003514313227731765021317 0ustar liggesusers#include #include #include #include #include #include "readstat_sas.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_malloc.h" #define SAS_CATALOG_FIRST_INDEX_PAGE 1 #define SAS_CATALOG_USELESS_PAGES 3 typedef struct sas7bcat_ctx_s { readstat_metadata_handler metadata_handler; readstat_value_label_handler value_label_handler; void *user_ctx; readstat_io_t *io; int u64; int pad1; int bswap; int64_t page_count; int64_t page_size; int64_t header_size; uint64_t *block_pointers; int block_pointers_used; int block_pointers_capacity; const char *input_encoding; const char *output_encoding; iconv_t converter; } sas7bcat_ctx_t; static void sas7bcat_ctx_free(sas7bcat_ctx_t *ctx) { if (ctx->converter) iconv_close(ctx->converter); if (ctx->block_pointers) free(ctx->block_pointers); free(ctx); } static readstat_error_t sas7bcat_parse_value_labels(const char *value_start, size_t value_labels_len, int label_count_used, int label_count_capacity, const char *name, sas7bcat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i; const char *lbp1 = value_start; uint32_t *value_offset = readstat_calloc(label_count_used, sizeof(uint32_t)); /* Doubles appear to be stored as big-endian, always */ int bswap_doubles = machine_is_little_endian(); int is_string = (name[0] == '$'); if (value_offset == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } /* Pass 1 -- find out the offset of the labels */ for (i=0; i value_labels_len || lbp1[2] < 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (ipad1+4] - value_start > value_labels_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } uint32_t label_pos = sas_read4(&lbp1[10+ctx->pad1], ctx->bswap); if (label_pos >= label_count_used) { retval = READSTAT_ERROR_PARSE; goto cleanup; } value_offset[label_pos] = lbp1 - value_start; } lbp1 += 6 + lbp1[2]; } const char *lbp2 = lbp1; /* Pass 2 -- parse pairs of values & labels */ for (i=0; i value_labels_len || &lbp2[10] - value_start > value_labels_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } size_t label_len = sas_read2(&lbp2[8], ctx->bswap); size_t value_entry_len = 6 + lbp1[2]; const char *label = &lbp2[10]; char string_val[4*16+1]; readstat_value_t value = { .type = is_string ? READSTAT_TYPE_STRING : READSTAT_TYPE_DOUBLE }; if (is_string) { retval = readstat_convert(string_val, sizeof(string_val), &lbp1[value_entry_len-16], 16, ctx->converter); if (retval != READSTAT_OK) goto cleanup; value.v.string_value = string_val; } else { uint64_t val = sas_read8(&lbp1[22], bswap_doubles); double dval = NAN; if ((val | 0xFF0000000000) == 0xFFFFFFFFFFFF) { value.tag = (val >> 40); if (value.tag) { value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } else { memcpy(&dval, &val, 8); dval *= -1.0; } value.v.double_value = dval; } if (ctx->value_label_handler) { if (ctx->value_label_handler(name, value, label, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } lbp2 += 8 + 2 + label_len + 1; } cleanup: if (value_offset) free(value_offset); return retval; } static readstat_error_t sas7bcat_parse_block(const char *data, size_t data_size, sas7bcat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; size_t pad = 0; int label_count_capacity = 0; int label_count_used = 0; char name[4*32+1]; if (data_size < 50) goto cleanup; pad = (data[2] & 0x08) ? 4 : 0; // might be 0x10, not sure label_count_capacity = sas_read4(&data[38+pad], ctx->bswap); label_count_used = sas_read4(&data[42+pad], ctx->bswap); if ((retval = readstat_convert(name, sizeof(name), &data[8], 8, ctx->converter)) != READSTAT_OK) goto cleanup; if (pad) { pad += 16; } if ((data[2] & 0x80)) { // has long name if (data_size < 106 + pad + 32) goto cleanup; retval = readstat_convert(name, sizeof(name), &data[106+pad], 32, ctx->converter); if (retval != READSTAT_OK) goto cleanup; pad += 32; } if (data_size < 106 + pad) goto cleanup; if ((retval = sas7bcat_parse_value_labels(&data[106+pad], data_size - 106 - pad, label_count_used, label_count_capacity, name, ctx)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t sas7bcat_augment_index(const char *index, size_t len, sas7bcat_ctx_t *ctx) { const char *xlsr = index; readstat_error_t retval = READSTAT_OK; while (xlsr + 212 <= index + len) { if (memcmp(xlsr, "XLSR", 4) != 0) // some block pointers seem to have 8 bytes of extra padding xlsr += 8; if (memcmp(xlsr, "XLSR", 4) != 0) break; if (xlsr[50+ctx->pad1] == 'O') ctx->block_pointers[ctx->block_pointers_used++] = ((uint64_t)sas_read2(&xlsr[4], ctx->bswap) << 32) + sas_read2(&xlsr[8], ctx->bswap); if (ctx->block_pointers_used == ctx->block_pointers_capacity) { ctx->block_pointers = readstat_realloc(ctx->block_pointers, (ctx->block_pointers_capacity *= 2) * sizeof(uint64_t)); if (ctx->block_pointers == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } } xlsr += 212 + ctx->pad1; } cleanup: return retval; } static int compare_block_pointers(const void *elem1, const void *elem2) { uint64_t v1 = *(const uint64_t *)elem1; uint64_t v2 = *(const uint64_t *)elem2; return v1 - v2; } static void sas7bcat_sort_index(sas7bcat_ctx_t *ctx) { if (ctx->block_pointers_used == 0) return; int i; for (i=1; iblock_pointers_used; i++) { if (ctx->block_pointers[i] < ctx->block_pointers[i-1]) { qsort(ctx->block_pointers, ctx->block_pointers_used, sizeof(uint64_t), &compare_block_pointers); break; } } } static void sas7bcat_uniq_index(sas7bcat_ctx_t *ctx) { if (ctx->block_pointers_used == 0) return; int i; int out_i = 1; for (i=1; iblock_pointers_used; i++) { if (ctx->block_pointers[i] != ctx->block_pointers[i-1]) { if (out_i != i) { ctx->block_pointers[out_i] = ctx->block_pointers[i]; } out_i++; } } ctx->block_pointers_used = out_i; } static int sas7bcat_block_size(int start_page, int start_page_pos, sas7bcat_ctx_t *ctx, readstat_error_t *outError) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; int next_page = start_page; int next_page_pos = start_page_pos; int link_count = 0; int buffer_len = 0; int chain_link_len = 0; char chain_link[16]; // calculate buffer size needed while (next_page > 0 && next_page_pos > 0 && next_page <= ctx->page_count && link_count++ < ctx->page_count) { if (io->seek(ctx->header_size+(next_page-1)*ctx->page_size+next_page_pos, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->read(chain_link, sizeof(chain_link), io->io_ctx) < sizeof(chain_link)) { retval = READSTAT_ERROR_READ; goto cleanup; } next_page = sas_read4(&chain_link[0], ctx->bswap); next_page_pos = sas_read2(&chain_link[4], ctx->bswap); chain_link_len = sas_read2(&chain_link[6], ctx->bswap); buffer_len += chain_link_len; } cleanup: if (outError) *outError = retval; return retval == READSTAT_OK ? buffer_len : -1; } static readstat_error_t sas7bcat_read_block(char *buffer, size_t buffer_len, int start_page, int start_page_pos, sas7bcat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; int next_page = start_page; int next_page_pos = start_page_pos; int chain_link_len = 0; int buffer_offset = 0; char chain_link[16]; while (next_page > 0 && next_page_pos > 0) { if (io->seek(ctx->header_size+(next_page-1)*ctx->page_size+next_page_pos, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->read(chain_link, sizeof(chain_link), io->io_ctx) < sizeof(chain_link)) { retval = READSTAT_ERROR_READ; goto cleanup; } next_page = sas_read4(&chain_link[0], ctx->bswap); next_page_pos = sas_read2(&chain_link[4], ctx->bswap); chain_link_len = sas_read2(&chain_link[6], ctx->bswap); if (buffer_offset + chain_link_len > buffer_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (io->read(buffer + buffer_offset, chain_link_len, io->io_ctx) < chain_link_len) { retval = READSTAT_ERROR_READ; goto cleanup; } buffer_offset += chain_link_len; } cleanup: return retval; } readstat_error_t readstat_parse_sas7bcat(readstat_parser_t *parser, const char *path, void *user_ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = parser->io; int64_t i; char *page = NULL; char *buffer = NULL; sas7bcat_ctx_t *ctx = calloc(1, sizeof(sas7bcat_ctx_t)); sas_header_info_t *hinfo = calloc(1, sizeof(sas_header_info_t)); ctx->block_pointers = malloc((ctx->block_pointers_capacity = 200) * sizeof(uint64_t)); ctx->value_label_handler = parser->value_label_handler; ctx->metadata_handler = parser->metadata_handler; ctx->input_encoding = parser->input_encoding; ctx->output_encoding = parser->output_encoding; ctx->user_ctx = user_ctx; ctx->io = io; if (io->open(path, io->io_ctx) == -1) { retval = READSTAT_ERROR_OPEN; goto cleanup; } if ((retval = sas_read_header(io, hinfo, parser->error_handler, user_ctx)) != READSTAT_OK) { goto cleanup; } ctx->u64 = hinfo->u64; ctx->pad1 = hinfo->pad1; ctx->bswap = machine_is_little_endian() ^ hinfo->little_endian; ctx->header_size = hinfo->header_size; ctx->page_count = hinfo->page_count; ctx->page_size = hinfo->page_size; if (ctx->input_encoding == NULL) { ctx->input_encoding = hinfo->encoding; } if (ctx->u64) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (ctx->input_encoding && ctx->output_encoding && strcmp(ctx->input_encoding, ctx->output_encoding) != 0) { iconv_t converter = iconv_open(ctx->output_encoding, ctx->input_encoding); if (converter == (iconv_t)-1) { retval = READSTAT_ERROR_UNSUPPORTED_CHARSET; goto cleanup; } ctx->converter = converter; } if (parser->metadata_handler) { char file_label[4*64+1]; retval = readstat_convert(file_label, sizeof(file_label), hinfo->file_label, sizeof(hinfo->file_label), ctx->converter); if (retval != READSTAT_OK) goto cleanup; if (ctx->metadata_handler(file_label, hinfo->encoding, hinfo->modification_time, 10000 * hinfo->major_version + hinfo->minor_version, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } if ((page = readstat_malloc(ctx->page_size)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (io->seek(ctx->header_size+SAS_CATALOG_FIRST_INDEX_PAGE*ctx->page_size, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->read(page, ctx->page_size, io->io_ctx) < ctx->page_size) { retval = READSTAT_ERROR_READ; goto cleanup; } retval = sas7bcat_augment_index(&page[856+2*ctx->pad1], ctx->page_size - 856 - 2*ctx->pad1, ctx); if (retval != READSTAT_OK) goto cleanup; // Pass 1 -- find the XLSR entries for (i=SAS_CATALOG_USELESS_PAGES; ipage_count; i++) { if (io->seek(ctx->header_size+i*ctx->page_size, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->read(page, ctx->page_size, io->io_ctx) < ctx->page_size) { retval = READSTAT_ERROR_READ; goto cleanup; } if (memcmp(&page[16], "XLSR", sizeof("XLSR")-1) == 0) { retval = sas7bcat_augment_index(&page[16], ctx->page_size - 16, ctx); if (retval != READSTAT_OK) goto cleanup; } } sas7bcat_sort_index(ctx); sas7bcat_uniq_index(ctx); // Pass 2 -- look up the individual block pointers for (i=0; iblock_pointers_used; i++) { int start_page = ctx->block_pointers[i] >> 32; int start_page_pos = (ctx->block_pointers[i]) & 0xFFFF; int buffer_len = sas7bcat_block_size(start_page, start_page_pos, ctx, &retval); if (buffer_len == -1) { goto cleanup; } else if (buffer_len == 0) { continue; } if ((buffer = readstat_realloc(buffer, buffer_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((retval = sas7bcat_read_block(buffer, buffer_len, start_page, start_page_pos, ctx)) != READSTAT_OK) goto cleanup; if ((retval = sas7bcat_parse_block(buffer, buffer_len, ctx)) != READSTAT_OK) goto cleanup; } cleanup: io->close(io->io_ctx); if (page) free(page); if (buffer) free(buffer); if (ctx) sas7bcat_ctx_free(ctx); if (hinfo) free(hinfo); return retval; } haven/src/readstat/sas/readstat_xport_read.c0000644000176200001440000005177113227731765020771 0ustar liggesusers#include #include #include #include #include #include #include "../readstat.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_malloc.h" #include "readstat_sas.h" #include "readstat_xport.h" #include "ieee.h" #define LINE_LEN 80 typedef struct xport_ctx_s { readstat_info_handler info_handler; readstat_metadata_handler metadata_handler; readstat_note_handler note_handler; readstat_variable_handler variable_handler; readstat_fweight_handler fweight_handler; readstat_value_handler value_handler; readstat_value_label_handler value_label_handler; readstat_error_handler error_handler; readstat_progress_handler progress_handler; size_t file_size; void *user_ctx; readstat_io_t *io; time_t timestamp; int obs_count; int var_count; int row_limit; size_t row_length; int parsed_row_count; readstat_variable_t **variables; int version; } xport_ctx_t; static readstat_error_t xport_update_progress(xport_ctx_t *ctx) { readstat_io_t *io = ctx->io; return io->update(ctx->file_size, ctx->progress_handler, ctx->user_ctx, io->io_ctx); } static xport_ctx_t *xport_ctx_init() { xport_ctx_t *ctx = calloc(1, sizeof(xport_ctx_t)); return ctx; } static void xport_ctx_free(xport_ctx_t *ctx) { if (ctx->variables) { int i; for (i=0; ivar_count; i++) { if (ctx->variables[i]) free(ctx->variables[i]); } free(ctx->variables); } free(ctx); } static ssize_t read_bytes(xport_ctx_t *ctx, void *dst, size_t dst_len) { readstat_io_t *io = (readstat_io_t *)ctx->io; return io->read(dst, dst_len, io->io_ctx); } static readstat_error_t xport_skip_record(xport_ctx_t *ctx) { readstat_io_t *io = (readstat_io_t *)ctx->io; if (io->seek(LINE_LEN, READSTAT_SEEK_CUR, io->io_ctx) == -1) return READSTAT_ERROR_SEEK; return READSTAT_OK; } static readstat_error_t xport_skip_rest_of_record(xport_ctx_t *ctx) { readstat_io_t *io = (readstat_io_t *)ctx->io; off_t pos = io->seek(0, READSTAT_SEEK_CUR, io->io_ctx); if (pos == -1) return READSTAT_ERROR_SEEK; if (pos % LINE_LEN) { if (io->seek(LINE_LEN - (pos % LINE_LEN), READSTAT_SEEK_CUR, io->io_ctx) == -1) return READSTAT_ERROR_SEEK; } return READSTAT_OK; } static readstat_error_t xport_read_record(xport_ctx_t *ctx, char *record) { ssize_t bytes_read = read_bytes(ctx, record, LINE_LEN); if (bytes_read < LINE_LEN) return READSTAT_ERROR_READ; record[LINE_LEN] = '\0'; return READSTAT_OK; } static readstat_error_t xport_read_header_record(xport_ctx_t *ctx, xport_header_record_t *xrecord) { char line[LINE_LEN+1]; readstat_error_t retval = READSTAT_OK; retval = xport_read_record(ctx, line); if (retval != READSTAT_OK) return retval; memset(xrecord, 0, sizeof(xport_header_record_t)); int matches = sscanf(line, "HEADER RECORD*******%8s HEADER RECORD!!!!!!!" "%05d%05d%05d" "%05d%05d%05d", xrecord->name, &xrecord->num1, &xrecord->num2, &xrecord->num3, &xrecord->num4, &xrecord->num5, &xrecord->num6); if (matches < 2) { return READSTAT_ERROR_PARSE; } return READSTAT_OK; } static readstat_error_t xport_expect_header_record(xport_ctx_t *ctx, const char *v5_name, const char *v8_name) { readstat_error_t retval = READSTAT_OK; xport_header_record_t xrecord; retval = xport_read_header_record(ctx, &xrecord); if (retval != READSTAT_OK) goto cleanup; if (ctx->version == 5 && strcmp(xrecord.name, v5_name) != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } else if (ctx->version == 8 && strcmp(xrecord.name, v8_name) != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } cleanup: return retval; } static readstat_error_t xport_read_file_label_record(xport_ctx_t *ctx) { char line[LINE_LEN+1]; char label[40*4+1]; readstat_error_t retval = READSTAT_OK; retval = xport_read_record(ctx, line); if (retval != READSTAT_OK) goto cleanup; retval = readstat_convert(label, sizeof(label), &line[32], 40, NULL); if (retval != READSTAT_OK) goto cleanup; if (ctx->metadata_handler) { if (ctx->metadata_handler(label, NULL, ctx->timestamp, ctx->version, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: return retval; } static readstat_error_t xport_read_library_record(xport_ctx_t *ctx) { xport_header_record_t xrecord; readstat_error_t retval = xport_read_header_record(ctx, &xrecord); if (retval != READSTAT_OK) goto cleanup; if (strcmp(xrecord.name, "LIBRARY") == 0) { ctx->version = 5; } else if (strcmp(xrecord.name, "LIBV8") == 0) { ctx->version = 8; } else { retval = READSTAT_ERROR_UNSUPPORTED_FILE_FORMAT_VERSION; goto cleanup; } cleanup: return retval; } static readstat_error_t xport_read_timestamp_record(xport_ctx_t *ctx) { char line[LINE_LEN+1]; readstat_error_t retval = READSTAT_OK; struct tm ts = { .tm_isdst = -1 }; char month[4]; int i; retval = xport_read_record(ctx, line); if (retval != READSTAT_OK) goto cleanup; sscanf(line, "%02d%3s%02d:%02d:%02d:%02d", &ts.tm_mday, month, &ts.tm_year, &ts.tm_hour, &ts.tm_min, &ts.tm_sec); for (i=0; itimestamp = mktime(&ts); cleanup: return retval; } static readstat_error_t xport_read_namestr_header_record(xport_ctx_t *ctx) { xport_header_record_t xrecord; readstat_error_t retval = READSTAT_OK; retval = xport_read_header_record(ctx, &xrecord); if (retval != READSTAT_OK) goto cleanup; if (ctx->version == 5 && strcmp(xrecord.name, "NAMESTR") != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } else if (ctx->version == 8 && strcmp(xrecord.name, "NAMSTV8") != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->var_count = xrecord.num2; ctx->variables = readstat_calloc(ctx->var_count, sizeof(readstat_variable_t *)); if (ctx->variables == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (ctx->info_handler) { if (ctx->info_handler(-1, ctx->var_count, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: return retval; } static readstat_error_t xport_read_obs_header_record(xport_ctx_t *ctx) { return xport_expect_header_record(ctx, "OBS", "OBSV8"); } static readstat_error_t xport_construct_format(char *dst, size_t dst_len, const char *src, size_t src_len, int width, int decimals) { char format[4*src_len+1]; readstat_error_t retval = readstat_convert(format, sizeof(format), src, src_len, NULL); if (decimals) { snprintf(dst, dst_len, "%s%d.%d", format, width, decimals); } else if (width) { snprintf(dst, dst_len, "%s%d", format, width); } else { strcpy(dst, format); } return retval; } static readstat_error_t xport_read_labels_v8(xport_ctx_t *ctx, int label_count) { readstat_error_t retval = READSTAT_OK; uint16_t labeldef[3]; int i; for (i=0; i= ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } char name[name_len+1]; char label[label_len+1]; readstat_variable_t *variable = ctx->variables[index]; if (read_bytes(ctx, name, name_len) != name_len || read_bytes(ctx, label, label_len) != label_len) { retval = READSTAT_ERROR_READ; goto cleanup; } retval = readstat_convert(variable->name, sizeof(variable->name), name, name_len, NULL); if (retval != READSTAT_OK) goto cleanup; retval = readstat_convert(variable->label, sizeof(variable->label), label, label_len, NULL); if (retval != READSTAT_OK) goto cleanup; } retval = xport_skip_rest_of_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_read_obs_header_record(ctx); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t xport_read_labels_v9(xport_ctx_t *ctx, int label_count) { readstat_error_t retval = READSTAT_OK; uint16_t labeldef[5]; int i; for (i=0; i= ctx->var_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } char name[name_len+1]; char format[format_len+1]; char informat[informat_len+1]; char label[label_len+1]; readstat_variable_t *variable = ctx->variables[index]; if (read_bytes(ctx, name, name_len) != name_len || read_bytes(ctx, format, format_len) != format_len || read_bytes(ctx, informat, informat_len) != informat_len || read_bytes(ctx, label, label_len) != label_len) { retval = READSTAT_ERROR_READ; goto cleanup; } retval = readstat_convert(variable->name, sizeof(variable->name), name, name_len, NULL); if (retval != READSTAT_OK) goto cleanup; retval = readstat_convert(variable->label, sizeof(variable->label), label, label_len, NULL); if (retval != READSTAT_OK) goto cleanup; retval = xport_construct_format(variable->format, sizeof(variable->format), format, format_len, variable->display_width, variable->decimals); if (retval != READSTAT_OK) goto cleanup; } retval = xport_skip_rest_of_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_read_obs_header_record(ctx); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t xport_read_variables(xport_ctx_t *ctx) { int i; readstat_error_t retval = READSTAT_OK; for (i=0; ivar_count; i++) { xport_namestr_t namestr; ssize_t bytes_read = read_bytes(ctx, &namestr, sizeof(xport_namestr_t)); if (bytes_read < sizeof(xport_namestr_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } xport_namestr_bswap(&namestr); readstat_variable_t *variable = calloc(1, sizeof(readstat_variable_t)); variable->index = i; variable->type = namestr.ntype == SAS_COLUMN_TYPE_CHR ? READSTAT_TYPE_STRING : READSTAT_TYPE_DOUBLE; variable->storage_width = namestr.nlng; variable->display_width = namestr.nfl; variable->decimals = namestr.nfd; variable->alignment = namestr.nfj ? READSTAT_ALIGNMENT_RIGHT : READSTAT_ALIGNMENT_LEFT; readstat_convert(variable->name, sizeof(variable->name), namestr.nname, sizeof(namestr.nname), NULL); if (retval != READSTAT_OK) goto cleanup; readstat_convert(variable->label, sizeof(variable->label), namestr.nlabel, sizeof(namestr.nlabel), NULL); if (retval != READSTAT_OK) goto cleanup; xport_construct_format(variable->format, sizeof(variable->format), namestr.nform, sizeof(namestr.nform), variable->display_width, variable->decimals); if (retval != READSTAT_OK) goto cleanup; ctx->variables[i] = variable; } retval = xport_skip_rest_of_record(ctx); if (retval != READSTAT_OK) goto cleanup; if (ctx->version == 5) { retval = xport_read_obs_header_record(ctx); if (retval != READSTAT_OK) goto cleanup; } else { xport_header_record_t xrecord; retval = xport_read_header_record(ctx, &xrecord); if (retval != READSTAT_OK) goto cleanup; if (strcmp(xrecord.name, "OBSV8") == 0) { /* void */ } else if (strcmp(xrecord.name, "LABELV8") == 0) { retval = xport_read_labels_v8(ctx, xrecord.num1); } else if (strcmp(xrecord.name, "LABELV9") == 0) { retval = xport_read_labels_v9(ctx, xrecord.num1); } if (retval != READSTAT_OK) goto cleanup; } ctx->row_length = 0; int index_after_skipping = 0; for (i=0; ivar_count; i++) { readstat_variable_t *variable = ctx->variables[i]; variable->index_after_skipping = index_after_skipping; int cb_retval = READSTAT_HANDLER_OK; if (ctx->variable_handler) { cb_retval = ctx->variable_handler(i, variable, variable->format, ctx->user_ctx); } if (cb_retval == READSTAT_HANDLER_ABORT) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } if (cb_retval == READSTAT_HANDLER_SKIP_VARIABLE) { variable->skip = 1; } else { index_after_skipping++; } ctx->row_length += variable->storage_width; } cleanup: return retval; } static readstat_error_t xport_process_row(xport_ctx_t *ctx, const char *row, size_t row_length) { readstat_error_t retval = READSTAT_OK; int i; off_t pos = 0; char *string = NULL; for (i=0; ivar_count; i++) { readstat_variable_t *variable = ctx->variables[i]; readstat_value_t value = { .type = variable->type }; if (variable->type == READSTAT_TYPE_STRING) { string = readstat_realloc(string, 4*variable->storage_width+1); if (string == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } retval = readstat_convert(string, 4*variable->storage_width+1, &row[pos], variable->storage_width, NULL); if (retval != READSTAT_OK) goto cleanup; value.v.string_value = string; } else { double dval = NAN; if (variable->storage_width <= XPORT_MAX_DOUBLE_SIZE && variable->storage_width >= XPORT_MIN_DOUBLE_SIZE) { char full_value[8] = { 0 }; if (memcmp(&full_value[1], &row[pos+1], variable->storage_width - 1) == 0 && (row[pos] == '_' || row[pos] == '.' || (row[pos] >= 'A' && row[pos] <= 'Z'))) { if (row[pos] == '.') { value.is_system_missing = 1; } else { value.tag = row[pos]; value.is_tagged_missing = 1; } } else { memcpy(full_value, &row[pos], variable->storage_width); int rc = cnxptiee(full_value, CN_TYPE_XPORT, &dval, CN_TYPE_NATIVE); if (rc != 0) { retval = READSTAT_ERROR_CONVERT; goto cleanup; } } } value.v.double_value = dval; } pos += variable->storage_width; if (ctx->value_handler(ctx->parsed_row_count, variable, value, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } cleanup: free(string); return retval; } static readstat_error_t xport_read_data(xport_ctx_t *ctx) { if (!ctx->row_length) return READSTAT_OK; if (!ctx->value_handler) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; char *row = readstat_malloc(ctx->row_length); char *blank_row = readstat_malloc(ctx->row_length); int num_blank_rows = 0; if (row == NULL || blank_row == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } memset(blank_row, ' ', ctx->row_length); while (1) { ssize_t bytes_read = read_bytes(ctx, row, ctx->row_length); if (bytes_read == -1) { retval = READSTAT_ERROR_READ; goto cleanup; } else if (bytes_read < ctx->row_length) { break; } off_t pos = 0; int row_is_blank = 1; for (pos=0; posrow_length; pos++) { if (row[pos] != ' ') { row_is_blank = 0; break; } } if (row_is_blank) { num_blank_rows++; continue; } while (num_blank_rows) { retval = xport_process_row(ctx, blank_row, ctx->row_length); if (retval != READSTAT_OK) goto cleanup; if (++(ctx->parsed_row_count) == ctx->row_limit) goto cleanup; num_blank_rows--; } retval = xport_process_row(ctx, row, ctx->row_length); if (retval != READSTAT_OK) goto cleanup; retval = xport_update_progress(ctx); if (retval != READSTAT_OK) goto cleanup; if (++(ctx->parsed_row_count) == ctx->row_limit) break; } cleanup: if (row) free(row); if (blank_row) free(blank_row); return retval; } readstat_error_t readstat_parse_xport(readstat_parser_t *parser, const char *path, void *user_ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = parser->io; xport_ctx_t *ctx = xport_ctx_init(); ctx->info_handler = parser->info_handler; ctx->metadata_handler = parser->metadata_handler; ctx->note_handler = parser->note_handler; ctx->variable_handler = parser->variable_handler; ctx->value_handler = parser->value_handler; ctx->value_label_handler = parser->value_label_handler; ctx->error_handler = parser->error_handler; ctx->progress_handler = parser->progress_handler; ctx->user_ctx = user_ctx; ctx->io = io; ctx->row_limit = parser->row_limit; if (io->open(path, io->io_ctx) == -1) { retval = READSTAT_ERROR_OPEN; goto cleanup; } if ((ctx->file_size = io->seek(0, READSTAT_SEEK_END, io->io_ctx)) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->seek(0, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } retval = xport_read_library_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_skip_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_read_timestamp_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_expect_header_record(ctx, "MEMBER", "MEMBV8"); if (retval != READSTAT_OK) goto cleanup; retval = xport_expect_header_record(ctx, "DSCRPTR", "DSCPTV8"); if (retval != READSTAT_OK) goto cleanup; retval = xport_skip_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_read_file_label_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_read_namestr_header_record(ctx); if (retval != READSTAT_OK) goto cleanup; retval = xport_read_variables(ctx); if (retval != READSTAT_OK) goto cleanup; if (ctx->row_length) { retval = xport_read_data(ctx); if (retval != READSTAT_OK) goto cleanup; } cleanup: io->close(io->io_ctx); xport_ctx_free(ctx); return retval; } haven/src/readstat/sas/readstat_sas7bdat_read.c0000644000176200001440000011470513227731765021322 0ustar liggesusers #include #include #include #include #include #include #include "readstat_sas.h" #include "readstat_sas_rle.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_malloc.h" #define SAS_COMPRESSION_SIGNATURE_RLE "SASYZCRL" #define SAS_COMPRESSION_SIGNATURE_RDC "SASYZCR2" typedef struct col_info_s { sas_text_ref_t name_ref; sas_text_ref_t format_ref; sas_text_ref_t label_ref; int index; int offset; int width; int type; } col_info_t; typedef struct sas7bdat_ctx_s { readstat_info_handler info_handler; readstat_metadata_handler metadata_handler; readstat_variable_handler variable_handler; readstat_value_handler value_handler; readstat_error_handler error_handler; readstat_progress_handler progress_handler; int64_t file_size; int little_endian; int u64; int vendor; void *user_ctx; readstat_io_t *io; int bswap; int did_submit_columns; uint32_t row_length; uint32_t page_row_count; uint32_t parsed_row_count; uint32_t column_count; uint32_t row_limit; uint64_t header_size; uint64_t page_count; uint64_t page_size; char *page; uint64_t page_header_size; uint64_t subheader_pointer_size; int text_blob_count; size_t *text_blob_lengths; char **text_blobs; int col_names_count; int col_attrs_count; int col_formats_count; size_t max_col_width; char *scratch_buffer; size_t scratch_buffer_len; int col_info_count; col_info_t *col_info; readstat_variable_t **variables; const char *input_encoding; const char *output_encoding; iconv_t converter; time_t timestamp; int version; char file_label[4*64+1]; char error_buf[2048]; } sas7bdat_ctx_t; static void sas7bdat_ctx_free(sas7bdat_ctx_t *ctx) { int i; if (ctx->text_blobs) { for (i=0; itext_blob_count; i++) { free(ctx->text_blobs[i]); } free(ctx->text_blobs); free(ctx->text_blob_lengths); } if (ctx->variables) { for (i=0; icolumn_count; i++) { if (ctx->variables[i]) free(ctx->variables[i]); } free(ctx->variables); } if (ctx->col_info) free(ctx->col_info); if (ctx->scratch_buffer) free(ctx->scratch_buffer); if (ctx->page) free(ctx->page); if (ctx->converter) iconv_close(ctx->converter); free(ctx); } static readstat_error_t sas7bdat_update_progress(sas7bdat_ctx_t *ctx) { readstat_io_t *io = ctx->io; return io->update(ctx->file_size, ctx->progress_handler, ctx->user_ctx, io->io_ctx); } static readstat_error_t sas7bdat_parse_column_text_subheader(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; size_t signature_len = ctx->u64 ? 8 : 4; uint16_t remainder = sas_read2(&subheader[signature_len], ctx->bswap); char *blob = NULL; if (remainder != len - (4+2*signature_len)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->text_blob_count++; ctx->text_blobs = readstat_realloc(ctx->text_blobs, ctx->text_blob_count * sizeof(char *)); ctx->text_blob_lengths = readstat_realloc(ctx->text_blob_lengths, ctx->text_blob_count * sizeof(ctx->text_blob_lengths[0])); if (ctx->text_blobs == NULL || ctx->text_blob_lengths == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((blob = readstat_malloc(len-signature_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } memcpy(blob, subheader+signature_len, len-signature_len); ctx->text_blob_lengths[ctx->text_blob_count-1] = len-signature_len; ctx->text_blobs[ctx->text_blob_count-1] = blob; /* another bit of a hack */ if (len-signature_len > 12 + sizeof(SAS_COMPRESSION_SIGNATURE_RDC)-1 && strncmp(blob + 12, SAS_COMPRESSION_SIGNATURE_RDC, sizeof(SAS_COMPRESSION_SIGNATURE_RDC)-1) == 0) { retval = READSTAT_ERROR_UNSUPPORTED_COMPRESSION; goto cleanup; } cleanup: return retval; } static readstat_error_t sas7bdat_realloc_col_info(sas7bdat_ctx_t *ctx, size_t count) { if (ctx->col_info_count < count) { ctx->col_info_count = count; ctx->col_info = readstat_realloc(ctx->col_info, ctx->col_info_count * sizeof(col_info_t)); if (ctx->col_info == NULL) { return READSTAT_ERROR_MALLOC; } } return READSTAT_OK; } static readstat_error_t sas7bdat_parse_column_size_subheader(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { uint64_t col_count; readstat_error_t retval = READSTAT_OK; if (ctx->column_count) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (len < (ctx->u64 ? 16 : 8)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (ctx->u64) { col_count = sas_read8(&subheader[8], ctx->bswap); } else { col_count = sas_read4(&subheader[4], ctx->bswap); } ctx->column_count = col_count; retval = sas7bdat_realloc_col_info(ctx, ctx->column_count); cleanup: return retval; } static readstat_error_t sas7bdat_parse_row_size_subheader(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; uint64_t total_row_count; uint64_t row_length, page_row_count; if (len < (ctx->u64 ? 128 : 64)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (ctx->u64) { row_length = sas_read8(&subheader[40], ctx->bswap); total_row_count = sas_read8(&subheader[48], ctx->bswap); page_row_count = sas_read8(&subheader[120], ctx->bswap); } else { row_length = sas_read4(&subheader[20], ctx->bswap); total_row_count = sas_read4(&subheader[24], ctx->bswap); page_row_count = sas_read4(&subheader[60], ctx->bswap); } ctx->row_length = row_length; ctx->page_row_count = page_row_count; if (ctx->row_limit == 0 || total_row_count < ctx->row_limit) ctx->row_limit = total_row_count; cleanup: return retval; } static sas_text_ref_t sas7bdat_parse_text_ref(const char *data, sas7bdat_ctx_t *ctx) { sas_text_ref_t ref; ref.index = sas_read2(&data[0], ctx->bswap); ref.offset = sas_read2(&data[2], ctx->bswap); ref.length = sas_read2(&data[4], ctx->bswap); return ref; } static readstat_error_t sas7bdat_copy_text_ref(char *out_buffer, size_t out_buffer_len, sas_text_ref_t text_ref, sas7bdat_ctx_t *ctx) { if (text_ref.index >= ctx->text_blob_count) return READSTAT_ERROR_PARSE; if (text_ref.length == 0) { out_buffer[0] = '\0'; return READSTAT_OK; } char *blob = ctx->text_blobs[text_ref.index]; if (text_ref.offset + text_ref.length > ctx->text_blob_lengths[text_ref.index]) return READSTAT_ERROR_PARSE; return readstat_convert(out_buffer, out_buffer_len, &blob[text_ref.offset], text_ref.length, ctx->converter); } static readstat_error_t sas7bdat_parse_column_name_subheader(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; size_t signature_len = ctx->u64 ? 8 : 4; int cmax = ctx->u64 ? (len-28)/8 : (len-20)/8; int i; const char *cnp = &subheader[signature_len+8]; uint16_t remainder = sas_read2(&subheader[signature_len], ctx->bswap); if (remainder != len - (4+2*signature_len)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->col_names_count += cmax; if ((retval = sas7bdat_realloc_col_info(ctx, ctx->col_names_count)) != READSTAT_OK) goto cleanup; for (i=ctx->col_names_count-cmax; icol_names_count; i++) { ctx->col_info[i].name_ref = sas7bdat_parse_text_ref(cnp, ctx); cnp += 8; } cleanup: return retval; } static readstat_error_t sas7bdat_parse_column_attributes_subheader(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; size_t signature_len = ctx->u64 ? 8 : 4; int cmax = ctx->u64 ? (len-28)/16 : (len-20)/12; int i; const char *cap = &subheader[signature_len+8]; uint16_t remainder = sas_read2(&subheader[signature_len], ctx->bswap); if (remainder != len - (4+2*signature_len)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->col_attrs_count += cmax; if ((retval = sas7bdat_realloc_col_info(ctx, ctx->col_attrs_count)) != READSTAT_OK) goto cleanup; for (i=ctx->col_attrs_count-cmax; icol_attrs_count; i++) { if (ctx->u64) { ctx->col_info[i].offset = sas_read8(&cap[0], ctx->bswap); } else { ctx->col_info[i].offset = sas_read4(&cap[0], ctx->bswap); } readstat_off_t off=4; if (ctx->u64) off=8; ctx->col_info[i].width = sas_read4(&cap[off], ctx->bswap); if (ctx->col_info[i].width > ctx->max_col_width) ctx->max_col_width = ctx->col_info[i].width; if (cap[off+6] == SAS_COLUMN_TYPE_NUM) { ctx->col_info[i].type = READSTAT_TYPE_DOUBLE; if (ctx->col_info[i].width > 8 || ctx->col_info[i].width < 3) { retval = READSTAT_ERROR_PARSE; goto cleanup; } } else if (cap[off+6] == SAS_COLUMN_TYPE_CHR) { ctx->col_info[i].type = READSTAT_TYPE_STRING; if (ctx->col_info[i].width < 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } } else { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->col_info[i].index = i; cap += off+8; } cleanup: return retval; } static readstat_error_t sas7bdat_parse_column_format_subheader(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if (len < (ctx->u64 ? 58 : 46)) { retval = READSTAT_ERROR_PARSE; goto cleanup; } ctx->col_formats_count++; if ((retval = sas7bdat_realloc_col_info(ctx, ctx->col_formats_count)) != READSTAT_OK) goto cleanup; ctx->col_info[ctx->col_formats_count-1].format_ref = sas7bdat_parse_text_ref( ctx->u64 ? &subheader[46] : &subheader[34], ctx); ctx->col_info[ctx->col_formats_count-1].label_ref = sas7bdat_parse_text_ref( ctx->u64 ? &subheader[52] : &subheader[40], ctx); cleanup: return retval; } static readstat_error_t sas7bdat_handle_data_value(readstat_variable_t *variable, col_info_t *col_info, const char *col_data, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int cb_retval = 0; readstat_value_t value; memset(&value, 0, sizeof(readstat_value_t)); value.type = col_info->type; if (col_info->type == READSTAT_TYPE_STRING) { retval = readstat_convert(ctx->scratch_buffer, ctx->scratch_buffer_len, col_data, col_info->width, ctx->converter); if (retval != READSTAT_OK) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Error converting string to specified encoding: %.*s", col_info->width, col_data); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } value.v.string_value = ctx->scratch_buffer; } else if (col_info->type == READSTAT_TYPE_DOUBLE) { uint64_t val = 0; double dval = NAN; if (ctx->little_endian) { int k; for (k=0; kwidth; k++) { val = (val << 8) | (unsigned char)col_data[col_info->width-1-k]; } } else { int k; for (k=0; kwidth; k++) { val = (val << 8) | (unsigned char)col_data[k]; } } val <<= (8-col_info->width)*8; memcpy(&dval, &val, 8); if (isnan(dval)) { value.v.double_value = NAN; value.tag = ~((val >> 40) & 0xFF); if (value.tag) { value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } else { value.v.double_value = dval; } } cb_retval = ctx->value_handler(ctx->parsed_row_count, variable, value, ctx->user_ctx); if (cb_retval != READSTAT_HANDLER_OK) retval = READSTAT_ERROR_USER_ABORT; cleanup: return retval; } static readstat_error_t sas7bdat_parse_single_row(const char *data, sas7bdat_ctx_t *ctx) { if (ctx->parsed_row_count == ctx->row_limit) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; int j; if (ctx->value_handler) { ctx->scratch_buffer_len = 4*ctx->max_col_width+1; ctx->scratch_buffer = readstat_realloc(ctx->scratch_buffer, ctx->scratch_buffer_len); if (ctx->scratch_buffer == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } for (j=0; jcolumn_count; j++) { col_info_t *col_info = &ctx->col_info[j]; readstat_variable_t *variable = ctx->variables[j]; if (variable->skip) continue; if (col_info->offset < 0 || col_info->offset + col_info->width > ctx->row_length) { retval = READSTAT_ERROR_PARSE; goto cleanup; } retval = sas7bdat_handle_data_value(variable, col_info, &data[col_info->offset], ctx); if (retval != READSTAT_OK) { goto cleanup; } } } ctx->parsed_row_count++; cleanup: return retval; } static readstat_error_t sas7bdat_parse_rows(const char *data, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i; size_t row_offset=0; for (i=0; ipage_row_count && ctx->parsed_row_count < ctx->row_limit; i++) { if (row_offset + ctx->row_length > len) { retval = READSTAT_ERROR_ROW_WIDTH_MISMATCH; goto cleanup; } if ((retval = sas7bdat_parse_single_row(&data[row_offset], ctx)) != READSTAT_OK) goto cleanup; row_offset += ctx->row_length; } cleanup: return retval; } static readstat_error_t sas7bdat_parse_subheader_rle(const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { if (ctx->row_limit == ctx->parsed_row_count) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; char *buffer = NULL; ssize_t bytes_decompressed = 0; if ((buffer = readstat_malloc(ctx->row_length)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } bytes_decompressed = sas_rle_decompress( buffer, ctx->row_length, subheader, len); if (bytes_decompressed != ctx->row_length) { retval = READSTAT_ERROR_ROW_WIDTH_MISMATCH; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Row #%d decompressed to %ld bytes (expected %d bytes)", ctx->parsed_row_count, (long)(bytes_decompressed), ctx->row_length); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } retval = sas7bdat_parse_single_row(buffer, ctx); cleanup: if (buffer) free(buffer); return retval; } static readstat_error_t sas7bdat_parse_subheader(uint32_t signature, const char *subheader, size_t len, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if (len < 6) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (signature == SAS_SUBHEADER_SIGNATURE_ROW_SIZE) { retval = sas7bdat_parse_row_size_subheader(subheader, len, ctx); } else if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_SIZE) { retval = sas7bdat_parse_column_size_subheader(subheader, len, ctx); } else if (signature == SAS_SUBHEADER_SIGNATURE_COUNTS) { /* void */ } else if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_TEXT) { retval = sas7bdat_parse_column_text_subheader(subheader, len, ctx); } else if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_NAME) { retval = sas7bdat_parse_column_name_subheader(subheader, len, ctx); } else if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_ATTRS) { retval = sas7bdat_parse_column_attributes_subheader(subheader, len, ctx); } else if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_FORMAT) { retval = sas7bdat_parse_column_format_subheader(subheader, len, ctx); } else if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_LIST) { /* void */ } else if ((signature & SAS_SUBHEADER_SIGNATURE_COLUMN_MASK) == SAS_SUBHEADER_SIGNATURE_COLUMN_MASK) { /* void */ } else { retval = READSTAT_ERROR_PARSE; } cleanup: return retval; } static readstat_variable_t *sas7bdat_init_variable(sas7bdat_ctx_t *ctx, int i, int index_after_skipping, readstat_error_t *out_retval) { readstat_error_t retval = READSTAT_OK; readstat_variable_t *variable = readstat_calloc(1, sizeof(readstat_variable_t)); variable->index = i; variable->index_after_skipping = index_after_skipping; variable->type = ctx->col_info[i].type; variable->storage_width = ctx->col_info[i].width; if ((retval = sas7bdat_copy_text_ref(variable->name, sizeof(variable->name), ctx->col_info[i].name_ref, ctx)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_copy_text_ref(variable->format, sizeof(variable->format), ctx->col_info[i].format_ref, ctx)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_copy_text_ref(variable->label, sizeof(variable->label), ctx->col_info[i].label_ref, ctx)) != READSTAT_OK) { goto cleanup; } cleanup: if (retval != READSTAT_OK) { free(variable); if (out_retval) *out_retval = retval; if (retval == READSTAT_ERROR_CONVERT_BAD_STRING) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Error converting variable #%d info to specified encoding: %s %s (%s)", i, variable->name, variable->format, variable->label); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } } return NULL; } return variable; } static readstat_error_t sas7bdat_submit_columns(sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if (ctx->info_handler) { if (ctx->info_handler(ctx->row_limit, ctx->column_count, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } if (ctx->metadata_handler) { if (ctx->metadata_handler(ctx->file_label, ctx->input_encoding, ctx->timestamp, ctx->version, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } ctx->variables = readstat_calloc(ctx->column_count, sizeof(readstat_variable_t *)); if (ctx->variables == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } int i; int index_after_skipping = 0; for (i=0; icolumn_count; i++) { ctx->variables[i] = sas7bdat_init_variable(ctx, i, index_after_skipping, &retval); if (ctx->variables[i] == NULL) break; int cb_retval = READSTAT_HANDLER_OK; if (ctx->variable_handler) { cb_retval = ctx->variable_handler(i, ctx->variables[i], ctx->variables[i]->format, ctx->user_ctx); } if (cb_retval == READSTAT_HANDLER_ABORT) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } if (cb_retval == READSTAT_HANDLER_SKIP_VARIABLE) { ctx->variables[i]->skip = 1; } else { index_after_skipping++; } } cleanup: return retval; } static readstat_error_t sas7bdat_submit_columns_if_needed(sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; if (!ctx->did_submit_columns) { if ((retval = sas7bdat_submit_columns(ctx)) != READSTAT_OK) { goto cleanup; } ctx->did_submit_columns = 1; } cleanup: return retval; } static int sas7bdat_signature_is_recognized(uint32_t signature) { return (signature == SAS_SUBHEADER_SIGNATURE_ROW_SIZE || signature == SAS_SUBHEADER_SIGNATURE_COLUMN_SIZE || signature == SAS_SUBHEADER_SIGNATURE_COUNTS || signature == SAS_SUBHEADER_SIGNATURE_COLUMN_FORMAT || (signature & SAS_SUBHEADER_SIGNATURE_COLUMN_MASK) == SAS_SUBHEADER_SIGNATURE_COLUMN_MASK); } /* First, extract column text */ static readstat_error_t sas7bdat_parse_page_pass1(const char *page, size_t page_size, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; uint16_t subheader_count = sas_read2(&page[ctx->page_header_size-4], ctx->bswap); int i; const char *shp = &page[ctx->page_header_size]; int lshp = ctx->subheader_pointer_size; if (ctx->page_header_size + subheader_count*lshp > ctx->page_size) { retval = READSTAT_ERROR_PARSE; goto cleanup; } for (i=0; iu64 ? 8 : 4; if (ctx->u64) { offset = sas_read8(&shp[0], ctx->bswap); len = sas_read8(&shp[8], ctx->bswap); compression = shp[16]; } else { offset = sas_read4(&shp[0], ctx->bswap); len = sas_read4(&shp[4], ctx->bswap); compression = shp[8]; } if (len > 0 && compression != SAS_COMPRESSION_TRUNC) { if (offset > page_size || offset + len > page_size || offset < ctx->page_header_size+subheader_count*lshp) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (compression == SAS_COMPRESSION_NONE) { if (len < signature_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } signature = sas_read4(page + offset, ctx->bswap); if (!ctx->little_endian && signature == -1 && signature_len == 8) { signature = sas_read4(page + offset + 4, ctx->bswap); } if (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_TEXT) { if ((retval = sas7bdat_parse_subheader(signature, page + offset, len, ctx)) != READSTAT_OK) { goto cleanup; } } } else if (compression == SAS_COMPRESSION_ROW) { /* void */ } else { retval = READSTAT_ERROR_UNSUPPORTED_COMPRESSION; goto cleanup; } } shp += lshp; } cleanup: return retval; } static readstat_error_t sas7bdat_parse_page_pass2(const char *page, size_t page_size, sas7bdat_ctx_t *ctx) { uint16_t page_type; readstat_error_t retval = READSTAT_OK; page_type = sas_read2(&page[ctx->page_header_size-8], ctx->bswap); const char *data = NULL; if ((page_type & SAS_PAGE_TYPE_MASK) == SAS_PAGE_TYPE_DATA) { ctx->page_row_count = sas_read2(&page[ctx->page_header_size-6], ctx->bswap); data = &page[ctx->page_header_size]; } else if (!(page_type & SAS_PAGE_TYPE_COMP)) { uint16_t subheader_count = sas_read2(&page[ctx->page_header_size-4], ctx->bswap); int i; const char *shp = &page[ctx->page_header_size]; for (i=0; isubheader_pointer_size; if (ctx->u64) { offset = sas_read8(&shp[0], ctx->bswap); len = sas_read8(&shp[8], ctx->bswap); compression = shp[16]; is_compressed_data = shp[17]; } else { offset = sas_read4(&shp[0], ctx->bswap); len = sas_read4(&shp[4], ctx->bswap); compression = shp[8]; is_compressed_data = shp[9]; } if (len > 0 && compression != SAS_COMPRESSION_TRUNC) { if (offset > page_size || offset + len > page_size || offset < ctx->page_header_size+subheader_count*lshp) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (compression == SAS_COMPRESSION_NONE) { signature = sas_read4(page + offset, ctx->bswap); if (!ctx->little_endian && signature == -1 && ctx->u64) { signature = sas_read4(page + offset + 4, ctx->bswap); } if (is_compressed_data && !sas7bdat_signature_is_recognized(signature)) { if (len != ctx->row_length) { retval = READSTAT_ERROR_ROW_WIDTH_MISMATCH; goto cleanup; } if ((retval = sas7bdat_submit_columns_if_needed(ctx)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_parse_single_row(page + offset, ctx)) != READSTAT_OK) { goto cleanup; } } else { if (signature != SAS_SUBHEADER_SIGNATURE_COLUMN_TEXT) { if ((retval = sas7bdat_parse_subheader(signature, page + offset, len, ctx)) != READSTAT_OK) { goto cleanup; } } } } else if (compression == SAS_COMPRESSION_ROW) { if ((retval = sas7bdat_submit_columns_if_needed(ctx)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_parse_subheader_rle(page + offset, len, ctx)) != READSTAT_OK) { goto cleanup; } } else { retval = READSTAT_ERROR_UNSUPPORTED_COMPRESSION; goto cleanup; } } shp += lshp; } if ((page_type & SAS_PAGE_TYPE_MASK) == SAS_PAGE_TYPE_MIX) { /* HACK - this is supposed to obey 8-byte boundaries but * some files created by Stat/Transfer don't. So verify that the * padding is { 0, 0, 0, 0 } or { ' ', ' ', ' ', ' ' } (or that * the file is not from Stat/Transfer) before skipping it */ if ((shp-page)%8 == 4 && (*(uint32_t *)shp == 0x00000000 || *(uint32_t *)shp == 0x20202020 || ctx->vendor != READSTAT_VENDOR_STAT_TRANSFER)) { data = shp + 4; } else { data = shp; } } } if (data) { if ((retval = sas7bdat_submit_columns_if_needed(ctx)) != READSTAT_OK) { goto cleanup; } if (ctx->value_handler) { retval = sas7bdat_parse_rows(data, page + page_size - data, ctx); } } cleanup: return retval; } static readstat_error_t sas7bdat_parse_meta_pages_pass1(sas7bdat_ctx_t *ctx, int64_t *outLastExaminedPage) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; int64_t i; /* look for META and MIX pages at beginning... */ for (i=0; ipage_count; i++) { if (io->seek(ctx->header_size + i*ctx->page_size, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Failed to seek to position %" PRId64 " (= %" PRId64 " + %" PRId64 "*%" PRId64 ")", ctx->header_size + i*ctx->page_size, ctx->header_size, i, ctx->page_size); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } readstat_off_t off = 0; if (ctx->u64) off = 16; size_t head_len = off + 16 + 2; size_t tail_len = ctx->page_size - head_len; if (io->read(ctx->page, head_len, io->io_ctx) < head_len) { retval = READSTAT_ERROR_READ; goto cleanup; } uint16_t page_type = sas_read2(&ctx->page[off+16], ctx->bswap); if ((page_type & SAS_PAGE_TYPE_MASK) == SAS_PAGE_TYPE_DATA) break; if ((page_type & SAS_PAGE_TYPE_COMP)) continue; if (io->read(ctx->page + head_len, tail_len, io->io_ctx) < tail_len) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = sas7bdat_parse_page_pass1(ctx->page, ctx->page_size, ctx)) != READSTAT_OK) { if (ctx->error_handler && retval != READSTAT_ERROR_USER_ABORT) { int64_t pos = io->seek(0, READSTAT_SEEK_CUR, io->io_ctx); snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Error parsing page %" PRId64 ", bytes %" PRId64 "-%" PRId64, i, pos - ctx->page_size, pos-1); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } } cleanup: if (outLastExaminedPage) *outLastExaminedPage = i; return retval; } static readstat_error_t sas7bdat_parse_amd_pages_pass1(int64_t last_examined_page_pass1, sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; uint64_t i; uint64_t amd_page_count = 0; /* ...then AMD pages at the end */ for (i=ctx->page_count-1; i>last_examined_page_pass1; i--) { if (io->seek(ctx->header_size + i*ctx->page_size, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Failed to seek to position %" PRId64 " (= %" PRId64 " + %" PRId64 "*%" PRId64 ")", ctx->header_size + i*ctx->page_size, ctx->header_size, i, ctx->page_size); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } readstat_off_t off = 0; if (ctx->u64) off = 16; size_t head_len = off + 16 + 2; size_t tail_len = ctx->page_size - head_len; if (io->read(ctx->page, head_len, io->io_ctx) < head_len) { retval = READSTAT_ERROR_READ; goto cleanup; } uint16_t page_type = sas_read2(&ctx->page[off+16], ctx->bswap); if ((page_type & SAS_PAGE_TYPE_MASK) == SAS_PAGE_TYPE_DATA) { /* Usually AMD pages are at the end but sometimes data pages appear after them */ if (amd_page_count > 0) break; continue; } if ((page_type & SAS_PAGE_TYPE_COMP)) continue; if (io->read(ctx->page + head_len, tail_len, io->io_ctx) < tail_len) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = sas7bdat_parse_page_pass1(ctx->page, ctx->page_size, ctx)) != READSTAT_OK) { if (ctx->error_handler && retval != READSTAT_ERROR_USER_ABORT) { int64_t pos = io->seek(0, READSTAT_SEEK_CUR, io->io_ctx); snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Error parsing page %" PRId64 ", bytes %" PRId64 "-%" PRId64, i, pos - ctx->page_size, pos-1); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } amd_page_count++; } cleanup: return retval; } static readstat_error_t sas7bdat_parse_all_pages_pass2(sas7bdat_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; int64_t i; for (i=0; ipage_count; i++) { if ((retval = sas7bdat_update_progress(ctx)) != READSTAT_OK) { goto cleanup; } if (io->read(ctx->page, ctx->page_size, io->io_ctx) < ctx->page_size) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = sas7bdat_parse_page_pass2(ctx->page, ctx->page_size, ctx)) != READSTAT_OK) { if (ctx->error_handler && retval != READSTAT_ERROR_USER_ABORT) { int64_t pos = io->seek(0, READSTAT_SEEK_CUR, io->io_ctx); snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Error parsing page %" PRId64 ", bytes %" PRId64 "-%" PRId64, i, pos - ctx->page_size, pos-1); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } if (ctx->parsed_row_count == ctx->row_limit) break; } cleanup: return retval; } readstat_error_t readstat_parse_sas7bdat(readstat_parser_t *parser, const char *path, void *user_ctx) { int64_t last_examined_page_pass1 = 0; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = parser->io; sas7bdat_ctx_t *ctx = calloc(1, sizeof(sas7bdat_ctx_t)); sas_header_info_t *hinfo = calloc(1, sizeof(sas_header_info_t)); ctx->info_handler = parser->info_handler; ctx->metadata_handler = parser->metadata_handler; ctx->variable_handler = parser->variable_handler; ctx->value_handler = parser->value_handler; ctx->error_handler = parser->error_handler; ctx->progress_handler = parser->progress_handler; ctx->input_encoding = parser->input_encoding; ctx->output_encoding = parser->output_encoding; ctx->user_ctx = user_ctx; ctx->io = parser->io; ctx->row_limit = parser->row_limit; if (io->open(path, io->io_ctx) == -1) { retval = READSTAT_ERROR_OPEN; goto cleanup; } if ((ctx->file_size = io->seek(0, READSTAT_SEEK_END, io->io_ctx)) == -1) { retval = READSTAT_ERROR_SEEK; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Failed to seek to end of file"); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } if (io->seek(0, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Failed to seek to beginning of file"); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } if ((retval = sas_read_header(io, hinfo, ctx->error_handler, user_ctx)) != READSTAT_OK) { goto cleanup; } ctx->u64 = hinfo->u64; ctx->little_endian = hinfo->little_endian; ctx->vendor = hinfo->vendor; ctx->bswap = machine_is_little_endian() ^ hinfo->little_endian; ctx->header_size = hinfo->header_size; ctx->page_count = hinfo->page_count; ctx->page_size = hinfo->page_size; ctx->page_header_size = hinfo->page_header_size; ctx->subheader_pointer_size = hinfo->subheader_pointer_size; ctx->timestamp = hinfo->modification_time; ctx->version = 10000 * hinfo->major_version + hinfo->minor_version; if (ctx->input_encoding == NULL) { ctx->input_encoding = hinfo->encoding; } if ((ctx->page = readstat_malloc(ctx->page_size)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (ctx->input_encoding && ctx->output_encoding && strcmp(ctx->input_encoding, ctx->output_encoding) != 0) { iconv_t converter = iconv_open(ctx->output_encoding, ctx->input_encoding); if (converter == (iconv_t)-1) { retval = READSTAT_ERROR_UNSUPPORTED_CHARSET; goto cleanup; } ctx->converter = converter; } if ((retval = readstat_convert(ctx->file_label, sizeof(ctx->file_label), hinfo->file_label, sizeof(hinfo->file_label), ctx->converter)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_parse_meta_pages_pass1(ctx, &last_examined_page_pass1)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_parse_amd_pages_pass1(last_examined_page_pass1, ctx)) != READSTAT_OK) { goto cleanup; } if (io->seek(ctx->header_size, READSTAT_SEEK_SET, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Failed to seek to position %" PRId64, ctx->header_size); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } if ((retval = sas7bdat_parse_all_pages_pass2(ctx)) != READSTAT_OK) { goto cleanup; } if ((retval = sas7bdat_submit_columns_if_needed(ctx)) != READSTAT_OK) { goto cleanup; } if (ctx->value_handler && ctx->parsed_row_count != ctx->row_limit) { retval = READSTAT_ERROR_ROW_COUNT_MISMATCH; if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: Expected %d rows in file, found %d", ctx->row_limit, ctx->parsed_row_count); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } goto cleanup; } if ((retval = sas7bdat_update_progress(ctx)) != READSTAT_OK) { goto cleanup; } cleanup: io->close(io->io_ctx); if (retval == READSTAT_ERROR_OPEN || retval == READSTAT_ERROR_READ || retval == READSTAT_ERROR_SEEK) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "ReadStat: %s (retval = %d): %s (errno = %d)", readstat_error_message(retval), retval, strerror(errno), errno); ctx->error_handler(ctx->error_buf, user_ctx); } } if (ctx) sas7bdat_ctx_free(ctx); if (hinfo) free(hinfo); return retval; } haven/src/readstat/sas/readstat_sas_rle.c0000644000176200001440000002204413227731765020241 0ustar liggesusers #include #include #include #include "readstat_sas_rle.h" #define SAS_RLE_COMMAND_COPY64 0 #define SAS_RLE_COMMAND_INSERT_BYTE18 4 #define SAS_RLE_COMMAND_INSERT_AT17 5 #define SAS_RLE_COMMAND_INSERT_BLANK17 6 #define SAS_RLE_COMMAND_INSERT_ZERO17 7 #define SAS_RLE_COMMAND_COPY1 8 #define SAS_RLE_COMMAND_COPY17 9 #define SAS_RLE_COMMAND_COPY33 10 #define SAS_RLE_COMMAND_COPY49 11 #define SAS_RLE_COMMAND_INSERT_BYTE3 12 #define SAS_RLE_COMMAND_INSERT_AT2 13 #define SAS_RLE_COMMAND_INSERT_BLANK2 14 #define SAS_RLE_COMMAND_INSERT_ZERO2 15 static size_t command_lengths[16] = { [SAS_RLE_COMMAND_COPY64] = 1, [SAS_RLE_COMMAND_INSERT_BYTE18] = 2, [SAS_RLE_COMMAND_INSERT_AT17] = 1, [SAS_RLE_COMMAND_INSERT_BLANK17] = 1, [SAS_RLE_COMMAND_INSERT_ZERO17] = 1, [SAS_RLE_COMMAND_INSERT_BYTE3] = 1 }; ssize_t sas_rle_decompressed_len(const void *input_buf, size_t input_len) { return sas_rle_decompress(NULL, 0, input_buf, input_len); } ssize_t sas_rle_decompress(void *output_buf, size_t output_len, const void *input_buf, size_t input_len) { unsigned char *buffer = (unsigned char *)output_buf; unsigned char *output = buffer; size_t output_written = 0; const unsigned char *input = (const unsigned char *)input_buf; while (input < (const unsigned char *)input_buf + input_len) { unsigned char control = *input++; unsigned char command = (control & 0xF0) >> 4; unsigned char length = (control & 0x0F); int copy_len = 0; int insert_len = 0; unsigned char insert_byte = '\0'; if (input + command_lengths[command] > (const unsigned char *)input_buf + input_len) { return -1; } switch (command) { case SAS_RLE_COMMAND_COPY64: copy_len = (*input++) + 64 + length * 256; break; case SAS_RLE_COMMAND_INSERT_BYTE18: insert_len = (*input++) + 18 + length * 256; insert_byte = *input++; break; case SAS_RLE_COMMAND_INSERT_AT17: insert_len = (*input++) + 17 + length * 256; insert_byte = '@'; break; case SAS_RLE_COMMAND_INSERT_BLANK17: insert_len = (*input++) + 17 + length * 256; insert_byte = ' '; break; case SAS_RLE_COMMAND_INSERT_ZERO17: insert_len = (*input++) + 17 + length * 256; insert_byte = '\0'; break; case SAS_RLE_COMMAND_COPY1: copy_len = length + 1; break; case SAS_RLE_COMMAND_COPY17: copy_len = length + 17; break; case SAS_RLE_COMMAND_COPY33: copy_len = length + 33; break; case SAS_RLE_COMMAND_COPY49: copy_len = length + 49; break; case SAS_RLE_COMMAND_INSERT_BYTE3: insert_byte = *input++; insert_len = length + 3; break; case SAS_RLE_COMMAND_INSERT_AT2: insert_byte = '@'; insert_len = length + 2; break; case SAS_RLE_COMMAND_INSERT_BLANK2: insert_byte = ' '; insert_len = length + 2; break; case SAS_RLE_COMMAND_INSERT_ZERO2: insert_byte = '\0'; insert_len = length + 2; break; default: /* error out here? */ break; } if (copy_len) { if (output_written + copy_len > output_len) { return -1; } if (input + copy_len > (const unsigned char *)input_buf + input_len) { return -1; } if (output) { memcpy(&output[output_written], input, copy_len); } input += copy_len; output_written += copy_len; } if (insert_len) { if (output_written + insert_len > output_len) { return -1; } if (output) { memset(&output[output_written], insert_byte, insert_len); } output_written += insert_len; } } return output_written; } static size_t sas_rle_measure_copy_run(size_t copy_run) { return (copy_run > 64) + (copy_run > 0) + copy_run; } static size_t sas_rle_copy_run(unsigned char *output_buf, size_t offset, const unsigned char *copy, size_t copy_run) { unsigned char *out = output_buf + offset; if (output_buf == NULL) return sas_rle_measure_copy_run(copy_run); if (copy_run > 64) { int length = (copy_run - 64) / 256; unsigned char rem = (copy_run - 64) % 256; *out++ = (SAS_RLE_COMMAND_COPY64 << 4) + (length & 0x0F); *out++ = rem; } else if (copy_run >= 49) { *out++ = (SAS_RLE_COMMAND_COPY49 << 4) + (copy_run - 49); } else if (copy_run >= 33) { *out++ = (SAS_RLE_COMMAND_COPY33 << 4) + (copy_run - 33); } else if (copy_run >= 17) { *out++ = (SAS_RLE_COMMAND_COPY17 << 4) + (copy_run - 17); } else if (copy_run >= 1) { *out++ = (SAS_RLE_COMMAND_COPY1 << 4) + (copy_run - 1); } memcpy(out, copy, copy_run); out += copy_run; return out - (output_buf + offset); } static int sas_rle_is_special_byte(unsigned char last_byte) { return (last_byte == '@' || last_byte == ' ' || last_byte == '\0'); } static size_t sas_rle_measure_insert_run(unsigned char last_byte, size_t insert_run) { if (sas_rle_is_special_byte(last_byte)) return insert_run > 17 ? 2 : 1; return insert_run > 18 ? 3 : 2; } static size_t sas_rle_insert_run(unsigned char *output_buf, size_t offset, unsigned char last_byte, size_t insert_run) { unsigned char *out = output_buf + offset; if (output_buf == NULL) return sas_rle_measure_insert_run(last_byte, insert_run); if (sas_rle_is_special_byte(last_byte)) { if (insert_run > 17) { int length = (insert_run - 17) / 256; unsigned char rem = (insert_run - 17) % 256; if (last_byte == '@') { *out++ = (SAS_RLE_COMMAND_INSERT_AT17 << 4) + (length & 0x0F); } else if (last_byte == ' ') { *out++ = (SAS_RLE_COMMAND_INSERT_BLANK17 << 4) + (length & 0x0F); } else if (last_byte == '\0') { *out++ = (SAS_RLE_COMMAND_INSERT_ZERO17 << 4) + (length & 0x0F); } *out++ = rem; } else if (insert_run >= 2) { if (last_byte == '@') { *out++ = (SAS_RLE_COMMAND_INSERT_AT2 << 4) + (insert_run - 2); } else if (last_byte == ' ') { *out++ = (SAS_RLE_COMMAND_INSERT_BLANK2 << 4) + (insert_run - 2); } else if (last_byte == '\0') { *out++ = (SAS_RLE_COMMAND_INSERT_ZERO2 << 4) + (insert_run - 2); } } } else if (insert_run > 18) { int length = (insert_run - 18) / 256; unsigned char rem = (insert_run - 18) % 256; *out++ = (SAS_RLE_COMMAND_INSERT_BYTE18 << 4) + (length & 0x0F); *out++ = rem; *out++ = last_byte; } else if (insert_run >= 3) { *out++ = (SAS_RLE_COMMAND_INSERT_BYTE3 << 4) + (insert_run - 3); *out++ = last_byte; } return out - (output_buf + offset); } static int sas_rle_is_insert_run(unsigned char last_byte, size_t insert_run) { if (sas_rle_is_special_byte(last_byte)) return (insert_run > 1); return (insert_run > 2); } ssize_t sas_rle_compressed_len(const void *bytes, size_t len) { return sas_rle_compress(NULL, 0, bytes, len); } ssize_t sas_rle_compress(void *output_buf, size_t output_len, const void *input_buf, size_t input_len) { /* TODO bounds check */ const unsigned char *p = (const unsigned char *)input_buf; const unsigned char *pe = p + input_len; const unsigned char *copy = p; unsigned char *out = (unsigned char *)output_buf; size_t insert_run = 0; size_t copy_run = 0; size_t out_written = 0; unsigned char last_byte = 0; while (p < pe) { unsigned char c = *p; if (insert_run == 0) { insert_run = 1; } else if (c == last_byte) { insert_run++; } else { if (sas_rle_is_insert_run(last_byte, insert_run)) { out_written += sas_rle_copy_run(out, out_written, copy, copy_run); out_written += sas_rle_insert_run(out, out_written, last_byte, insert_run); copy_run = 0; copy = p; } else { copy_run += insert_run; } insert_run = 1; } last_byte = c; p++; } if (sas_rle_is_insert_run(last_byte, insert_run)) { out_written += sas_rle_copy_run(out, out_written, copy, copy_run); out_written += sas_rle_insert_run(out, out_written, last_byte, insert_run); } else { out_written += sas_rle_copy_run(out, out_written, copy, copy_run + insert_run); } return out_written; } haven/src/readstat/sas/ieee.h0000644000176200001440000000026213227731765015634 0ustar liggesusers#define CN_TYPE_NATIVE 0 #define CN_TYPE_XPORT 1 #define CN_TYPE_IEEEB 2 #define CN_TYPE_IEEEL 3 int cnxptiee(const void *from_bytes, int fromtype, void *to_bytes, int totype); haven/src/readstat/sas/readstat_sas7bdat_write.c0000644000176200001440000007305413227731765021542 0ustar liggesusers #include #include #include #include #include "../readstat.h" #include "../readstat_writer.h" #include "readstat_sas.h" #include "readstat_sas_rle.h" typedef struct sas7bdat_subheader_s { uint32_t signature; char *data; size_t len; int is_row_data; int is_row_data_compressed; } sas7bdat_subheader_t; typedef struct sas7bdat_subheader_array_s { int64_t count; int64_t capacity; sas7bdat_subheader_t **subheaders; } sas7bdat_subheader_array_t; typedef struct sas7bdat_column_text_s { char *data; size_t capacity; size_t used; int64_t index; } sas7bdat_column_text_t; typedef struct sas7bdat_column_text_array_s { int64_t count; sas7bdat_column_text_t **column_texts; } sas7bdat_column_text_array_t; typedef struct sas7bdat_write_ctx_s { sas_header_info_t *hinfo; sas7bdat_subheader_array_t *sarray; } sas7bdat_write_ctx_t; static size_t sas7bdat_variable_width(readstat_type_t type, size_t user_width); static int32_t sas7bdat_count_meta_pages(readstat_writer_t *writer) { sas7bdat_write_ctx_t *ctx = (sas7bdat_write_ctx_t *)writer->module_ctx; sas_header_info_t *hinfo = ctx->hinfo; sas7bdat_subheader_array_t *sarray = ctx->sarray; int i; int pages = 1; size_t bytes_left = hinfo->page_size - hinfo->page_header_size; size_t shp_ptr_size = hinfo->subheader_pointer_size; for (i=sarray->count-1; i>=0; i--) { sas7bdat_subheader_t *subheader = sarray->subheaders[i]; if (subheader->len + shp_ptr_size > bytes_left) { bytes_left = hinfo->page_size - hinfo->page_header_size; pages++; } bytes_left -= (subheader->len + shp_ptr_size); } return pages; } static size_t sas7bdat_row_length(readstat_writer_t *writer) { int i; size_t len = 0; for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); len += sas7bdat_variable_width(readstat_variable_get_type(variable), readstat_variable_get_storage_width(variable)); } return len; } static int32_t sas7bdat_rows_per_page(readstat_writer_t *writer, sas_header_info_t *hinfo) { size_t row_length = sas7bdat_row_length(writer); return (hinfo->page_size - hinfo->page_header_size) / row_length; } static int32_t sas7bdat_count_data_pages(readstat_writer_t *writer, sas_header_info_t *hinfo) { if (writer->compression == READSTAT_COMPRESS_ROWS) return 0; int32_t rows_per_page = sas7bdat_rows_per_page(writer, hinfo); return (writer->row_count + (rows_per_page - 1)) / rows_per_page; } static sas7bdat_column_text_t *sas7bdat_column_text_init(int64_t index, size_t len) { sas7bdat_column_text_t *column_text = calloc(1, sizeof(sas7bdat_column_text_t)); column_text->data = malloc(len); column_text->capacity = len; column_text->index = index; return column_text; } static void sas7bdat_column_text_free(sas7bdat_column_text_t *column_text) { free(column_text->data); free(column_text); } static void sas7bdat_column_text_array_free(sas7bdat_column_text_array_t *column_text_array) { int i; for (i=0; icount; i++) { sas7bdat_column_text_free(column_text_array->column_texts[i]); } free(column_text_array->column_texts); free(column_text_array); } static sas_text_ref_t sas7bdat_make_text_ref(sas7bdat_column_text_array_t *column_text_array, const char *string) { size_t len = strlen(string); size_t padded_len = (len + 3) / 4 * 4; sas7bdat_column_text_t *column_text = column_text_array->column_texts[ column_text_array->count-1]; if (column_text->used + padded_len > column_text->capacity) { column_text_array->count++; column_text_array->column_texts = realloc(column_text_array->column_texts, sizeof(sas7bdat_column_text_t *) * column_text_array->count); column_text = sas7bdat_column_text_init(column_text_array->count-1, column_text->capacity); column_text_array->column_texts[column_text_array->count-1] = column_text; } sas_text_ref_t text_ref = { .index = column_text->index, .offset = column_text->used + 28, .length = len }; strncpy(&column_text->data[column_text->used], string, padded_len); column_text->used += padded_len; return text_ref; } static readstat_error_t sas7bdat_emit_header(readstat_writer_t *writer, sas_header_info_t *hinfo) { sas_header_start_t header_start = { .a2 = hinfo->u64 ? SAS_ALIGNMENT_OFFSET_4 : SAS_ALIGNMENT_OFFSET_0, .a1 = SAS_ALIGNMENT_OFFSET_0, .endian = machine_is_little_endian() ? SAS_ENDIAN_LITTLE : SAS_ENDIAN_BIG, .file_format = SAS_FILE_FORMAT_UNIX, .encoding = 20, /* UTF-8 */ .file_type = "SAS FILE", .file_info = "DATA ~ ~" }; memcpy(&header_start.magic, sas7bdat_magic_number, sizeof(header_start.magic)); memset(header_start.file_label, ' ', sizeof(header_start.file_label)); size_t file_label_len = strlen(writer->file_label); if (file_label_len > sizeof(header_start.file_label)) file_label_len = sizeof(header_start.file_label); if (file_label_len) { memcpy(header_start.file_label, writer->file_label, file_label_len); } else { memcpy(header_start.file_label, "DATASET", sizeof("DATASET")-1); } return sas_write_header(writer, hinfo, header_start); } static sas7bdat_subheader_t *sas7bdat_subheader_init(uint32_t signature, size_t len) { sas7bdat_subheader_t *subheader = calloc(1, sizeof(sas7bdat_subheader_t)); subheader->signature = signature; subheader->len = len; subheader->data = calloc(1, len); return subheader; } static sas7bdat_subheader_t *sas7bdat_row_size_subheader_init(readstat_writer_t *writer, sas_header_info_t *hinfo) { sas7bdat_subheader_t *subheader = sas7bdat_subheader_init( SAS_SUBHEADER_SIGNATURE_ROW_SIZE, hinfo->u64 ? 128 : 64); if (hinfo->u64) { int64_t row_length = sas7bdat_row_length(writer); int64_t row_count = writer->row_count; int64_t page_size = hinfo->page_size; memcpy(&subheader->data[40], &row_length, sizeof(int64_t)); memcpy(&subheader->data[48], &row_count, sizeof(int64_t)); memcpy(&subheader->data[104], &page_size, sizeof(int64_t)); // memset(&subheader->data[128], 0xFF, 16); } else { int32_t row_length = sas7bdat_row_length(writer); int32_t row_count = writer->row_count; int32_t page_size = hinfo->page_size; memcpy(&subheader->data[20], &row_length, sizeof(int32_t)); memcpy(&subheader->data[24], &row_count, sizeof(int32_t)); memcpy(&subheader->data[52], &page_size, sizeof(int32_t)); // memset(&subheader->data[64], 0xFF, 8); } return subheader; } static sas7bdat_subheader_t *sas7bdat_col_size_subheader_init(readstat_writer_t *writer, sas_header_info_t *hinfo) { sas7bdat_subheader_t *subheader = sas7bdat_subheader_init( SAS_SUBHEADER_SIGNATURE_COLUMN_SIZE, hinfo->u64 ? 24 : 12); if (hinfo->u64) { int64_t col_count = writer->variables_count; memcpy(&subheader->data[8], &col_count, sizeof(int64_t)); } else { int32_t col_count = writer->variables_count; memcpy(&subheader->data[4], &col_count, sizeof(int32_t)); } return subheader; } static size_t sas7bdat_col_name_subheader_length(readstat_writer_t *writer, sas_header_info_t *hinfo) { return (hinfo->u64 ? 28+8*writer->variables_count : 20+8*writer->variables_count); } static sas7bdat_subheader_t *sas7bdat_col_name_subheader_init(readstat_writer_t *writer, sas_header_info_t *hinfo, sas7bdat_column_text_array_t *column_text_array) { size_t len = sas7bdat_col_name_subheader_length(writer, hinfo); size_t signature_len = hinfo->u64 ? 8 : 4; uint16_t remainder = len - (4+2*signature_len); sas7bdat_subheader_t *subheader = sas7bdat_subheader_init( SAS_SUBHEADER_SIGNATURE_COLUMN_NAME, len); memcpy(&subheader->data[signature_len], &remainder, sizeof(uint16_t)); int i; char *ptrs = &subheader->data[signature_len+8]; for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); const char *name = readstat_variable_get_name(variable); sas_text_ref_t text_ref = sas7bdat_make_text_ref(column_text_array, name); memcpy(&ptrs[0], &text_ref.index, sizeof(uint16_t)); memcpy(&ptrs[2], &text_ref.offset, sizeof(uint16_t)); memcpy(&ptrs[4], &text_ref.length, sizeof(uint16_t)); ptrs += 8; } return subheader; } static size_t sas7bdat_col_attrs_subheader_length(readstat_writer_t *writer, sas_header_info_t *hinfo) { return (hinfo->u64 ? 28+16*writer->variables_count : 20+12*writer->variables_count); } static sas7bdat_subheader_t *sas7bdat_col_attrs_subheader_init(readstat_writer_t *writer, sas_header_info_t *hinfo) { size_t len = sas7bdat_col_attrs_subheader_length(writer, hinfo); size_t signature_len = hinfo->u64 ? 8 : 4; uint16_t remainder = len - (4+2*signature_len); sas7bdat_subheader_t *subheader = sas7bdat_subheader_init( SAS_SUBHEADER_SIGNATURE_COLUMN_ATTRS, len); memcpy(&subheader->data[signature_len], &remainder, sizeof(uint16_t)); char *ptrs = &subheader->data[signature_len+8]; uint64_t offset = 0; int i; for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); const char *name = readstat_variable_get_name(variable); readstat_type_t type = readstat_variable_get_type(variable); uint16_t name_length_flag = strlen(name) <= 8 ? 4 : 2048; uint32_t width = 0; if (hinfo->u64) { memcpy(&ptrs[0], &offset, sizeof(uint64_t)); ptrs += sizeof(uint64_t); } else { uint32_t offset32 = offset; memcpy(&ptrs[0], &offset32, sizeof(uint32_t)); ptrs += sizeof(uint32_t); } if (type == READSTAT_TYPE_STRING) { ptrs[6] = SAS_COLUMN_TYPE_CHR; width = readstat_variable_get_storage_width(variable); } else { ptrs[6] = SAS_COLUMN_TYPE_NUM; width = 8; } memcpy(&ptrs[0], &width, sizeof(uint32_t)); memcpy(&ptrs[4], &name_length_flag, sizeof(uint16_t)); offset += width; ptrs += 8; } return subheader; } static sas7bdat_subheader_t *sas7bdat_col_format_subheader_init(readstat_variable_t *variable, sas_header_info_t *hinfo, sas7bdat_column_text_array_t *column_text_array) { sas7bdat_subheader_t *subheader = sas7bdat_subheader_init( SAS_SUBHEADER_SIGNATURE_COLUMN_FORMAT, hinfo->u64 ? 64 : 52); const char *format = readstat_variable_get_format(variable); const char *label = readstat_variable_get_label(variable); off_t format_offset = hinfo->u64 ? 46 : 34; off_t label_offset = hinfo->u64 ? 52 : 40; if (format) { sas_text_ref_t text_ref = sas7bdat_make_text_ref(column_text_array, format); memcpy(&subheader->data[format_offset+0], &text_ref.index, sizeof(uint16_t)); memcpy(&subheader->data[format_offset+2], &text_ref.offset, sizeof(uint16_t)); memcpy(&subheader->data[format_offset+4], &text_ref.length, sizeof(uint16_t)); } if (label) { sas_text_ref_t text_ref = sas7bdat_make_text_ref(column_text_array, label); memcpy(&subheader->data[label_offset+0], &text_ref.index, sizeof(uint16_t)); memcpy(&subheader->data[label_offset+2], &text_ref.offset, sizeof(uint16_t)); memcpy(&subheader->data[label_offset+4], &text_ref.length, sizeof(uint16_t)); } return subheader; } static sas7bdat_subheader_t *sas7bdat_col_text_subheader_init(readstat_writer_t *writer, sas_header_info_t *hinfo, sas7bdat_column_text_t *column_text) { size_t signature_len = hinfo->u64 ? 8 : 4; size_t len = signature_len + 28 + column_text->used; sas7bdat_subheader_t *subheader = sas7bdat_subheader_init( SAS_SUBHEADER_SIGNATURE_COLUMN_TEXT, len); uint16_t used = len - (4+2*signature_len); memcpy(&subheader->data[signature_len], &used, sizeof(uint16_t)); memset(&subheader->data[signature_len+12], ' ', 8); memcpy(&subheader->data[signature_len+28], column_text->data, column_text->used); return subheader; } static sas7bdat_subheader_array_t *sas7bdat_subheader_array_init(readstat_writer_t *writer, sas_header_info_t *hinfo) { sas7bdat_subheader_t *row_size_subheader = NULL; sas7bdat_subheader_t *col_size_subheader = NULL; sas7bdat_subheader_t *col_name_subheader = NULL; sas7bdat_subheader_t *col_attrs_subheader = NULL; sas7bdat_column_text_array_t *column_text_array = calloc(1, sizeof(sas7bdat_column_text_array_t)); column_text_array->count = 1; column_text_array->column_texts = malloc(sizeof(sas7bdat_column_text_t *)); column_text_array->column_texts[0] = sas7bdat_column_text_init(0, hinfo->page_size - hinfo->page_header_size - hinfo->subheader_pointer_size); row_size_subheader = sas7bdat_row_size_subheader_init(writer, hinfo); col_size_subheader = sas7bdat_col_size_subheader_init(writer, hinfo); col_name_subheader = sas7bdat_col_name_subheader_init(writer, hinfo, column_text_array); col_attrs_subheader = sas7bdat_col_attrs_subheader_init(writer, hinfo); sas7bdat_subheader_array_t *sarray = calloc(1, sizeof(sas7bdat_subheader_array_t)); sarray->count = 4+writer->variables_count; sarray->subheaders = calloc(sarray->count, sizeof(sas7bdat_subheader_t *)); long idx = 0; sarray->subheaders[idx++] = row_size_subheader; sarray->subheaders[idx++] = col_size_subheader; sarray->subheaders[idx++] = col_name_subheader; sarray->subheaders[idx++] = col_attrs_subheader; int i; for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); sarray->subheaders[idx++] = sas7bdat_col_format_subheader_init(variable, hinfo, column_text_array); } sarray->count += column_text_array->count; sarray->subheaders = realloc(sarray->subheaders, sarray->count * sizeof(sas7bdat_subheader_t *)); for (i=0; icount; i++) { sarray->subheaders[idx++] = sas7bdat_col_text_subheader_init(writer, hinfo, column_text_array->column_texts[i]); } sas7bdat_column_text_array_free(column_text_array); sarray->capacity = sarray->count; if (writer->compression == READSTAT_COMPRESS_ROWS) { sarray->capacity = (sarray->count + writer->row_count); sarray->subheaders = realloc(sarray->subheaders, sarray->capacity * sizeof(sas7bdat_subheader_t *)); } return sarray; } static void sas7bdat_subheader_free(sas7bdat_subheader_t *subheader) { if (!subheader) return; if (subheader->data) free(subheader->data); free(subheader); } static void sas7bdat_subheader_array_free(sas7bdat_subheader_array_t *sarray) { int i; for (i=0; icount; i++) { sas7bdat_subheader_free(sarray->subheaders[i]); } free(sarray->subheaders); free(sarray); } static int sas7bdat_subheader_type(uint32_t signature) { return (signature == SAS_SUBHEADER_SIGNATURE_COLUMN_TEXT || signature == SAS_SUBHEADER_SIGNATURE_COLUMN_NAME || signature == SAS_SUBHEADER_SIGNATURE_COLUMN_ATTRS || signature == SAS_SUBHEADER_SIGNATURE_COLUMN_LIST); } static readstat_error_t sas7bdat_emit_meta_pages(readstat_writer_t *writer) { sas7bdat_write_ctx_t *ctx = (sas7bdat_write_ctx_t *)writer->module_ctx; sas_header_info_t *hinfo = ctx->hinfo; sas7bdat_subheader_array_t *sarray = ctx->sarray; readstat_error_t retval = READSTAT_OK; int16_t page_type = SAS_PAGE_TYPE_META; char *page = malloc(hinfo->page_size); int64_t shp_written = 0; while (sarray->count > shp_written) { memset(page, 0, hinfo->page_size); int16_t shp_count = 0; size_t shp_data_offset = hinfo->page_size; size_t shp_ptr_offset = hinfo->page_header_size; size_t shp_ptr_size = hinfo->subheader_pointer_size; memcpy(&page[hinfo->page_header_size-8], &page_type, sizeof(int16_t)); if (sarray->subheaders[shp_written]->len + shp_ptr_size > shp_data_offset - shp_ptr_offset) { retval = READSTAT_ERROR_ROW_IS_TOO_WIDE_FOR_PAGE; goto cleanup; } while (sarray->count > shp_written && sarray->subheaders[shp_written]->len + shp_ptr_size <= shp_data_offset - shp_ptr_offset) { sas7bdat_subheader_t *subheader = sarray->subheaders[shp_written]; uint32_t signature32 = subheader->signature; /* copy ptr */ if (hinfo->u64) { uint64_t offset = shp_data_offset - subheader->len; uint64_t len = subheader->len; memcpy(&page[shp_ptr_offset], &offset, sizeof(uint64_t)); memcpy(&page[shp_ptr_offset+8], &len, sizeof(uint64_t)); if (subheader->is_row_data) { if (subheader->is_row_data_compressed) { page[shp_ptr_offset+16] = SAS_COMPRESSION_ROW; } else { page[shp_ptr_offset+16] = SAS_COMPRESSION_NONE; } page[shp_ptr_offset+17] = 1; } else { page[shp_ptr_offset+17] = sas7bdat_subheader_type(subheader->signature); if (signature32 >= 0xFF000000) { int64_t signature64 = (int32_t)signature32; memcpy(&subheader->data[0], &signature64, sizeof(int64_t)); } else { memcpy(&subheader->data[0], &signature32, sizeof(int32_t)); } } } else { uint32_t offset = shp_data_offset - subheader->len; uint32_t len = subheader->len; memcpy(&page[shp_ptr_offset], &offset, sizeof(uint32_t)); memcpy(&page[shp_ptr_offset+4], &len, sizeof(uint32_t)); if (subheader->is_row_data) { if (subheader->is_row_data_compressed) { page[shp_ptr_offset+8] = SAS_COMPRESSION_ROW; } else { page[shp_ptr_offset+8] = SAS_COMPRESSION_NONE; } page[shp_ptr_offset+9] = 1; } else { page[shp_ptr_offset+9] = sas7bdat_subheader_type(subheader->signature); memcpy(&subheader->data[0], &signature32, sizeof(int32_t)); } } shp_ptr_offset += shp_ptr_size; /* copy data */ shp_data_offset -= subheader->len; memcpy(&page[shp_data_offset], subheader->data, subheader->len); shp_written++; shp_count++; } if (hinfo->u64) { memcpy(&page[34], &shp_count, sizeof(int16_t)); memcpy(&page[36], &shp_count, sizeof(int16_t)); } else { memcpy(&page[18], &shp_count, sizeof(int16_t)); memcpy(&page[20], &shp_count, sizeof(int16_t)); } retval = readstat_write_bytes(writer, page, hinfo->page_size); if (retval != READSTAT_OK) goto cleanup; } cleanup: free(page); return retval; } static int sas7bdat_page_is_too_small(readstat_writer_t *writer, sas_header_info_t *hinfo, size_t row_length) { size_t page_length = hinfo->page_size - hinfo->page_header_size; if (writer->compression == READSTAT_COMPRESS_NONE && page_length < row_length) return 1; if (writer->compression == READSTAT_COMPRESS_ROWS && page_length < row_length + hinfo->subheader_pointer_size) return 1; if (page_length < sas7bdat_col_name_subheader_length(writer, hinfo) + hinfo->subheader_pointer_size) return 1; if (page_length < sas7bdat_col_attrs_subheader_length(writer, hinfo) + hinfo->subheader_pointer_size) return 1; return 0; } static sas7bdat_write_ctx_t *sas7bdat_write_ctx_init(readstat_writer_t *writer) { sas7bdat_write_ctx_t *ctx = calloc(1, sizeof(sas7bdat_write_ctx_t)); sas_header_info_t *hinfo = sas_header_info_init(writer, writer->is_64bit); size_t row_length = sas7bdat_row_length(writer); while (sas7bdat_page_is_too_small(writer, hinfo, row_length)) { hinfo->page_size <<= 1; } ctx->hinfo = hinfo; ctx->sarray = sas7bdat_subheader_array_init(writer, hinfo); return ctx; } static void sas7bdat_write_ctx_free(sas7bdat_write_ctx_t *ctx) { free(ctx->hinfo); sas7bdat_subheader_array_free(ctx->sarray); free(ctx); } static readstat_error_t sas7bdat_emit_header_and_meta_pages(readstat_writer_t *writer) { sas7bdat_write_ctx_t *ctx = (sas7bdat_write_ctx_t *)writer->module_ctx; readstat_error_t retval = READSTAT_OK; if (sas7bdat_row_length(writer) == 0) { retval = READSTAT_ERROR_ROW_IS_EMPTY; goto cleanup; } if (writer->compression == READSTAT_COMPRESS_NONE && sas7bdat_rows_per_page(writer, ctx->hinfo) == 0) { retval = READSTAT_ERROR_ROW_IS_TOO_WIDE_FOR_PAGE; goto cleanup; } ctx->hinfo->page_count = sas7bdat_count_meta_pages(writer) + sas7bdat_count_data_pages(writer, ctx->hinfo); retval = sas7bdat_emit_header(writer, ctx->hinfo); if (retval != READSTAT_OK) goto cleanup; retval = sas7bdat_emit_meta_pages(writer); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t sas7bdat_begin_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t retval = READSTAT_OK; writer->module_ctx = sas7bdat_write_ctx_init(writer); if (writer->compression == READSTAT_COMPRESS_NONE) { retval = sas7bdat_emit_header_and_meta_pages(writer); if (retval != READSTAT_OK) goto cleanup; } cleanup: if (retval != READSTAT_OK) { if (writer->module_ctx) { sas7bdat_write_ctx_free(writer->module_ctx); writer->module_ctx = NULL; } } return retval; } static readstat_error_t sas7bdat_end_data(void *writer_ctx) { readstat_error_t retval = READSTAT_OK; readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; sas7bdat_write_ctx_t *ctx = (sas7bdat_write_ctx_t *)writer->module_ctx; if (writer->compression == READSTAT_COMPRESS_ROWS) { retval = sas7bdat_emit_header_and_meta_pages(writer); } else { retval = sas_fill_page(writer, ctx->hinfo); } return retval; } static void sas7bdat_module_ctx_free(void *module_ctx) { sas7bdat_write_ctx_free(module_ctx); } static readstat_error_t sas7bdat_write_double(void *row, const readstat_variable_t *var, double value) { memcpy(row, &value, sizeof(double)); return READSTAT_OK; } static readstat_error_t sas7bdat_write_float(void *row, const readstat_variable_t *var, float value) { return sas7bdat_write_double(row, var, value); } static readstat_error_t sas7bdat_write_int32(void *row, const readstat_variable_t *var, int32_t value) { return sas7bdat_write_double(row, var, value); } static readstat_error_t sas7bdat_write_int16(void *row, const readstat_variable_t *var, int16_t value) { return sas7bdat_write_double(row, var, value); } static readstat_error_t sas7bdat_write_int8(void *row, const readstat_variable_t *var, int8_t value) { return sas7bdat_write_double(row, var, value); } static readstat_error_t sas7bdat_write_missing_tagged_raw(void *row, const readstat_variable_t *var, char tag) { union { double dval; char chars[8]; } nan_value; nan_value.dval = NAN; nan_value.chars[5] = ~tag; return sas7bdat_write_double(row, var, nan_value.dval); } static readstat_error_t sas7bdat_write_missing_tagged(void *row, const readstat_variable_t *var, char tag) { if (tag == '_' || (tag >= 'A' && tag <= 'Z')) return sas7bdat_write_missing_tagged_raw(row, var, tag); return READSTAT_ERROR_TAGGED_VALUE_IS_OUT_OF_RANGE; } static readstat_error_t sas7bdat_write_missing_numeric(void *row, const readstat_variable_t *var) { return sas7bdat_write_missing_tagged_raw(row, var, 0); } static readstat_error_t sas7bdat_write_string(void *row, const readstat_variable_t *var, const char *value) { size_t max_len = readstat_variable_get_storage_width(var); if (value == NULL || value[0] == '\0') { memset(row, '\0', max_len); } else { size_t value_len = strlen(value); if (value_len > max_len) return READSTAT_ERROR_STRING_VALUE_IS_TOO_LONG; strncpy((char *)row, value, max_len); } return READSTAT_OK; } static readstat_error_t sas7bdat_write_missing_string(void *row, const readstat_variable_t *var) { return sas7bdat_write_string(row, var, NULL); } static size_t sas7bdat_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) { return user_width; } return 8; } static readstat_error_t sas7bdat_write_row_uncompressed(readstat_writer_t *writer, sas7bdat_write_ctx_t *ctx, void *bytes, size_t len) { readstat_error_t retval = READSTAT_OK; sas_header_info_t *hinfo = ctx->hinfo; int32_t rows_per_page = sas7bdat_rows_per_page(writer, hinfo); if (writer->current_row % rows_per_page == 0) { retval = sas_fill_page(writer, ctx->hinfo); if (retval != READSTAT_OK) goto cleanup; int16_t page_type = SAS_PAGE_TYPE_DATA; int16_t page_row_count = (writer->row_count - writer->current_row < rows_per_page ? writer->row_count - writer->current_row : rows_per_page); char header[hinfo->page_header_size]; memset(header, 0, sizeof(header)); memcpy(&header[hinfo->page_header_size-6], &page_row_count, sizeof(int16_t)); memcpy(&header[hinfo->page_header_size-8], &page_type, sizeof(int16_t)); retval = readstat_write_bytes(writer, header, hinfo->page_header_size); if (retval != READSTAT_OK) goto cleanup; } retval = readstat_write_bytes(writer, bytes, len); cleanup: return retval; } /* We don't actually write compressed data out at this point; the file header * requires a page count, so instead we collect the compressed subheaders in * memory and write the entire file at the end, once the page count can be * determined. */ static readstat_error_t sas7bdat_write_row_compressed(readstat_writer_t *writer, sas7bdat_write_ctx_t *ctx, void *bytes, size_t len) { readstat_error_t retval = READSTAT_OK; size_t compressed_len = sas_rle_compressed_len(bytes, len); sas7bdat_subheader_t *subheader = NULL; if (compressed_len < len) { subheader = sas7bdat_subheader_init(0, compressed_len); subheader->is_row_data = 1; subheader->is_row_data_compressed = 1; size_t actual_len = sas_rle_compress(subheader->data, subheader->len, bytes, len); if (actual_len != compressed_len) { retval = READSTAT_ERROR_ROW_WIDTH_MISMATCH; goto cleanup; } } else { subheader = sas7bdat_subheader_init(0, len); subheader->is_row_data = 1; memcpy(subheader->data, bytes, len); } ctx->sarray->subheaders[ctx->sarray->count++] = subheader; cleanup: if (retval != READSTAT_OK) sas7bdat_subheader_free(subheader); return retval; } static readstat_error_t sas7bdat_write_row(void *writer_ctx, void *bytes, size_t len) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; sas7bdat_write_ctx_t *ctx = (sas7bdat_write_ctx_t *)writer->module_ctx; readstat_error_t retval = READSTAT_OK; if (writer->compression == READSTAT_COMPRESS_NONE) { retval = sas7bdat_write_row_uncompressed(writer, ctx, bytes, len); } else if (writer->compression == READSTAT_COMPRESS_ROWS) { retval = sas7bdat_write_row_compressed(writer, ctx, bytes, len); } return retval; } readstat_error_t readstat_begin_writing_sas7bdat(readstat_writer_t *writer, void *user_ctx, long row_count) { if (writer->compression != READSTAT_COMPRESS_NONE && writer->compression != READSTAT_COMPRESS_ROWS) return READSTAT_ERROR_UNSUPPORTED_COMPRESSION; if (writer->version == 0) writer->version = SAS_DEFAULT_FILE_VERSION; writer->callbacks.write_int8 = &sas7bdat_write_int8; writer->callbacks.write_int16 = &sas7bdat_write_int16; writer->callbacks.write_int32 = &sas7bdat_write_int32; writer->callbacks.write_float = &sas7bdat_write_float; writer->callbacks.write_double = &sas7bdat_write_double; writer->callbacks.write_string = &sas7bdat_write_string; writer->callbacks.write_missing_string = &sas7bdat_write_missing_string; writer->callbacks.write_missing_number = &sas7bdat_write_missing_numeric; writer->callbacks.write_missing_tagged = &sas7bdat_write_missing_tagged; writer->callbacks.variable_width = &sas7bdat_variable_width; writer->callbacks.variable_ok = &sas_validate_variable; writer->callbacks.begin_data = &sas7bdat_begin_data; writer->callbacks.end_data = &sas7bdat_end_data; writer->callbacks.module_ctx_free = &sas7bdat_module_ctx_free; writer->callbacks.write_row = &sas7bdat_write_row; return readstat_begin_writing_file(writer, user_ctx, row_count); } haven/src/readstat/sas/readstat_xport.h0000644000176200001440000000155613227731765017777 0ustar liggesusers typedef struct xport_header_record_s { char name[9]; int num1; int num2; int num3; int num4; int num5; int num6; } xport_header_record_t; extern char _xport_months[12][4]; #pragma pack(push, 1) typedef struct xport_namestr_s { uint16_t ntype; uint16_t nhfun; uint16_t nlng; uint16_t nvar0; char nname[8]; char nlabel[40]; char nform[8]; uint16_t nfl; uint16_t nfd; uint16_t nfj; char nfill[2]; char niform[8]; uint16_t nifl; uint16_t nifd; uint32_t npos; char longname[32]; uint16_t labeln; char rest[18]; } xport_namestr_t; #pragma pack(pop) #define XPORT_MIN_DOUBLE_SIZE 3 #define XPORT_MAX_DOUBLE_SIZE 8 void xport_namestr_bswap(xport_namestr_t *namestr); haven/src/readstat/sas/readstat_sas.h0000644000176200001440000000710413227731765017404 0ustar liggesusers #include "../readstat.h" #include "../readstat_bits.h" #pragma pack(push, 1) typedef struct sas_header_start_s { unsigned char magic[32]; unsigned char a2; unsigned char mystery1[2]; unsigned char a1; unsigned char mystery2[1]; unsigned char endian; unsigned char mystery3[1]; char file_format; unsigned char mystery4[30]; unsigned char encoding; unsigned char mystery5[13]; char file_type[8]; char file_label[64]; char file_info[8]; } sas_header_start_t; typedef struct sas_header_end_s { char release[8]; char host[16]; char version[16]; char os_vendor[16]; char os_name[16]; char extra[48]; } sas_header_end_t; #pragma pack(pop) typedef struct sas_header_info_s { int little_endian; int u64; int vendor; int major_version; int minor_version; int revision; int pad1; int64_t page_size; int64_t page_header_size; int64_t subheader_pointer_size; int64_t page_count; int64_t header_size; time_t creation_time; time_t modification_time; char file_label[64]; char *encoding; } sas_header_info_t; enum { READSTAT_VENDOR_STAT_TRANSFER, READSTAT_VENDOR_SAS }; typedef struct sas_text_ref_s { uint16_t index; uint16_t offset; uint16_t length; } sas_text_ref_t; #define SAS_ENDIAN_BIG 0x00 #define SAS_ENDIAN_LITTLE 0x01 #define SAS_FILE_FORMAT_UNIX '1' #define SAS_FILE_FORMAT_WINDOWS '2' #define SAS_ALIGNMENT_OFFSET_0 0x22 #define SAS_ALIGNMENT_OFFSET_4 0x33 #define SAS_COLUMN_TYPE_NUM 0x01 #define SAS_COLUMN_TYPE_CHR 0x02 #define SAS_SUBHEADER_SIGNATURE_ROW_SIZE 0xF7F7F7F7 #define SAS_SUBHEADER_SIGNATURE_COLUMN_SIZE 0xF6F6F6F6 #define SAS_SUBHEADER_SIGNATURE_COUNTS 0xFFFFFC00 #define SAS_SUBHEADER_SIGNATURE_COLUMN_FORMAT 0xFFFFFBFE #define SAS_SUBHEADER_SIGNATURE_COLUMN_MASK 0xFFFFFFF8 /* Seen in the wild: FA (unknown), F8 (locale?) */ #define SAS_SUBHEADER_SIGNATURE_COLUMN_ATTRS 0xFFFFFFFC #define SAS_SUBHEADER_SIGNATURE_COLUMN_TEXT 0xFFFFFFFD #define SAS_SUBHEADER_SIGNATURE_COLUMN_LIST 0xFFFFFFFE #define SAS_SUBHEADER_SIGNATURE_COLUMN_NAME 0xFFFFFFFF #define SAS_PAGE_TYPE_META 0x0000 #define SAS_PAGE_TYPE_DATA 0x0100 #define SAS_PAGE_TYPE_MIX 0x0200 #define SAS_PAGE_TYPE_AMD 0x0400 #define SAS_PAGE_TYPE_MASK 0x0F00 #define SAS_PAGE_TYPE_META2 0x4000 #define SAS_PAGE_TYPE_COMP 0x9000 #define SAS_SUBHEADER_POINTER_SIZE_32BIT 12 #define SAS_SUBHEADER_POINTER_SIZE_64BIT 24 #define SAS_PAGE_HEADER_SIZE_32BIT 24 #define SAS_PAGE_HEADER_SIZE_64BIT 40 #define SAS_COMPRESSION_NONE 0x00 #define SAS_COMPRESSION_TRUNC 0x01 #define SAS_COMPRESSION_ROW 0x04 #define SAS_DEFAULT_FILE_VERSION 90101 extern unsigned char sas7bdat_magic_number[32]; extern unsigned char sas7bcat_magic_number[32]; uint64_t sas_read8(const char *data, int bswap); uint32_t sas_read4(const char *data, int bswap); uint16_t sas_read2(const char *data, int bswap); readstat_error_t sas_read_header(readstat_io_t *io, sas_header_info_t *ctx, readstat_error_handler error_handler, void *user_ctx); sas_header_info_t *sas_header_info_init(readstat_writer_t *writer, int is_64bit); readstat_error_t sas_write_header(readstat_writer_t *writer, sas_header_info_t *hinfo, sas_header_start_t header_start); readstat_error_t sas_fill_page(readstat_writer_t *writer, sas_header_info_t *hinfo); readstat_error_t sas_validate_variable(readstat_variable_t *variable); haven/src/readstat/sas/readstat_xport_write.c0000644000176200001440000004127513227731765021206 0ustar liggesusers #include #include #include #include "../readstat.h" #include "../readstat_writer.h" #include "readstat_sas.h" #include "readstat_xport.h" #include "ieee.h" #define XPORT_DEFAULT_VERISON 8 #define RECORD_LEN 80 static void copypad(char *dst, size_t dst_len, const char *src) { strncpy(dst, src, dst_len); if (strlen(src) < dst_len) memset(&dst[strlen(src)], ' ', dst_len-strlen(src)); } static readstat_error_t xport_write_bytes(readstat_writer_t *writer, const void *bytes, size_t len) { return readstat_write_bytes_as_lines(writer, bytes, len, RECORD_LEN, ""); } static readstat_error_t xport_finish_record(readstat_writer_t *writer) { return readstat_write_line_padding(writer, ' ', RECORD_LEN, ""); } static readstat_error_t xport_write_record(readstat_writer_t *writer, const char *record) { size_t len = strlen(record); readstat_error_t retval = READSTAT_OK; retval = xport_write_bytes(writer, record, len); if (retval != READSTAT_OK) goto cleanup; retval = xport_finish_record(writer); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t xport_write_header_record_v8(readstat_writer_t *writer, xport_header_record_t *xrecord) { char record[RECORD_LEN+1]; snprintf(record, sizeof(record), "HEADER RECORD*******%-8sHEADER RECORD!!!!!!!%-30d", xrecord->name, xrecord->num1); return xport_write_record(writer, record); } static readstat_error_t xport_write_header_record(readstat_writer_t *writer, xport_header_record_t *xrecord) { char record[RECORD_LEN+1]; snprintf(record, sizeof(record), "HEADER RECORD*******%-8sHEADER RECORD!!!!!!!" "%05d%05d%05d" "%05d%05d%05d", xrecord->name, xrecord->num1, xrecord->num2, xrecord->num3, xrecord->num4, xrecord->num5, xrecord->num6); return xport_write_record(writer, record); } static size_t xport_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) return user_width; if (user_width >= XPORT_MAX_DOUBLE_SIZE || user_width == 0) return XPORT_MAX_DOUBLE_SIZE; if (user_width <= XPORT_MIN_DOUBLE_SIZE) return XPORT_MIN_DOUBLE_SIZE; return user_width; } static readstat_error_t xport_write_variables(readstat_writer_t *writer) { readstat_error_t retval = READSTAT_OK; int i; long offset = 0; int num_long_labels = 0; int any_has_long_format = 0; for (i=0; ivariables_count; i++) { int needs_long_record = 0; readstat_variable_t *variable = readstat_get_variable(writer, i); size_t width = xport_variable_width(variable->type, variable->user_width); xport_namestr_t namestr = { .nvar0 = i, .nlng = width, .npos = offset }; if (readstat_variable_get_type_class(variable) == READSTAT_TYPE_CLASS_STRING) { namestr.ntype = SAS_COLUMN_TYPE_CHR; } else { namestr.ntype = SAS_COLUMN_TYPE_NUM; } copypad(namestr.nname, sizeof(namestr.nname), variable->name); copypad(namestr.nlabel, sizeof(namestr.nlabel), variable->label); if (variable->format[0]) { int decimals = 0; int width = 0; char name[24]; sscanf(variable->format, "%s%d.%d", name, &width, &decimals); copypad(namestr.nform, sizeof(namestr.nform), name); namestr.nfl = width; namestr.nfd = decimals; copypad(namestr.niform, sizeof(namestr.niform), name); namestr.nifl = width; namestr.nifd = decimals; if (strlen(name) > 8) { any_has_long_format = 1; needs_long_record = 1; } } namestr.nfj = (variable->alignment == READSTAT_ALIGNMENT_RIGHT); if (writer->version == 8) { copypad(namestr.longname, sizeof(namestr.longname), variable->name); size_t label_len = strlen(variable->label); if (label_len > 40) { needs_long_record = 1; } namestr.labeln = label_len; } if (needs_long_record) { num_long_labels++; } offset += width; xport_namestr_bswap(&namestr); retval = xport_write_bytes(writer, &namestr, sizeof(xport_namestr_t)); if (retval != READSTAT_OK) goto cleanup; } retval = xport_finish_record(writer); if (retval != READSTAT_OK) goto cleanup; if (writer->version == 8 && num_long_labels) { xport_header_record_t header = { .name = "LABELV8", .num1 = num_long_labels }; if (any_has_long_format) { strcpy(header.name, "LABELV9"); } retval = xport_write_header_record_v8(writer, &header); if (retval != READSTAT_OK) goto cleanup; for (i=0; ivariables_count; i++) { readstat_variable_t *variable = readstat_get_variable(writer, i); size_t label_len = strlen(variable->label); size_t name_len = strlen(variable->name); int has_long_label = 0; int has_long_format = 0; int format_len = 0; char format_name[24]; memset(format_name, 0, sizeof(format_name)); has_long_label = (label_len > 40); if (variable->format[0]) { int decimals = 2; int width = 8; int matches = sscanf(variable->format, "%s%d.%d", format_name, &width, &decimals); if (matches < 1) { retval = READSTAT_ERROR_BAD_FORMAT_STRING; goto cleanup; } format_len = strlen(format_name); if (format_len > 8) { has_long_format = 1; } } if (has_long_format) { uint16_t labeldef[5] = { i, name_len, format_len, format_len, label_len }; if (machine_is_little_endian()) { labeldef[0] = byteswap2(labeldef[0]); labeldef[1] = byteswap2(labeldef[1]); labeldef[2] = byteswap2(labeldef[2]); labeldef[3] = byteswap2(labeldef[3]); labeldef[4] = byteswap2(labeldef[4]); } retval = readstat_write_bytes(writer, labeldef, sizeof(labeldef)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_string(writer, variable->name); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_string(writer, format_name); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_string(writer, format_name); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_string(writer, variable->label); if (retval != READSTAT_OK) goto cleanup; } else if (has_long_label) { uint16_t labeldef[3] = { i, name_len, label_len }; if (machine_is_little_endian()) { labeldef[0] = byteswap2(labeldef[0]); labeldef[1] = byteswap2(labeldef[1]); labeldef[2] = byteswap2(labeldef[2]); } retval = readstat_write_bytes(writer, labeldef, sizeof(labeldef)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_string(writer, variable->name); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_string(writer, variable->label); if (retval != READSTAT_OK) goto cleanup; } } retval = xport_finish_record(writer); if (retval != READSTAT_OK) goto cleanup; } cleanup: return retval; } static readstat_error_t xport_write_first_header_record(readstat_writer_t *writer) { xport_header_record_t xrecord = { .name = "LIBRARY" }; if (writer->version == 8) { strcpy(xrecord.name, "LIBV8"); } return xport_write_header_record(writer, &xrecord); } static readstat_error_t xport_write_first_real_header_record(readstat_writer_t *writer, const char *timestamp) { char real_record[RECORD_LEN+1]; snprintf(real_record, sizeof(real_record), "%-8.8s" "%-8.8s" "%-8.8s" "%-8.8s" "%-8.8s" "%-24.24s" "%16.16s", "SAS", "SAS", "SASLIB", "6.06", "bsd4.2", "", timestamp); return xport_write_record(writer, real_record); } static readstat_error_t xport_write_member_header_record(readstat_writer_t *writer) { xport_header_record_t xrecord = { .name = "MEMBER", .num4 = 160, .num6 = 140 }; if (writer->version == 8) { strcpy(xrecord.name, "MEMBV8"); } return xport_write_header_record(writer, &xrecord); } static readstat_error_t xport_write_descriptor_header_record(readstat_writer_t *writer) { xport_header_record_t xrecord = { .name = "DSCRPTR" }; if (writer->version == 8) { strcpy(xrecord.name, "DSCPTV8"); } return xport_write_header_record(writer, &xrecord); } static readstat_error_t xport_write_member_record_v8(readstat_writer_t *writer, char *timestamp) { char member_header[RECORD_LEN+1]; snprintf(member_header, sizeof(member_header), "%-8.8s" "%-32.32s" "%-8.8s" "%-8.8s" "%-8.8s" "%16.16s", "SAS", "DATASET", "SASDATA", "6.06", "bsd4.2", timestamp); return xport_write_record(writer, member_header); } static readstat_error_t xport_write_member_record(readstat_writer_t *writer, char *timestamp) { if (writer->version == 8) return xport_write_member_record_v8(writer, timestamp); char member_header[RECORD_LEN+1]; snprintf(member_header, sizeof(member_header), "%-8.8s" "%-8.8s" "%-8.8s" "%-8.8s" "%-8.8s" "%-24.24s" "%16.16s", "SAS", "DATASET", "SASDATA", "6.06", "bsd4.2", "", timestamp); return xport_write_record(writer, member_header); } static readstat_error_t xport_write_file_label_record(readstat_writer_t *writer, char *timestamp) { char member_header[RECORD_LEN+1]; snprintf(member_header, sizeof(member_header), "%16.16s" "%16.16s" "%-40.40s" "%-8.8s", timestamp, "", writer->file_label, "" /* dstype? */); return xport_write_record(writer, member_header); } static readstat_error_t xport_write_namestr_header_record(readstat_writer_t *writer) { xport_header_record_t xrecord = { .name = "NAMESTR", .num2 = writer->variables_count }; if (writer->version == 8) { strcpy(xrecord.name, "NAMSTV8"); } return xport_write_header_record(writer, &xrecord); } static readstat_error_t xport_write_obs_header_record(readstat_writer_t *writer) { xport_header_record_t xrecord = { .name = "OBS" }; if (writer->version == 8) { strcpy(xrecord.name, "OBSV8"); } return xport_write_header_record(writer, &xrecord); } static void xport_format_timestamp(char *output, size_t output_len, time_t timestamp) { struct tm *ts = localtime(×tamp); snprintf(output, output_len, "%02d%3.3s%02d:%02d:%02d:%02d", (unsigned int)ts->tm_mday % 100, _xport_months[ts->tm_mon], (unsigned int)ts->tm_year % 100, (unsigned int)ts->tm_hour % 100, (unsigned int)ts->tm_min % 100, (unsigned int)ts->tm_sec % 100 ); } static readstat_error_t xport_begin_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t retval = READSTAT_OK; char timestamp[17]; xport_format_timestamp(timestamp, sizeof(timestamp), writer->timestamp); retval = xport_write_first_header_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_first_real_header_record(writer, timestamp); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_record(writer, timestamp); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_member_header_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_descriptor_header_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_member_record(writer, timestamp); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_file_label_record(writer, timestamp); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_namestr_header_record(writer); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_variables(writer); if (retval != READSTAT_OK) goto cleanup; retval = xport_write_obs_header_record(writer); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t xport_end_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t retval = READSTAT_OK; retval = xport_finish_record(writer); return retval; } static readstat_error_t xport_write_row(void *writer_ctx, void *row, size_t row_len) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; return xport_write_bytes(writer, row, row_len); } static readstat_error_t xport_write_double(void *row, const readstat_variable_t *var, double value) { char full_value[8]; int rc = cnxptiee(&value, CN_TYPE_NATIVE, full_value, CN_TYPE_XPORT); if (rc) return READSTAT_ERROR_CONVERT; memcpy(row, full_value, var->storage_width); return READSTAT_OK; } static readstat_error_t xport_write_float(void *row, const readstat_variable_t *var, float value) { return xport_write_double(row, var, value); } static readstat_error_t xport_write_int32(void *row, const readstat_variable_t *var, int32_t value) { return xport_write_double(row, var, value); } static readstat_error_t xport_write_int16(void *row, const readstat_variable_t *var, int16_t value) { return xport_write_double(row, var, value); } static readstat_error_t xport_write_int8(void *row, const readstat_variable_t *var, int8_t value) { return xport_write_double(row, var, value); } static readstat_error_t xport_write_string(void *row, const readstat_variable_t *var, const char *string) { memset(row, ' ', var->storage_width); if (string != NULL && string[0]) { size_t value_len = strlen(string); if (value_len > var->storage_width) return READSTAT_ERROR_STRING_VALUE_IS_TOO_LONG; memcpy(row, string, value_len); } return READSTAT_OK; } static readstat_error_t xport_write_missing_numeric(void *row, const readstat_variable_t *var) { char *row_bytes = (char *)row; row_bytes[0] = 0x2e; return READSTAT_OK; } static readstat_error_t xport_write_missing_string(void *row, const readstat_variable_t *var) { return xport_write_string(row, var, NULL); } static readstat_error_t xport_write_missing_tagged(void *row, const readstat_variable_t *var, char tag) { char *row_bytes = (char *)row; if (tag == '_' || (tag >= 'A' && tag <= 'Z')) { row_bytes[0] = tag; return READSTAT_OK; } return READSTAT_ERROR_TAGGED_VALUE_IS_OUT_OF_RANGE; } readstat_error_t readstat_begin_writing_xport(readstat_writer_t *writer, void *user_ctx, long row_count) { if (writer->version == 0) writer->version = XPORT_DEFAULT_VERISON; if (writer->version != 5 && writer->version != 8) return READSTAT_ERROR_UNSUPPORTED_FILE_FORMAT_VERSION; writer->callbacks.write_int8 = &xport_write_int8; writer->callbacks.write_int16 = &xport_write_int16; writer->callbacks.write_int32 = &xport_write_int32; writer->callbacks.write_float = &xport_write_float; writer->callbacks.write_double = &xport_write_double; writer->callbacks.write_string = &xport_write_string; writer->callbacks.write_missing_string = &xport_write_missing_string; writer->callbacks.write_missing_number = &xport_write_missing_numeric; writer->callbacks.write_missing_tagged = &xport_write_missing_tagged; writer->callbacks.variable_width = &xport_variable_width; writer->callbacks.variable_ok = &sas_validate_variable; writer->callbacks.begin_data = &xport_begin_data; writer->callbacks.end_data = &xport_end_data; writer->callbacks.write_row = &xport_write_row; return readstat_begin_writing_file(writer, user_ctx, row_count); } haven/src/readstat/sas/readstat_xport.c0000644000176200001440000000147413227731765017771 0ustar liggesusers#include #include "readstat_xport.h" #include "../readstat_bits.h" char _xport_months[12][4] = { "JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC" }; void xport_namestr_bswap(xport_namestr_t *namestr) { if (!machine_is_little_endian()) return; namestr->ntype = byteswap2(namestr->ntype); namestr->nhfun = byteswap2(namestr->nhfun); namestr->nlng = byteswap2(namestr->nlng); namestr->nvar0 = byteswap2(namestr->nvar0); namestr->nfl = byteswap2(namestr->nfl); namestr->nfd = byteswap2(namestr->nfd); namestr->nfj = byteswap2(namestr->nfj); namestr->nifl = byteswap2(namestr->nifl); namestr->nifd = byteswap2(namestr->nifd); namestr->npos = byteswap4(namestr->npos); namestr->labeln = byteswap2(namestr->labeln); } haven/src/readstat/sas/ieee.c0000644000176200001440000003433713227731765015641 0ustar liggesusers#include #include #include "ieee.h" #include "../readstat_bits.h" /* These routines are modified versions of those found in SAS publication TS-140, * "RECORD LAYOUT OF A SAS VERSION 5 OR 6 DATA SET IN SAS TRANSPORT (XPORT) FORMAT" * https://support.sas.com/techsup/technote/ts140.pdf * * Modifications include using stdint.h and supporting infinite IEEE values. */ static void xpt2ieee(unsigned char *xport, unsigned char *ieee); static void ieee2xpt(unsigned char *ieee, unsigned char *xport); #ifndef FLOATREP #define FLOATREP get_native() int get_native(); #endif void memreverse(void *intp_void, int l) { if (!machine_is_little_endian()) return; int i,j; char save; char *intp = (char *)intp_void; j = l/2; for (i=0;i=0;i--) { temp[7-i] = from[i]; } from = temp; fromtype = CN_TYPE_IEEEB; /* Break intentionally omitted. */ case CN_TYPE_IEEEB : /* Break intentionally omitted. */ case CN_TYPE_XPORT : break; default: return(-1); } if (totype == CN_TYPE_NATIVE) { totype = FLOATREP; } switch(totype) { case CN_TYPE_XPORT : case CN_TYPE_IEEEB : case CN_TYPE_IEEEL : break; default: return(-2); } if (fromtype == totype) { memcpy(to,from,8); return(0); } switch(fromtype) { case CN_TYPE_IEEEB : if (totype == CN_TYPE_XPORT) ieee2xpt(from,to); else memcpy(to,from,8); break; case CN_TYPE_XPORT : xpt2ieee(from,to); break; } if (totype == CN_TYPE_IEEEL) { memcpy(temp,to,8); for (i=7;i>=0;i--) { to[7-i] = temp[i]; } } return(0); } int get_native() { static unsigned char float_reps[][8] = { {0x41,0x10,0x00,0x00,0x00,0x00,0x00,0x00}, {0x3f,0xf0,0x00,0x00,0x00,0x00,0x00,0x00}, {0x00,0x00,0x00,0x00,0x00,0x00,0xf0,0x3f} }; static double one = 1.00; int i,j; j = sizeof(float_reps)/8; for (i=0;i>= shift; ieee2 = (xport2 >> shift) | ((xport1 & 0x00000007) << (29 + (3 - shift))); } /* clear the 1 bit to the left of the binary point */ ieee1 &= 0xffefffff; /* set the exponent of the ieee number to be the actual */ /* exponent plus the shift count + 1023. Or this into the */ /* first half of the ieee number. The ibm exponent is excess */ /* 64 but is adjusted by 65 since during conversion to ibm */ /* format the exponent is incremented by 1 and the fraction */ /* bits left 4 positions to the right of the radix point. */ ieee1 |= (((((int32_t)(*temp & 0x7f) - 65) * 4) + shift + 1023) << 20) | (xport1 & 0x80000000); doret: memreverse(&ieee1,sizeof(uint32_t)); memcpy(ieee,&ieee1,sizeof(uint32_t)); memreverse(&ieee2,sizeof(uint32_t)); memcpy(ieee+4,&ieee2,sizeof(uint32_t)); return; } /*-------------------------------------------------------------*/ /* Name: ieee2xpt */ /* Purpose: converts IEEE to transport */ /* Usage: rc = ieee2xpt(to_ieee,p_data); */ /* Notes: this routine is an adaptation of the wzctdbl routine */ /* from the Apollo. */ /*-------------------------------------------------------------*/ void ieee2xpt(unsigned char *ieee, unsigned char *xport) { register int shift; unsigned char misschar; int ieee_exp; uint32_t xport1,xport2; uint32_t ieee1 = 0; uint32_t ieee2 = 0; char ieee8[8]; memcpy(ieee8,ieee,8); /*------get 2 longs for shifting------------------------------*/ memcpy(&ieee1,ieee8,sizeof(uint32_t)); memreverse(&ieee1,sizeof(uint32_t)); memcpy(&ieee2,ieee8+4,sizeof(uint32_t)); memreverse(&ieee2,sizeof(uint32_t)); memset(xport,0,8); /*-----if IEEE value is missing (1st 2 bytes are FFFF)-----*/ if (*ieee8 == (char)0xff && ieee8[1] == (char)0xff) { misschar = ~ieee8[2]; *xport = (misschar == 0xD2) ? 0x6D : misschar; return; } /**************************************************************/ /* Translate IEEE floating point number into IBM format float */ /* */ /* IEEE format: */ /* */ /* 6 5 0 */ /* 3 1 0 */ /* */ /* SEEEEEEEEEEEMMMM ........ MMMM */ /* */ /* Sign bit, 11 bit exponent, 52 fraction. Exponent is excess */ /* 1023. The fraction is multiplied by a power of 2 of the */ /* actual exponent. Normalized floating point numbers are */ /* represented with the binary point immediately to the left */ /* of the fraction with an implied "1" to the left of the */ /* binary point. */ /* */ /* IBM format: */ /* */ /* 6 5 0 */ /* 3 5 0 */ /* */ /* SEEEEEEEMMMM ......... MMMM */ /* */ /* Sign bit, 7 bit exponent, 56 bit fraction. Exponent is */ /* excess 64. The fraction is multiplied by a power of 16 of */ /* of the actual exponent. Normalized floating point numbers */ /* are presented with the radix point immediately to the left */ /* of the high order hex fraction digit. */ /* */ /* How do you translate from local to IBM format? */ /* */ /* The ieee format gives you a number that has a power of 2 */ /* exponent and a fraction of the form "1.". */ /* The first step is to get that "1" bit back into the */ /* fraction. Right shift it down 1 position, set the high */ /* order bit and reduce the binary exponent by 1. Now we have */ /* a fraction that looks like ".1" and it's */ /* ready to be shoved into ibm format. The ibm fraction has 4 */ /* more bits than the ieee, the ieee fraction must therefore */ /* be shifted left 4 positions before moving it in. We must */ /* also correct the fraction bits to account for the loss of 2*/ /* bits when converting from a binary exponent to a hex one */ /* (>> 2). We must shift the fraction left for 0, 1, 2, or 3 */ /* positions to maintain the proper magnitude. Doing */ /* conversion this way would tend to lose bits in the fraction*/ /* which is not desirable or necessary if we cheat a bit. */ /* First of all, we know that we are going to have to shift */ /* the ieee fraction left 4 places to put it in the right */ /* position; we won't do that, we'll just leave it where it is*/ /* and increment the ibm exponent by one, this will have the */ /* same effect and we won't have to do any shifting. Now, */ /* since we have 4 bits in front of the fraction to work with,*/ /* we won't lose any bits. We set the bit to the left of the */ /* fraction which is the implicit "1" in the ieee fraction. We*/ /* then adjust the fraction to account for the loss of bits */ /* when going to a hex exponent. This adjustment will never */ /* involve shifting by more than 3 positions so no bits are */ /* lost. */ /* Get ieee number less the exponent into the first half of */ /* the ibm number */ xport1 = ieee1 & 0x000fffff; /* get the second half of the number into the second half of */ /* the ibm number and see if both halves are 0. If so, ibm is */ /* also 0 and we just return */ if ((!(xport2 = ieee2)) && !ieee1) { ieee_exp = 0; goto doret; } /* get the actual exponent value out of the ieee number. The */ /* ibm fraction is a power of 16 and the ieee fraction a power*/ /* of 2 (16 ** n == 2 ** 4n). Save the low order 2 bits since */ /* they will get lost when we divide the exponent by 4 (right */ /* shift by 2) and we will have to shift the fraction by the */ /* appropriate number of bits to keep the proper magnitude. */ shift = (int) (ieee_exp = (int)(((ieee1 >> 16) & 0x7ff0) >> 4) - 1023) & 3; /* the ieee format has an implied "1" immdeiately to the left */ /* of the binary point. Show it in here. */ xport1 |= 0x00100000; if (shift) { /* set the first half of the ibm number by shifting it left */ /* the appropriate number of bits and oring in the bits */ /* from the lower half that would have been shifted in (if */ /* we could shift a double). The shift count can never */ /* exceed 3, so all we care about are the high order 3 */ /* bits. We don't want sign extention so make sure it's an */ /* unsigned char. We'll shift either5, 6, or 7 places to */ /* keep 3, 2, or 1 bits. After that, shift the second half */ /* of the number the right number of places. We always get */ /* zero fill on left shifts. */ xport1 = (xport1 << shift) | ((unsigned char) (((ieee2 >> 24) & 0xE0) >> (5 + (3 - shift)))); xport2 <<= shift; } /* Now set the ibm exponent and the sign of the fraction. The */ /* power of 2 ieee exponent must be divided by 4 and made */ /* excess 64 (we add 65 here because of the poisition of the */ /* fraction bits, essentially 4 positions lower than they */ /* should be so we incrment the ibm exponent). */ xport1 |= (((ieee_exp >>2) + 65) | ((ieee1 >> 24) & 0x80)) << 24; /* If the ieee exponent is greater than 248 or less than -260, */ /* then it cannot fit in the ibm exponent field. Send back the */ /* appropriate flag. */ doret: if (ieee_exp < -260) { memset(xport,0x00,8); } else if (ieee_exp > 248) { memset(xport+1,0xFF,7); *xport = 0x7F | ((ieee1 >> 24) & 0x80); } else { memreverse(&xport1,sizeof(uint32_t)); memcpy(xport,&xport1,sizeof(uint32_t)); memreverse(&xport2,sizeof(uint32_t)); memcpy(xport+4,&xport2,sizeof(uint32_t)); } return; } haven/src/readstat/readstat_io_unistd.c0000644000176200001440000000540113227731765020016 0ustar liggesusers #include #include #include #include "readstat.h" #include "readstat_io_unistd.h" #if defined _WIN32 || defined __CYGWIN__ #define UNISTD_OPEN_OPTIONS O_RDONLY | O_BINARY #elif defined _AIX #define UNISTD_OPEN_OPTIONS O_RDONLY | O_LARGEFILE #else #define UNISTD_OPEN_OPTIONS O_RDONLY #endif #if defined _WIN32 || defined _AIX #define lseek lseek64 #endif int unistd_open_handler(const char *path, void *io_ctx) { int fd = open(path, UNISTD_OPEN_OPTIONS); ((unistd_io_ctx_t*) io_ctx)->fd = fd; return fd; } int unistd_close_handler(void *io_ctx) { int fd = ((unistd_io_ctx_t*) io_ctx)->fd; if (fd != -1) return close(fd); else return 0; } readstat_off_t unistd_seek_handler(readstat_off_t offset, readstat_io_flags_t whence, void *io_ctx) { int flag = 0; switch(whence) { case READSTAT_SEEK_SET: flag = SEEK_SET; break; case READSTAT_SEEK_CUR: flag = SEEK_CUR; break; case READSTAT_SEEK_END: flag = SEEK_END; break; default: return -1; } int fd = ((unistd_io_ctx_t*) io_ctx)->fd; return lseek(fd, offset, flag); } ssize_t unistd_read_handler(void *buf, size_t nbyte, void *io_ctx) { int fd = ((unistd_io_ctx_t*) io_ctx)->fd; ssize_t out = read(fd, buf, nbyte); return out; } readstat_error_t unistd_update_handler(long file_size, readstat_progress_handler progress_handler, void *user_ctx, void *io_ctx) { if (!progress_handler) return READSTAT_OK; int fd = ((unistd_io_ctx_t*) io_ctx)->fd; long current_offset = lseek(fd, 0, SEEK_CUR); if (current_offset == -1) return READSTAT_ERROR_SEEK; if (progress_handler(1.0 * current_offset / file_size, user_ctx)) return READSTAT_ERROR_USER_ABORT; return READSTAT_OK; } readstat_error_t unistd_io_init(readstat_parser_t *parser) { readstat_error_t retval = READSTAT_OK; unistd_io_ctx_t *io_ctx = NULL; if ((retval = readstat_set_open_handler(parser, unistd_open_handler)) != READSTAT_OK) return retval; if ((retval = readstat_set_close_handler(parser, unistd_close_handler)) != READSTAT_OK) return retval; if ((retval = readstat_set_seek_handler(parser, unistd_seek_handler)) != READSTAT_OK) return retval; if ((retval = readstat_set_read_handler(parser, unistd_read_handler)) != READSTAT_OK) return retval; if ((readstat_set_update_handler(parser, unistd_update_handler)) != READSTAT_OK) return retval; io_ctx = calloc(1, sizeof(unistd_io_ctx_t)); io_ctx->fd = -1; retval = readstat_set_io_ctx(parser, (void*) io_ctx); parser->io->io_ctx_needs_free = 1; return retval; } haven/src/readstat/readstat_bits.c0000644000176200001440000000316513227731765016767 0ustar liggesusers// // readstat_bits.c - Bit-twiddling utility functions // #include #include #include #include "readstat_bits.h" int machine_is_little_endian() { int test_byte_order = 1; return ((char *)&test_byte_order)[0]; } char ones_to_twos_complement1(char num) { return num < 0 ? num+1 : num; } int16_t ones_to_twos_complement2(int16_t num) { return num < 0 ? num+1 : num; } int32_t ones_to_twos_complement4(int32_t num) { return num < 0 ? num+1 : num; } char twos_to_ones_complement1(char num) { return num < 0 ? num-1 : num; } int16_t twos_to_ones_complement2(int16_t num) { return num < 0 ? num-1 : num; } int32_t twos_to_ones_complement4(int32_t num) { return num < 0 ? num-1 : num; } uint16_t byteswap2(uint16_t num) { return ((num & 0xFF00) >> 8) | ((num & 0x00FF) << 8); } uint32_t byteswap4(uint32_t num) { num = ((num & 0xFFFF0000) >> 16) | ((num & 0x0000FFFF) << 16); return ((num & 0xFF00FF00) >> 8) | ((num & 0x00FF00FF) << 8); } uint64_t byteswap8(uint64_t num) { num = ((num & 0xFFFFFFFF00000000) >> 32) | ((num & 0x00000000FFFFFFFF) << 32); num = ((num & 0xFFFF0000FFFF0000) >> 16) | ((num & 0x0000FFFF0000FFFF) << 16); return ((num & 0xFF00FF00FF00FF00) >> 8) | ((num & 0x00FF00FF00FF00FF) << 8); } float byteswap_float(float num) { uint32_t answer = 0; memcpy(&answer, &num, 4); answer = byteswap4(answer); memcpy(&num, &answer, 4); return num; } double byteswap_double(double num) { uint64_t answer = 0; memcpy(&answer, &num, 8); answer = byteswap8(answer); memcpy(&num, &answer, 8); return num; } haven/src/readstat/readstat_malloc.h0000644000176200001440000000020613227731765017273 0ustar liggesusers void *readstat_malloc(size_t size); void *readstat_calloc(size_t count, size_t size); void *readstat_realloc(void *ptr, size_t len); haven/src/readstat/readstat_error.c0000644000176200001440000001156713227731765017164 0ustar liggesusers #include "readstat.h" const char *readstat_error_message(readstat_error_t error_code) { if (error_code == READSTAT_OK) return NULL; if (error_code == READSTAT_ERROR_OPEN) return "Unable to open file"; if (error_code == READSTAT_ERROR_READ) return "Unable to read from file"; if (error_code == READSTAT_ERROR_MALLOC) return "Unable to allocate memory"; if (error_code == READSTAT_ERROR_USER_ABORT) return "The parsing was aborted (callback returned non-zero value)"; if (error_code == READSTAT_ERROR_PARSE) return "Invalid file, or file has unsupported features"; if (error_code == READSTAT_ERROR_UNSUPPORTED_COMPRESSION) return "File has unsupported compression scheme"; if (error_code == READSTAT_ERROR_UNSUPPORTED_CHARSET) return "File has an unsupported character set"; if (error_code == READSTAT_ERROR_COLUMN_COUNT_MISMATCH) return "File did not contain the expected number of columns"; if (error_code == READSTAT_ERROR_ROW_COUNT_MISMATCH) return "File did not contain the expected number of rows"; if (error_code == READSTAT_ERROR_ROW_WIDTH_MISMATCH) return "A row in the file was not the expected length"; if (error_code == READSTAT_ERROR_BAD_FORMAT_STRING) return "A provided format string could not be understood"; if (error_code == READSTAT_ERROR_VALUE_TYPE_MISMATCH) return "A provided value was incompatible with the variable's declared type"; if (error_code == READSTAT_ERROR_WRITE) return "Unable to write data"; if (error_code == READSTAT_ERROR_WRITER_NOT_INITIALIZED) return "The writer object was not properly initialized (call and check return value of readstat_begin_writing_XXX)"; if (error_code == READSTAT_ERROR_SEEK) return "Unable to seek within file"; if (error_code == READSTAT_ERROR_CONVERT) return "Unable to convert string to the requested encoding"; if (error_code == READSTAT_ERROR_CONVERT_BAD_STRING) return "Unable to convert string to the requested encoding (invalid byte sequence)"; if (error_code == READSTAT_ERROR_CONVERT_SHORT_STRING) return "Unable to convert string to the requested encoding (incomplete byte sequence)"; if (error_code == READSTAT_ERROR_CONVERT_LONG_STRING) return "Unable to convert string to the requested encoding (output buffer too small)"; if (error_code == READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE) return "A provided numeric value was outside the range of representable values in the specified file format"; if (error_code == READSTAT_ERROR_TAGGED_VALUE_IS_OUT_OF_RANGE) return "A provided tag value was outside the range of allowed values in the specified file format"; if (error_code == READSTAT_ERROR_STRING_VALUE_IS_TOO_LONG) return "A provided string value was longer than the available storage size of the specified column"; if (error_code == READSTAT_ERROR_TAGGED_VALUES_NOT_SUPPORTED) return "The file format does not supported character tags for missing values"; if (error_code == READSTAT_ERROR_UNSUPPORTED_FILE_FORMAT_VERSION) return "This version of the file format is not supported"; if (error_code == READSTAT_ERROR_NAME_BEGINS_WITH_ILLEGAL_CHARACTER) return "A provided column name begins with an illegal character (must be a letter or underscore)"; if (error_code == READSTAT_ERROR_NAME_CONTAINS_ILLEGAL_CHARACTER) return "A provided column name contains an illegal character (must be a letter, number, or underscore)"; if (error_code == READSTAT_ERROR_NAME_IS_RESERVED_WORD) return "A provided column name is a reserved word"; if (error_code == READSTAT_ERROR_NAME_IS_TOO_LONG) return "A provided column name is too long for the file format"; if (error_code == READSTAT_ERROR_BAD_TIMESTAMP) return "The file's timestamp string is invalid"; if (error_code == READSTAT_ERROR_BAD_FREQUENCY_WEIGHT) return "The provided variable can't be used as a frequency weight"; if (error_code == READSTAT_ERROR_TOO_MANY_MISSING_VALUE_DEFINITIONS) return "The number of defined missing values exceeds the format limit"; if (error_code == READSTAT_ERROR_NOTE_IS_TOO_LONG) return "The provided note is too long for the file format"; if (error_code == READSTAT_ERROR_STRING_REFS_NOT_SUPPORTED) return "This version of the file format does not support string references"; if (error_code == READSTAT_ERROR_STRING_REF_IS_REQUIRED) return "The provided value was not a valid string reference"; if (error_code == READSTAT_ERROR_ROW_IS_TOO_WIDE_FOR_PAGE) return "A row of data will not fit into the file format"; if (error_code == READSTAT_ERROR_ROW_IS_EMPTY) return "One or more columns must be provided"; return "Unknown error"; } haven/src/readstat/readstat_io_unistd.h0000644000176200001440000000103113227731765020016 0ustar liggesusers typedef struct unistd_io_ctx_s { int fd; } unistd_io_ctx_t; int unistd_open_handler(const char *path, void *io_ctx); int unistd_close_handler(void *io_ctx); readstat_off_t unistd_seek_handler(readstat_off_t offset, readstat_io_flags_t whence, void *io_ctx); ssize_t unistd_read_handler(void *buf, size_t nbytes, void *io_ctx); readstat_error_t unistd_update_handler(long file_size, readstat_progress_handler progress_handler, void *user_ctx, void *io_ctx); readstat_error_t unistd_io_init(readstat_parser_t *parser); haven/src/readstat/readstat.h0000644000176200001440000006014513227731765015754 0ustar liggesusers// // readstat.h - API and internal data structures for ReadStat // // Copyright Evan Miller and ReadStat authors (see LICENSE) // #ifndef INCLUDE_READSTAT_H #define INCLUDE_READSTAT_H #ifdef __cplusplus extern "C" { #endif #include #include #include #include #include enum { READSTAT_HANDLER_OK, READSTAT_HANDLER_ABORT, READSTAT_HANDLER_SKIP_VARIABLE }; typedef enum readstat_type_e { READSTAT_TYPE_STRING, READSTAT_TYPE_INT8, READSTAT_TYPE_INT16, READSTAT_TYPE_INT32, READSTAT_TYPE_FLOAT, READSTAT_TYPE_DOUBLE, READSTAT_TYPE_STRING_REF } readstat_type_t; typedef enum readstat_type_class_e { READSTAT_TYPE_CLASS_STRING, READSTAT_TYPE_CLASS_NUMERIC } readstat_type_class_t; typedef enum readstat_measure_e { READSTAT_MEASURE_UNKNOWN, READSTAT_MEASURE_NOMINAL = 1, READSTAT_MEASURE_ORDINAL, READSTAT_MEASURE_SCALE } readstat_measure_t; typedef enum readstat_alignment_e { READSTAT_ALIGNMENT_UNKNOWN, READSTAT_ALIGNMENT_LEFT = 1, READSTAT_ALIGNMENT_CENTER, READSTAT_ALIGNMENT_RIGHT } readstat_alignment_t; typedef enum readstat_compress_e { READSTAT_COMPRESS_NONE, READSTAT_COMPRESS_ROWS } readstat_compress_t; typedef enum readstat_error_e { READSTAT_OK, READSTAT_ERROR_OPEN = 1, READSTAT_ERROR_READ, READSTAT_ERROR_MALLOC, READSTAT_ERROR_USER_ABORT, READSTAT_ERROR_PARSE, READSTAT_ERROR_UNSUPPORTED_COMPRESSION, READSTAT_ERROR_UNSUPPORTED_CHARSET, READSTAT_ERROR_COLUMN_COUNT_MISMATCH, READSTAT_ERROR_ROW_COUNT_MISMATCH, READSTAT_ERROR_ROW_WIDTH_MISMATCH, READSTAT_ERROR_BAD_FORMAT_STRING, READSTAT_ERROR_VALUE_TYPE_MISMATCH, READSTAT_ERROR_WRITE, READSTAT_ERROR_WRITER_NOT_INITIALIZED, READSTAT_ERROR_SEEK, READSTAT_ERROR_CONVERT, READSTAT_ERROR_CONVERT_BAD_STRING, READSTAT_ERROR_CONVERT_SHORT_STRING, READSTAT_ERROR_CONVERT_LONG_STRING, READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE, READSTAT_ERROR_TAGGED_VALUE_IS_OUT_OF_RANGE, READSTAT_ERROR_STRING_VALUE_IS_TOO_LONG, READSTAT_ERROR_TAGGED_VALUES_NOT_SUPPORTED, READSTAT_ERROR_UNSUPPORTED_FILE_FORMAT_VERSION, READSTAT_ERROR_NAME_BEGINS_WITH_ILLEGAL_CHARACTER, READSTAT_ERROR_NAME_CONTAINS_ILLEGAL_CHARACTER, READSTAT_ERROR_NAME_IS_RESERVED_WORD, READSTAT_ERROR_NAME_IS_TOO_LONG, READSTAT_ERROR_BAD_TIMESTAMP, READSTAT_ERROR_BAD_FREQUENCY_WEIGHT, READSTAT_ERROR_TOO_MANY_MISSING_VALUE_DEFINITIONS, READSTAT_ERROR_NOTE_IS_TOO_LONG, READSTAT_ERROR_STRING_REFS_NOT_SUPPORTED, READSTAT_ERROR_STRING_REF_IS_REQUIRED, READSTAT_ERROR_ROW_IS_TOO_WIDE_FOR_PAGE, READSTAT_ERROR_ROW_IS_EMPTY } readstat_error_t; const char *readstat_error_message(readstat_error_t error_code); typedef struct readstat_value_s { union { float float_value; double double_value; int8_t i8_value; int16_t i16_value; int32_t i32_value; const char *string_value; } v; readstat_type_t type; char tag; unsigned int is_system_missing:1; unsigned int is_tagged_missing:1; } readstat_value_t; /* Internal data structures */ typedef struct readstat_value_label_s { double double_key; int32_t int32_key; char tag; char *string_key; size_t string_key_len; char *label; size_t label_len; } readstat_value_label_t; typedef struct readstat_label_set_s { readstat_type_t type; char name[256]; readstat_value_label_t *value_labels; long value_labels_count; long value_labels_capacity; void *variables; long variables_count; long variables_capacity; } readstat_label_set_t; typedef struct readstat_missingness_s { readstat_value_t missing_ranges[32]; long missing_ranges_count; } readstat_missingness_t; typedef struct readstat_variable_s { readstat_type_t type; int index; char name[300]; char format[256]; char label[1024]; readstat_label_set_t *label_set; off_t offset; size_t storage_width; size_t user_width; readstat_missingness_t missingness; readstat_measure_t measure; readstat_alignment_t alignment; int display_width; int decimals; int skip; int index_after_skipping; } readstat_variable_t; /* Value accessors */ readstat_type_t readstat_value_type(readstat_value_t value); readstat_type_class_t readstat_value_type_class(readstat_value_t value); /* Values can be missing in one of three ways: * 1. "System missing", delivered to value handlers as NaN. Occurs in all file * types. The most common kind of missing value. * 2. Tagged missing, also delivered as NaN, but with a single character tag * accessible via readstat_value_tag(). The tag might be 'a', 'b', etc, * corresponding to Stata's .a, .b, values etc. Occurs only in Stata and * SAS files. * 3. Defined missing. The value is a real number but is to be treated as * missing according to the variable's missingness rules (such as "value < 0 || * value == 999"). Occurs only in SPSS files. access the rules via: * * readstat_variable_get_missing_ranges_count() * readstat_variable_get_missing_range_lo() * readstat_variable_get_missing_range_hi() * * Note that "ranges" include individual values where lo == hi. * * readstat_value_is_missing() is equivalent to: * * (readstat_value_is_system_missing() * || readstat_value_is_tagged_missing() * || readstat_value_is_defined_missing()) */ int readstat_value_is_missing(readstat_value_t value, readstat_variable_t *variable); int readstat_value_is_system_missing(readstat_value_t value); int readstat_value_is_tagged_missing(readstat_value_t value); int readstat_value_is_defined_missing(readstat_value_t value, readstat_variable_t *variable); char readstat_value_tag(readstat_value_t value); char readstat_int8_value(readstat_value_t value); int16_t readstat_int16_value(readstat_value_t value); int32_t readstat_int32_value(readstat_value_t value); float readstat_float_value(readstat_value_t value); double readstat_double_value(readstat_value_t value); const char *readstat_string_value(readstat_value_t value); readstat_type_class_t readstat_type_class(readstat_type_t type); /* Accessor methods for use inside variable handlers */ int readstat_variable_get_index(const readstat_variable_t *variable); int readstat_variable_get_index_after_skipping(const readstat_variable_t *variable); const char *readstat_variable_get_name(const readstat_variable_t *variable); const char *readstat_variable_get_label(const readstat_variable_t *variable); const char *readstat_variable_get_format(const readstat_variable_t *variable); readstat_type_t readstat_variable_get_type(const readstat_variable_t *variable); readstat_type_class_t readstat_variable_get_type_class(const readstat_variable_t *variable); size_t readstat_variable_get_storage_width(const readstat_variable_t *variable); int readstat_variable_get_display_width(const readstat_variable_t *variable); readstat_measure_t readstat_variable_get_measure(const readstat_variable_t *variable); readstat_alignment_t readstat_variable_get_alignment(const readstat_variable_t *variable); int readstat_variable_get_missing_ranges_count(const readstat_variable_t *variable); readstat_value_t readstat_variable_get_missing_range_lo(const readstat_variable_t *variable, int i); readstat_value_t readstat_variable_get_missing_range_hi(const readstat_variable_t *variable, int i); /* Callbacks should return 0 (aka READSTAT_HANDLER_OK) on success and 1 (aka READSTAT_HANDLER_ABORT) to abort. */ /* If the variable handler returns READSTAT_HANDLER_SKIP_VARIABLE, the value handler will not be called on * the associated variable. (Note that subsequent variables will retain their original index values.) */ typedef int (*readstat_info_handler)(int obs_count, int var_count, void *ctx); typedef int (*readstat_metadata_handler)(const char *file_label, const char *orig_encoding, time_t timestamp, long format_version, void *ctx); typedef int (*readstat_note_handler)(int note_index, const char *note, void *ctx); typedef int (*readstat_variable_handler)(int index, readstat_variable_t *variable, const char *val_labels, void *ctx); typedef int (*readstat_fweight_handler)(readstat_variable_t *variable, void *ctx); typedef int (*readstat_value_handler)(int obs_index, readstat_variable_t *variable, readstat_value_t value, void *ctx); typedef int (*readstat_value_label_handler)(const char *val_labels, readstat_value_t value, const char *label, void *ctx); typedef void (*readstat_error_handler)(const char *error_message, void *ctx); typedef int (*readstat_progress_handler)(double progress, void *ctx); #if defined _WIN32 || defined __CYGWIN__ typedef _off64_t readstat_off_t; #elif defined _AIX typedef off64_t readstat_off_t; #else typedef off_t readstat_off_t; #endif typedef enum readstat_io_flags_e { READSTAT_SEEK_SET, READSTAT_SEEK_CUR, READSTAT_SEEK_END } readstat_io_flags_t; typedef int (*readstat_open_handler)(const char *path, void *io_ctx); typedef int (*readstat_close_handler)(void *io_ctx); typedef readstat_off_t (*readstat_seek_handler)(readstat_off_t offset, readstat_io_flags_t whence, void *io_ctx); typedef ssize_t (*readstat_read_handler)(void *buf, size_t nbyte, void *io_ctx); typedef readstat_error_t (*readstat_update_handler)(long file_size, readstat_progress_handler progress_handler, void *user_ctx, void *io_ctx); typedef struct readstat_io_s { readstat_open_handler open; readstat_close_handler close; readstat_seek_handler seek; readstat_read_handler read; readstat_update_handler update; void *io_ctx; int io_ctx_needs_free; } readstat_io_t; typedef struct readstat_parser_s { readstat_info_handler info_handler; readstat_metadata_handler metadata_handler; readstat_note_handler note_handler; readstat_variable_handler variable_handler; readstat_fweight_handler fweight_handler; readstat_value_handler value_handler; readstat_value_label_handler value_label_handler; readstat_error_handler error_handler; readstat_progress_handler progress_handler; readstat_io_t *io; const char *input_encoding; const char *output_encoding; long row_limit; } readstat_parser_t; readstat_parser_t *readstat_parser_init(void); void readstat_parser_free(readstat_parser_t *parser); void readstat_io_free(readstat_io_t *io); readstat_error_t readstat_set_info_handler(readstat_parser_t *parser, readstat_info_handler info_handler); readstat_error_t readstat_set_metadata_handler(readstat_parser_t *parser, readstat_metadata_handler metadata_handler); readstat_error_t readstat_set_note_handler(readstat_parser_t *parser, readstat_note_handler note_handler); readstat_error_t readstat_set_variable_handler(readstat_parser_t *parser, readstat_variable_handler variable_handler); readstat_error_t readstat_set_fweight_handler(readstat_parser_t *parser, readstat_fweight_handler fweight_handler); readstat_error_t readstat_set_value_handler(readstat_parser_t *parser, readstat_value_handler value_handler); readstat_error_t readstat_set_value_label_handler(readstat_parser_t *parser, readstat_value_label_handler value_label_handler); readstat_error_t readstat_set_error_handler(readstat_parser_t *parser, readstat_error_handler error_handler); readstat_error_t readstat_set_progress_handler(readstat_parser_t *parser, readstat_progress_handler progress_handler); readstat_error_t readstat_set_open_handler(readstat_parser_t *parser, readstat_open_handler open_handler); readstat_error_t readstat_set_close_handler(readstat_parser_t *parser, readstat_close_handler close_handler); readstat_error_t readstat_set_seek_handler(readstat_parser_t *parser, readstat_seek_handler seek_handler); readstat_error_t readstat_set_read_handler(readstat_parser_t *parser, readstat_read_handler read_handler); readstat_error_t readstat_set_update_handler(readstat_parser_t *parser, readstat_update_handler update_handler); readstat_error_t readstat_set_io_ctx(readstat_parser_t *parser, void *io_ctx); // Usually inferred from the file, but sometimes a manual override is desirable. // In particular, pre-14 Stata uses the system encoding, which is usually Win 1252 // but could be anything. `encoding' should be an iconv-compatible name. readstat_error_t readstat_set_file_character_encoding(readstat_parser_t *parser, const char *encoding); // Defaults to UTF-8. Pass in NULL to disable transliteration. readstat_error_t readstat_set_handler_character_encoding(readstat_parser_t *parser, const char *encoding); readstat_error_t readstat_set_row_limit(readstat_parser_t *parser, long row_limit); readstat_error_t readstat_parse_dta(readstat_parser_t *parser, const char *path, void *user_ctx); readstat_error_t readstat_parse_sav(readstat_parser_t *parser, const char *path, void *user_ctx); readstat_error_t readstat_parse_por(readstat_parser_t *parser, const char *path, void *user_ctx); readstat_error_t readstat_parse_sas7bdat(readstat_parser_t *parser, const char *path, void *user_ctx); readstat_error_t readstat_parse_sas7bcat(readstat_parser_t *parser, const char *path, void *user_ctx); readstat_error_t readstat_parse_xport(readstat_parser_t *parser, const char *path, void *user_ctx); /* Internal module callbacks */ typedef struct readstat_string_ref_s { int64_t first_v; int64_t first_o; size_t len; char data[1]; // Flexible array; using [1] for C++98 compatibility } readstat_string_ref_t; typedef size_t (*readstat_variable_width_callback)(readstat_type_t type, size_t user_width); typedef readstat_error_t (*readstat_variable_ok_callback)(readstat_variable_t *variable); typedef readstat_error_t (*readstat_write_int8_callback)(void *row_data, const readstat_variable_t *variable, int8_t value); typedef readstat_error_t (*readstat_write_int16_callback)(void *row_data, const readstat_variable_t *variable, int16_t value); typedef readstat_error_t (*readstat_write_int32_callback)(void *row_data, const readstat_variable_t *variable, int32_t value); typedef readstat_error_t (*readstat_write_float_callback)(void *row_data, const readstat_variable_t *variable, float value); typedef readstat_error_t (*readstat_write_double_callback)(void *row_data, const readstat_variable_t *variable, double value); typedef readstat_error_t (*readstat_write_string_callback)(void *row_data, const readstat_variable_t *variable, const char *value); typedef readstat_error_t (*readstat_write_string_ref_callback)(void *row_data, const readstat_variable_t *variable, readstat_string_ref_t *ref); typedef readstat_error_t (*readstat_write_missing_callback)(void *row_data, const readstat_variable_t *variable); typedef readstat_error_t (*readstat_write_tagged_callback)(void *row_data, const readstat_variable_t *variable, char tag); typedef readstat_error_t (*readstat_begin_data_callback)(void *writer); typedef readstat_error_t (*readstat_write_row_callback)(void *writer, void *row_data, size_t row_len); typedef readstat_error_t (*readstat_end_data_callback)(void *writer); typedef void (*readstat_module_ctx_free_callback)(void *module_ctx); typedef struct readstat_writer_callbacks_s { readstat_variable_width_callback variable_width; readstat_variable_ok_callback variable_ok; readstat_write_int8_callback write_int8; readstat_write_int16_callback write_int16; readstat_write_int32_callback write_int32; readstat_write_float_callback write_float; readstat_write_double_callback write_double; readstat_write_string_callback write_string; readstat_write_string_ref_callback write_string_ref; readstat_write_missing_callback write_missing_string; readstat_write_missing_callback write_missing_number; readstat_write_tagged_callback write_missing_tagged; readstat_begin_data_callback begin_data; readstat_write_row_callback write_row; readstat_end_data_callback end_data; readstat_module_ctx_free_callback module_ctx_free; } readstat_writer_callbacks_t; /* You'll need to define one of these to get going. Should return # bytes written, * or -1 on error, a la write(2) */ typedef ssize_t (*readstat_data_writer)(const void *data, size_t len, void *ctx); typedef struct readstat_writer_s { readstat_data_writer data_writer; size_t bytes_written; long version; int is_64bit; // SAS only readstat_compress_t compression; time_t timestamp; readstat_variable_t **variables; long variables_count; long variables_capacity; readstat_label_set_t **label_sets; long label_sets_count; long label_sets_capacity; char **notes; long notes_count; long notes_capacity; readstat_string_ref_t **string_refs; long string_refs_count; long string_refs_capacity; unsigned char *row; size_t row_len; int row_count; int current_row; char file_label[100]; const readstat_variable_t *fweight_variable; readstat_writer_callbacks_t callbacks; readstat_error_handler error_handler; void *module_ctx; void *user_ctx; int initialized; } readstat_writer_t; /* Writer API */ // First call this... readstat_writer_t *readstat_writer_init(void); // Then specify a function that will handle the output bytes... readstat_error_t readstat_set_data_writer(readstat_writer_t *writer, readstat_data_writer data_writer); // Next define your value labels, if any. Create as many named sets as you'd like. readstat_label_set_t *readstat_add_label_set(readstat_writer_t *writer, readstat_type_t type, const char *name); void readstat_label_double_value(readstat_label_set_t *label_set, double value, const char *label); void readstat_label_int32_value(readstat_label_set_t *label_set, int32_t value, const char *label); void readstat_label_string_value(readstat_label_set_t *label_set, const char *value, const char *label); void readstat_label_tagged_value(readstat_label_set_t *label_set, char tag, const char *label); // Now define your variables. Note that `storage_width' is used for: // * READSTAT_TYPE_STRING variables in all formats // * READSTAT_TYPE_DOUBLE variables, but only in the SAS XPORT format (valid values 3-8, defaults to 8) readstat_variable_t *readstat_add_variable(readstat_writer_t *writer, const char *name, readstat_type_t type, size_t storage_width); void readstat_variable_set_label(readstat_variable_t *variable, const char *label); void readstat_variable_set_format(readstat_variable_t *variable, const char *format); void readstat_variable_set_label_set(readstat_variable_t *variable, readstat_label_set_t *label_set); void readstat_variable_set_measure(readstat_variable_t *variable, readstat_measure_t measure); void readstat_variable_set_alignment(readstat_variable_t *variable, readstat_alignment_t alignment); void readstat_variable_set_display_width(readstat_variable_t *variable, int display_width); void readstat_variable_add_missing_double_value(readstat_variable_t *variable, double value); void readstat_variable_add_missing_double_range(readstat_variable_t *variable, double lo, double hi); readstat_variable_t *readstat_get_variable(readstat_writer_t *writer, int index); // "Notes" appear in the file metadata. In SPSS these are stored as // lines in the Document Record; in Stata these are stored using // the "notes" feature. // // Note that the line length in SPSS is 80 characters; ReadStat will // produce a write error if a note is longer than this limit. void readstat_add_note(readstat_writer_t *writer, const char *note); // String refs are used for creating a READSTAT_TYPE_STRING_REF column, // which is only supported in Stata. String references can be shared // across columns, and inserted with readstat_insert_string_ref(). readstat_string_ref_t *readstat_add_string_ref(readstat_writer_t *writer, const char *string); readstat_string_ref_t *readstat_get_string_ref(readstat_writer_t *writer, int index); // Optional metadata readstat_error_t readstat_writer_set_file_label(readstat_writer_t *writer, const char *file_label); readstat_error_t readstat_writer_set_file_timestamp(readstat_writer_t *writer, time_t timestamp); readstat_error_t readstat_writer_set_fweight_variable(readstat_writer_t *writer, const readstat_variable_t *variable); readstat_error_t readstat_writer_set_file_format_version(readstat_writer_t *writer, long file_format_version); // e.g. 104-118 for DTA; 5 or 8 for SAS Transport readstat_error_t readstat_writer_set_file_format_is_64bit(readstat_writer_t *writer, int is_64bit); // applies only to SAS files; defaults to 1=true readstat_error_t readstat_writer_set_compression(readstat_writer_t *writer, readstat_compress_t compression); // applies only to SAS and SAV files // Optional error handler readstat_error_t readstat_writer_set_error_handler(readstat_writer_t *writer, readstat_error_handler error_handler); // Call one of these at any time before the first invocation of readstat_begin_row readstat_error_t readstat_begin_writing_dta(readstat_writer_t *writer, void *user_ctx, long row_count); readstat_error_t readstat_begin_writing_por(readstat_writer_t *writer, void *user_ctx, long row_count); readstat_error_t readstat_begin_writing_sas7bcat(readstat_writer_t *writer, void *user_ctx); readstat_error_t readstat_begin_writing_sas7bdat(readstat_writer_t *writer, void *user_ctx, long row_count); readstat_error_t readstat_begin_writing_sav(readstat_writer_t *writer, void *user_ctx, long row_count); readstat_error_t readstat_begin_writing_xport(readstat_writer_t *writer, void *user_ctx, long row_count); // Start a row of data (that is, a case or observation) readstat_error_t readstat_begin_row(readstat_writer_t *writer); // Then call one of these for each variable readstat_error_t readstat_insert_int8_value(readstat_writer_t *writer, const readstat_variable_t *variable, int8_t value); readstat_error_t readstat_insert_int16_value(readstat_writer_t *writer, const readstat_variable_t *variable, int16_t value); readstat_error_t readstat_insert_int32_value(readstat_writer_t *writer, const readstat_variable_t *variable, int32_t value); readstat_error_t readstat_insert_float_value(readstat_writer_t *writer, const readstat_variable_t *variable, float value); readstat_error_t readstat_insert_double_value(readstat_writer_t *writer, const readstat_variable_t *variable, double value); readstat_error_t readstat_insert_string_value(readstat_writer_t *writer, const readstat_variable_t *variable, const char *value); readstat_error_t readstat_insert_string_ref(readstat_writer_t *writer, const readstat_variable_t *variable, readstat_string_ref_t *ref); readstat_error_t readstat_insert_missing_value(readstat_writer_t *writer, const readstat_variable_t *variable); readstat_error_t readstat_insert_tagged_missing_value(readstat_writer_t *writer, const readstat_variable_t *variable, char tag); // Finally, close out the row readstat_error_t readstat_end_row(readstat_writer_t *writer); // Once you've written all the rows, clean up after yourself readstat_error_t readstat_end_writing(readstat_writer_t *writer); void readstat_writer_free(readstat_writer_t *writer); #ifdef __cplusplus } #endif #endif haven/src/readstat/readstat_malloc.c0000644000176200001440000000127413227731765017274 0ustar liggesusers#include #define MAX_MALLOC_SIZE (1<<20) /* One megabyte ought to be enough for anyone */ void *readstat_malloc(size_t len) { if (len > MAX_MALLOC_SIZE || len == 0) { return NULL; } return malloc(len); } void *readstat_calloc(size_t count, size_t size) { if (count > MAX_MALLOC_SIZE || size > MAX_MALLOC_SIZE || count * size > MAX_MALLOC_SIZE) { return NULL; } if (count == 0 || size == 0) { return NULL; } return calloc(count, size); } void *readstat_realloc(void *ptr, size_t len) { if (len > MAX_MALLOC_SIZE || len == 0) { if (ptr) free(ptr); return NULL; } return realloc(ptr, len); } haven/src/readstat/readstat_convert.h0000644000176200001440000000016313227731765017506 0ustar liggesusers readstat_error_t readstat_convert(char *dst, size_t dst_len, const char *src, size_t src_len, iconv_t converter); haven/src/readstat/readstat_iconv.h0000644000176200001440000000037413227731765017150 0ustar liggesusers#include #ifdef WINICONV_CONST typedef const char ** readstat_iconv_inbuf_t; #else typedef char ** readstat_iconv_inbuf_t; #endif typedef struct readstat_charset_entry_s { int code; char name[32]; } readstat_charset_entry_t; haven/src/readstat/readstat_value.c0000644000176200001440000001075013227731765017140 0ustar liggesusers #include "readstat.h" readstat_type_class_t readstat_type_class(readstat_type_t type) { if (type == READSTAT_TYPE_STRING || type == READSTAT_TYPE_STRING_REF) return READSTAT_TYPE_CLASS_STRING; return READSTAT_TYPE_CLASS_NUMERIC; } readstat_type_t readstat_value_type(readstat_value_t value) { return value.type; } readstat_type_class_t readstat_value_type_class(readstat_value_t value) { return readstat_type_class(value.type); } char readstat_value_tag(readstat_value_t value) { return value.tag; } int readstat_value_is_missing(readstat_value_t value, readstat_variable_t *variable) { if (value.is_system_missing || value.is_tagged_missing) return 1; if (variable) readstat_value_is_defined_missing(value, variable); return 0; } int readstat_value_is_system_missing(readstat_value_t value) { return (value.is_system_missing); } int readstat_value_is_tagged_missing(readstat_value_t value) { return (value.is_tagged_missing); } int readstat_value_is_defined_missing(readstat_value_t value, readstat_variable_t *variable) { if (readstat_value_type_class(value) != READSTAT_TYPE_CLASS_NUMERIC || readstat_variable_get_type_class(variable) != READSTAT_TYPE_CLASS_NUMERIC) return 0; double fp_value = readstat_double_value(value); int count = readstat_variable_get_missing_ranges_count(variable); int i; for (i=0; i= lo && fp_value <= hi) { return 1; } } return 0; } char readstat_int8_value(readstat_value_t value) { if (readstat_value_is_system_missing(value)) return 0; if (value.type == READSTAT_TYPE_DOUBLE) return value.v.double_value; if (value.type == READSTAT_TYPE_FLOAT) return value.v.float_value; if (value.type == READSTAT_TYPE_INT32) return value.v.i32_value; if (value.type == READSTAT_TYPE_INT16) return value.v.i16_value; if (value.type == READSTAT_TYPE_INT8) return value.v.i8_value; return 0; } int16_t readstat_int16_value(readstat_value_t value) { if (readstat_value_is_system_missing(value)) return 0; if (value.type == READSTAT_TYPE_DOUBLE) return value.v.double_value; if (value.type == READSTAT_TYPE_FLOAT) return value.v.float_value; if (value.type == READSTAT_TYPE_INT32) return value.v.i32_value; if (value.type == READSTAT_TYPE_INT16) return value.v.i16_value; if (value.type == READSTAT_TYPE_INT8) return value.v.i8_value; return 0; } int32_t readstat_int32_value(readstat_value_t value) { if (readstat_value_is_system_missing(value)) return 0; if (value.type == READSTAT_TYPE_DOUBLE) return value.v.double_value; if (value.type == READSTAT_TYPE_FLOAT) return value.v.float_value; if (value.type == READSTAT_TYPE_INT32) return value.v.i32_value; if (value.type == READSTAT_TYPE_INT16) return value.v.i16_value; if (value.type == READSTAT_TYPE_INT8) return value.v.i8_value; return 0; } float readstat_float_value(readstat_value_t value) { if (readstat_value_is_system_missing(value)) return NAN; if (value.type == READSTAT_TYPE_DOUBLE) return value.v.double_value; if (value.type == READSTAT_TYPE_FLOAT) return value.v.float_value; if (value.type == READSTAT_TYPE_INT32) return value.v.i32_value; if (value.type == READSTAT_TYPE_INT16) return value.v.i16_value; if (value.type == READSTAT_TYPE_INT8) return value.v.i8_value; return value.v.float_value; } double readstat_double_value(readstat_value_t value) { if (readstat_value_is_system_missing(value)) return NAN; if (value.type == READSTAT_TYPE_DOUBLE) return value.v.double_value; if (value.type == READSTAT_TYPE_FLOAT) return value.v.float_value; if (value.type == READSTAT_TYPE_INT32) return value.v.i32_value; if (value.type == READSTAT_TYPE_INT16) return value.v.i16_value; if (value.type == READSTAT_TYPE_INT8) return value.v.i8_value; return NAN; } const char *readstat_string_value(readstat_value_t value) { if (readstat_value_type(value) == READSTAT_TYPE_STRING) return value.v.string_value; return NULL; } haven/src/readstat/readstat_parser.c0000644000176200001440000000756213227731765017327 0ustar liggesusers #include #include "readstat.h" #include "readstat_io_unistd.h" readstat_parser_t *readstat_parser_init() { readstat_parser_t *parser = calloc(1, sizeof(readstat_parser_t)); parser->io = calloc(1, sizeof(readstat_io_t)); if (unistd_io_init(parser) != READSTAT_OK) { readstat_parser_free(parser); return NULL; } parser->output_encoding = "UTF-8"; return parser; } void readstat_parser_free(readstat_parser_t *parser) { if (parser) { if (parser->io) { readstat_set_io_ctx(parser, NULL); free(parser->io); } free(parser); } } readstat_error_t readstat_set_info_handler(readstat_parser_t *parser, readstat_info_handler info_handler) { parser->info_handler = info_handler; return READSTAT_OK; } readstat_error_t readstat_set_metadata_handler(readstat_parser_t *parser, readstat_metadata_handler metadata_handler) { parser->metadata_handler = metadata_handler; return READSTAT_OK; } readstat_error_t readstat_set_note_handler(readstat_parser_t *parser, readstat_note_handler note_handler) { parser->note_handler = note_handler; return READSTAT_OK; } readstat_error_t readstat_set_variable_handler(readstat_parser_t *parser, readstat_variable_handler variable_handler) { parser->variable_handler = variable_handler; return READSTAT_OK; } readstat_error_t readstat_set_value_handler(readstat_parser_t *parser, readstat_value_handler value_handler) { parser->value_handler = value_handler; return READSTAT_OK; } readstat_error_t readstat_set_value_label_handler(readstat_parser_t *parser, readstat_value_label_handler label_handler) { parser->value_label_handler = label_handler; return READSTAT_OK; } readstat_error_t readstat_set_error_handler(readstat_parser_t *parser, readstat_error_handler error_handler) { parser->error_handler = error_handler; return READSTAT_OK; } readstat_error_t readstat_set_progress_handler(readstat_parser_t *parser, readstat_progress_handler progress_handler) { parser->progress_handler = progress_handler; return READSTAT_OK; } readstat_error_t readstat_set_fweight_handler(readstat_parser_t *parser, readstat_fweight_handler fweight_handler) { parser->fweight_handler = fweight_handler; return READSTAT_OK; } readstat_error_t readstat_set_open_handler(readstat_parser_t *parser, readstat_open_handler open_handler) { parser->io->open = open_handler; return READSTAT_OK; } readstat_error_t readstat_set_close_handler(readstat_parser_t *parser, readstat_close_handler close_handler) { parser->io->close = close_handler; return READSTAT_OK; } readstat_error_t readstat_set_seek_handler(readstat_parser_t *parser, readstat_seek_handler seek_handler) { parser->io->seek = seek_handler; return READSTAT_OK; } readstat_error_t readstat_set_read_handler(readstat_parser_t *parser, readstat_read_handler read_handler) { parser->io->read = read_handler; return READSTAT_OK; } readstat_error_t readstat_set_update_handler(readstat_parser_t *parser, readstat_update_handler update_handler) { parser->io->update = update_handler; return READSTAT_OK; } readstat_error_t readstat_set_io_ctx(readstat_parser_t *parser, void *io_ctx) { if (parser->io->io_ctx_needs_free) { free(parser->io->io_ctx); } parser->io->io_ctx = io_ctx; parser->io->io_ctx_needs_free = 0; return READSTAT_OK; } readstat_error_t readstat_set_file_character_encoding(readstat_parser_t *parser, const char *encoding) { parser->input_encoding = encoding; return READSTAT_OK; } readstat_error_t readstat_set_handler_character_encoding(readstat_parser_t *parser, const char *encoding) { parser->output_encoding = encoding; return READSTAT_OK; } readstat_error_t readstat_set_row_limit(readstat_parser_t *parser, long row_limit) { parser->row_limit = row_limit; return READSTAT_OK; } haven/src/readstat/stata/0000755000176200001440000000000013227731765015102 5ustar liggesusershaven/src/readstat/stata/readstat_dta_parse_timestamp.c0000644000176200001440000001752513227731765023174 0ustar liggesusers #line 1 "src/stata/readstat_dta_parse_timestamp.rl" #include #include "../readstat.h" #include "readstat_dta_parse_timestamp.h" #line 10 "src/stata/readstat_dta_parse_timestamp.c" static const char _dta_timestamp_parse_actions[] = { 0, 1, 0, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9, 1, 10, 1, 11, 1, 12, 1, 13, 1, 14, 1, 15, 1, 16, 1, 17, 2, 1, 0 }; static const char _dta_timestamp_parse_key_offsets[] = { 0, 0, 3, 5, 8, 24, 28, 30, 31, 33, 36, 39, 42, 44, 46, 47, 49, 53, 54, 56, 58, 59, 63, 65, 66, 70, 71, 72, 74, 80, 81, 82, 84, 86, 87, 91, 93, 94, 96, 98, 99 }; static const char _dta_timestamp_parse_trans_keys[] = { 32, 48, 57, 48, 57, 32, 48, 57, 65, 68, 70, 74, 77, 78, 79, 83, 97, 100, 102, 106, 109, 110, 111, 115, 80, 85, 112, 117, 82, 114, 32, 48, 57, 32, 48, 57, 32, 48, 57, 58, 48, 57, 48, 57, 71, 103, 32, 69, 101, 67, 90, 99, 122, 32, 69, 101, 66, 98, 32, 65, 85, 97, 117, 78, 110, 32, 76, 78, 108, 110, 32, 32, 65, 97, 73, 82, 89, 105, 114, 121, 32, 32, 79, 111, 86, 118, 32, 67, 75, 99, 107, 84, 116, 32, 69, 101, 80, 112, 32, 48, 57, 0 }; static const char _dta_timestamp_parse_single_lengths[] = { 0, 1, 0, 1, 16, 4, 2, 1, 0, 1, 1, 1, 0, 2, 1, 2, 4, 1, 2, 2, 1, 4, 2, 1, 4, 1, 1, 2, 6, 1, 1, 2, 2, 1, 4, 2, 1, 2, 2, 1, 0 }; static const char _dta_timestamp_parse_range_lengths[] = { 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 }; static const unsigned char _dta_timestamp_parse_index_offsets[] = { 0, 0, 3, 5, 8, 25, 30, 33, 35, 37, 40, 43, 46, 48, 51, 53, 56, 61, 63, 66, 69, 71, 76, 79, 81, 86, 88, 90, 93, 100, 102, 104, 107, 110, 112, 117, 120, 122, 125, 128, 130 }; static const char _dta_timestamp_parse_indicies[] = { 0, 2, 1, 2, 1, 3, 4, 1, 5, 6, 7, 8, 9, 10, 11, 12, 5, 6, 7, 8, 9, 10, 11, 12, 1, 13, 14, 13, 14, 1, 15, 15, 1, 16, 1, 17, 1, 18, 19, 1, 20, 21, 1, 23, 22, 1, 24, 1, 25, 25, 1, 26, 1, 27, 27, 1, 28, 28, 28, 28, 1, 29, 1, 30, 30, 1, 31, 31, 1, 32, 1, 33, 34, 33, 34, 1, 35, 35, 1, 36, 1, 37, 38, 37, 38, 1, 39, 1, 40, 1, 41, 41, 1, 42, 43, 42, 42, 43, 42, 1, 44, 1, 45, 1, 46, 46, 1, 47, 47, 1, 48, 1, 49, 49, 49, 49, 1, 50, 50, 1, 51, 1, 52, 52, 1, 53, 53, 1, 54, 1, 55, 1, 0 }; static const char _dta_timestamp_parse_trans_targs[] = { 2, 0, 3, 4, 3, 5, 15, 18, 21, 27, 31, 34, 37, 6, 13, 7, 8, 9, 10, 9, 10, 11, 11, 12, 40, 14, 8, 16, 17, 8, 19, 20, 8, 22, 24, 23, 8, 25, 26, 8, 8, 28, 29, 30, 8, 8, 32, 33, 8, 35, 36, 8, 38, 39, 8, 40 }; static const char _dta_timestamp_parse_trans_actions[] = { 0, 0, 35, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 35, 29, 1, 0, 35, 1, 31, 35, 0, 19, 0, 0, 27, 0, 0, 7, 0, 0, 0, 5, 0, 0, 17, 15, 0, 0, 0, 13, 9, 0, 0, 25, 0, 0, 23, 0, 0, 21, 1 }; static const char _dta_timestamp_parse_eof_actions[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 33 }; static const int dta_timestamp_parse_start = 1; static const int dta_timestamp_parse_en_main = 1; #line 9 "src/stata/readstat_dta_parse_timestamp.rl" readstat_error_t dta_parse_timestamp(const char *data, size_t len, struct tm *timestamp, readstat_error_handler error_handler, void *user_ctx) { readstat_error_t retval = READSTAT_OK; const char *p = data; const char *pe = p + len; const char *eof = pe; int cs; unsigned int temp_val = 0; #line 137 "src/stata/readstat_dta_parse_timestamp.c" { cs = dta_timestamp_parse_start; } #line 142 "src/stata/readstat_dta_parse_timestamp.c" { int _klen; unsigned int _trans; const char *_acts; unsigned int _nacts; const char *_keys; if ( p == pe ) goto _test_eof; if ( cs == 0 ) goto _out; _resume: _keys = _dta_timestamp_parse_trans_keys + _dta_timestamp_parse_key_offsets[cs]; _trans = _dta_timestamp_parse_index_offsets[cs]; _klen = _dta_timestamp_parse_single_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + _klen - 1; while (1) { if ( _upper < _lower ) break; _mid = _lower + ((_upper-_lower) >> 1); if ( (*p) < *_mid ) _upper = _mid - 1; else if ( (*p) > *_mid ) _lower = _mid + 1; else { _trans += (unsigned int)(_mid - _keys); goto _match; } } _keys += _klen; _trans += _klen; } _klen = _dta_timestamp_parse_range_lengths[cs]; if ( _klen > 0 ) { const char *_lower = _keys; const char *_mid; const char *_upper = _keys + (_klen<<1) - 2; while (1) { if ( _upper < _lower ) break; _mid = _lower + (((_upper-_lower) >> 1) & ~1); if ( (*p) < _mid[0] ) _upper = _mid - 2; else if ( (*p) > _mid[1] ) _lower = _mid + 2; else { _trans += (unsigned int)((_mid - _keys)>>1); goto _match; } } _trans += _klen; } _match: _trans = _dta_timestamp_parse_indicies[_trans]; cs = _dta_timestamp_parse_trans_targs[_trans]; if ( _dta_timestamp_parse_trans_actions[_trans] == 0 ) goto _again; _acts = _dta_timestamp_parse_actions + _dta_timestamp_parse_trans_actions[_trans]; _nacts = (unsigned int) *_acts++; while ( _nacts-- > 0 ) { switch ( *_acts++ ) { case 0: #line 20 "src/stata/readstat_dta_parse_timestamp.rl" { temp_val = 10 * temp_val + ((*p) - '0'); } break; case 1: #line 24 "src/stata/readstat_dta_parse_timestamp.rl" { temp_val = 0; } break; case 2: #line 26 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mday = temp_val; } break; case 3: #line 29 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 0; } break; case 4: #line 30 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 1; } break; case 5: #line 31 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 2; } break; case 6: #line 32 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 3; } break; case 7: #line 33 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 4; } break; case 8: #line 34 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 5; } break; case 9: #line 35 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 6; } break; case 10: #line 36 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 7; } break; case 11: #line 37 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 8; } break; case 12: #line 38 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 9; } break; case 13: #line 39 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 10; } break; case 14: #line 40 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_mon = 11; } break; case 15: #line 42 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_year = temp_val - 1900; } break; case 16: #line 44 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_hour = temp_val; } break; #line 286 "src/stata/readstat_dta_parse_timestamp.c" } } _again: if ( cs == 0 ) goto _out; if ( ++p != pe ) goto _resume; _test_eof: {} if ( p == eof ) { const char *__acts = _dta_timestamp_parse_actions + _dta_timestamp_parse_eof_actions[cs]; unsigned int __nacts = (unsigned int) *__acts++; while ( __nacts-- > 0 ) { switch ( *__acts++ ) { case 17: #line 46 "src/stata/readstat_dta_parse_timestamp.rl" { timestamp->tm_min = temp_val; } break; #line 306 "src/stata/readstat_dta_parse_timestamp.c" } } } _out: {} } #line 52 "src/stata/readstat_dta_parse_timestamp.rl" if (cs < 40|| p != pe) { char error_buf[1024]; if (error_handler) { snprintf(error_buf, sizeof(error_buf), "Invalid timestamp string (length=%d): %.*s", (int)len, (int)len, data); error_handler(error_buf, user_ctx); } retval = READSTAT_ERROR_BAD_TIMESTAMP; } (void)dta_timestamp_parse_en_main; return retval; } haven/src/readstat/stata/readstat_dta_write.c0000644000176200001440000013504313227731765021125 0ustar liggesusers #include #include #include #include #include #include #include #include "../readstat.h" #include "../readstat_bits.h" #include "../readstat_iconv.h" #include "../readstat_writer.h" #include "readstat_dta.h" #define DTA_DEFAULT_FORMAT_BYTE "8.0g" #define DTA_DEFAULT_FORMAT_INT16 "8.0g" #define DTA_DEFAULT_FORMAT_INT32 "12.0g" #define DTA_DEFAULT_FORMAT_FLOAT "9.0g" #define DTA_DEFAULT_FORMAT_DOUBLE "10.0g" #define DTA_DEFAULT_FILE_VERSION 118 #define DTA_OLD_MAX_WIDTH 128 #define DTA_111_MAX_WIDTH 244 #define DTA_117_MAX_WIDTH 2045 #define DTA_OLD_MAX_NAME_LEN 9 #define DTA_110_MAX_NAME_LEN 33 #define DTA_118_MAX_NAME_LEN 129 static readstat_error_t dta_113_write_missing_numeric(void *row, const readstat_variable_t *var); static readstat_error_t dta_write_tag(readstat_writer_t *writer, dta_ctx_t *ctx, const char *tag) { if (!ctx->file_is_xmlish) return READSTAT_OK; return readstat_write_string(writer, tag); } static readstat_error_t dta_write_chunk(readstat_writer_t *writer, dta_ctx_t *ctx, const char *start_tag, const void *bytes, size_t len, const char *end_tag) { readstat_error_t error = READSTAT_OK; if ((error = dta_write_tag(writer, ctx, start_tag)) != READSTAT_OK) goto cleanup; if ((error = readstat_write_bytes(writer, bytes, len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, end_tag)) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_header_data_label(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; char *data_label = NULL; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: if (data_label) free(data_label); return error; } static readstat_error_t dta_emit_header_time_stamp(readstat_writer_t *writer, dta_ctx_t *ctx) { if (!ctx->timestamp_len) return READSTAT_OK; readstat_error_t error = READSTAT_OK; time_t now = writer->timestamp; struct tm *time_s = localtime(&now); char *timestamp = calloc(1, ctx->timestamp_len); /* There are locale/portability issues with strftime so hack something up */ char months[][4] = { "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" }; uint8_t actual_timestamp_len = snprintf(timestamp, ctx->timestamp_len, "%02d %3s %04d %02d:%02d", time_s->tm_mday, months[time_s->tm_mon], time_s->tm_year + 1900, time_s->tm_hour, time_s->tm_min); if (actual_timestamp_len == 0) { error = READSTAT_ERROR_WRITE; goto cleanup; } if (ctx->file_is_xmlish) { if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; if ((error = readstat_write_bytes(writer, &actual_timestamp_len, sizeof(uint8_t))) != READSTAT_OK) goto cleanup; if ((error = readstat_write_bytes(writer, timestamp, actual_timestamp_len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; } else { error = readstat_write_bytes(writer, timestamp, ctx->timestamp_len); } cleanup: free(timestamp); return error; } static readstat_error_t dta_111_typecode_for_variable(readstat_variable_t *r_variable, uint16_t *out_typecode) { readstat_error_t retval = READSTAT_OK; size_t max_len = r_variable->storage_width; uint16_t typecode = 0; switch (r_variable->type) { case READSTAT_TYPE_INT8: typecode = DTA_111_TYPE_CODE_INT8; break; case READSTAT_TYPE_INT16: typecode = DTA_111_TYPE_CODE_INT16; break; case READSTAT_TYPE_INT32: typecode = DTA_111_TYPE_CODE_INT32; break; case READSTAT_TYPE_FLOAT: typecode = DTA_111_TYPE_CODE_FLOAT; break; case READSTAT_TYPE_DOUBLE: typecode = DTA_111_TYPE_CODE_DOUBLE; break; case READSTAT_TYPE_STRING: typecode = max_len; break; case READSTAT_TYPE_STRING_REF: retval = READSTAT_ERROR_STRING_REFS_NOT_SUPPORTED; break; } if (out_typecode && retval == READSTAT_OK) *out_typecode = typecode; return retval; } static readstat_error_t dta_117_typecode_for_variable(readstat_variable_t *r_variable, uint16_t *out_typecode) { readstat_error_t retval = READSTAT_OK; size_t max_len = r_variable->storage_width; uint16_t typecode = 0; switch (r_variable->type) { case READSTAT_TYPE_INT8: typecode = DTA_117_TYPE_CODE_INT8; break; case READSTAT_TYPE_INT16: typecode = DTA_117_TYPE_CODE_INT16; break; case READSTAT_TYPE_INT32: typecode = DTA_117_TYPE_CODE_INT32; break; case READSTAT_TYPE_FLOAT: typecode = DTA_117_TYPE_CODE_FLOAT; break; case READSTAT_TYPE_DOUBLE: typecode = DTA_117_TYPE_CODE_DOUBLE; break; case READSTAT_TYPE_STRING: typecode = max_len; break; case READSTAT_TYPE_STRING_REF: typecode = DTA_117_TYPE_CODE_STRL; break; } if (out_typecode) *out_typecode = typecode; return retval; } static readstat_error_t dta_old_typecode_for_variable(readstat_variable_t *r_variable, uint16_t *out_typecode) { readstat_error_t retval = READSTAT_OK; size_t max_len = r_variable->storage_width; uint16_t typecode = 0; switch (r_variable->type) { case READSTAT_TYPE_INT8: typecode = DTA_OLD_TYPE_CODE_INT8; break; case READSTAT_TYPE_INT16: typecode = DTA_OLD_TYPE_CODE_INT16; break; case READSTAT_TYPE_INT32: typecode = DTA_OLD_TYPE_CODE_INT32; break; case READSTAT_TYPE_FLOAT: typecode = DTA_OLD_TYPE_CODE_FLOAT; break; case READSTAT_TYPE_DOUBLE: typecode = DTA_OLD_TYPE_CODE_DOUBLE; break; case READSTAT_TYPE_STRING: typecode = max_len + 0x7F; break; case READSTAT_TYPE_STRING_REF: retval = READSTAT_ERROR_STRING_REFS_NOT_SUPPORTED; break; } if (out_typecode && retval == READSTAT_OK) *out_typecode = typecode; return retval; } static readstat_error_t dta_typecode_for_variable(readstat_variable_t *r_variable, int typlist_version, uint16_t *typecode) { if (typlist_version == 111) { return dta_111_typecode_for_variable(r_variable, typecode); } if (typlist_version == 117) { return dta_117_typecode_for_variable(r_variable, typecode); } return dta_old_typecode_for_variable(r_variable, typecode); } static readstat_error_t dta_emit_typlist(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; int i; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; for (i=0; invar; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); uint16_t typecode = 0; error = dta_typecode_for_variable(r_variable, ctx->typlist_version, &typecode); if (error != READSTAT_OK) goto cleanup; ctx->typlist[i] = typecode; } for (i=0; invar; i++) { if (ctx->typlist_entry_len == 1) { uint8_t byte = ctx->typlist[i]; error = readstat_write_bytes(writer, &byte, sizeof(uint8_t)); } else if (ctx->typlist_entry_len == 2) { uint16_t val = ctx->typlist[i]; error = readstat_write_bytes(writer, &val, sizeof(uint16_t)); } if (error != READSTAT_OK) goto cleanup; } if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_validate_name(const char *name, size_t max_len) { int j; for (j=0; name[j]; j++) { if (name[j] != '_' && !(name[j] >= 'a' && name[j] <= 'z') && !(name[j] >= 'A' && name[j] <= 'Z') && !(name[j] >= '0' && name[j] <= '9')) { return READSTAT_ERROR_NAME_CONTAINS_ILLEGAL_CHARACTER; } } char first_char = name[0]; if (first_char != '_' && !(first_char >= 'a' && first_char <= 'z') && !(first_char >= 'A' && first_char <= 'Z')) { return READSTAT_ERROR_NAME_BEGINS_WITH_ILLEGAL_CHARACTER; } if (strcmp(name, "_all") == 0 || strcmp(name, "_b") == 0 || strcmp(name, "byte") == 0 || strcmp(name, "_coef") == 0 || strcmp(name, "_cons") == 0 || strcmp(name, "double") == 0 || strcmp(name, "float") == 0 || strcmp(name, "if") == 0 || strcmp(name, "in") == 0 || strcmp(name, "int") == 0 || strcmp(name, "long") == 0 || strcmp(name, "_n") == 0 || strcmp(name, "_N") == 0 || strcmp(name, "_pi") == 0 || strcmp(name, "_pred") == 0 || strcmp(name, "_rc") == 0 || strcmp(name, "_skip") == 0 || strcmp(name, "strL") == 0 || strcmp(name, "using") == 0 || strcmp(name, "with") == 0) { return READSTAT_ERROR_NAME_IS_RESERVED_WORD; } int len; if (sscanf(name, "str%d", &len) == 1) return READSTAT_ERROR_NAME_IS_RESERVED_WORD; if (strlen(name) > max_len) return READSTAT_ERROR_NAME_IS_TOO_LONG; return READSTAT_OK; } static readstat_error_t dta_old_variable_ok(readstat_variable_t *variable) { return dta_validate_name(readstat_variable_get_name(variable), DTA_OLD_MAX_NAME_LEN); } static readstat_error_t dta_110_variable_ok(readstat_variable_t *variable) { return dta_validate_name(readstat_variable_get_name(variable), DTA_110_MAX_NAME_LEN); } static readstat_error_t dta_118_variable_ok(readstat_variable_t *variable) { return dta_validate_name(readstat_variable_get_name(variable), DTA_118_MAX_NAME_LEN); } static readstat_error_t dta_emit_varlist(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; int i; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; for (i=0; invar; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); strncpy(&ctx->varlist[ctx->variable_name_len*i], r_variable->name, ctx->variable_name_len); } if ((error = readstat_write_bytes(writer, ctx->varlist, ctx->varlist_len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_srtlist(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; memset(ctx->srtlist, '\0', ctx->srtlist_len); if ((error = readstat_write_bytes(writer, ctx->srtlist, ctx->srtlist_len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_fmtlist(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; int i; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; for (i=0; invar; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); if (r_variable->format[0]) { strncpy(&ctx->fmtlist[ctx->fmtlist_entry_len*i], r_variable->format, ctx->fmtlist_entry_len); } else { char *format_spec = "9s"; if (r_variable->type == READSTAT_TYPE_INT8) { format_spec = DTA_DEFAULT_FORMAT_BYTE; } else if (r_variable->type == READSTAT_TYPE_INT16) { format_spec = DTA_DEFAULT_FORMAT_INT16; } else if (r_variable->type == READSTAT_TYPE_INT32) { format_spec = DTA_DEFAULT_FORMAT_INT32; } else if (r_variable->type == READSTAT_TYPE_FLOAT) { format_spec = DTA_DEFAULT_FORMAT_FLOAT; } else if (r_variable->type == READSTAT_TYPE_DOUBLE) { format_spec = DTA_DEFAULT_FORMAT_DOUBLE; } char format[64]; sprintf(format, "%%%s%s", r_variable->alignment == READSTAT_ALIGNMENT_LEFT ? "-" : "", format_spec); strncpy(&ctx->fmtlist[ctx->fmtlist_entry_len*i], format, ctx->fmtlist_entry_len); } } if ((error = readstat_write_bytes(writer, ctx->fmtlist, ctx->fmtlist_len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_lbllist(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; int i; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; for (i=0; invar; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); if (r_variable->label_set) { strncpy(&ctx->lbllist[ctx->lbllist_entry_len*i], r_variable->label_set->name, ctx->lbllist_entry_len); } else { memset(&ctx->lbllist[ctx->lbllist_entry_len*i], '\0', ctx->lbllist_entry_len); } } if ((error = readstat_write_bytes(writer, ctx->lbllist, ctx->lbllist_len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_descriptors(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; error = dta_emit_typlist(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_varlist(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_srtlist(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_fmtlist(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_lbllist(writer, ctx); if (error != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_variable_labels(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; int i; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; for (i=0; invar; i++) { readstat_variable_t *r_variable = readstat_get_variable(writer, i); strncpy(&ctx->variable_labels[ctx->variable_labels_entry_len*i], r_variable->label, ctx->variable_labels_entry_len); } if ((error = readstat_write_bytes(writer, ctx->variable_labels, ctx->variable_labels_len)) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_emit_characteristics(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t error = READSTAT_OK; int i; char buffer[ctx->ch_metadata_len]; if (ctx->expansion_len_len == 0) return READSTAT_OK; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; for (i=0; inotes_count; i++) { if (ctx->file_is_xmlish) { error = dta_write_tag(writer, ctx, ""); } else { char data_type = 1; error = readstat_write_bytes(writer, &data_type, 1); } if (error != READSTAT_OK) goto cleanup; size_t len = strlen(writer->notes[i]); if (ctx->expansion_len_len == 2) { int16_t len16 = 2*ctx->ch_metadata_len + len + 1; error = readstat_write_bytes(writer, &len16, sizeof(len16)); } else if (ctx->expansion_len_len == 4) { int32_t len32 = 2*ctx->ch_metadata_len + len + 1; error = readstat_write_bytes(writer, &len32, sizeof(len32)); } if (error != READSTAT_OK) goto cleanup; strncpy(buffer, "_dta", ctx->ch_metadata_len); error = readstat_write_bytes(writer, buffer, ctx->ch_metadata_len); if (error != READSTAT_OK) goto cleanup; snprintf(buffer, ctx->ch_metadata_len, "note%d", i+1); error = readstat_write_bytes(writer, buffer, ctx->ch_metadata_len); if (error != READSTAT_OK) goto cleanup; error = readstat_write_bytes(writer, writer->notes[i], len + 1); if (error != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; } if (ctx->file_is_xmlish) { error = dta_write_tag(writer, ctx, ""); } else { error = readstat_write_zeros(writer, 1 + ctx->expansion_len_len); } if (error != READSTAT_OK) goto cleanup; cleanup: return error; } static readstat_error_t dta_117_emit_strl_header(readstat_writer_t *writer, readstat_string_ref_t *ref) { dta_117_strl_header_t header = { .v = ref->first_v, .o = ref->first_o, .type = DTA_GSO_TYPE_ASCII, .len = ref->len }; return readstat_write_bytes(writer, &header, sizeof(dta_117_strl_header_t)); } static readstat_error_t dta_118_emit_strl_header(readstat_writer_t *writer, readstat_string_ref_t *ref) { dta_118_strl_header_t header = { .v = ref->first_v, .o = ref->first_o, .type = DTA_GSO_TYPE_ASCII, .len = ref->len }; return readstat_write_bytes(writer, &header, sizeof(dta_118_strl_header_t)); } static readstat_error_t dta_emit_strls(readstat_writer_t *writer, dta_ctx_t *ctx) { if (!ctx->file_is_xmlish) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; retval = readstat_write_string(writer, ""); if (retval != READSTAT_OK) goto cleanup; int i; for (i=0; istring_refs_count; i++) { readstat_string_ref_t *ref = writer->string_refs[i]; retval = readstat_write_string(writer, "GSO"); if (retval != READSTAT_OK) goto cleanup; if (ctx->strl_o_len > 4) { retval = dta_118_emit_strl_header(writer, ref); } else { retval = dta_117_emit_strl_header(writer, ref); } if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &ref->data[0], ref->len); if (retval != READSTAT_OK) goto cleanup; } retval = readstat_write_string(writer, ""); if (retval != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t dta_old_emit_value_labels(readstat_writer_t *writer, dta_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; int i, j; char labname[12+2]; char *label_buffer = NULL; for (i=0; ilabel_sets_count; i++) { readstat_label_set_t *r_label_set = readstat_get_label_set(writer, i); int32_t max_value = 0; for (j=0; jvalue_labels_count; j++) { readstat_value_label_t *value_label = readstat_get_value_label(r_label_set, j); if (value_label->tag) { retval = READSTAT_ERROR_TAGGED_VALUES_NOT_SUPPORTED; goto cleanup; } if (value_label->int32_key < 0 || value_label->int32_key > 1024) { retval = READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; goto cleanup; } if (value_label->int32_key > max_value) { max_value = value_label->int32_key; } } int16_t table_len = 8*(max_value + 1); retval = readstat_write_bytes(writer, &table_len, sizeof(int16_t)); if (retval != READSTAT_OK) goto cleanup; memset(labname, 0, sizeof(labname)); strncpy(labname, r_label_set->name, ctx->value_label_table_labname_len); retval = readstat_write_bytes(writer, labname, ctx->value_label_table_labname_len + ctx->value_label_table_padding_len); if (retval != READSTAT_OK) goto cleanup; label_buffer = realloc(label_buffer, table_len); memset(label_buffer, 0, table_len); for (j=0; jvalue_labels_count; j++) { readstat_value_label_t *value_label = readstat_get_value_label(r_label_set, j); size_t len = value_label->label_len; if (len > 8) len = 8; memcpy(&label_buffer[8*value_label->int32_key], value_label->label, len); } retval = readstat_write_bytes(writer, label_buffer, table_len); if (retval != READSTAT_OK) goto cleanup; } cleanup: if (label_buffer) free(label_buffer); return retval; } static int dta_compare_value_labels(const readstat_value_label_t *vl1, const readstat_value_label_t *vl2) { if (vl1->tag) { if (vl2->tag) { return vl1->tag - vl2->tag; } return 1; } if (vl2->tag) { return -1; } return vl1->int32_key - vl2->int32_key; } static readstat_error_t dta_emit_value_labels(readstat_writer_t *writer, dta_ctx_t *ctx) { if (ctx->value_label_table_len_len == 2) return dta_old_emit_value_labels(writer, ctx); readstat_error_t retval = READSTAT_OK; int i, j; int32_t *off = NULL; int32_t *val = NULL; char *txt = NULL; char *labname = calloc(1, ctx->value_label_table_labname_len + ctx->value_label_table_padding_len); retval = dta_write_tag(writer, ctx, ""); if (retval != READSTAT_OK) goto cleanup; for (i=0; ilabel_sets_count; i++) { readstat_label_set_t *r_label_set = readstat_get_label_set(writer, i); int32_t n = r_label_set->value_labels_count; int32_t txtlen = 0; for (j=0; jlabel_len + 1; } retval = dta_write_tag(writer, ctx, ""); if (retval != READSTAT_OK) goto cleanup; int32_t table_len = 8 + 8*n + txtlen; retval = readstat_write_bytes(writer, &table_len, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; strncpy(labname, r_label_set->name, ctx->value_label_table_labname_len); retval = readstat_write_bytes(writer, labname, ctx->value_label_table_labname_len + ctx->value_label_table_padding_len); if (retval != READSTAT_OK) goto cleanup; if (txtlen == 0) { retval = readstat_write_bytes(writer, &txtlen, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &txtlen, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = dta_write_tag(writer, ctx, ""); if (retval != READSTAT_OK) goto cleanup; continue; } off = realloc(off, 4*n); val = realloc(val, 4*n); txt = realloc(txt, txtlen); readstat_off_t offset = 0; readstat_sort_label_set(r_label_set, &dta_compare_value_labels); for (j=0; jlabel; size_t label_data_len = value_label->label_len; off[j] = offset; if (value_label->tag) { if (writer->version < 113) { retval = READSTAT_ERROR_TAGGED_VALUES_NOT_SUPPORTED; goto cleanup; } val[j] = DTA_113_MISSING_INT32_A + (value_label->tag - 'a'); } else { val[j] = value_label->int32_key; } memcpy(txt + offset, label, label_data_len); offset += label_data_len; txt[offset++] = '\0'; } retval = readstat_write_bytes(writer, &n, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, &txtlen, sizeof(int32_t)); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, off, 4*n); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, val, 4*n); if (retval != READSTAT_OK) goto cleanup; retval = readstat_write_bytes(writer, txt, txtlen); if (retval != READSTAT_OK) goto cleanup; retval = dta_write_tag(writer, ctx, ""); if (retval != READSTAT_OK) goto cleanup; } retval = dta_write_tag(writer, ctx, ""); if (retval != READSTAT_OK) goto cleanup; cleanup: if (off) free(off); if (val) free(val); if (txt) free(txt); if (labname) free(labname); return retval; } static size_t dta_numeric_variable_width(readstat_type_t type, size_t user_width) { size_t len = 0; if (type == READSTAT_TYPE_DOUBLE) { len = 8; } else if (type == READSTAT_TYPE_FLOAT) { len = 4; } else if (type == READSTAT_TYPE_INT32) { len = 4; } else if (type == READSTAT_TYPE_INT16) { len = 2; } else if (type == READSTAT_TYPE_INT8) { len = 1; } return len; } static size_t dta_111_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) { if (user_width > DTA_111_MAX_WIDTH || user_width == 0) user_width = DTA_111_MAX_WIDTH; return user_width; } return dta_numeric_variable_width(type, user_width); } static size_t dta_117_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) { if (user_width > DTA_117_MAX_WIDTH || user_width == 0) user_width = DTA_117_MAX_WIDTH; return user_width; } if (type == READSTAT_TYPE_STRING_REF) return 8; return dta_numeric_variable_width(type, user_width); } static size_t dta_old_variable_width(readstat_type_t type, size_t user_width) { if (type == READSTAT_TYPE_STRING) { if (user_width > DTA_OLD_MAX_WIDTH || user_width == 0) user_width = DTA_OLD_MAX_WIDTH; return user_width; } return dta_numeric_variable_width(type, user_width); } static readstat_error_t dta_emit_header(readstat_writer_t *writer, dta_ctx_t *ctx, dta_header_t *header) { readstat_error_t error = READSTAT_OK; if (!ctx->file_is_xmlish) { error = readstat_write_bytes(writer, header, sizeof(dta_header_t)); if (error != READSTAT_OK) goto cleanup; error = dta_emit_header_data_label(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_header_time_stamp(writer, ctx); if (error != READSTAT_OK) goto cleanup; return READSTAT_OK; } if ((error = dta_write_tag(writer, ctx, "")) != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "
")) != READSTAT_OK) goto cleanup; char release[128]; snprintf(release, sizeof(release), "%d", header->ds_format); if ((error = readstat_write_string(writer, release)) != READSTAT_OK) goto cleanup; error = dta_write_chunk(writer, ctx, "", (header->byteorder == DTA_HILO) ? "MSF" : "LSF", sizeof("MSF")-1, ""); if (error != READSTAT_OK) goto cleanup; error = dta_write_chunk(writer, ctx, "", &header->nvar, sizeof(int16_t), ""); if (error != READSTAT_OK) goto cleanup; if (header->ds_format >= 118) { int64_t nobs = header->nobs; error = dta_write_chunk(writer, ctx, "", &nobs, sizeof(int64_t), ""); if (error != READSTAT_OK) goto cleanup; } else { error = dta_write_chunk(writer, ctx, "", &header->nobs, sizeof(int32_t), ""); if (error != READSTAT_OK) goto cleanup; } error = dta_emit_header_data_label(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_header_time_stamp(writer, ctx); if (error != READSTAT_OK) goto cleanup; if ((error = dta_write_tag(writer, ctx, "
")) != READSTAT_OK) goto cleanup; cleanup: return error; } static size_t dta_measure_tag(dta_ctx_t *ctx, const char *tag) { if (!ctx->file_is_xmlish) return 0; return strlen(tag); } static size_t dta_measure_map(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + 14 * sizeof(uint64_t) + dta_measure_tag(ctx, "")); } static size_t dta_measure_typlist(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + ctx->typlist_entry_len * ctx->nvar + dta_measure_tag(ctx, "")); } static size_t dta_measure_varlist(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + ctx->varlist_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_srtlist(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + ctx->srtlist_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_fmtlist(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + ctx->fmtlist_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_lbllist(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + ctx->lbllist_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_variable_labels(dta_ctx_t *ctx) { return (dta_measure_tag(ctx, "") + ctx->variable_labels_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_characteristics(readstat_writer_t *writer, dta_ctx_t *ctx) { size_t characteristics_len = 0; int i; for (i=0; inotes_count; i++) { size_t ch_len = dta_measure_tag(ctx, "") + ctx->expansion_len_len + 2 * ctx->ch_metadata_len + strlen(writer->notes[i]) + 1 + dta_measure_tag(ctx, ""); characteristics_len += ch_len; } return (dta_measure_tag(ctx, "") + characteristics_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_data(readstat_writer_t *writer, dta_ctx_t *ctx) { int i; for (i=0; invar; i++) { size_t max_len = 0; readstat_variable_t *r_variable = readstat_get_variable(writer, i); uint16_t typecode = 0; dta_typecode_for_variable(r_variable, ctx->typlist_version, &typecode); if (dta_type_info(typecode, ctx, &max_len, NULL) == READSTAT_OK) ctx->record_len += max_len; } return (dta_measure_tag(ctx, "") + ctx->record_len * ctx->nobs + dta_measure_tag(ctx, "")); } static size_t dta_measure_strls(readstat_writer_t *writer, dta_ctx_t *ctx) { int i; size_t strls_len = 0; for (i=0; istring_refs_count; i++) { readstat_string_ref_t *ref = writer->string_refs[i]; if (ctx->strl_o_len > 4) { strls_len += 20 + ref->len; } else { strls_len += 16 + ref->len; } } return (dta_measure_tag(ctx, "") + strls_len + dta_measure_tag(ctx, "")); } static size_t dta_measure_value_labels(readstat_writer_t *writer, dta_ctx_t *ctx) { size_t len = dta_measure_tag(ctx, ""); int i, j; for (i=0; ilabel_sets_count; i++) { readstat_label_set_t *r_label_set = readstat_get_label_set(writer, i); int32_t n = r_label_set->value_labels_count; int32_t txtlen = 0; for (j=0; jlabel_len + 1; } len += dta_measure_tag(ctx, ""); len += sizeof(int32_t); len += ctx->value_label_table_labname_len; len += ctx->value_label_table_padding_len; len += 8 + 8*n + txtlen; len += dta_measure_tag(ctx, ""); } len += dta_measure_tag(ctx, ""); return len; } static readstat_error_t dta_emit_map(readstat_writer_t *writer, dta_ctx_t *ctx) { if (!ctx->file_is_xmlish) return READSTAT_OK; uint64_t map[14]; map[0] = 0; /* */ map[1] = writer->bytes_written; /* */ map[2] = map[1] + dta_measure_map(ctx); /* */ map[3] = map[2] + dta_measure_typlist(ctx); /* */ map[4] = map[3] + dta_measure_varlist(ctx); /* */ map[5] = map[4] + dta_measure_srtlist(ctx); /* */ map[6] = map[5] + dta_measure_fmtlist(ctx); /* */ map[7] = map[6] + dta_measure_lbllist(ctx); /* */ map[8] = map[7] + dta_measure_variable_labels(ctx); /* */ map[9] = map[8] + dta_measure_characteristics(writer, ctx); /* */ map[10]= map[9] + dta_measure_data(writer, ctx); /* */ map[11]= map[10]+ dta_measure_strls(writer, ctx); /* */ map[12]= map[11]+ dta_measure_value_labels(writer, ctx); /* */ map[13]= map[12]+ dta_measure_tag(ctx, "
"); return dta_write_chunk(writer, ctx, "", map, sizeof(map), ""); } static readstat_error_t dta_begin_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; readstat_error_t error = READSTAT_OK; if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; dta_ctx_t *ctx = dta_ctx_alloc(NULL); dta_header_t header = {0}; header.ds_format = writer->version; header.byteorder = machine_is_little_endian() ? DTA_LOHI : DTA_HILO; header.filetype = 0x01; header.unused = 0x00; header.nvar = writer->variables_count; header.nobs = writer->row_count; error = dta_ctx_init(ctx, header.nvar, header.nobs, header.byteorder, header.ds_format, NULL, NULL); if (error != READSTAT_OK) goto cleanup; error = dta_emit_header(writer, ctx, &header); if (error != READSTAT_OK) goto cleanup; error = dta_emit_map(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_descriptors(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_variable_labels(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_characteristics(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_write_tag(writer, ctx, ""); if (error != READSTAT_OK) goto cleanup; cleanup: if (error != READSTAT_OK) { dta_ctx_free(ctx); } else { writer->module_ctx = ctx; } return error; } static readstat_error_t dta_write_raw_int8(void *row, int8_t value) { memcpy(row, &value, sizeof(char)); return READSTAT_OK; } static readstat_error_t dta_write_raw_int16(void *row, int16_t value) { memcpy(row, &value, sizeof(int16_t)); return READSTAT_OK; } static readstat_error_t dta_write_raw_int32(void *row, int32_t value) { memcpy(row, &value, sizeof(int32_t)); return READSTAT_OK; } static readstat_error_t dta_write_raw_int64(void *row, int64_t value) { memcpy(row, &value, sizeof(int64_t)); return READSTAT_OK; } static readstat_error_t dta_write_raw_float(void *row, float value) { memcpy(row, &value, sizeof(float)); return READSTAT_OK; } static readstat_error_t dta_write_raw_double(void *row, double value) { memcpy(row, &value, sizeof(double)); return READSTAT_OK; } static readstat_error_t dta_113_write_int8(void *row, const readstat_variable_t *var, int8_t value) { if (value > DTA_113_MAX_INT8) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } return dta_write_raw_int8(row, value); } static readstat_error_t dta_old_write_int8(void *row, const readstat_variable_t *var, int8_t value) { if (value > DTA_OLD_MAX_INT8) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } return dta_write_raw_int8(row, value); } static readstat_error_t dta_113_write_int16(void *row, const readstat_variable_t *var, int16_t value) { if (value > DTA_113_MAX_INT16) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } return dta_write_raw_int16(row, value); } static readstat_error_t dta_old_write_int16(void *row, const readstat_variable_t *var, int16_t value) { if (value > DTA_OLD_MAX_INT16) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } return dta_write_raw_int16(row, value); } static readstat_error_t dta_113_write_int32(void *row, const readstat_variable_t *var, int32_t value) { if (value > DTA_113_MAX_INT32) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } return dta_write_raw_int32(row, value); } static readstat_error_t dta_old_write_int32(void *row, const readstat_variable_t *var, int32_t value) { if (value > DTA_OLD_MAX_INT32) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } return dta_write_raw_int32(row, value); } static readstat_error_t dta_write_float(void *row, const readstat_variable_t *var, float value) { int32_t max_flt_i32 = DTA_113_MAX_FLOAT; float max_flt; memcpy(&max_flt, &max_flt_i32, sizeof(float)); if (value > max_flt) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } else if (isnan(value)) { return dta_113_write_missing_numeric(row, var); } return dta_write_raw_float(row, value); } static readstat_error_t dta_write_double(void *row, const readstat_variable_t *var, double value) { int64_t max_dbl_i64 = DTA_113_MAX_DOUBLE; double max_dbl; memcpy(&max_dbl, &max_dbl_i64, sizeof(double)); if (value > max_dbl) { return READSTAT_ERROR_NUMERIC_VALUE_IS_OUT_OF_RANGE; } else if (isnan(value)) { return dta_113_write_missing_numeric(row, var); } return dta_write_raw_double(row, value); } static readstat_error_t dta_write_string(void *row, const readstat_variable_t *var, const char *value) { size_t max_len = var->storage_width; if (value == NULL || value[0] == '\0') { memset(row, '\0', max_len); } else { size_t value_len = strlen(value); if (value_len > max_len) return READSTAT_ERROR_STRING_VALUE_IS_TOO_LONG; strncpy((char *)row, value, max_len); } return READSTAT_OK; } static readstat_error_t dta_118_write_string_ref(void *row, const readstat_variable_t *var, readstat_string_ref_t *ref) { if (ref == NULL) return READSTAT_ERROR_STRING_REF_IS_REQUIRED; int16_t v = ref->first_v; int64_t o = ref->first_o; char *row_bytes = (char *)row; memcpy(&row_bytes[0], &v, sizeof(int16_t)); if (!machine_is_little_endian()) { o <<= 16; } memcpy(&row_bytes[2], &o, 6); return READSTAT_OK; } static readstat_error_t dta_117_write_string_ref(void *row, const readstat_variable_t *var, readstat_string_ref_t *ref) { if (ref == NULL) return READSTAT_ERROR_STRING_REF_IS_REQUIRED; int32_t v = ref->first_v; int32_t o = ref->first_o; char *row_bytes = (char *)row; memcpy(&row_bytes[0], &v, sizeof(int32_t)); memcpy(&row_bytes[4], &o, sizeof(int32_t)); return READSTAT_OK; } static readstat_error_t dta_113_write_missing_numeric(void *row, const readstat_variable_t *var) { readstat_error_t retval = READSTAT_OK; if (var->type == READSTAT_TYPE_INT8) { retval = dta_write_raw_int8(row, DTA_113_MISSING_INT8); } else if (var->type == READSTAT_TYPE_INT16) { retval = dta_write_raw_int16(row, DTA_113_MISSING_INT16); } else if (var->type == READSTAT_TYPE_INT32) { retval = dta_write_raw_int32(row, DTA_113_MISSING_INT32); } else if (var->type == READSTAT_TYPE_FLOAT) { retval = dta_write_raw_int32(row, DTA_113_MISSING_FLOAT); } else if (var->type == READSTAT_TYPE_DOUBLE) { retval = dta_write_raw_int64(row, DTA_113_MISSING_DOUBLE); } return retval; } static readstat_error_t dta_old_write_missing_numeric(void *row, const readstat_variable_t *var) { readstat_error_t retval = READSTAT_OK; if (var->type == READSTAT_TYPE_INT8) { retval = dta_write_raw_int8(row, DTA_OLD_MISSING_INT8); } else if (var->type == READSTAT_TYPE_INT16) { retval = dta_write_raw_int16(row, DTA_OLD_MISSING_INT16); } else if (var->type == READSTAT_TYPE_INT32) { retval = dta_write_raw_int32(row, DTA_OLD_MISSING_INT32); } else if (var->type == READSTAT_TYPE_FLOAT) { retval = dta_write_raw_int32(row, DTA_OLD_MISSING_FLOAT); } else if (var->type == READSTAT_TYPE_DOUBLE) { retval = dta_write_raw_int64(row, DTA_OLD_MISSING_DOUBLE); } return retval; } static readstat_error_t dta_write_missing_string(void *row, const readstat_variable_t *var) { return dta_write_string(row, var, NULL); } static readstat_error_t dta_113_write_missing_tagged(void *row, const readstat_variable_t *var, char tag) { readstat_error_t retval = READSTAT_OK; if (tag < 'a' || tag > 'z') return READSTAT_ERROR_TAGGED_VALUE_IS_OUT_OF_RANGE; if (var->type == READSTAT_TYPE_INT8) { retval = dta_write_raw_int8(row, DTA_113_MISSING_INT8_A + (tag - 'a')); } else if (var->type == READSTAT_TYPE_INT16) { retval = dta_write_raw_int16(row, DTA_113_MISSING_INT16_A + (tag - 'a')); } else if (var->type == READSTAT_TYPE_INT32) { retval = dta_write_raw_int32(row, DTA_113_MISSING_INT32_A + (tag - 'a')); } else if (var->type == READSTAT_TYPE_FLOAT) { retval = dta_write_raw_int32(row, DTA_113_MISSING_FLOAT_A + ((tag - 'a') << 11)); } else if (var->type == READSTAT_TYPE_DOUBLE) { retval = dta_write_raw_int64(row, DTA_113_MISSING_DOUBLE_A + ((int64_t)(tag - 'a') << 40)); } else { retval = READSTAT_ERROR_TAGGED_VALUES_NOT_SUPPORTED; } return retval; } static readstat_error_t dta_end_data(void *writer_ctx) { readstat_writer_t *writer = (readstat_writer_t *)writer_ctx; dta_ctx_t *ctx = writer->module_ctx; readstat_error_t error = READSTAT_OK; if (!writer->initialized) return READSTAT_ERROR_WRITER_NOT_INITIALIZED; error = dta_write_tag(writer, ctx, ""); if (error != READSTAT_OK) goto cleanup; error = dta_emit_strls(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_emit_value_labels(writer, ctx); if (error != READSTAT_OK) goto cleanup; error = dta_write_tag(writer, ctx, ""); if (error != READSTAT_OK) goto cleanup; cleanup: return error; } static void dta_module_ctx_free(void *module_ctx) { dta_ctx_free(module_ctx); } readstat_error_t readstat_begin_writing_dta(readstat_writer_t *writer, void *user_ctx, long row_count) { if (writer->compression != READSTAT_COMPRESS_NONE) return READSTAT_ERROR_UNSUPPORTED_COMPRESSION; if (writer->version == 0) writer->version = DTA_DEFAULT_FILE_VERSION; if (writer->version >= 119 || writer->version < 104) { return READSTAT_ERROR_UNSUPPORTED_FILE_FORMAT_VERSION; } if (writer->version >= 117) { writer->callbacks.variable_width = &dta_117_variable_width; } else if (writer->version >= 111) { writer->callbacks.variable_width = &dta_111_variable_width; } else { writer->callbacks.variable_width = &dta_old_variable_width; } if (writer->version >= 118) { writer->callbacks.variable_ok = &dta_118_variable_ok; } else if (writer->version >= 110) { writer->callbacks.variable_ok = &dta_110_variable_ok; } else { writer->callbacks.variable_ok = &dta_old_variable_ok; } if (writer->version == 118) { writer->callbacks.write_string_ref = &dta_118_write_string_ref; } else if (writer->version == 117) { writer->callbacks.write_string_ref = &dta_117_write_string_ref; } if (writer->version >= 113) { writer->callbacks.write_int8 = &dta_113_write_int8; writer->callbacks.write_int16 = &dta_113_write_int16; writer->callbacks.write_int32 = &dta_113_write_int32; writer->callbacks.write_missing_number = &dta_113_write_missing_numeric; writer->callbacks.write_missing_tagged = &dta_113_write_missing_tagged; } else { writer->callbacks.write_int8 = &dta_old_write_int8; writer->callbacks.write_int16 = &dta_old_write_int16; writer->callbacks.write_int32 = &dta_old_write_int32; writer->callbacks.write_missing_number = &dta_old_write_missing_numeric; } writer->callbacks.write_float = &dta_write_float; writer->callbacks.write_double = &dta_write_double; writer->callbacks.write_string = &dta_write_string; writer->callbacks.write_missing_string = &dta_write_missing_string; writer->callbacks.begin_data = &dta_begin_data; writer->callbacks.end_data = &dta_end_data; writer->callbacks.module_ctx_free = &dta_module_ctx_free; return readstat_begin_writing_file(writer, user_ctx, row_count); } haven/src/readstat/stata/readstat_dta_parse_timestamp.h0000644000176200001440000000023113227731765023163 0ustar liggesusers readstat_error_t dta_parse_timestamp(const char *data, size_t len, struct tm *timestamp, readstat_error_handler error_handler, void *user_ctx); haven/src/readstat/stata/readstat_dta.c0000644000176200001440000002300513227731765017705 0ustar liggesusers#include #include #include #include #include #include "../readstat.h" #include "../readstat_iconv.h" #include "../readstat_malloc.h" #include "../readstat_bits.h" #include "readstat_dta.h" #define DTA_MIN_VERSION 104 #define DTA_MAX_VERSION 118 dta_ctx_t *dta_ctx_alloc(readstat_io_t *io) { dta_ctx_t *ctx = calloc(1, sizeof(dta_ctx_t)); if (ctx == NULL) { return NULL; } ctx->io = io; ctx->initialized = 0; return ctx; } readstat_error_t dta_ctx_init(dta_ctx_t *ctx, int16_t nvar, int32_t nobs, unsigned char byteorder, unsigned char ds_format, const char *input_encoding, const char *output_encoding) { readstat_error_t retval = READSTAT_OK; int machine_byteorder = DTA_HILO; if (ds_format < DTA_MIN_VERSION || ds_format > DTA_MAX_VERSION) return READSTAT_ERROR_UNSUPPORTED_FILE_FORMAT_VERSION; if (machine_is_little_endian()) { machine_byteorder = DTA_LOHI; } ctx->bswap = (byteorder != machine_byteorder); ctx->nvar = ctx->bswap ? byteswap2(nvar) : nvar; ctx->nobs = ctx->bswap ? byteswap4(nobs) : nobs; if (ctx->nvar) { if ((ctx->variables = readstat_calloc(ctx->nvar, sizeof(readstat_variable_t *))) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } } ctx->machine_is_twos_complement = READSTAT_MACHINE_IS_TWOS_COMPLEMENT; if (ds_format < 105) { ctx->fmtlist_entry_len = 7; } else if (ds_format < 114) { ctx->fmtlist_entry_len = 12; } else if (ds_format < 118) { ctx->fmtlist_entry_len = 49; } else { ctx->fmtlist_entry_len = 57; } if (ds_format >= 117) { ctx->typlist_version = 117; } else if (ds_format >= 111) { ctx->typlist_version = 111; } else { ctx->typlist_version = 0; } if (ds_format >= 118) { ctx->data_label_len_len = 2; ctx->strl_v_len = 2; ctx->strl_o_len = 6; } else if (ds_format >= 117) { ctx->data_label_len_len = 1; ctx->strl_v_len = 4; ctx->strl_o_len = 4; } if (ds_format < 105) { ctx->expansion_len_len = 0; } else if (ds_format < 110) { ctx->expansion_len_len = 2; } else { ctx->expansion_len_len = 4; } if (ds_format < 110) { ctx->lbllist_entry_len = 9; ctx->variable_name_len = 9; ctx->ch_metadata_len = 9; } else if (ds_format < 118) { ctx->lbllist_entry_len = 33; ctx->variable_name_len = 33; ctx->ch_metadata_len = 33; } else { ctx->lbllist_entry_len = 129; ctx->variable_name_len = 129; ctx->ch_metadata_len = 129; } if (ds_format < 108) { ctx->variable_labels_entry_len = 32; ctx->data_label_len = 32; } else if (ds_format < 118) { ctx->variable_labels_entry_len = 81; ctx->data_label_len = 81; } else { ctx->variable_labels_entry_len = 321; ctx->data_label_len = 321; } if (ds_format < 105) { ctx->timestamp_len = 0; ctx->value_label_table_len_len = 2; ctx->value_label_table_labname_len = 12; ctx->value_label_table_padding_len = 2; } else { ctx->timestamp_len = 18; ctx->value_label_table_len_len = 4; if (ds_format < 118) { ctx->value_label_table_labname_len = 33; } else { ctx->value_label_table_labname_len = 129; } ctx->value_label_table_padding_len = 3; } if (ds_format < 117) { ctx->typlist_entry_len = 1; ctx->file_is_xmlish = 0; } else { ctx->typlist_entry_len = 2; ctx->file_is_xmlish = 1; } if (ds_format < 113) { ctx->max_int8 = DTA_OLD_MAX_INT8; ctx->max_int16 = DTA_OLD_MAX_INT16; ctx->max_int32 = DTA_OLD_MAX_INT32; ctx->max_float = DTA_OLD_MAX_FLOAT; ctx->max_double = DTA_OLD_MAX_DOUBLE; } else { ctx->max_int8 = DTA_113_MAX_INT8; ctx->max_int16 = DTA_113_MAX_INT16; ctx->max_int32 = DTA_113_MAX_INT32; ctx->max_float = DTA_113_MAX_FLOAT; ctx->max_double = DTA_113_MAX_DOUBLE; ctx->supports_tagged_missing = 1; } if (output_encoding) { if (input_encoding) { ctx->converter = iconv_open(output_encoding, input_encoding); } else if (ds_format < 118) { ctx->converter = iconv_open(output_encoding, "WINDOWS-1252"); } else if (strcmp(output_encoding, "UTF-8") != 0) { ctx->converter = iconv_open(output_encoding, "UTF-8"); } if (ctx->converter == (iconv_t)-1) { ctx->converter = NULL; retval = READSTAT_ERROR_UNSUPPORTED_CHARSET; goto cleanup; } } ctx->srtlist_len = (ctx->nvar + 1) * sizeof(int16_t); if ((ctx->srtlist = readstat_malloc(ctx->srtlist_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (ctx->nvar > 0) { ctx->typlist_len = ctx->nvar * sizeof(uint16_t); ctx->varlist_len = ctx->variable_name_len * ctx->nvar * sizeof(char); ctx->fmtlist_len = ctx->fmtlist_entry_len * ctx->nvar * sizeof(char); ctx->lbllist_len = ctx->lbllist_entry_len * ctx->nvar * sizeof(char); ctx->variable_labels_len = ctx->variable_labels_entry_len * ctx->nvar * sizeof(char); if ((ctx->typlist = readstat_malloc(ctx->typlist_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((ctx->varlist = readstat_malloc(ctx->varlist_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((ctx->fmtlist = readstat_malloc(ctx->fmtlist_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((ctx->lbllist = readstat_malloc(ctx->lbllist_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((ctx->variable_labels = readstat_malloc(ctx->variable_labels_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } } ctx->initialized = 1; cleanup: return retval; } void dta_ctx_free(dta_ctx_t *ctx) { if (ctx->typlist) free(ctx->typlist); if (ctx->varlist) free(ctx->varlist); if (ctx->srtlist) free(ctx->srtlist); if (ctx->fmtlist) free(ctx->fmtlist); if (ctx->lbllist) free(ctx->lbllist); if (ctx->variable_labels) free(ctx->variable_labels); if (ctx->converter) iconv_close(ctx->converter); if (ctx->data_label) free(ctx->data_label); if (ctx->variables) { int i; for (i=0; invar; i++) { if (ctx->variables[i]) free(ctx->variables[i]); } free(ctx->variables); } if (ctx->strls) { int i; for (i=0; istrls_count; i++) { free(ctx->strls[i]); } free(ctx->strls); } free(ctx); } readstat_error_t dta_type_info(uint16_t typecode, dta_ctx_t *ctx, size_t *max_len, readstat_type_t *out_type) { readstat_error_t retval = READSTAT_OK; size_t len = 0; readstat_type_t type = READSTAT_TYPE_STRING; if (ctx->typlist_version == 111) { switch (typecode) { case DTA_111_TYPE_CODE_INT8: len = 1; type = READSTAT_TYPE_INT8; break; case DTA_111_TYPE_CODE_INT16: len = 2; type = READSTAT_TYPE_INT16; break; case DTA_111_TYPE_CODE_INT32: len = 4; type = READSTAT_TYPE_INT32; break; case DTA_111_TYPE_CODE_FLOAT: len = 4; type = READSTAT_TYPE_FLOAT; break; case DTA_111_TYPE_CODE_DOUBLE: len = 8; type = READSTAT_TYPE_DOUBLE; break; default: len = typecode; type = READSTAT_TYPE_STRING; break; } } else if (ctx->typlist_version == 117) { switch (typecode) { case DTA_117_TYPE_CODE_INT8: len = 1; type = READSTAT_TYPE_INT8; break; case DTA_117_TYPE_CODE_INT16: len = 2; type = READSTAT_TYPE_INT16; break; case DTA_117_TYPE_CODE_INT32: len = 4; type = READSTAT_TYPE_INT32; break; case DTA_117_TYPE_CODE_FLOAT: len = 4; type = READSTAT_TYPE_FLOAT; break; case DTA_117_TYPE_CODE_DOUBLE: len = 8; type = READSTAT_TYPE_DOUBLE; break; case DTA_117_TYPE_CODE_STRL: len = 8; type = READSTAT_TYPE_STRING_REF; break; default: len = typecode; type = READSTAT_TYPE_STRING; break; } } else if (typecode < 0x7F) { switch (typecode) { case DTA_OLD_TYPE_CODE_INT8: len = 1; type = READSTAT_TYPE_INT8; break; case DTA_OLD_TYPE_CODE_INT16: len = 2; type = READSTAT_TYPE_INT16; break; case DTA_OLD_TYPE_CODE_INT32: len = 4; type = READSTAT_TYPE_INT32; break; case DTA_OLD_TYPE_CODE_FLOAT: len = 4; type = READSTAT_TYPE_FLOAT; break; case DTA_OLD_TYPE_CODE_DOUBLE: len = 8; type = READSTAT_TYPE_DOUBLE; break; default: retval = READSTAT_ERROR_PARSE; break; } } else { len = typecode - 0x7F; type = READSTAT_TYPE_STRING; } if (max_len) *max_len = len; if (out_type) *out_type = type; return retval; } haven/src/readstat/stata/readstat_dta_read.c0000644000176200001440000011634113227731765020706 0ustar liggesusers #include #include #include #include #include #include #include #include "../readstat.h" #include "../readstat_bits.h" #include "../readstat_iconv.h" #include "../readstat_convert.h" #include "../readstat_malloc.h" #include "readstat_dta.h" #include "readstat_dta_parse_timestamp.h" #define MAX_VALUE_LABEL_LEN 32000 static readstat_error_t dta_update_progress(dta_ctx_t *ctx); static readstat_error_t dta_read_descriptors(dta_ctx_t *ctx); static readstat_error_t dta_read_tag(dta_ctx_t *ctx, const char *tag); static readstat_error_t dta_read_expansion_fields(dta_ctx_t *ctx); static readstat_error_t dta_update_progress(dta_ctx_t *ctx) { double progress = 0.0; if (ctx->row_limit > 0) progress = 1.0 * ctx->current_row / ctx->row_limit; if (ctx->progress_handler && ctx->progress_handler(progress, ctx->user_ctx) != READSTAT_HANDLER_OK) return READSTAT_ERROR_USER_ABORT; return READSTAT_OK; } static readstat_variable_t *dta_init_variable(dta_ctx_t *ctx, int i, int index_after_skipping, readstat_type_t type, size_t max_len) { readstat_variable_t *variable = calloc(1, sizeof(readstat_variable_t)); variable->type = type; variable->index = i; variable->index_after_skipping = index_after_skipping; variable->storage_width = max_len; readstat_convert(variable->name, sizeof(variable->name), &ctx->varlist[ctx->variable_name_len*i], ctx->variable_name_len, ctx->converter); if (ctx->variable_labels[ctx->variable_labels_entry_len*i]) { readstat_convert(variable->label, sizeof(variable->label), &ctx->variable_labels[ctx->variable_labels_entry_len*i], ctx->variable_labels_entry_len, ctx->converter); } if (ctx->fmtlist[ctx->fmtlist_entry_len*i]) { readstat_convert(variable->format, sizeof(variable->format), &ctx->fmtlist[ctx->fmtlist_entry_len*i], ctx->fmtlist_entry_len, ctx->converter); if (variable->format[0] == '%') { if (variable->format[1] == '-') { variable->alignment = READSTAT_ALIGNMENT_LEFT; } else if (variable->format[1] == '~') { variable->alignment = READSTAT_ALIGNMENT_CENTER; } else { variable->alignment = READSTAT_ALIGNMENT_RIGHT; } } int display_width; if (sscanf(variable->format, "%%%ds", &display_width) == 1 || sscanf(variable->format, "%%-%ds", &display_width) == 1) { variable->display_width = display_width; } } return variable; } static readstat_error_t dta_read_chunk( dta_ctx_t *ctx, const char *start_tag, void *dst, size_t dst_len, const char *end_tag) { char *dst_buffer = (char *)dst; readstat_io_t *io = ctx->io; readstat_error_t retval = READSTAT_OK; if ((retval = dta_read_tag(ctx, start_tag)) != READSTAT_OK) goto cleanup; if (io->read(dst_buffer, dst_len, io->io_ctx) != dst_len) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = dta_read_tag(ctx, end_tag)) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t dta_read_map(dta_ctx_t *ctx) { if (!ctx->file_is_xmlish) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; uint64_t map_buffer[14]; if ((retval = dta_read_chunk(ctx, "", map_buffer, sizeof(map_buffer), "")) != READSTAT_OK) { goto cleanup; } ctx->data_offset = ctx->bswap ? byteswap8(map_buffer[9]) : map_buffer[9]; ctx->strls_offset = ctx->bswap ? byteswap8(map_buffer[10]) : map_buffer[10]; ctx->value_labels_offset = ctx->bswap ? byteswap8(map_buffer[11]) : map_buffer[11]; cleanup: return retval; } static readstat_error_t dta_read_descriptors(dta_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; size_t buffer_len = ctx->nvar * ctx->typlist_entry_len; unsigned char *buffer = NULL; int i; if (ctx->nvar && (buffer = readstat_malloc(buffer_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if ((retval = dta_read_chunk(ctx, "", buffer, buffer_len, "")) != READSTAT_OK) goto cleanup; if (ctx->typlist_entry_len == 1) { for (i=0; invar; i++) { ctx->typlist[i] = buffer[i]; } } else if (ctx->typlist_entry_len == 2) { memcpy(ctx->typlist, buffer, buffer_len); if (ctx->bswap) { for (i=0; invar; i++) { ctx->typlist[i] = byteswap2(ctx->typlist[i]); } } } if ((retval = dta_read_chunk(ctx, "", ctx->varlist, ctx->varlist_len, "")) != READSTAT_OK) goto cleanup; if ((retval = dta_read_chunk(ctx, "", ctx->srtlist, ctx->srtlist_len, "")) != READSTAT_OK) goto cleanup; if ((retval = dta_read_chunk(ctx, "", ctx->fmtlist, ctx->fmtlist_len, "")) != READSTAT_OK) goto cleanup; if ((retval = dta_read_chunk(ctx, "", ctx->lbllist, ctx->lbllist_len, "")) != READSTAT_OK) goto cleanup; if ((retval = dta_read_chunk(ctx, "", ctx->variable_labels, ctx->variable_labels_len, "")) != READSTAT_OK) goto cleanup; cleanup: if (buffer) free(buffer); return retval; } static readstat_error_t dta_read_expansion_fields(dta_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; char *buffer = NULL; if (ctx->expansion_len_len == 0) return READSTAT_OK; if (ctx->file_is_xmlish && !ctx->note_handler) { if (io->seek(ctx->data_offset, READSTAT_SEEK_SET, io->io_ctx) == -1) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "Failed to seek to data section (offset=%" PRId64 ")", ctx->data_offset); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } return READSTAT_ERROR_SEEK; } return READSTAT_OK; } retval = dta_read_tag(ctx, ""); if (retval != READSTAT_OK) goto cleanup; while (1) { size_t len; char data_type; if (ctx->file_is_xmlish) { char start[4]; if (io->read(start, sizeof(start), io->io_ctx) != sizeof(start)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (memcmp(start, ""); if (retval != READSTAT_OK) goto cleanup; break; } else if (memcmp(start, "", sizeof(start)) != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } data_type = 1; } else { if (io->read(&data_type, 1, io->io_ctx) != 1) { retval = READSTAT_ERROR_READ; goto cleanup; } } if (ctx->expansion_len_len == 2) { uint16_t len16; if (io->read(&len16, sizeof(uint16_t), io->io_ctx) != sizeof(uint16_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } len = ctx->bswap ? byteswap2(len16) : len16; } else { uint32_t len32; if (io->read(&len32, sizeof(uint32_t), io->io_ctx) != sizeof(uint32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } len = ctx->bswap ? byteswap4(len32) : len32; } if (data_type == 0 && len == 0) break; if (data_type != 1 || len > (1<<20)) { retval = READSTAT_ERROR_NOTE_IS_TOO_LONG; goto cleanup; } if (ctx->note_handler && len >= 2 * ctx->ch_metadata_len) { if ((buffer = readstat_realloc(buffer, len + 1)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } buffer[len] = '\0'; if (io->read(buffer, len, io->io_ctx) != len) { retval = READSTAT_ERROR_READ; goto cleanup; } int index = 0; if (strncmp(&buffer[0], "_dta", 4) == 0 && sscanf(&buffer[ctx->ch_metadata_len], "note%d", &index) == 1) { if (ctx->note_handler(index, &buffer[2*ctx->ch_metadata_len], ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } } else { if (io->seek(len, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } } retval = dta_read_tag(ctx, ""); if (retval != READSTAT_OK) goto cleanup; } cleanup: if (buffer) free(buffer); return retval; } static readstat_error_t dta_read_tag(dta_ctx_t *ctx, const char *tag) { readstat_error_t retval = READSTAT_OK; if (ctx->initialized && !ctx->file_is_xmlish) return retval; char buffer[256]; size_t len = strlen(tag); if (ctx->io->read(buffer, len, ctx->io->io_ctx) != len) { retval = READSTAT_ERROR_READ; goto cleanup; } if (strncmp(buffer, tag, len) != 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } cleanup: return retval; } static int dta_compare_strls(const void *elem1, const void *elem2) { const dta_strl_t *key = (const dta_strl_t *)elem1; const dta_strl_t *target = *(const dta_strl_t **)elem2; if (key->v == target->v) return key->o - target->o; return key->v - target->v; } static dta_strl_t dta_interpret_strl_vo_bytes(dta_ctx_t *ctx, const unsigned char *vo_bytes) { dta_strl_t strl = {0}; int file_is_big_endian = (!machine_is_little_endian() ^ ctx->bswap); if (ctx->strl_v_len == 2) { if (file_is_big_endian) { strl.v = (vo_bytes[0] << 8) + vo_bytes[1]; strl.o = (((uint64_t)vo_bytes[2] << 40) + ((uint64_t)vo_bytes[3] << 32) + (vo_bytes[4] << 24) + (vo_bytes[5] << 16) + (vo_bytes[6] << 8) + vo_bytes[7]); } else { strl.v = vo_bytes[0] + (vo_bytes[1] << 8); strl.o = (vo_bytes[2] + (vo_bytes[3] << 8) + (vo_bytes[4] << 16) + (vo_bytes[5] << 24) + ((uint64_t)vo_bytes[6] << 32) + ((uint64_t)vo_bytes[7] << 40)); } } else if (ctx->strl_v_len == 4) { uint32_t v, o; memcpy(&v, &vo_bytes[0], sizeof(uint32_t)); memcpy(&o, &vo_bytes[4], sizeof(uint32_t)); strl.v = ctx->bswap ? byteswap4(v) : v; strl.o = ctx->bswap ? byteswap4(o) : o; } return strl; } static readstat_error_t dta_117_read_strl(dta_ctx_t *ctx, dta_strl_t *strl) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; dta_117_strl_header_t header; if (io->read(&header, sizeof(header), io->io_ctx) != sizeof(dta_117_strl_header_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } strl->v = ctx->bswap ? byteswap4(header.v) : header.v; strl->o = ctx->bswap ? byteswap4(header.o) : header.o; strl->type = header.type; strl->len = ctx->bswap ? byteswap4(header.len) : header.len; cleanup: return retval; } static readstat_error_t dta_118_read_strl(dta_ctx_t *ctx, dta_strl_t *strl) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; dta_118_strl_header_t header; if (io->read(&header, sizeof(header), io->io_ctx) != sizeof(dta_118_strl_header_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } strl->v = ctx->bswap ? byteswap4(header.v) : header.v; strl->o = ctx->bswap ? byteswap8(header.o) : header.o; strl->type = header.type; strl->len = ctx->bswap ? byteswap4(header.len) : header.len; cleanup: return retval; } static readstat_error_t dta_read_strl(dta_ctx_t *ctx, dta_strl_t *strl) { if (ctx->strl_o_len > 4) { return dta_118_read_strl(ctx, strl); } return dta_117_read_strl(ctx, strl); } static readstat_error_t dta_read_strls(dta_ctx_t *ctx) { if (!ctx->file_is_xmlish) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (io->seek(ctx->strls_offset, READSTAT_SEEK_SET, io->io_ctx) == -1) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "Failed to seek to strls section (offset=%" PRId64 ")", ctx->strls_offset); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_SEEK; goto cleanup; } retval = dta_read_tag(ctx, ""); if (retval != READSTAT_OK) goto cleanup; ctx->strls_capacity = 100; ctx->strls = readstat_malloc(ctx->strls_capacity * sizeof(dta_strl_t *)); while (1) { char tag[3]; if (io->read(tag, sizeof(tag), io->io_ctx) != sizeof(tag)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (memcmp(tag, "GSO", sizeof(tag)) == 0) { dta_strl_t strl; retval = dta_read_strl(ctx, &strl); if (retval != READSTAT_OK) goto cleanup; if (strl.type != DTA_GSO_TYPE_ASCII) continue; if (ctx->strls_count == ctx->strls_capacity) { ctx->strls_capacity *= 2; if ((ctx->strls = readstat_realloc(ctx->strls, sizeof(dta_strl_t *) * ctx->strls_capacity)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } } dta_strl_t *strl_ptr = readstat_malloc(sizeof(dta_strl_t) + strl.len); if (strl_ptr == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } memcpy(strl_ptr, &strl, sizeof(dta_strl_t)); ctx->strls[ctx->strls_count++] = strl_ptr; if (io->read(&strl_ptr->data[0], strl_ptr->len, io->io_ctx) != strl_ptr->len) { retval = READSTAT_ERROR_READ; goto cleanup; } } else if (memcmp(tag, ""); if (retval != READSTAT_OK) goto cleanup; break; } else { retval = READSTAT_ERROR_PARSE; goto cleanup; } } cleanup: return retval; } static readstat_value_t dta_interpret_int8_bytes(dta_ctx_t *ctx, const unsigned char *buf) { readstat_value_t value = { .type = READSTAT_TYPE_INT8 }; int8_t byte = (int8_t)buf[0]; if (ctx->machine_is_twos_complement) { byte = ones_to_twos_complement1(byte); } if (byte > ctx->max_int8) { if (ctx->supports_tagged_missing && byte > DTA_113_MISSING_INT8) { value.tag = 'a' + (byte - DTA_113_MISSING_INT8_A); value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } value.v.i8_value = byte; return value; } static readstat_value_t dta_interpret_int16_bytes(dta_ctx_t *ctx, const unsigned char *buf) { readstat_value_t value = { .type = READSTAT_TYPE_INT16 }; int16_t num = 0; memcpy(&num, buf, sizeof(int16_t)); if (ctx->bswap) { num = byteswap2(num); } if (ctx->machine_is_twos_complement) { num = ones_to_twos_complement2(num); } if (num > ctx->max_int16) { if (ctx->supports_tagged_missing && num > DTA_113_MISSING_INT16) { value.tag = 'a' + (num - DTA_113_MISSING_INT16_A); value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } value.v.i16_value = num; return value; } static readstat_value_t dta_interpret_int32_bytes(dta_ctx_t *ctx, const unsigned char *buf) { readstat_value_t value = { .type = READSTAT_TYPE_INT32 }; int32_t num = 0; memcpy(&num, buf, sizeof(int32_t)); if (ctx->bswap) { num = byteswap4(num); } if (ctx->machine_is_twos_complement) { num = ones_to_twos_complement4(num); } if (num > ctx->max_int32) { if (ctx->supports_tagged_missing && num > DTA_113_MISSING_INT32) { value.tag = 'a' + (num - DTA_113_MISSING_INT32_A); value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } value.v.i32_value = num; return value; } static readstat_value_t dta_interpret_float_bytes(dta_ctx_t *ctx, const unsigned char *buf) { readstat_value_t value = { .type = READSTAT_TYPE_FLOAT }; float f_num = NAN; int32_t num = 0; memcpy(&num, buf, sizeof(int32_t)); if (ctx->bswap) { num = byteswap4(num); } if (num > ctx->max_float) { if (ctx->supports_tagged_missing && num > DTA_113_MISSING_FLOAT) { value.tag = 'a' + ((num - DTA_113_MISSING_FLOAT_A) >> 11); value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } else { memcpy(&f_num, &num, sizeof(int32_t)); } value.v.float_value = f_num; return value; } static readstat_value_t dta_interpret_double_bytes(dta_ctx_t *ctx, const unsigned char *buf) { readstat_value_t value = { .type = READSTAT_TYPE_DOUBLE }; double d_num = NAN; int64_t num = 0; memcpy(&num, buf, sizeof(int64_t)); if (ctx->bswap) { num = byteswap8(num); } if (num > ctx->max_double) { if (ctx->supports_tagged_missing && num > DTA_113_MISSING_DOUBLE) { value.tag = 'a' + ((num - DTA_113_MISSING_DOUBLE_A) >> 40); value.is_tagged_missing = 1; } else { value.is_system_missing = 1; } } else { memcpy(&d_num, &num, sizeof(int64_t)); } value.v.double_value = d_num; return value; } static readstat_error_t dta_handle_row(const unsigned char *buf, dta_ctx_t *ctx) { char str_buf[2048]; int j; readstat_off_t offset = 0; readstat_error_t retval = READSTAT_OK; for (j=0; jnvar; j++) { size_t max_len; readstat_value_t value = { { 0 } }; retval = dta_type_info(ctx->typlist[j], ctx, &max_len, &value.type); if (retval != READSTAT_OK) goto cleanup; if (ctx->variables[j]->skip) { offset += max_len; continue; } if (offset + max_len > ctx->record_len) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if (value.type == READSTAT_TYPE_STRING) { retval = readstat_convert(str_buf, sizeof(str_buf), (const char *)&buf[offset], max_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; value.v.string_value = str_buf; } else if (value.type == READSTAT_TYPE_STRING_REF) { dta_strl_t key = dta_interpret_strl_vo_bytes(ctx, &buf[offset]); dta_strl_t **found = bsearch(&key, ctx->strls, ctx->strls_count, sizeof(dta_strl_t *), &dta_compare_strls); if (found) { value.v.string_value = (*found)->data; } value.type = READSTAT_TYPE_STRING; } else if (value.type == READSTAT_TYPE_INT8) { value = dta_interpret_int8_bytes(ctx, &buf[offset]); } else if (value.type == READSTAT_TYPE_INT16) { value = dta_interpret_int16_bytes(ctx, &buf[offset]); } else if (value.type == READSTAT_TYPE_INT32) { value = dta_interpret_int32_bytes(ctx, &buf[offset]); } else if (value.type == READSTAT_TYPE_FLOAT) { value = dta_interpret_float_bytes(ctx, &buf[offset]); } else if (value.type == READSTAT_TYPE_DOUBLE) { value = dta_interpret_double_bytes(ctx, &buf[offset]); } if (ctx->value_handler(ctx->current_row, ctx->variables[j], value, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } offset += max_len; } cleanup: return retval; } static readstat_error_t dta_handle_rows(dta_ctx_t *ctx) { readstat_io_t *io = ctx->io; unsigned char *buf = NULL; int i; readstat_error_t retval = READSTAT_OK; if (ctx->record_len && (buf = readstat_malloc(ctx->record_len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } for (i=0; irow_limit; i++) { if (io->read(buf, ctx->record_len, io->io_ctx) != ctx->record_len) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = dta_handle_row(buf, ctx)) != READSTAT_OK) { goto cleanup; } ctx->current_row++; if ((retval = dta_update_progress(ctx)) != READSTAT_OK) { goto cleanup; } } if (ctx->row_limit < ctx->nobs) { if (io->seek(ctx->record_len * (ctx->nobs - ctx->row_limit), READSTAT_SEEK_CUR, io->io_ctx) == -1) retval = READSTAT_ERROR_SEEK; } cleanup: if (buf) free(buf); return retval; } static readstat_error_t dta_read_data(dta_ctx_t *ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if (!ctx->value_handler) { return READSTAT_OK; } if (io->seek(ctx->data_offset, READSTAT_SEEK_SET, io->io_ctx) == -1) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "Failed to seek to data section (offset=%" PRId64 ")", ctx->data_offset); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_SEEK; goto cleanup; } if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) goto cleanup; if ((retval = dta_update_progress(ctx)) != READSTAT_OK) goto cleanup; if ((retval = dta_handle_rows(ctx)) != READSTAT_OK) goto cleanup; if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) goto cleanup; cleanup: return retval; } static readstat_error_t dta_read_xmlish_preamble(dta_ctx_t *ctx, dta_header_t *header) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = ctx->io; if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } if ((retval = dta_read_tag(ctx, "
")) != READSTAT_OK) { goto cleanup; } char ds_format[3]; if ((retval = dta_read_chunk(ctx, "", ds_format, sizeof(ds_format), "")) != READSTAT_OK) { goto cleanup; } header->ds_format = 100 * (ds_format[0] - '0') + 10 * (ds_format[1] - '0') + (ds_format[2] - '0'); char byteorder[3]; if ((retval = dta_read_chunk(ctx, "", byteorder, sizeof(byteorder), "")) != READSTAT_OK) { goto cleanup; } if (strncmp(byteorder, "MSF", 3) == 0) { header->byteorder = DTA_HILO; } else if (strncmp(byteorder, "LSF", 3) == 0) { header->byteorder = DTA_LOHI; } else { retval = READSTAT_ERROR_PARSE; goto cleanup; } if ((retval = dta_read_chunk(ctx, "", &header->nvar, sizeof(int16_t), "")) != READSTAT_OK) { goto cleanup; } if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } if (io->read(&header->nobs, sizeof(int32_t), io->io_ctx) != sizeof(int32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } if (header->ds_format >= 118) { /* Only support files < 4 billion rows for now */ if (header->byteorder == DTA_HILO) { if (io->read(&header->nobs, sizeof(int32_t), io->io_ctx) != sizeof(int32_t)) { retval = READSTAT_ERROR_READ; goto cleanup; } } else { if (io->seek(4, READSTAT_SEEK_CUR, io->io_ctx) == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } } } if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } cleanup: return retval; } static readstat_error_t dta_read_label_and_timestamp(dta_ctx_t *ctx) { readstat_io_t *io = ctx->io; readstat_error_t retval = READSTAT_OK; char *data_label_buffer = NULL; char *timestamp_buffer = NULL; uint16_t label_len = 0; unsigned char timestamp_len = 0; char last_data_label_char = 0; struct tm timestamp = { .tm_isdst = -1 }; if (ctx->file_is_xmlish) { if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } if (io->read(×tamp_len, 1, io->io_ctx) != 1) { retval = READSTAT_ERROR_READ; goto cleanup; } } else { timestamp_len = ctx->timestamp_len; } if (timestamp_len) { timestamp_buffer = readstat_malloc(timestamp_len); if (io->read(timestamp_buffer, timestamp_len, io->io_ctx) != timestamp_len) { retval = READSTAT_ERROR_READ; goto cleanup; } if (!ctx->file_is_xmlish) timestamp_len--; if (timestamp_buffer[0]) { if (timestamp_buffer[timestamp_len-1] == '\0' && last_data_label_char != '\0') { /* Stupid hack for miswritten files with off-by-one timestamp, DTA 114 era? */ memmove(timestamp_buffer+1, timestamp_buffer, timestamp_len-1); timestamp_buffer[0] = last_data_label_char; } if ((retval = dta_parse_timestamp(timestamp_buffer, timestamp_len, ×tamp, ctx->error_handler, ctx->user_ctx)) != READSTAT_OK) { goto cleanup; } ctx->timestamp = mktime(×tamp); } } if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } cleanup: if (data_label_buffer) free(data_label_buffer); if (timestamp_buffer) free(timestamp_buffer); return retval; } static readstat_error_t dta_handle_variables(dta_ctx_t *ctx) { if (!ctx->variable_handler) return READSTAT_OK; readstat_error_t retval = READSTAT_OK; int i; int index_after_skipping = 0; for (i=0; invar; i++) { size_t max_len; readstat_type_t type; retval = dta_type_info(ctx->typlist[i], ctx, &max_len, &type); if (retval != READSTAT_OK) goto cleanup; if (type == READSTAT_TYPE_STRING) max_len++; /* might append NULL */ if (type == READSTAT_TYPE_STRING_REF) { type = READSTAT_TYPE_STRING; max_len = 0; } ctx->variables[i] = dta_init_variable(ctx, i, index_after_skipping, type, max_len); const char *value_labels = NULL; if (ctx->lbllist[ctx->lbllist_entry_len*i]) value_labels = &ctx->lbllist[ctx->lbllist_entry_len*i]; int cb_retval = ctx->variable_handler(i, ctx->variables[i], value_labels, ctx->user_ctx); if (cb_retval == READSTAT_HANDLER_ABORT) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } if (cb_retval == READSTAT_HANDLER_SKIP_VARIABLE) { ctx->variables[i]->skip = 1; } else { index_after_skipping++; } } cleanup: return retval; } static readstat_error_t dta_handle_value_labels(dta_ctx_t *ctx) { readstat_io_t *io = ctx->io; readstat_error_t retval = READSTAT_OK; char *table_buffer = NULL; char *utf8_buffer = NULL; if (io->seek(ctx->value_labels_offset, READSTAT_SEEK_SET, io->io_ctx) == -1) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "Failed to seek to value labels section (offset=%" PRId64 ")", ctx->value_labels_offset); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_SEEK; goto cleanup; } if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } if (!ctx->value_label_handler) { return READSTAT_OK; } while (1) { size_t len = 0; char labname[129]; uint32_t i = 0, n = 0; if (ctx->value_label_table_len_len == 2) { int16_t table_header_len; if (io->read(&table_header_len, sizeof(int16_t), io->io_ctx) < sizeof(int16_t)) break; len = table_header_len; if (ctx->bswap) len = byteswap2(table_header_len); n = len / 8; } else { if (dta_read_tag(ctx, "") != READSTAT_OK) { break; } int32_t table_header_len; if (io->read(&table_header_len, sizeof(int32_t), io->io_ctx) < sizeof(int32_t)) break; len = table_header_len; if (ctx->bswap) len = byteswap4(table_header_len); } if (io->read(labname, ctx->value_label_table_labname_len, io->io_ctx) < ctx->value_label_table_labname_len) break; if (io->seek(ctx->value_label_table_padding_len, READSTAT_SEEK_CUR, io->io_ctx) == -1) break; if ((table_buffer = readstat_realloc(table_buffer, len)) == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (io->read(table_buffer, len, io->io_ctx) < len) { break; } if (ctx->value_label_table_len_len == 2) { for (i=0; iconverter); if (retval != READSTAT_OK) goto cleanup; if (label_buf[0] && ctx->value_label_handler(labname, value, label_buf, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } } else if (len > 8) { if ((retval = dta_read_tag(ctx, "")) != READSTAT_OK) { goto cleanup; } n = *(uint32_t *)table_buffer; uint32_t txtlen = *((uint32_t *)table_buffer+1); if (ctx->bswap) { n = byteswap4(n); txtlen = byteswap4(txtlen); } if (txtlen > len - 8 || n > (len - 8 - txtlen) / 8) { break; } uint32_t *off = (uint32_t *)table_buffer+2; uint32_t *val = (uint32_t *)table_buffer+2+n; char *txt = &table_buffer[8LL*n+8]; size_t utf8_buffer_len = 4*txtlen+1; if (txtlen > MAX_VALUE_LABEL_LEN+1) utf8_buffer_len = 4*MAX_VALUE_LABEL_LEN+1; utf8_buffer = realloc(utf8_buffer, utf8_buffer_len); /* Much bigger than we need but whatever */ if (utf8_buffer == NULL) { retval = READSTAT_ERROR_MALLOC; goto cleanup; } if (ctx->bswap) { for (i=0; imachine_is_twos_complement) { for (i=0; i= txtlen) { retval = READSTAT_ERROR_PARSE; goto cleanup; } readstat_value_t value = { .v = { .i32_value = val[i] }, .type = READSTAT_TYPE_INT32 }; size_t max_label_len = txtlen - off[i]; if (max_label_len > MAX_VALUE_LABEL_LEN) max_label_len = MAX_VALUE_LABEL_LEN; if (val[i] > ctx->max_int32) { if (ctx->supports_tagged_missing && val[i] > DTA_113_MISSING_INT32) { value.tag = 'a' + (val[i] - DTA_113_MISSING_INT32_A); value.is_tagged_missing = 1; } else{ value.is_system_missing = 1; } } retval = readstat_convert(utf8_buffer, utf8_buffer_len, &txt[off[i]], max_label_len, ctx->converter); if (retval != READSTAT_OK) goto cleanup; if (ctx->value_label_handler(labname, value, utf8_buffer, ctx->user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } } } cleanup: if (table_buffer) free(table_buffer); if (utf8_buffer) free(utf8_buffer); return retval; } readstat_error_t readstat_parse_dta(readstat_parser_t *parser, const char *path, void *user_ctx) { readstat_error_t retval = READSTAT_OK; readstat_io_t *io = parser->io; int i; dta_header_t header; dta_ctx_t *ctx; size_t file_size = 0; ctx = dta_ctx_alloc(io); if (io->open(path, io->io_ctx) == -1) { retval = READSTAT_ERROR_OPEN; goto cleanup; } char magic[4]; if (io->read(magic, 4, io->io_ctx) != 4) { retval = READSTAT_ERROR_READ; goto cleanup; } file_size = io->seek(0, READSTAT_SEEK_END, io->io_ctx); if (file_size == -1) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "Failed to seek to end of file"); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_SEEK; goto cleanup; } if (io->seek(0, READSTAT_SEEK_SET, io->io_ctx) == -1) { if (ctx->error_handler) { snprintf(ctx->error_buf, sizeof(ctx->error_buf), "Failed to seek to start of file"); ctx->error_handler(ctx->error_buf, ctx->user_ctx); } retval = READSTAT_ERROR_SEEK; goto cleanup; } if (strncmp(magic, "read(&header, sizeof(header), io->io_ctx) != sizeof(header)) { retval = READSTAT_ERROR_READ; goto cleanup; } } retval = dta_ctx_init(ctx, header.nvar, header.nobs, header.byteorder, header.ds_format, parser->input_encoding, parser->output_encoding); if (retval != READSTAT_OK) { goto cleanup; } ctx->user_ctx = user_ctx; ctx->file_size = file_size; ctx->error_handler = parser->error_handler; ctx->progress_handler = parser->progress_handler; ctx->note_handler = parser->note_handler; ctx->variable_handler = parser->variable_handler; ctx->value_handler = parser->value_handler; ctx->value_label_handler = parser->value_label_handler; ctx->row_limit = ctx->nobs; if (parser->row_limit > 0 && parser->row_limit < ctx->nobs) ctx->row_limit = parser->row_limit; retval = dta_update_progress(ctx); if (retval != READSTAT_OK) goto cleanup; if (parser->info_handler) { if (parser->info_handler(ctx->row_limit, ctx->nvar, user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } if ((retval = dta_read_label_and_timestamp(ctx)) != READSTAT_OK) goto cleanup; if ((retval = dta_read_tag(ctx, "
")) != READSTAT_OK) { goto cleanup; } if (parser->metadata_handler) { if (parser->metadata_handler(ctx->data_label, NULL, ctx->timestamp, header.ds_format, user_ctx) != READSTAT_HANDLER_OK) { retval = READSTAT_ERROR_USER_ABORT; goto cleanup; } } if ((retval = dta_read_map(ctx)) != READSTAT_OK) { retval = READSTAT_ERROR_READ; goto cleanup; } if ((retval = dta_read_descriptors(ctx)) != READSTAT_OK) { goto cleanup; } for (i=0; invar; i++) { size_t max_len; if ((retval = dta_type_info(ctx->typlist[i], ctx, &max_len, NULL)) != READSTAT_OK) goto cleanup; ctx->record_len += max_len; } if ((ctx->nvar > 0 || ctx->nobs > 0) && ctx->record_len == 0) { retval = READSTAT_ERROR_PARSE; goto cleanup; } if ((retval = dta_handle_variables(ctx)) != READSTAT_OK) goto cleanup; if ((retval = dta_read_expansion_fields(ctx)) != READSTAT_OK) goto cleanup; if (!ctx->file_is_xmlish) { ctx->data_offset = io->seek(0, READSTAT_SEEK_CUR, io->io_ctx); if (ctx->data_offset == -1) { retval = READSTAT_ERROR_SEEK; goto cleanup; } ctx->value_labels_offset = ctx->data_offset + ctx->record_len * ctx->nobs; } if ((retval = dta_read_strls(ctx)) != READSTAT_OK) goto cleanup; if ((retval = dta_read_data(ctx)) != READSTAT_OK) goto cleanup; if ((retval = dta_handle_value_labels(ctx)) != READSTAT_OK) goto cleanup; cleanup: io->close(io->io_ctx); if (ctx) dta_ctx_free(ctx); return retval; } haven/src/readstat/stata/readstat_dta.h0000644000176200001440000001304213227731765017712 0ustar liggesusers#pragma pack(push, 1) // DTA files typedef struct dta_header_s { unsigned char ds_format; unsigned char byteorder; unsigned char filetype; unsigned char unused; int16_t nvar; int32_t nobs; } dta_header_t; typedef struct dta_117_strl_header_s { uint32_t v; uint32_t o; unsigned char type; int32_t len; } dta_117_strl_header_t; typedef struct dta_118_strl_header_s { uint32_t v; uint64_t o; unsigned char type; int32_t len; } dta_118_strl_header_t; #pragma pack(pop) typedef struct dta_strl_s { uint16_t v; uint64_t o; unsigned char type; size_t len; char data[1]; // Flexible array; use [1] for C++98 compatibility } dta_strl_t; typedef struct dta_ctx_s { char *data_label; size_t data_label_len; size_t data_label_len_len; time_t timestamp; size_t timestamp_len; char typlist_version; size_t typlist_entry_len; uint16_t *typlist; size_t typlist_len; char *varlist; size_t varlist_len; int16_t *srtlist; size_t srtlist_len; char *fmtlist; size_t fmtlist_len; char *lbllist; size_t lbllist_len; char *variable_labels; size_t variable_labels_len; size_t variable_name_len; size_t fmtlist_entry_len; size_t lbllist_entry_len; size_t variable_labels_entry_len; size_t expansion_len_len; size_t ch_metadata_len; size_t value_label_table_len_len; size_t value_label_table_labname_len; size_t value_label_table_padding_len; size_t strl_v_len; size_t strl_o_len; int64_t data_offset; int64_t strls_offset; int64_t value_labels_offset; int nvar; int64_t nobs; size_t record_len; int64_t row_limit; int64_t current_row; int bswap; int machine_is_twos_complement; int file_is_xmlish; int supports_tagged_missing; int8_t max_int8; int16_t max_int16; int32_t max_int32; int32_t max_float; int64_t max_double; dta_strl_t **strls; size_t strls_count; size_t strls_capacity; readstat_variable_t **variables; iconv_t converter; readstat_error_handler error_handler; readstat_progress_handler progress_handler; readstat_note_handler note_handler; readstat_variable_handler variable_handler; readstat_value_handler value_handler; readstat_value_label_handler value_label_handler; size_t file_size; void *user_ctx; readstat_io_t *io; int initialized; char error_buf[256]; } dta_ctx_t; #define DTA_HILO 0x01 #define DTA_LOHI 0x02 #define DTA_OLD_MAX_INT8 0x7e #define DTA_OLD_MAX_INT16 0x7ffe #define DTA_OLD_MAX_INT32 0x7ffffffe #define DTA_OLD_MAX_FLOAT 0x7effffff // +1.7e38f #define DTA_OLD_MAX_DOUBLE 0x7fdfffffffffffffL // +8.9e307 #define DTA_OLD_MISSING_INT8 0x7F #define DTA_OLD_MISSING_INT16 0x7FFF #define DTA_OLD_MISSING_INT32 0x7FFFFFFF #define DTA_OLD_MISSING_FLOAT 0x7F000000 #define DTA_OLD_MISSING_DOUBLE 0x7FE0000000000000L #define DTA_113_MAX_INT8 0x64 #define DTA_113_MAX_INT16 0x7fe4 #define DTA_113_MAX_INT32 0x7fffffe4 #define DTA_113_MAX_FLOAT 0x7effffff // +1.7e38f #define DTA_113_MAX_DOUBLE 0x7fdfffffffffffffL // +8.9e307 #define DTA_113_MISSING_INT8 0x65 #define DTA_113_MISSING_INT16 0x7FE5 #define DTA_113_MISSING_INT32 0x7FFFFFE5 #define DTA_113_MISSING_FLOAT 0x7F000000 #define DTA_113_MISSING_DOUBLE 0x7FE0000000000000L #define DTA_113_MISSING_INT8_A (DTA_113_MISSING_INT8+1) #define DTA_113_MISSING_INT16_A (DTA_113_MISSING_INT16+1) #define DTA_113_MISSING_INT32_A (DTA_113_MISSING_INT32+1) #define DTA_113_MISSING_FLOAT_A (DTA_113_MISSING_FLOAT+0x0800) #define DTA_113_MISSING_DOUBLE_A (DTA_113_MISSING_DOUBLE+0x010000000000) #define DTA_GSO_TYPE_BINARY 0x81 #define DTA_GSO_TYPE_ASCII 0x82 #define DTA_117_TYPE_CODE_INT8 0xFFFA #define DTA_117_TYPE_CODE_INT16 0xFFF9 #define DTA_117_TYPE_CODE_INT32 0xFFF8 #define DTA_117_TYPE_CODE_FLOAT 0xFFF7 #define DTA_117_TYPE_CODE_DOUBLE 0xFFF6 #define DTA_117_TYPE_CODE_STRL 0x8000 #define DTA_111_TYPE_CODE_INT8 0xFB #define DTA_111_TYPE_CODE_INT16 0xFC #define DTA_111_TYPE_CODE_INT32 0xFD #define DTA_111_TYPE_CODE_FLOAT 0xFE #define DTA_111_TYPE_CODE_DOUBLE 0xFF #define DTA_OLD_TYPE_CODE_INT8 'b' #define DTA_OLD_TYPE_CODE_INT16 'i' #define DTA_OLD_TYPE_CODE_INT32 'l' #define DTA_OLD_TYPE_CODE_FLOAT 'f' #define DTA_OLD_TYPE_CODE_DOUBLE 'd' dta_ctx_t *dta_ctx_alloc(readstat_io_t *io); readstat_error_t dta_ctx_init(dta_ctx_t *ctx, int16_t nvar, int32_t nobs, unsigned char byteorder, unsigned char ds_format, const char *input_encoding, const char *output_encoding); void dta_ctx_free(dta_ctx_t *ctx); readstat_error_t dta_type_info(uint16_t typecode, dta_ctx_t *ctx, size_t *max_len, readstat_type_t *out_type); haven/src/readstat/CKHashTable.c0000644000176200001440000001247213227731765016211 0ustar liggesusers// CKHashTable - A simple hash table // Copyright 2010 Evan Miller (see LICENSE) #include "CKHashTable.h" #include #include int ck_str_n_hash_insert(const char *key, size_t keylen, const void *value, ck_hash_table_t *table); const void *ck_str_n_hash_lookup(const char *key, size_t keylen, ck_hash_table_t *table); static inline void ck_float2str(float key, char keystr[6]); static inline void ck_double2str(double key, char keystr[11]); inline uint64_t ck_hash_str(const char *str) { uint64_t hash = 5381; int c; while ((c = *str++)) hash = ((hash << 5) + hash) + c; return hash; } static inline void ck_float2str(float key, char keystr[6]) { memcpy(keystr, &key, 4); keystr[4] = (0xF0 | (keystr[0] & 0x01) | (keystr[1] & 0x02) | (keystr[2] & 0x04) | (keystr[3] & 0x08)); keystr[0] |= 0x01; keystr[1] |= 0x02; keystr[2] |= 0x04; keystr[3] |= 0x08; keystr[5] = 0x00; } static inline void ck_double2str(double key, char keystr[11]) { memcpy(keystr, &key, 8); keystr[8] = (0xF0 | (keystr[0] & 0x01) | (keystr[1] & 0x02) | (keystr[2] & 0x04) | (keystr[3] & 0x08)); keystr[0] |= 0x01; keystr[1] |= 0x02; keystr[2] |= 0x04; keystr[3] |= 0x08; keystr[9] = (0xF0 | (keystr[4] & 0x01) | (keystr[5] & 0x02) | (keystr[6] & 0x04) | (keystr[7] & 0x08)); keystr[4] |= 0x01; keystr[5] |= 0x02; keystr[6] |= 0x04; keystr[7] |= 0x08; keystr[10] = 0x00; } const void *ck_float_hash_lookup(float key, ck_hash_table_t *table) { char keystr[6]; ck_float2str(key, keystr); return ck_str_n_hash_lookup(keystr, 5, table); } int ck_float_hash_insert(float key, const void *value, ck_hash_table_t *table) { char keystr[6]; ck_float2str(key, keystr); return ck_str_n_hash_insert(keystr, 5, value, table); } const void *ck_double_hash_lookup(double key, ck_hash_table_t *table) { char keystr[11]; ck_double2str(key, keystr); return ck_str_n_hash_lookup(keystr, 10, table); } int ck_double_hash_insert(double key, const void *value, ck_hash_table_t *table) { char keystr[11]; ck_double2str(key, keystr); return ck_str_n_hash_insert(keystr, 10, value, table); } const void *ck_str_hash_lookup(const char *key, ck_hash_table_t *table) { size_t keylen = strlen(key); if (keylen >= CK_HASH_KEY_SIZE) keylen = CK_HASH_KEY_SIZE-1; return ck_str_n_hash_lookup(key, keylen, table); } const void *ck_str_n_hash_lookup(const char *key, size_t keylen, ck_hash_table_t *table) { if (table->count == 0) return NULL; if (keylen == 0 || keylen >= CK_HASH_KEY_SIZE) return NULL; uint64_t hash_key = ck_hash_str(key); hash_key %= table->capacity; uint64_t end = (hash_key - 1) % table->capacity; while (hash_key != end && table->entries[hash_key].key[0] != '\0') { if (strncmp(table->entries[hash_key].key, key, keylen + 1) == 0) { return table->entries[hash_key].value; } hash_key++; hash_key %= table->capacity; } return NULL; } int ck_str_hash_insert(const char *key, const void *value, ck_hash_table_t *table) { size_t keylen = strlen(key); if (keylen >= CK_HASH_KEY_SIZE) keylen = CK_HASH_KEY_SIZE-1; return ck_str_n_hash_insert(key, keylen, value, table); } int ck_str_n_hash_insert(const char *key, size_t keylen, const void *value, ck_hash_table_t *table) { if (table->capacity == 0) return 0; if (keylen == 0 || keylen >= CK_HASH_KEY_SIZE) return 0; if (table->count >= 0.75 * table->capacity) { if (ck_hash_table_grow(table) == -1) { return 0; } } uint64_t hash_key = ck_hash_str(key); hash_key %= table->capacity; uint64_t end = (hash_key - 1) % table->capacity; while (hash_key != end) { if (table->entries[hash_key].key[0] == '\0') { table->count++; } if (table->entries[hash_key].key[0] == '\0' || strncmp(table->entries[hash_key].key, key, keylen + 1) == 0) { memcpy(table->entries[hash_key].key, key, keylen); memset(table->entries[hash_key].key + keylen, 0, 1); table->entries[hash_key].value = value; return 1; } hash_key++; hash_key %= table->capacity; } return 0; } ck_hash_table_t *ck_hash_table_init(size_t size) { ck_hash_table_t *table; if ((table = malloc(sizeof(ck_hash_table_t))) == NULL) return NULL; if ((table->entries = malloc(size * sizeof(ck_hash_entry_t))) == NULL) { free(table); return NULL; } table->capacity = size; table->count = 0; ck_hash_table_wipe(table); return table; } void ck_hash_table_free(ck_hash_table_t *table) { free(table->entries); free(table); } void ck_hash_table_wipe(ck_hash_table_t *table) { memset(table->entries, 0, table->capacity * sizeof(ck_hash_entry_t)); } int ck_hash_table_grow(ck_hash_table_t *table) { ck_hash_entry_t *old_entries = table->entries; uint64_t old_capacity = table->capacity; uint64_t new_capacity = 2 * table->capacity; int i; if ((table->entries = calloc(new_capacity, sizeof(ck_hash_entry_t))) == NULL) { return -1; } table->capacity = new_capacity; table->count = 0; for (i=0; i using namespace Rcpp; #include "readstat.h" #include "haven_types.h" ssize_t data_writer(const void *data, size_t len, void *ctx); inline const char* string_utf8(SEXP x, int i) { return Rf_translateCharUTF8(STRING_ELT(x, i)); } inline const bool string_is_missing(SEXP x, int i) { return STRING_ELT(x, i) == NA_STRING; } inline readstat_measure_e measureType(SEXP x) { if (Rf_inherits(x, "ordered")) { return READSTAT_MEASURE_ORDINAL; } else if (Rf_inherits(x, "factor")) { return READSTAT_MEASURE_NOMINAL; } else { switch(TYPEOF(x)) { case INTSXP: case REALSXP: return READSTAT_MEASURE_SCALE; case LGLSXP: case STRSXP: return READSTAT_MEASURE_NOMINAL; default: return READSTAT_MEASURE_UNKNOWN; } } } inline int displayWidth(RObject x) { RObject display_width_obj = x.attr("display_width"); switch(TYPEOF(display_width_obj)) { case INTSXP: return INTEGER(display_width_obj)[0]; case REALSXP: return REAL(display_width_obj)[0]; } return 0; } class Writer { FileType type_; List x_; readstat_writer_t* writer_; FILE* pOut_; public: Writer(FileType type, List x, std::string path): type_(type), x_(x) { pOut_ = fopen(path.c_str(), "wb"); if (pOut_ == NULL) stop("Failed to open '%s' for writing", path); writer_ = readstat_writer_init(); checkStatus(readstat_set_data_writer(writer_, data_writer)); } ~Writer() { try { fclose(pOut_); readstat_writer_free(writer_); } catch (...) {}; } void setVersion(int version) { readstat_writer_set_file_format_version(writer_, version); } void write() { int p = x_.size(); if (p == 0) return; CharacterVector names = as(x_.attr("names")); // Define variables for (int j = 0; j < p; ++j) { RObject col = x_[j]; VarType type = numType(col); const char* name = string_utf8(names, j); const char* format = var_format(col, type); switch(TYPEOF(col)) { case LGLSXP: defineVariable(as(col), name, format); break; case INTSXP: defineVariable(as(col), name, format); break; case REALSXP: defineVariable(as(col), name, format); break; case STRSXP: defineVariable(as(col), name, format); break; default: stop("Variables of type %s not supported yet", Rf_type2char(TYPEOF(col))); } } int n = Rf_length(x_[0]); switch(type_) { case HAVEN_SPSS: checkStatus(readstat_begin_writing_sav(writer_, this, n)); break; case HAVEN_STATA: checkStatus(readstat_begin_writing_dta(writer_, this, n)); break; case HAVEN_SAS: checkStatus(readstat_begin_writing_sas7bdat(writer_, this, n)); break; case HAVEN_XPT: checkStatus(readstat_begin_writing_xport(writer_, this, n)); break; } // Write data for (int i = 0; i < n; ++i) { checkStatus(readstat_begin_row(writer_)); for (int j = 0; j < p; ++j) { RObject col = x_[j]; readstat_variable_t* var = readstat_get_variable(writer_, j); switch (TYPEOF(col)) { case LGLSXP: { int val = LOGICAL(col)[i]; insertValue(var, val, val == NA_LOGICAL); break; } case INTSXP: { int val = INTEGER(col)[i]; insertValue(var, (int) adjustDatetimeFromR(type_, col, val), val == NA_INTEGER); break; } case REALSXP: { double val = REAL(col)[i]; insertValue(var, adjustDatetimeFromR(type_, col, val), !R_finite(val)); break; } case STRSXP: { insertValue(var, string_utf8(col, i), string_is_missing(col, i)); break; } default: break; } } checkStatus(readstat_end_row(writer_)); } checkStatus(readstat_end_writing(writer_)); } // Define variables ---------------------------------------------------------- const char* var_label(RObject x) { RObject label = x.attr("label"); if (label == R_NilValue) return NULL; return string_utf8(label, 0); } const char* var_format(RObject x, VarType varType) { // Use attribute, if present RObject format = x.attr(formatAttribute(type_)); if (format != R_NilValue) return string_utf8(format, 0); switch(varType) { case HAVEN_DEFAULT: return NULL; case HAVEN_DATETIME: switch(type_) { case HAVEN_XPT: case HAVEN_SAS: return "DATETIME"; case HAVEN_SPSS: return "DATETIME"; case HAVEN_STATA: return "%tc"; } case HAVEN_DATE: switch(type_) { case HAVEN_XPT: case HAVEN_SAS: return "DATE"; case HAVEN_SPSS: return "DATE"; case HAVEN_STATA: return "%td"; } case HAVEN_TIME: switch(type_) { case HAVEN_XPT: case HAVEN_SAS: return "TIME"; case HAVEN_SPSS: return "TIME"; case HAVEN_STATA: return NULL; // Stata doesn't have a pure time type } } return NULL; } void defineVariable(IntegerVector x, const char* name, const char* format = NULL) { readstat_label_set_t* labelSet = NULL; if (Rf_inherits(x, "factor")) { labelSet = readstat_add_label_set(writer_, READSTAT_TYPE_INT32, name); CharacterVector levels = as(x.attr("levels")); for (int i = 0; i < levels.size(); ++i) readstat_label_int32_value(labelSet, i + 1, string_utf8(levels, i)); } else if (Rf_inherits(x, "labelled")) { labelSet = readstat_add_label_set(writer_, READSTAT_TYPE_INT32, name); IntegerVector values = as(x.attr("labels")); CharacterVector labels = as(values.attr("names")); for (int i = 0; i < values.size(); ++i) readstat_label_int32_value(labelSet, values[i], string_utf8(labels, i)); } readstat_variable_t* var = readstat_add_variable(writer_, name, READSTAT_TYPE_INT32, 0); readstat_variable_set_format(var, format); readstat_variable_set_label(var, var_label(x)); readstat_variable_set_label_set(var, labelSet); readstat_variable_set_measure(var, measureType(x)); readstat_variable_set_display_width(var, displayWidth(x)); } void defineVariable(NumericVector x, const char* name, const char* format = NULL) { readstat_label_set_t* labelSet = NULL; if (Rf_inherits(x, "labelled")) { labelSet = readstat_add_label_set(writer_, READSTAT_TYPE_DOUBLE, name); NumericVector values = as(x.attr("labels")); CharacterVector labels = as(values.attr("names")); for (int i = 0; i < values.size(); ++i) readstat_label_double_value(labelSet, values[i], string_utf8(labels, i)); } readstat_variable_t* var = readstat_add_variable(writer_, name, READSTAT_TYPE_DOUBLE, 0); readstat_variable_set_format(var, format); readstat_variable_set_label(var, var_label(x)); readstat_variable_set_label_set(var, labelSet); readstat_variable_set_measure(var, measureType(x)); readstat_variable_set_display_width(var, displayWidth(x)); if (Rf_inherits(x, "labelled_spss")) { SEXP na_range = x.attr("na_range"); if (TYPEOF(na_range) == REALSXP && Rf_length(na_range) == 2) { readstat_variable_add_missing_double_range(var, REAL(na_range)[0], REAL(na_range)[1]); } SEXP na_values = x.attr("na_values"); if (TYPEOF(na_values) == REALSXP) { int n = Rf_length(na_values); for (int i = 0; i < n; ++i) { readstat_variable_add_missing_double_value(var, REAL(na_values)[i]); } } } } void defineVariable(CharacterVector x, const char* name, const char* format = NULL) { readstat_label_set_t* labelSet = NULL; if (Rf_inherits(x, "labelled")) { labelSet = readstat_add_label_set(writer_, READSTAT_TYPE_STRING, name); CharacterVector values = as(x.attr("labels")); CharacterVector labels = as(values.attr("names")); for (int i = 0; i < values.size(); ++i) readstat_label_string_value(labelSet, string_utf8(values, i), string_utf8(labels, i)); } int max_length = 0; for (int i = 0; i < x.size(); ++i) { int length = strlen(string_utf8(x, i)); if (length > max_length) max_length = length; } readstat_variable_t* var = readstat_add_variable(writer_, name, READSTAT_TYPE_STRING, max_length); readstat_variable_set_format(var, format); readstat_variable_set_label(var, var_label(x)); readstat_variable_set_label_set(var, labelSet); readstat_variable_set_measure(var, measureType(x)); readstat_variable_set_display_width(var, displayWidth(x)); } // Value helper ------------------------------------------------------------- void insertValue(readstat_variable_t* var, int val, bool is_missing) { if (is_missing) { checkStatus(readstat_insert_missing_value(writer_, var)); } else { checkStatus(readstat_insert_int32_value(writer_, var, val)); } } void insertValue(readstat_variable_t* var, double val, bool is_missing) { if (is_missing) { checkStatus(readstat_insert_missing_value(writer_, var)); } else { checkStatus(readstat_insert_double_value(writer_, var, val)); } } void insertValue(readstat_variable_t* var, const char* val, bool is_missing) { if (is_missing) { checkStatus(readstat_insert_missing_value(writer_, var)); } else { checkStatus(readstat_insert_string_value(writer_, var, val)); } } // Misc ---------------------------------------------------------------------- void checkStatus(readstat_error_t err) { if (err == 0) return; stop("Writing failure: %s.", readstat_error_message(err)); } ssize_t write(const void *data, size_t len) { return fwrite(data, sizeof(char), len, pOut_); } }; ssize_t data_writer(const void *data, size_t len, void *ctx) { return ((Writer*) ctx)->write(data, len); } // [[Rcpp::export]] void write_sav_(List data, std::string path) { Writer(HAVEN_SPSS, data, path).write(); } // [[Rcpp::export]] void write_dta_(List data, std::string path, int version) { Writer writer(HAVEN_STATA, data, path); writer.setVersion(version); writer.write(); } // [[Rcpp::export]] void write_sas_(List data, std::string path) { Writer(HAVEN_SAS, data, path).write(); } // [[Rcpp::export]] void write_xpt_(List data, std::string path, int version) { Writer writer(HAVEN_XPT, data, path); writer.setVersion(version); writer.write(); } haven/src/haven_types.h0000644000176200001440000000124013227731765014652 0ustar liggesusers#ifndef __HAVEN_TYPES__ #define __HAVEN_TYPES__ #include #include enum FileType { HAVEN_SPSS, HAVEN_STATA, HAVEN_SAS, HAVEN_XPT }; enum VarType { HAVEN_DEFAULT, HAVEN_DATE, HAVEN_TIME, HAVEN_DATETIME }; std::string formatAttribute(FileType type); bool hasPrefix(std::string x, std::string prefix); VarType numType(SEXP x); VarType numType(FileType type, const char* var_format); // Value conversion ----------------------------------------------------------- int daysOffset(FileType type); double adjustDatetimeToR(FileType file, VarType var, double value); double adjustDatetimeFromR(FileType file, SEXP col, double value); #endif haven/src/haven_types.cpp0000644000176200001440000000745613227731765015224 0ustar liggesusers#include #include #include "haven_types.h" std::string formatAttribute(FileType type) { switch (type) { case HAVEN_STATA: return "format.stata"; case HAVEN_SPSS: return "format.spss"; case HAVEN_SAS: return "format.sas"; case HAVEN_XPT: return "format.xpt"; } return ""; } bool hasPrefix(std::string x, std::string prefix) { return x.compare(0, prefix.size(), prefix) == 0; } VarType numType(SEXP x) { if (Rf_inherits(x, "Date")) { return HAVEN_DATE; } else if (Rf_inherits(x, "POSIXct")) { return HAVEN_DATETIME; } else if (Rf_inherits(x, "hms")) { return HAVEN_TIME; } else { return HAVEN_DEFAULT; } } VarType numType(FileType type, const char* var_format) { if (var_format == NULL) return HAVEN_DEFAULT; std::string format(var_format); switch(type) { case HAVEN_XPT: case HAVEN_SAS: // http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000589916.htm if (hasPrefix(format,"DATETIME")) return HAVEN_DATETIME; else if (hasPrefix(format,"WEEKDATE")) return HAVEN_DATE; else if (hasPrefix(format,"MMDDYY")) return HAVEN_DATE; else if (hasPrefix(format,"DDMMYY")) return HAVEN_DATE; else if (hasPrefix(format,"YYMMDD")) return HAVEN_DATE; else if (hasPrefix(format,"DATE")) return HAVEN_DATE; else if (hasPrefix(format,"TIME")) return HAVEN_TIME; else if (hasPrefix(format,"HHMM")) return HAVEN_TIME; else return HAVEN_DEFAULT; case HAVEN_SPSS: // http://www-01.ibm.com/support/knowledgecenter/?lang=en#!/SSLVMB_20.0.0/com.ibm.spss.statistics.help/syn_date_and_time_date_time_formats.htm if (hasPrefix(format, "DATETIME")) return HAVEN_DATETIME; else if (hasPrefix(format, "DATE")) return HAVEN_DATE; else if (hasPrefix(format, "ADATE")) return HAVEN_DATE; else if (hasPrefix(format, "EDATE")) return HAVEN_DATE; else if (hasPrefix(format, "JDATE")) return HAVEN_DATE; else if (hasPrefix(format, "SDATE")) return HAVEN_DATE; else if (hasPrefix(format, "TIME")) return HAVEN_TIME; else if (hasPrefix(format, "DTIME")) return HAVEN_TIME; else return HAVEN_DEFAULT; case HAVEN_STATA: if (hasPrefix(format, "%tC")) return HAVEN_DATETIME; else if (hasPrefix(format, "%tc")) return HAVEN_DATETIME; else if (hasPrefix(format, "%td")) return HAVEN_DATE; else if (hasPrefix(format, "%d")) return HAVEN_DATE; else return HAVEN_DEFAULT; } return HAVEN_DEFAULT; } // Value conversion ----------------------------------------------------------- int daysOffset(FileType type) { switch(type) { case HAVEN_XPT: case HAVEN_SAS: return 3653; // 1960-01-01 case HAVEN_STATA: return 3653; case HAVEN_SPSS: return 141428; // 1582-01-01 } return 0; } double adjustDatetimeToR(FileType file, VarType var, double value) { if (std::isnan(value)) return value; double offset = daysOffset(file); switch(var) { case HAVEN_DATETIME: if (file == HAVEN_STATA) // stored in milliseconds value /= 1000; return value - offset * 86400; case HAVEN_DATE: if (file == HAVEN_SPSS) // stored in seconds value /= 86400; return value - offset; default: return value; } } double adjustDatetimeFromR(FileType file, SEXP col, double value) { if (std::isnan(value)) return value; double offset = daysOffset(file); switch(numType(col)) { case HAVEN_DATETIME: value += offset * 86400; if (file == HAVEN_STATA) // stored in milliseconds value *= 1000; return value; case HAVEN_DATE: value += offset; if (file == HAVEN_SPSS) // stored in seconds value *= 86400; return value; default: return value; } } haven/src/DfReader.cpp0000644000176200001440000005175013227731765014347 0ustar liggesusers#include #include #include #include using namespace Rcpp; #include "readstat.h" #include "haven_types.h" #include "tagged_na.h" double haven_double_value_udm(readstat_value_t value, readstat_variable_t* var, bool user_na) { if (readstat_value_is_tagged_missing(value)) { return make_tagged_na(tolower(readstat_value_tag(value))); } else if (!user_na && readstat_value_is_defined_missing(value, var)) { return NA_REAL; } else if (readstat_value_is_system_missing(value)) { return NA_REAL; } else { return readstat_double_value(value); } } double haven_double_value(readstat_value_t value) { if (readstat_value_is_tagged_missing(value)) { return make_tagged_na(tolower(readstat_value_tag(value))); } else { return readstat_double_value(value); } } // LabelSet ------------------------------------------------------------------- class LabelSet { std::vector labels_; std::vector values_s_; std::vector values_i_; std::vector values_d_; public: LabelSet() {} void add(const char* value, std::string label) { if (values_i_.size() > 0 || values_d_.size() > 0) stop("Can't add string to integer/double labelset"); values_s_.push_back(value); labels_.push_back(label); } void add(int value, std::string label) { if (values_d_.size() > 0 || values_s_.size() > 0) stop("Can't add integer to string/double labelset"); values_i_.push_back(value); labels_.push_back(label); } void add(double value, std::string label) { if (values_i_.size() > 0 || values_s_.size() > 0) stop("Can't add double to integer/string labelset"); values_d_.push_back(value); labels_.push_back(label); } size_t size() const { return labels_.size(); } RObject labels() const { RObject out; if (values_i_.size() > 0) { int n = values_i_.size(); IntegerVector values(n); CharacterVector labels(n); for (int i = 0; i < n; ++i) { values[i] = values_i_[i]; labels[i] = Rf_mkCharCE(labels_[i].c_str(), CE_UTF8); } values.attr("names") = labels; out = values; } else if (values_d_.size() > 0) { int n = values_d_.size(); NumericVector values(n); CharacterVector labels(n); for (int i = 0; i < n; ++i) { values[i] = values_d_[i]; labels[i] = Rf_mkCharCE(labels_[i].c_str(), CE_UTF8); } values.attr("names") = labels; out = values; } else { int n = values_s_.size(); CharacterVector values(n), labels(n); for (int i = 0; i < n; ++i) { values[i] = Rf_mkCharCE(values_s_[i].c_str(), CE_UTF8); labels[i] = Rf_mkCharCE(labels_[i].c_str(), CE_UTF8); } values.attr("names") = labels; out = values; } return out; } }; // DfReader ------------------------------------------------------------------ class DfReader { FileType type_; int nrows_, nrowsAlloc_; int ncols_; List output_; CharacterVector names_; bool user_na_; std::vector val_labels_; std::map label_sets_; std::vector var_types_; std::vector notes_; // If empty, assume all will be read in std::set colsOnly_; public: DfReader(FileType type, bool user_na = false): type_(type), nrows_(0), ncols_(0), user_na_(user_na) { } void restrictCols(const std::set& cols) { colsOnly_ = cols; } void setInfo(int obs_count, int var_count) { if (obs_count < 0) { // If unknown, start with 1e5, and use doubling strategy nrowsAlloc_ = 1e5; nrows_ = 0; } else { nrowsAlloc_ = nrows_ = obs_count; } ncols_ = colsOnly_.empty() ? var_count : colsOnly_.size(); output_ = List(ncols_); names_ = CharacterVector(ncols_); val_labels_.resize(ncols_); var_types_.resize(ncols_); } void setMetadata(const char *file_label, time_t timestamp, long format_version) { if (file_label != NULL && strcmp(file_label, "") != 0) { output_.attr("label") = CharacterVector::create(Rf_mkCharCE(file_label, CE_UTF8)); } } void setNote(int note_index, const char *note) { if (note != NULL && strcmp(note, "") != 0) { notes_.push_back(note); } } int createVariable(int index, readstat_variable_t *variable, const char *val_labels) { const char* name = readstat_variable_get_name(variable); if (!colsOnly_.empty() && colsOnly_.count(name) == 0) { return READSTAT_HANDLER_SKIP_VARIABLE; } int var_index = readstat_variable_get_index_after_skipping(variable); names_[var_index] = Rf_mkCharCE(name, CE_UTF8); switch(readstat_variable_get_type(variable)) { case READSTAT_TYPE_STRING_REF: case READSTAT_TYPE_STRING: output_[var_index] = CharacterVector(nrowsAlloc_); break; case READSTAT_TYPE_INT8: case READSTAT_TYPE_INT16: case READSTAT_TYPE_INT32: case READSTAT_TYPE_FLOAT: case READSTAT_TYPE_DOUBLE: output_[var_index] = NumericVector(nrowsAlloc_); break; } RObject col = output_[var_index]; const char* var_label = readstat_variable_get_label(variable); if (var_label != NULL && strcmp(var_label, "") != 0) { col.attr("label") = CharacterVector::create(Rf_mkCharCE(var_label, CE_UTF8)); } if (val_labels != NULL) val_labels_[var_index] = val_labels; const char* var_format = readstat_variable_get_format(variable); VarType var_type = numType(type_, var_format); // Rcout << name << ": " << var_format << " [" << var_type << "]\n"; var_types_[var_index] = var_type; switch(var_type) { case HAVEN_DATE: col.attr("class") = "Date"; break; case HAVEN_TIME: col.attr("class") = CharacterVector::create("hms", "difftime"); col.attr("units") = "secs"; break; case HAVEN_DATETIME: col.attr("class") = CharacterVector::create("POSIXct", "POSIXt"); col.attr("tzone") = "UTC"; break; default: break; } // User defined missing values int n_ranges = readstat_variable_get_missing_ranges_count(variable); if (user_na_ && n_ranges > 0) { switch(readstat_variable_get_type(variable)) { case READSTAT_TYPE_STRING_REF: case READSTAT_TYPE_STRING: { CharacterVector na_values(n_ranges); for (int i = 0; i < n_ranges; ++i) { readstat_value_t value = readstat_variable_get_missing_range_lo(variable, i); const char* str_value = readstat_string_value(value); na_values[0] = str_value == NULL ? NA_STRING : Rf_mkCharCE(str_value, CE_UTF8); } col.attr("na_values") = na_values; break; } case READSTAT_TYPE_INT8: case READSTAT_TYPE_INT16: case READSTAT_TYPE_INT32: case READSTAT_TYPE_FLOAT: case READSTAT_TYPE_DOUBLE: { std::vector na_values; NumericVector na_range(2); bool has_range = false; for (int i = 0; i < n_ranges; ++i) { readstat_value_t lo_value = readstat_variable_get_missing_range_lo(variable, i), hi_value = readstat_variable_get_missing_range_hi(variable, i); double lo = readstat_double_value(lo_value), hi = readstat_double_value(hi_value); if (lo == hi) { // Single value na_values.push_back(lo); } else { has_range = true; // Can only ever be one range na_range[0] = lo; na_range[1] = hi; } } if (na_values.size() > 0) col.attr("na_values") = Rcpp::wrap(na_values); if (has_range) col.attr("na_range") = na_range; } } col.attr("class") = CharacterVector::create("labelled_spss", "labelled"); } // Store original format as attribute if (var_format != NULL && strcmp(var_format, "") != 0) { col.attr(formatAttribute(type_)) = Rf_ScalarString(Rf_mkCharCE(var_format, CE_UTF8)); } // Store original display width as attribute if it differs from the default int display_width = readstat_variable_get_display_width(variable); if (type_ == HAVEN_SPSS && display_width != 8) { col.attr("display_width") = Rf_ScalarInteger(display_width); } return READSTAT_HANDLER_OK; } void setValue(int obs_index, readstat_variable_t *variable, readstat_value_t value) { int var_index = readstat_variable_get_index_after_skipping(variable); VarType var_type = var_types_[var_index]; if (obs_index >= nrowsAlloc_) resizeCols(nrowsAlloc_ * 2); if (obs_index >= nrows_) nrows_ = obs_index + 1; switch(value.type) { case READSTAT_TYPE_STRING_REF: case READSTAT_TYPE_STRING: { CharacterVector col = output_[var_index]; const char* str_value = readstat_string_value(value); col[obs_index] = str_value == NULL ? NA_STRING : Rf_mkCharCE(str_value, CE_UTF8); break; } case READSTAT_TYPE_INT8: case READSTAT_TYPE_INT16: case READSTAT_TYPE_INT32: case READSTAT_TYPE_FLOAT: case READSTAT_TYPE_DOUBLE: { NumericVector col = output_[var_index]; double val = haven_double_value_udm(value, variable, user_na_); col[obs_index] = adjustDatetimeToR(type_, var_type, val); break; } } } void setValueLabels(const char *val_labels, readstat_value_t value, const char *label) { LabelSet& label_set = label_sets_[val_labels]; std::string label_s(label); switch(value.type) { case READSTAT_TYPE_STRING: // Encoded to utf-8 on output label_set.add(readstat_string_value(value), label_s); break; case READSTAT_TYPE_INT8: case READSTAT_TYPE_INT16: case READSTAT_TYPE_INT32: case READSTAT_TYPE_DOUBLE: label_set.add(haven_double_value(value), label_s); break; default: Rf_warning("Unsupported label type: %s", value.type); } } bool hasLabel(int var_index) const { std::string label = val_labels_[var_index]; if (label == "") return false; return label_sets_.count(label) > 0; } void resizeCols(int n) { // Rcout << "resizing to " << n << "\n"; nrowsAlloc_ = n; for (int i = 0; i < ncols_; ++i) { Shield copy(Rf_lengthgets(output_[i], n)); Rf_copyMostAttrib(output_[i], copy); output_[i] = copy; } } List output() { if (nrows_ != nrowsAlloc_) resizeCols(nrows_); for (int i = 0; i < output_.size(); ++i) { RObject col = output_[i]; if (hasLabel(i)) { if (col.attr("class") == R_NilValue) { col.attr("class") = "labelled"; } col.attr("labels") = label_sets_[val_labels_[i]].labels(); } } int nNotes = notes_.size(); if (nNotes > 0) { CharacterVector notes(nNotes); for (int i = 0; i < nNotes; ++i) { notes[i] = Rf_mkCharCE(notes_[i].c_str(), CE_UTF8); } output_.attr("notes") = notes_; } output_.attr("names") = names_; static Function as_tibble("as_tibble", Environment::namespace_env("tibble")); return as_tibble(output_); } }; int dfreader_info(int obs_count, int var_count, void *ctx) { ((DfReader*) ctx)->setInfo(obs_count, var_count); return 0; } int dfreader_metadata(const char *file_label, const char *orig_encoding, time_t timestamp, long format_version, void *ctx) { ((DfReader*) ctx)->setMetadata(file_label, timestamp, format_version); return 0; } int dfreader_note(int note_index, const char *note, void *ctx) { ((DfReader*) ctx)->setNote(note_index, note); return 0; } int dfreader_variable(int index, readstat_variable_t *variable, const char *val_labels, void *ctx) { return ((DfReader*) ctx)->createVariable(index, variable, val_labels); } int dfreader_value(int obs_index, readstat_variable_t *variable, readstat_value_t value, void *ctx) { // Check for user interrupts every 10,000 rows or cols if ((obs_index + 1) % 10000 == 0 || (variable->index + 1) % 10000 == 0) checkUserInterrupt(); ((DfReader*) ctx)->setValue(obs_index, variable, value); return 0; } int dfreader_value_label(const char *val_labels, readstat_value_t value, const char *label, void *ctx) { ((DfReader*) ctx)->setValueLabels(val_labels, value, label); return 0; } void print_error(const char* error_message, void* ctx) { Rcout << error_message << "\n"; } // IO handling ----------------------------------------------------------- class DfReaderInput { public: virtual ~DfReaderInput() {}; virtual int open(void* io_ctx) = 0; virtual int close(void* io_ctx) = 0; virtual readstat_off_t seek(readstat_off_t offset, readstat_io_flags_t whence, void *io_ctx) = 0; virtual ssize_t read(void *buf, size_t nbyte, void *io_ctx) = 0; }; template class DfReaderInputStream : public DfReaderInput { protected: Stream file_; public: readstat_off_t seek(readstat_off_t offset, readstat_io_flags_t whence, void *io_ctx) { std::ios_base::seekdir dir; switch(whence) { case READSTAT_SEEK_SET: dir = file_.beg; break; case READSTAT_SEEK_CUR: dir = file_.cur; break; case READSTAT_SEEK_END: default: dir = file_.end; break; } file_.seekg(offset, dir); return file_.tellg(); // returns -1 if failed } ssize_t read(void *buf, size_t nbyte, void *io_ctx) { file_.read((char*) buf, nbyte); return (file_.good() || file_.eof()) ? file_.gcount() : -1; } }; class DfReaderInputFile : public DfReaderInputStream { std::string filename_; public: DfReaderInputFile(Rcpp::List spec) { filename_ = as(spec[0]); } int open(void* io_ctx) { file_.open(filename_.c_str(), std::ifstream::binary); return file_.is_open() ? 0 : -1; } int close(void* io_ctx) { file_.close(); return file_.is_open() ? -1 : 0; } }; class DfReaderInputRaw : public DfReaderInputStream { public: DfReaderInputRaw(Rcpp::List spec) { Rcpp::RawVector raw_data(spec[0]); std::string string_data((char*) RAW(raw_data), Rf_length(raw_data)); file_.str(string_data); } int open(void* io_ctx) { return 0; } int close(void* io_ctx) { return 0; } }; int dfreader_open(const char* path, void *io_ctx) { return ((DfReaderInput*) io_ctx)->open(io_ctx); } int dfreader_close(void *io_ctx) { return ((DfReaderInput*) io_ctx)->close(io_ctx); } readstat_off_t dfreader_seek(readstat_off_t offset, readstat_io_flags_t whence, void* io_ctx) { return ((DfReaderInput*) io_ctx)->seek(offset, whence, io_ctx); } ssize_t dfreader_read(void* buf, size_t nbyte, void* io_ctx) { return ((DfReaderInput*) io_ctx)->read(buf, nbyte, io_ctx); } readstat_error_t dfreader_update(long file_size, readstat_progress_handler progress_handler, void *user_ctx, void *io_ctx) { return READSTAT_OK; } // Parser wrappers ------------------------------------------------------------- readstat_parser_t* haven_init_parser(std::string encoding = "") { readstat_parser_t* parser = readstat_parser_init(); readstat_set_info_handler(parser, dfreader_info); readstat_set_metadata_handler(parser, dfreader_metadata); readstat_set_note_handler(parser, dfreader_note); readstat_set_variable_handler(parser, dfreader_variable); readstat_set_value_handler(parser, dfreader_value); readstat_set_value_label_handler(parser, dfreader_value_label); readstat_set_error_handler(parser, print_error); if (encoding != "") { readstat_set_file_character_encoding(parser, encoding.c_str()); } return parser; } template void haven_init_io(readstat_parser_t* parser, InputClass &builder_input) { readstat_set_open_handler(parser, dfreader_open); readstat_set_close_handler(parser, dfreader_close); readstat_set_seek_handler(parser, dfreader_seek); readstat_set_read_handler(parser, dfreader_read); readstat_set_update_handler(parser, dfreader_update); readstat_set_io_ctx(parser, (void*) &builder_input); } std::string haven_error_message(Rcpp::List spec) { std::string source_class(as(spec.attr("class"))[0]); if (source_class == "source_raw") return "file"; else return as(spec[0]); } template List df_parse_spss(Rcpp::List spec, bool user_na = false, bool por = false) { DfReader builder(HAVEN_SPSS, user_na); InputClass builder_input(spec); readstat_parser_t* parser = haven_init_parser(); haven_init_io(parser, builder_input); readstat_error_t result; if (por) { result = readstat_parse_por(parser, "", &builder); } else { result = readstat_parse_sav(parser, "", &builder); } readstat_parser_free(parser); if (result != 0) { stop("Failed to parse %s: %s.", haven_error_message(spec), readstat_error_message(result)); } return builder.output(); } template List df_parse_dta(Rcpp::List spec, std::string encoding = "") { DfReader builder(HAVEN_STATA); InputClass builder_input(spec); readstat_parser_t* parser = haven_init_parser(encoding); haven_init_io(parser, builder_input); readstat_error_t result = readstat_parse_dta(parser, "", &builder); readstat_parser_free(parser); if (result != 0) { stop("Failed to parse %s: %s.", haven_error_message(spec), readstat_error_message(result)); } return builder.output(); } template List df_parse_xpt(Rcpp::List spec, std::string encoding = "") { DfReader builder(HAVEN_XPT); InputClass builder_input(spec); readstat_parser_t* parser = haven_init_parser(encoding); haven_init_io(parser, builder_input); readstat_error_t result = readstat_parse_xport(parser, "", &builder); readstat_parser_free(parser); if (result != 0) { stop("Failed to parse %s: %s.", haven_error_message(spec), readstat_error_message(result)); } return builder.output(); } template List df_parse_sas(Rcpp::List spec_b7dat, Rcpp::List spec_b7cat, std::string encoding, std::string catalog_encoding, std::vector cols_only) { DfReader builder(HAVEN_SAS); if (!cols_only.empty()) { std::set cols_set(cols_only.begin(), cols_only.end()); builder.restrictCols(cols_set); } InputClass builder_input_dat(spec_b7dat); readstat_parser_t* parser = haven_init_parser(); haven_init_io(parser, builder_input_dat); if (spec_b7cat.size() != 0) { InputClass builder_input_cat(spec_b7cat); readstat_set_io_ctx(parser, (void*) &builder_input_cat); if (catalog_encoding != "") { readstat_set_file_character_encoding(parser, catalog_encoding.c_str()); } readstat_error_t result = readstat_parse_sas7bcat(parser, "", &builder); if (result != 0) { readstat_parser_free(parser); stop("Failed to parse %s: %s.", haven_error_message(spec_b7cat), readstat_error_message(result)); } } readstat_set_io_ctx(parser, (void*) &builder_input_dat); if (encoding != "") { readstat_set_file_character_encoding(parser, encoding.c_str()); } readstat_error_t result = readstat_parse_sas7bdat(parser, "", &builder); readstat_parser_free(parser); if (result != 0) { stop("Failed to parse %s: %s.", haven_error_message(spec_b7dat), readstat_error_message(result)); } return builder.output(); } // # nocov start // [[Rcpp::export]] List df_parse_sas_file(Rcpp::List spec_b7dat, Rcpp::List spec_b7cat, std::string encoding, std::string catalog_encoding, std::vector cols_only) { return df_parse_sas(spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only); } // [[Rcpp::export]] List df_parse_sas_raw(Rcpp::List spec_b7dat, Rcpp::List spec_b7cat, std::string encoding, std::string catalog_encoding, std::vector cols_only) { return df_parse_sas(spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only); } // [[Rcpp::export]] List df_parse_xpt_file(Rcpp::List spec) { return df_parse_xpt(spec); } // [[Rcpp::export]] List df_parse_xpt_raw(Rcpp::List spec) { return df_parse_xpt(spec); } // [[Rcpp::export]] List df_parse_dta_file(Rcpp::List spec, std::string encoding) { return df_parse_dta(spec, encoding); } // [[Rcpp::export]] List df_parse_dta_raw(Rcpp::List spec, std::string encoding) { return df_parse_dta(spec, encoding); } // [[Rcpp::export]] List df_parse_sav_file(Rcpp::List spec, bool user_na) { return df_parse_spss(spec, user_na, false); } // [[Rcpp::export]] List df_parse_sav_raw(Rcpp::List spec, bool user_na) { return df_parse_spss(spec, user_na, false); } // [[Rcpp::export]] List df_parse_por_file(Rcpp::List spec, bool user_na) { return df_parse_spss(spec, user_na, true); } // [[Rcpp::export]] List df_parse_por_raw(Rcpp::List spec, bool user_na) { return df_parse_spss(spec, user_na, true); } // # nocov end haven/src/tagged_na.c0000644000176200001440000000635513227731765014245 0ustar liggesusers#define R_NO_REMAP #include #include #include // Scalar operators ------------------------------------------------------- // IEEE 754 defines binary64 as // * 1 bit : sign // * 11 bits: exponent // * 52 bits: significand // // R stores the value "1954" in the last 32 bits: this payload marks // the value as a NA, not a regular NaN. // // (Note that this discussion like most discussion of FP on the web, assumes // a big-endian architecture - in little endian the sign bit is the last // bit) typedef union { double value; // 8 bytes char byte[8]; // 8 * 1 bytes } ieee_double; #ifdef WORDS_BIGENDIAN // First two bytes are sign & expoonent // Last four bytes are 1954 const int TAG_BYTE = 3; #else const int TAG_BYTE = 4; #endif double make_tagged_na(char x) { ieee_double y; y.value = NA_REAL; y.byte[TAG_BYTE] = x; return y.value; } char tagged_na_value(double x) { ieee_double y; y.value = x; return y.byte[TAG_BYTE]; } char first_char(SEXP x) { if (TYPEOF(x) != CHARSXP) return '\0'; if (x == NA_STRING) return '\0'; return CHAR(x)[0]; } // Vectorised wrappers ----------------------------------------------------- SEXP tagged_na_(SEXP x) { if (TYPEOF(x) != STRSXP) Rf_errorcall(R_NilValue, "`x` must be a character vector"); int n = Rf_length(x); SEXP out = PROTECT(Rf_allocVector(REALSXP, n)); for (int i = 0; i < n; ++i) { char xi = first_char(STRING_ELT(x, i)); REAL(out)[i] = make_tagged_na(xi); } UNPROTECT(1); return out; } SEXP na_tag_(SEXP x) { if (TYPEOF(x) != REALSXP) Rf_errorcall(R_NilValue, "`x` must be a double vector"); int n = Rf_length(x); SEXP out = PROTECT(Rf_allocVector(STRSXP, n)); for (int i = 0; i < n; ++i) { double xi = REAL(x)[i]; if (!isnan(xi)) { SET_STRING_ELT(out, i, NA_STRING); } else { char tag = tagged_na_value(xi); if (tag == '\0') { SET_STRING_ELT(out, i, NA_STRING); } else { SET_STRING_ELT(out, i, Rf_mkCharLenCE(&tag, 1, CE_UTF8)); } } } UNPROTECT(1); return out; } SEXP falses(int n) { SEXP out = PROTECT(Rf_allocVector(LGLSXP, n)); for (int i = 0; i < n; ++i) LOGICAL(out)[i] = 0; UNPROTECT(1); return out; } SEXP is_tagged_na_(SEXP x, SEXP tag_) { if (TYPEOF(x) != REALSXP) { return falses(Rf_length(x)); } bool has_tag; char check_tag; if (TYPEOF(tag_) == NILSXP) { has_tag = false; check_tag = '\0'; } else if (TYPEOF(tag_) == STRSXP) { if (Rf_length(tag_) != 1) Rf_errorcall(R_NilValue, "`tag` must be a character vector of length 1"); has_tag = true; check_tag = first_char(STRING_ELT(tag_, 0)); } else { Rf_errorcall(R_NilValue, "`tag` must be NULL or a character vector"); } int n = Rf_length(x); SEXP out = PROTECT(Rf_allocVector(LGLSXP, n)); for (int i = 0; i < n; ++i) { double xi = REAL(x)[i]; if (!isnan(xi)) { LOGICAL(out)[i] = false; } else { char tag = tagged_na_value(xi); if (tag == '\0') { LOGICAL(out)[i] = false; } else { if (has_tag) { LOGICAL(out)[i] = tag == check_tag; } else { LOGICAL(out)[i] = true; } } } } UNPROTECT(1); return out; } haven/src/Makevars.win0000644000176200001440000000025113227731765014445 0ustar liggesusersinclude Makevars # This is also defined in Makevars, but somehow the definition from there is not used OBJECTS = $(CFILES:.c=.o) $(CPPFILES:.cpp=.o) PKG_LIBS=-lRiconv haven/src/RcppExports.cpp0000644000176200001440000002157113227731765015162 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include "haven_types.h" #include using namespace Rcpp; // df_parse_sas_file List df_parse_sas_file(Rcpp::List spec_b7dat, Rcpp::List spec_b7cat, std::string encoding, std::string catalog_encoding, std::vector cols_only); RcppExport SEXP _haven_df_parse_sas_file(SEXP spec_b7datSEXP, SEXP spec_b7catSEXP, SEXP encodingSEXP, SEXP catalog_encodingSEXP, SEXP cols_onlySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec_b7dat(spec_b7datSEXP); Rcpp::traits::input_parameter< Rcpp::List >::type spec_b7cat(spec_b7catSEXP); Rcpp::traits::input_parameter< std::string >::type encoding(encodingSEXP); Rcpp::traits::input_parameter< std::string >::type catalog_encoding(catalog_encodingSEXP); Rcpp::traits::input_parameter< std::vector >::type cols_only(cols_onlySEXP); rcpp_result_gen = Rcpp::wrap(df_parse_sas_file(spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only)); return rcpp_result_gen; END_RCPP } // df_parse_sas_raw List df_parse_sas_raw(Rcpp::List spec_b7dat, Rcpp::List spec_b7cat, std::string encoding, std::string catalog_encoding, std::vector cols_only); RcppExport SEXP _haven_df_parse_sas_raw(SEXP spec_b7datSEXP, SEXP spec_b7catSEXP, SEXP encodingSEXP, SEXP catalog_encodingSEXP, SEXP cols_onlySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec_b7dat(spec_b7datSEXP); Rcpp::traits::input_parameter< Rcpp::List >::type spec_b7cat(spec_b7catSEXP); Rcpp::traits::input_parameter< std::string >::type encoding(encodingSEXP); Rcpp::traits::input_parameter< std::string >::type catalog_encoding(catalog_encodingSEXP); Rcpp::traits::input_parameter< std::vector >::type cols_only(cols_onlySEXP); rcpp_result_gen = Rcpp::wrap(df_parse_sas_raw(spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only)); return rcpp_result_gen; END_RCPP } // df_parse_xpt_file List df_parse_xpt_file(Rcpp::List spec); RcppExport SEXP _haven_df_parse_xpt_file(SEXP specSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_xpt_file(spec)); return rcpp_result_gen; END_RCPP } // df_parse_xpt_raw List df_parse_xpt_raw(Rcpp::List spec); RcppExport SEXP _haven_df_parse_xpt_raw(SEXP specSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_xpt_raw(spec)); return rcpp_result_gen; END_RCPP } // df_parse_dta_file List df_parse_dta_file(Rcpp::List spec, std::string encoding); RcppExport SEXP _haven_df_parse_dta_file(SEXP specSEXP, SEXP encodingSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); Rcpp::traits::input_parameter< std::string >::type encoding(encodingSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_dta_file(spec, encoding)); return rcpp_result_gen; END_RCPP } // df_parse_dta_raw List df_parse_dta_raw(Rcpp::List spec, std::string encoding); RcppExport SEXP _haven_df_parse_dta_raw(SEXP specSEXP, SEXP encodingSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); Rcpp::traits::input_parameter< std::string >::type encoding(encodingSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_dta_raw(spec, encoding)); return rcpp_result_gen; END_RCPP } // df_parse_sav_file List df_parse_sav_file(Rcpp::List spec, bool user_na); RcppExport SEXP _haven_df_parse_sav_file(SEXP specSEXP, SEXP user_naSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); Rcpp::traits::input_parameter< bool >::type user_na(user_naSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_sav_file(spec, user_na)); return rcpp_result_gen; END_RCPP } // df_parse_sav_raw List df_parse_sav_raw(Rcpp::List spec, bool user_na); RcppExport SEXP _haven_df_parse_sav_raw(SEXP specSEXP, SEXP user_naSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); Rcpp::traits::input_parameter< bool >::type user_na(user_naSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_sav_raw(spec, user_na)); return rcpp_result_gen; END_RCPP } // df_parse_por_file List df_parse_por_file(Rcpp::List spec, bool user_na); RcppExport SEXP _haven_df_parse_por_file(SEXP specSEXP, SEXP user_naSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); Rcpp::traits::input_parameter< bool >::type user_na(user_naSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_por_file(spec, user_na)); return rcpp_result_gen; END_RCPP } // df_parse_por_raw List df_parse_por_raw(Rcpp::List spec, bool user_na); RcppExport SEXP _haven_df_parse_por_raw(SEXP specSEXP, SEXP user_naSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::List >::type spec(specSEXP); Rcpp::traits::input_parameter< bool >::type user_na(user_naSEXP); rcpp_result_gen = Rcpp::wrap(df_parse_por_raw(spec, user_na)); return rcpp_result_gen; END_RCPP } // write_sav_ void write_sav_(List data, std::string path); RcppExport SEXP _haven_write_sav_(SEXP dataSEXP, SEXP pathSEXP) { BEGIN_RCPP Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type data(dataSEXP); Rcpp::traits::input_parameter< std::string >::type path(pathSEXP); write_sav_(data, path); return R_NilValue; END_RCPP } // write_dta_ void write_dta_(List data, std::string path, int version); RcppExport SEXP _haven_write_dta_(SEXP dataSEXP, SEXP pathSEXP, SEXP versionSEXP) { BEGIN_RCPP Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type data(dataSEXP); Rcpp::traits::input_parameter< std::string >::type path(pathSEXP); Rcpp::traits::input_parameter< int >::type version(versionSEXP); write_dta_(data, path, version); return R_NilValue; END_RCPP } // write_sas_ void write_sas_(List data, std::string path); RcppExport SEXP _haven_write_sas_(SEXP dataSEXP, SEXP pathSEXP) { BEGIN_RCPP Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type data(dataSEXP); Rcpp::traits::input_parameter< std::string >::type path(pathSEXP); write_sas_(data, path); return R_NilValue; END_RCPP } // write_xpt_ void write_xpt_(List data, std::string path, int version); RcppExport SEXP _haven_write_xpt_(SEXP dataSEXP, SEXP pathSEXP, SEXP versionSEXP) { BEGIN_RCPP Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type data(dataSEXP); Rcpp::traits::input_parameter< std::string >::type path(pathSEXP); Rcpp::traits::input_parameter< int >::type version(versionSEXP); write_xpt_(data, path, version); return R_NilValue; END_RCPP } RcppExport SEXP is_tagged_na_(SEXP, SEXP); RcppExport SEXP na_tag_(SEXP); RcppExport SEXP tagged_na_(SEXP); static const R_CallMethodDef CallEntries[] = { {"_haven_df_parse_sas_file", (DL_FUNC) &_haven_df_parse_sas_file, 5}, {"_haven_df_parse_sas_raw", (DL_FUNC) &_haven_df_parse_sas_raw, 5}, {"_haven_df_parse_xpt_file", (DL_FUNC) &_haven_df_parse_xpt_file, 1}, {"_haven_df_parse_xpt_raw", (DL_FUNC) &_haven_df_parse_xpt_raw, 1}, {"_haven_df_parse_dta_file", (DL_FUNC) &_haven_df_parse_dta_file, 2}, {"_haven_df_parse_dta_raw", (DL_FUNC) &_haven_df_parse_dta_raw, 2}, {"_haven_df_parse_sav_file", (DL_FUNC) &_haven_df_parse_sav_file, 2}, {"_haven_df_parse_sav_raw", (DL_FUNC) &_haven_df_parse_sav_raw, 2}, {"_haven_df_parse_por_file", (DL_FUNC) &_haven_df_parse_por_file, 2}, {"_haven_df_parse_por_raw", (DL_FUNC) &_haven_df_parse_por_raw, 2}, {"_haven_write_sav_", (DL_FUNC) &_haven_write_sav_, 2}, {"_haven_write_dta_", (DL_FUNC) &_haven_write_dta_, 3}, {"_haven_write_sas_", (DL_FUNC) &_haven_write_sas_, 2}, {"_haven_write_xpt_", (DL_FUNC) &_haven_write_xpt_, 3}, {"is_tagged_na_", (DL_FUNC) &is_tagged_na_, 2}, {"na_tag_", (DL_FUNC) &na_tag_, 1}, {"tagged_na_", (DL_FUNC) &tagged_na_, 1}, {NULL, NULL, 0} }; RcppExport void R_init_haven(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } haven/src/tagged_na.h0000644000176200001440000000027113227731765014241 0ustar liggesusers#ifndef __TAGGED_NA__ #define __TAGGED_NA__ #ifdef __cplusplus extern "C" { #endif double make_tagged_na(char x); char tagged_na_value(double x); #ifdef __cplusplus } #endif #endif haven/NAMESPACE0000644000176200001440000000242013165724163012600 0ustar liggesusers# Generated by roxygen2: do not edit by hand S3method("[",labelled) S3method(as.data.frame,labelled) S3method(as_factor,data.frame) S3method(as_factor,labelled) S3method(is.na,labelled_spss) S3method(print,labelled) S3method(print,labelled_spss) S3method(type_sum,labelled) S3method(zap_formats,data.frame) S3method(zap_formats,default) S3method(zap_labels,data.frame) S3method(zap_labels,default) S3method(zap_labels,labelled) S3method(zap_labels,labelled_spss) S3method(zap_missing,data.frame) S3method(zap_missing,default) S3method(zap_missing,labelled) S3method(zap_missing,labelled_spss) S3method(zap_widths,data.frame) S3method(zap_widths,default) export(as_factor) export(format_tagged_na) export(is.labelled) export(is_tagged_na) export(labelled) export(labelled_spss) export(na_tag) export(print_labels) export(print_tagged_na) export(read_dta) export(read_por) export(read_sas) export(read_sav) export(read_spss) export(read_stata) export(read_xpt) export(tagged_na) export(write_dta) export(write_sas) export(write_sav) export(write_xpt) export(zap_empty) export(zap_formats) export(zap_labels) export(zap_missing) export(zap_widths) importFrom(Rcpp,sourceCpp) importFrom(forcats,as_factor) importFrom(tibble,tibble) importFrom(tibble,type_sum) useDynLib(haven, .registration = TRUE) haven/NEWS.md0000644000176200001440000002175013227706175012470 0ustar liggesusers# haven 1.1.1 * Update to latest readstat. Includes: * SPSS: empty charater columns now read as character (#311) * SPSS: now write long strings (#266) * Stata: reorder labelled vectors on write (#327) * State: `encoding` now affects value labels (#325) * SAS: can now write wide/long rows (#272, #335). * SAS: can now handle Windows Vietnamese character set (#336) * `read_por()` and `read_xpt()` now correctly preserve attributes if output needs to be reallocated (which is typical behaviour) (#313) * `read_sas()` recognises date/times format with trailing separator and width specifications (#324) * `read_sas()` gains a `catalog_encoding` argument so you can independently specify encoding of data and catalog (#312) * `write_*()` correctly measures lengths of non-ASCII labels (#258): this fixes the cryptic error "A provided string value was longer than the available storage size of the specified column." * `write_dta()` now checks for bad labels in all columns, not just the first (#326). * `write_sav()` no longer fails on empty factors or factors with an `NA` level (#301) and writes out more metadata for `labelled_spss` vectors (#334). # haven 1.1.0 * Update to latest readstat. Includes: * SAS: support Win baltic code page (#231) * SAS: better error messages instead of crashes (#234, #270) * SAS: fix "unable to read error" (#271) * SPSS: support uppercase time stamps (#230) * SPSS: fixes for 252-255 byte strings (#226) * SPSS: fixes for 0 byte strings (#245) * Share `as_factor()` with forcats package (#256) * `read_sav()` once again correctly returns system defined missings as `NA` (rather than `NaN`) (#223). `read_sav()` and `write_sav()` preserve SPSS's display widths (@ecortens). * `read_sas()` gains experimental `cols_only` argument to only read in specified columns (#248). * tibbles are created with `tibble::as_tibble()`, rather than by "hand" (#229). * `write_sav()` checks that factors don't have levels with >120 characters (#262) * `write_dta()` no longer checks that all value labels are at most 32 characters (since this is not a restriction of dta files) (#239). * All write methds now check that you're trying to write a data frame (#287). * Add support for reading (`read_xpt()`) and writing (`write_xpt()`) SAS transport files. * `write_*` functions turn ordered factors into labelled vectors (#285) # haven 1.0.0 * The ReadStat library is stored in a subdirectory of `src` (#209, @krlmlr). * Import tibble so that tibbles are printed consistently (#154, @krlmlr). * Update to latest ReadStat (#65). Includes: * Support for binary (aka Ross) compression for SAS (#31). * Support extended ASCII encoding for Stata (#71). * Support for Stata 14 files (#75, #212). * Support for SPSS value labels with more than 8 characters (#157). * More likely to get an error when attempting to create an invalid output file (#171). * Added support for reading and writing variable formats. Similarly to to variable labels, formats are stored as an attribute on the vector. Use `zap_formats()` if you want to remove these attributes. (@gorcha, #119, #123). * Added support for reading file "label" and "notes". These are not currently printed, but are stored in the attributes if you need to access them (#186). * Added support for "tagged" missing values (in Stata these are called "extended" and in SAS these are called "special") which carry an extra byte of information: a character label from "a" to "z". The downside of this change is that all integer columns are now converted to doubles, to support the encoding of the tag in the payload of a NaN. * New `labelled_spss()` is a subclass of `labelled()` that can model user missing values from SPSS. These can either be a set of distinct values, or for numeric vectors, a range. `zap_labels()` strips labels, and replaces user-defined missing values with `NA`. New `zap_missing()` just replaces user-defined missing vlaues with `NA`. `labelled_spss()` is potentially dangerous to work with in R because base functions don't know about `labelled_spss()` functions so will return the wrong result in the presence of user-defined missing values. For this reason, they will only be created by `read_spss()` when `user_na = TRUE` (normally user-defined missings are converted to NA). * `as_factor()` no longer drops the `label` attribute (variable label) when used (#177, @itsdalmo). * Using `as_factor()` with `levels = "default` or `levels = "both"` preserves unused labels (implicit missing) when converting (#172, @itsdalmo). Labels (and the resulting factor levels) are always sorted by values. * `as_factor()` gains a new `levels = "default"` mechanism. This uses the labels where present, and otherwise uses the labels. This is now the default, as it seems to map better to the semantics of labelled values in other statistical packages (#81). You can also use `levels = "both"` to combine the value and the label into a single string (#82). It also gains a method for data frames, so you can easily convert every labelled column to a factor in one function call. * New `vignette("semantics", package = "haven")` discusses the semantics of missing values and labelling in SAS, SPSS, and Stata, and how they are translated into R. * Support for `hms()` has been moved into the hms package (#162). Time varibles now have class `c("hms", "difftime")` and a `units` attribute with value "secs" (#162). * `labelled()` is less strict with its checks: you can mix double and integer value and labels (#86, #110, @lionel-), and `is.labelled()` is now exported (#124). Putting a labelled vector in a data frame now generates the correct column name (#193). * `read_dta()` now recognises "%d" and custom date types (#80, #130). It also gains an encoding parameter which you can use to override the default encoding. This is particularly useful for Stata 13 and below which did not store the encoding used in the file (#163). * `read_por()` now actually works (#35). * `read_sav()` now correctly recognises EDATE and JDATE formats as dates (#72). Variables with format DATE, ADATE, EDATE, JDATE or SDATE are imported as `Date` variables instead of `POSIXct`. You can now set `user_na = TRUE` to preserve user defined missing values: they will be given class `labelled_spss`. * `read_dta()`, `read_sas()`, and `read_sav()` have a better test for missing string values (#79). They can all read from connections and compressed files (@lionel-, #109) * `read_sas()` gains an encoding parameter to overide the encoding stored in the file if it is incorrect (#176). It gets better argument names (#214). * Added `type_sum()` method for labelled objects so they print nicely in tibbles. * `write_dta()` now verifies that variable names are valid Stata variables (#132), and throws an error if you attempt to save a labelled vector that is not an integer (#144). You can choose which `version` of Stata's file format to output (#217). * New `write_sas()` allows you to write data frames out to `sas7bdat` files. This is still somewhat experimental. * `write_sav()` writes hms variables to SPSS time variables, and the "measure" type is set for each variable (#133). * `write_dta()` and `write_sav()` support writing date and date/times (#25, #139, #145). Labelled values are always converted to UTF-8 before being written out (#87). Infinite values are now converted to missing values since SPSS and Stata don't support them (#149). Both use a better test for missing values (#70). * `zap_labels()` has been completely overhauled. It now works (@markriseley, #69), and only drops label attributes; it no longer replaces labelled values with `NA`s. It also gains a data frame method that zaps the labels from every column. * `print.labelled()` and `print.labelled_spss()` now display the type. # haven 0.2.0 * fixed a bug in `as_factor.labelled`, which generated 's and wrong labels for integer labels. * `zap_labels()` now leaves unlabelled vectors unchanged, making it easier to apply to all columns. * `write_dta()` and `write_sav()` take more care to always write output as UTF-8 (#36) * `write_dta()` and `write_sav()` won't crash if you give them invalid paths, and you can now use `~` to refer to your home directory (#37). * Byte variables are now correctly read into integers (not strings, #45), and missing values are captured correctly (#43). * Added `read_stata()` as alias to `read_dta()` (#52). * `read_spss()` uses extension to automatically choose between `read_sav()` and `read_por()` (#53) * Updates from ReadStat. Including fixes for various parsing bugs, more encodings, and better support for large files. * hms objects deal better with missings when printing. * Fixed bug causing labels for numeric variables to be read in as integers and associated error: ``Error: `x` and `labels` must be same type`` # haven 0.1.1 * Fixed memory initialisation problems found by valgrind. haven/R/0000755000176200001440000000000013227731765011571 5ustar liggesusershaven/R/zap_missing.R0000644000176200001440000000223212743423326014227 0ustar liggesusers#' Zap special missings to regular R missings #' #' This is useful if you want to convert tagged missing values from SAS or #' Stata, or user-defined missings from SPSS, to regular R \code{NA}. #' #' @param x A vector or data frame #' @export #' @examples #' x1 <- labelled( #' c(1, 5, tagged_na("a", "b")), #' c(Unknown = tagged_na("a"), Refused = tagged_na("b")) #' ) #' x1 #' zap_missing(x1) #' #' x2 <- labelled_spss( #' c(1, 2, 1, 99), #' c(missing = 99), #' na_value = 99 #' ) #' x2 #' zap_missing(x2) #' #' # You can also apply to data frames #' df <- tibble::data_frame(x1, x2, y = 4:1) #' df #' zap_missing(df) zap_missing <- function(x) { UseMethod("zap_missing") } #' @export zap_missing.default <- function(x) { x } #' @export zap_missing.labelled <- function(x) { x[is.na(x)] <- NA labels <- attr(x, "labels") labels <- labels[!is.na(labels)] attr(x, "labels") <- labels x } #' @export zap_missing.labelled_spss <- function(x) { is.na(x) <- is.na(x) attr(x, "na_values") <- NULL attr(x, "na_range") <- NULL class(x) <- "labelled" x } #' @export zap_missing.data.frame <- function(x) { x[] <- lapply(x, zap_missing) x } haven/R/utils.R0000644000176200001440000000016313227417055013045 0ustar liggesusersmax_level_length <- function(x) { if (!is.factor(x)) return(0L) max(0L, nchar(levels(x)), na.rm = TRUE) } haven/R/as_factor.R0000644000176200001440000000606713224443423013652 0ustar liggesusers#' Convert input to a factor. #' #' The base function \code{as.factor()} is not a generic, but this variant #' is. Methods are provided for factors, character vectors, labelled #' vectors, and data frames. By default, when applied to a data frame, #' it only affects \code{labelled} columns. #' #' @param x Object to coerce to a factor. #' @param ... Other arguments passed down to method. #' @param only_labelled Only apply to labelled columns? #' @export #' @examples #' x <- labelled(sample(5, 10, replace = TRUE), c(Bad = 1, Good = 5)) #' #' # Default method uses values where available #' as_factor(x) #' # You can also extract just the labels #' as_factor(x, "labels") #' # Or just the values #' as_factor(x, "values") #' # Or combine value and label #' as_factor(x, "both") #' @importFrom forcats as_factor #' @export #' @name as_factor NULL #' @rdname as_factor #' @export as_factor.data.frame <- function(x, ..., only_labelled = TRUE) { if (only_labelled) { labelled <- vapply(x, is.labelled, logical(1)) x[labelled] <- lapply(x[labelled], as_factor) } else { x[] <- lapply(x, as_factor) } x } #' @param ordered If \code{TRUE} create an ordered (ordinal) factor, if #' \code{FALSE} (the default) create a regular (nominal) factor. #' @param levels How to create the levels of the generated factor: #' #' \itemize{ #' \item "default": uses labels where available, otherwise the values. Labels are sorted by value. #' \item "both": like "default", but pastes together the level and value #' \item "label": use only the labels; unlabelled values become \code{NA} #' \item "values: use only the values #' } #' @rdname as_factor #' @export as_factor.labelled <- function(x, levels = c("default", "labels", "values", "both"), ordered = FALSE, ...) { levels <- match.arg(levels) label <- attr(x, "label", exact = TRUE) labels <- attr(x, "labels") if (levels == "default" || levels == "both") { if (levels == "both") { names(labels) <- paste0("[", labels, "] ", names(labels)) } # Replace each value with its label vals <- unique(x) levs <- replace_with(vals, unname(labels), names(labels)) # Ensure all labels are preserved levs <- sort(c(stats::setNames(vals, levs), labels), na.last = TRUE) levs <- unique(names(levs)) x <- replace_with(x, unname(labels), names(labels)) x <- factor(x, levels = levs, ordered = ordered) } else { levs <- unname(labels) labs <- switch(levels, labels = names(labels), values = levs ) x <- factor(x, levs, labels = labs, ordered = ordered) } structure(x, label = label) } replace_with <- function(x, from, to) { stopifnot(length(from) == length(to)) out <- x # First replace regular values matches <- match(x, from, incomparables = NA) out[!is.na(matches)] <- to[matches[!is.na(matches)]] # Then tagged missing values tagged <- is_tagged_na(x) if (!any(tagged)) { return(out) } matches <- match(na_tag(x), na_tag(from), incomparables = NA) out[!is.na(matches)] <- to[matches[!is.na(matches)]] out } haven/R/zap_formats.R0000644000176200001440000000127112743423326014233 0ustar liggesusers#' Remove format attributes #' #' To provide some mild support for round-tripping variables between Stata/SPSS #' and R, haven stores variable formats in an attribute: \code{format.stata}, #' \code{format.spss}, or \code{format.sas}. If this causes problems for your #' code, you can get rid of them with \code{zap_formats}. #' #' @param x A vector or data frame. #' @family zappers #' @export zap_formats <- function(x) { UseMethod("zap_formats") } #' @export zap_formats.default <- function(x) { attr(x, "format.spss") <- NULL attr(x, "format.sas") <- NULL attr(x, "format.stata") <- NULL x } #' @export zap_formats.data.frame <- function(x) { x[] <- lapply(x, zap_formats) x } haven/R/update.R0000644000176200001440000000146613227700743013175 0ustar liggesusers# nocov start update_readstat <- function() { tmp <- tempfile() utils::download.file( "https://github.com/WizardMac/ReadStat/archive/master.zip", tmp, method = "wget" ) utils::unzip(tmp, exdir = tempdir()) zip_dir <- file.path(tempdir(), "ReadStat-master", "src") src <- dir(zip_dir, "\\.[ch]$", recursive = TRUE) # Drop test & bin ignore <- dirname(src) %in% c("test", "bin", "bin/modules", "bin/util", "fuzz") src <- src[!ignore] dirs <- file.path("src", "readstat", c("sas", "stata", "spss")) lapply(dirs, dir.create, showWarnings = FALSE, recursive = TRUE) ok <- file.copy(file.path(zip_dir, src), file.path("src", "readstat", src), overwrite = TRUE) if (any(!ok)) { stop("Failed to copy: ", paste(src[!ok], collapse = ", "), call. = FALSE) } invisible() } # nocov end haven/R/labelled_spss.R0000644000176200001440000000431613224443423014520 0ustar liggesusers#' Labelled vectors for SPSS #' #' This class is only used when \code{user_na = TRUE} in #' \code{\link{read_sav}()}. It is similar to the \code{\link{labelled}} class #' but it also models SPSS's user-defined missings, which can be up to #' three distinct values, or for numeric vectors a range. #' #' @param na_values A vector of values that should also be considered as missing. #' @param na_range A numeric vector of length two giving the (inclusive) extents #' of the range. Use \code{-Inf} and \code{Inf} if you want the range to be #' open ended. #' @inheritParams labelled #' @export #' @examples #' x1 <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_values = c(9, 10)) #' is.na(x1) #' #' x2 <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_range = c(9, Inf)) #' is.na(x2) labelled_spss <- function(x, labels, na_values = NULL, na_range = NULL) { if (!is.null(na_values)) { if (!is_coercible(x, na_values)) { stop("`x` and `na_values` must be same type", call. = FALSE) } } if (!is.null(na_range)) { if (!is.numeric(x)) { stop("`na_range` is only applicable for labelled numeric vectors", call. = FALSE) } if (!is.numeric(na_range) || length(na_range) != 2) { stop("`na_range` must be a numeric vector of length two.", call. = FALSE) } } structure( labelled(x, labels), na_values = na_values, na_range = na_range, class = c("labelled_spss", "labelled") ) } #' @export print.labelled_spss <- function(x, ...) { cat("\n", sep = "") xx <- x attributes(xx) <- NULL print(xx, quote = FALSE) na_values <- attr(x, "na_values") if (!is.null(na_values)) { cat("Missing values: ", paste(na_values, collapse = ", "), "\n", sep = "") } na_range <- attr(x, "na_range") if (!is.null(na_range)) { cat("Missing range: [", paste(na_range, collapse = ", "), "]\n", sep = "") } print_labels(x) invisible() } #' @export is.na.labelled_spss <- function(x) { miss <- NextMethod() na_values <- attr(x, "na_values") if (!is.null(na_values)) { miss <- miss | x %in% na_values } na_range <- attr(x, "na_range") if (!is.null(na_range)) { miss <- miss | (x >= na_range[1] & x <= na_range[2]) } miss } haven/R/haven.R0000644000176200001440000002040613227455610013007 0ustar liggesusers#' @useDynLib haven, .registration = TRUE #' @importFrom Rcpp sourceCpp #' @importFrom tibble tibble NULL #' Read and write SAS files. #' #' Reading supports both sas7bdat files and the accompanying sas7bdat files #' that SAS uses to record value labels. Writing value labels is not currently #' supported. #' #' @param data_file,catalog_file Path to data and catalog files. The files are #' processed with \code{\link[readr]{datasource}()}. #' @param data Data frame to write. #' @param path Path to file where the data will be written. #' @param encoding,catalog_encoding The character encoding used for the #' `data_file` and `catalog_encoding` respectively. A value of `NULL` #' uses the encoding specified in the file; use this argument to override it #' if it is incorrect. #' @param cols_only A character vector giving an experimental way to read in #' only specified columns. #' @return A tibble, data frame variant with nice defaults. #' #' Variable labels are stored in the "label" attribute of each variable. #' It is not printed on the console, but the RStudio viewer will show it. #' @export #' @examples #' path <- system.file("examples", "iris.sas7bdat", package = "haven") #' read_sas(path) read_sas <- function(data_file, catalog_file = NULL, encoding = NULL, catalog_encoding = encoding, cols_only = NULL) { if (is.null(encoding)) { encoding <- "" } if (is.null(cols_only)) { cols_only <- character() } spec_data <- readr::datasource(data_file) if (is.null(catalog_file)) { spec_cat <- list() } else { spec_cat <- readr::datasource(catalog_file) } switch(class(spec_data)[1], source_file = df_parse_sas_file(spec_data, spec_cat, encoding = encoding, catalog_encoding = catalog_encoding, cols_only = cols_only), source_raw = df_parse_sas_raw(spec_data, spec_cat, encoding = encoding, catalog_encoding = catalog_encoding, cols_only = cols_only), stop("This kind of input is not handled", call. = FALSE) ) } #' @export #' @rdname read_sas write_sas <- function(data, path) { validate_sas(data) write_sas_(data, normalizePath(path, mustWork = FALSE)) } #' Read and write SAS transport files #' #' The SAS transport format is a open format, as is required for submission #' of the data to the FDA. #' #' @inherit read_spss #' @export #' @examples #' tmp <- tempfile(fileext = ".xpt") #' write_xpt(mtcars, tmp) #' read_xpt(tmp) read_xpt <- function(file) { spec <- readr::datasource(file) switch(class(spec)[1], source_file = df_parse_xpt_file(spec), source_raw = df_parse_xpt_raw(spec), stop("This kind of input is not handled", call. = FALSE) ) } #' @export #' @rdname read_xpt #' @param version Version of transport file specification to use: either 5 or 8. write_xpt <- function(data, path, version = 8) { stopifnot(version %in% c(5, 8)) write_xpt_(data, normalizePath(path, mustWork = FALSE), version) } #' Read SPSS (SAV & POR) files. Write SAV files. #' #' Currently haven can read and write logical, integer, numeric, character #' and factors. See \code{\link{labelled_spss}} for how labelled variables in #' SPSS are handled in R. \code{read_spss} is an alias for \code{read_sav}. #' #' @inheritParams readr::datasource #' @param path Path to a file where the data will be written. #' @param data Data frame to write. #' @return A tibble, data frame variant with nice defaults. #' #' Variable labels are stored in the "label" attribute of each variable. #' It is not printed on the console, but the RStudio viewer will show it. #' @name read_spss #' @examples #' path <- system.file("examples", "iris.sav", package = "haven") #' read_sav(path) #' #' tmp <- tempfile(fileext = ".sav") #' write_sav(mtcars, tmp) #' read_sav(tmp) NULL #' @export #' @rdname read_spss read_sav <- function(file, user_na = FALSE) { spec <- readr::datasource(file) switch(class(spec)[1], source_file = df_parse_sav_file(spec, user_na), source_raw = df_parse_sav_raw(spec, user_na), stop("This kind of input is not handled", call. = FALSE) ) } #' @export #' @rdname read_spss read_por <- function(file, user_na = FALSE) { spec <- readr::datasource(file) switch(class(spec)[1], source_file = df_parse_por_file(spec, user_na), source_raw = df_parse_por_raw(spec, user_na), stop("This kind of input is not handled", call. = FALSE) ) } #' @export #' @rdname read_spss write_sav <- function(data, path) { validate_sav(data) write_sav_(data, normalizePath(path, mustWork = FALSE)) } #' @export #' @rdname read_spss #' @param user_na If \code{TRUE} variables with user defined missing will #' be read into \code{\link{labelled_spss}} objects. If \code{FALSE}, the #' default, user-defined missings will be converted to \code{NA}. read_spss <- function(file, user_na = FALSE) { ext <- tolower(tools::file_ext(file)) switch(ext, sav = read_sav(file, user_na = user_na), por = read_por(file, user_na = user_na), stop("Unknown extension '.", ext, "'", call. = FALSE) ) } #' Read and write Stata DTA files. #' #' Currently haven can read and write logical, integer, numeric, character #' and factors. See \code{\link{labelled}} for how labelled variables in #' Stata are handled in R. #' #' @inheritParams readr::datasource #' @inheritParams read_spss #' @param encoding The character encoding used for the file. This defaults to #' the encoding specified in the file, or UTF-8. But older versions of Stata #' (13 and earlier) did not store the encoding used, and you'll need to #' specify manually. A commonly used value is "windows-1252". #' @return A tibble, data frame variant with nice defaults. #' #' Variable labels are stored in the "label" attribute of each variable. #' It is not printed on the console, but the RStudio viewer will show it. #' @export #' @examples #' path <- system.file("examples", "iris.dta", package = "haven") #' read_dta(path) #' #' tmp <- tempfile(fileext = ".dta") #' write_dta(mtcars, tmp) #' read_dta(tmp) #' read_stata(tmp) read_dta <- function(file, encoding = NULL) { if (is.null(encoding)) { encoding <- "" } spec <- readr::datasource(file) switch(class(spec)[1], source_file = df_parse_dta_file(spec, encoding), source_raw = df_parse_dta_raw(spec, encoding), stop("This kind of input is not handled", call. = FALSE) ) } #' @export #' @rdname read_dta read_stata <- function(file, encoding = NULL) { read_dta(file, encoding) } #' @export #' @rdname read_dta #' @param version File version to use. Supports versions 8-14. write_dta <- function(data, path, version = 14) { validate_dta(data) write_dta_(data, normalizePath(path, mustWork = FALSE), version = stata_file_format(version) ) } stata_file_format <- function(version) { stopifnot(is.numeric(version), length(version) == 1) version <- as.integer(version) if (version == 14L) { 118 } else if (version == 13L) { 117 } else if (version == 12L) { 115 } else if (version %in% c(10L, 11L)) { 114 } else if (version %in% c(8L, 9L)) { 113 } else { stop("Version ", version, " not currently supported", call. = FALSE) } } validate_dta <- function(data) { stopifnot(is.data.frame(data)) # Check variable names bad_names <- !grepl("^[A-Za-z_]{1}[A-Za-z0-9_]{0,31}$", names(data)) if (any(bad_names)) { stop( "The following variable names are not valid Stata variables: ", var_names(data, bad_names), call. = FALSE ) } # Check for labelled double vectors is_labelled <- vapply(data, is.labelled, logical(1)) is_integer <- vapply(data, typeof, character(1)) == "integer" bad_labels <- is_labelled & !is_integer if (any(bad_labels)) { stop( "Stata only supports labelled integers.\nProblems: ", var_names(data, bad_labels), call. = FALSE ) } } validate_sav <- function(data) { stopifnot(is.data.frame(data)) # Check factor lengths level_lengths <- vapply(data, max_level_length, integer(1)) bad_lengths <- level_lengths > 120 if (any(bad_lengths)) { stop( "SPSS only supports levels with <= 120 characters\n", "Problems: ", var_names(data, bad_lengths), call. = FALSE ) } } validate_sas <- function(data) { stopifnot(is.data.frame(data)) } var_names <- function(data, i) { x <- names(data)[i] paste(encodeString(x, quote = "`"), collapse = ", ") } haven/R/RcppExports.R0000644000176200001440000000326713227426122014201 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 df_parse_sas_file <- function(spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only) { .Call(`_haven_df_parse_sas_file`, spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only) } df_parse_sas_raw <- function(spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only) { .Call(`_haven_df_parse_sas_raw`, spec_b7dat, spec_b7cat, encoding, catalog_encoding, cols_only) } df_parse_xpt_file <- function(spec) { .Call(`_haven_df_parse_xpt_file`, spec) } df_parse_xpt_raw <- function(spec) { .Call(`_haven_df_parse_xpt_raw`, spec) } df_parse_dta_file <- function(spec, encoding) { .Call(`_haven_df_parse_dta_file`, spec, encoding) } df_parse_dta_raw <- function(spec, encoding) { .Call(`_haven_df_parse_dta_raw`, spec, encoding) } df_parse_sav_file <- function(spec, user_na) { .Call(`_haven_df_parse_sav_file`, spec, user_na) } df_parse_sav_raw <- function(spec, user_na) { .Call(`_haven_df_parse_sav_raw`, spec, user_na) } df_parse_por_file <- function(spec, user_na) { .Call(`_haven_df_parse_por_file`, spec, user_na) } df_parse_por_raw <- function(spec, user_na) { .Call(`_haven_df_parse_por_raw`, spec, user_na) } write_sav_ <- function(data, path) { invisible(.Call(`_haven_write_sav_`, data, path)) } write_dta_ <- function(data, path, version) { invisible(.Call(`_haven_write_dta_`, data, path, version)) } write_sas_ <- function(data, path) { invisible(.Call(`_haven_write_sas_`, data, path)) } write_xpt_ <- function(data, path, version) { invisible(.Call(`_haven_write_xpt_`, data, path, version)) } haven/R/tagged_na.R0000644000176200001440000000366713165724161013632 0ustar liggesusers#' "Tagged" missing values #' #' "Tagged" missing values work exactly like regular R missing values except #' that they store one additional byte of information a tag, which is usually #' a letter ("a" to "z"). When by loading a SAS and Stata file, the tagged #' missing values always use lower case values. #' #' \code{format_tagged_na()} and \code{print_tagged_na()} format tagged #' NA's as NA(a), NA(b), etc. #' #' @param ... Vectors containing single character. The letter will be used to #' "tag" the missing value. #' @param x A numeric vector #' @param digits Number of digits to use in string representation #' @export #' @examples #' x <- c(1:5, tagged_na("a"), tagged_na("z"), NA) #' #' # Tagged NA's work identically to regular NAs #' x #' is.na(x) #' #' # To see that they're special, you need to use na_tag(), #' # is_tagged_na(), or print_tagged_na(): #' is_tagged_na(x) #' na_tag(x) #' print_tagged_na(x) #' #' # You can test for specific tagged NAs with the second argument #' is_tagged_na(x, "a") #' #' # Because the support for tagged's NAs is somewhat tagged on to R, #' # the left-most NA will tend to be preserved in arithmetic operations. #' na_tag(tagged_na("a") + tagged_na("z")) tagged_na <- function(...) { .Call(tagged_na_, c(...)) } #' @rdname tagged_na #' @export na_tag <- function(x) { .Call(na_tag_, x) } #' @param tag If \code{NULL}, will only return true if the tag has this value. #' @rdname tagged_na #' @export is_tagged_na <- function(x, tag = NULL) { .Call(is_tagged_na_, x, tag) } #' @rdname tagged_na #' @export format_tagged_na <- function(x, digits = getOption("digits")) { out <- format(x, digits = digits) out[is_tagged_na(x)] <- paste0("NA(", na_tag(x[is_tagged_na(x)]), ")") # format again to make sure all elements have same width format(out, justify = "right") } #' @rdname tagged_na #' @export print_tagged_na <- function(x, digits = getOption("digits")) { print(format_tagged_na(x), quote = FALSE) } haven/R/labelled.R0000644000176200001440000000766613042172161013457 0ustar liggesusers#' Create a labelled vector. #' #' A labelled vector is a common data structure in other statistical #' environments, allowing you to assign text labels to specific values. #' This class makes it possible to import such labelled vectors in to R #' without loss of fidelity. This class provides few methods, as I #' expect you'll coerce to a standard R class (e.g. a \code{\link{factor}}) #' soon after importing. #' #' @param x A vector to label. Must be either numeric (integer or double) or #' character. #' @param labels A named vector. The vector should be the same type as #' \code{x}. Unlike factors, labels don't need to be exhaustive: only a fraction #' of the values might be labelled. #' @param ... Ignored #' @export #' @examples #' s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F")) #' s2 <- labelled(c(1, 1, 2), c(Male = 1, Female = 2)) #' #' # Unfortunately it's not possible to make as.factor work for labelled objects #' # so instead use as_factor. This works for all types of labelled vectors. #' as_factor(s1) #' as_factor(s1, labels = "values") #' as_factor(s2) #' #' # Other statistical software supports multiple types of missing values #' s3 <- labelled(c("M", "M", "F", "X", "N/A"), #' c(Male = "M", Female = "F", Refused = "X", "Not applicable" = "N/A") #' ) #' s3 #' as_factor(s3) #' #' # Often when you have a partially labelled numeric vector, labelled values #' # are special types of missing. Use zap_labels to replace labels with missing #' # values #' x <- labelled(c(1, 2, 1, 2, 10, 9), c(Unknown = 9, Refused = 10)) #' zap_labels(x) labelled <- function(x, labels) { if (!is.numeric(x) && !is.character(x)) { stop("`x` must be a numeric or a character vector", call. = FALSE) } if (!is_coercible(x, labels)) { stop("`x` and `labels` must be same type", call. = FALSE) } if (is.null(names(labels))) { stop("`labels` must have names", call. = FALSE) } structure(x, labels = labels, class = "labelled" ) } is_coercible <- function(x, labels) { if (typeof(x) == typeof(labels)) { return(TRUE) } if (is.numeric(x) && is.numeric(labels)) { return(TRUE) } FALSE } #' @export #' @rdname labelled is.labelled <- function(x) inherits(x, "labelled") #' @export `[.labelled` <- function(x, ...) { labelled(NextMethod(), attr(x, "labels")) } #' @export print.labelled <- function(x, ..., digits = getOption("digits")) { cat("\n", sep = "") if (is.double(x)) { print_tagged_na(x, digits = digits) } else { xx <- x attributes(xx) <- NULL print.default(xx, quote = FALSE) } print_labels(x) invisible() } #' Print the labels of a labelled vector #' #' This is a convenience function, useful to explore the variables of #' a newly imported dataset. #' @param x A labelled vector #' @param name The name of the vector (optional) #' @export #' @examples #' s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F")) #' s2 <- labelled(c(1, 1, 2), c(Male = 1, Female = 2)) #' labelled_df <- tibble::data_frame(s1, s2) #' #' for (var in names(labelled_df)) { #' print_labels(labelled_df[[var]], var) #' } print_labels <- function(x, name = NULL) { if (!is.labelled(x)) { stop("x must be a labelled vector", call. = FALSE) } labels <- attr(x, "labels") if (length(labels) == 0) { return(invisible(x)) } cat("\nLabels:", name, "\n", sep = "") value <- if (is.double(labels)) format_tagged_na(labels) else unname(labels) lab_df <- data.frame(value = value, label = names(labels)) print(lab_df, row.names = FALSE) invisible(x) } #' @export as.data.frame.labelled <- function(x, ...) { df <- list(x) class(df) <- "data.frame" attr(df, "row.names") <- .set_row_names(length(x)) df } label_length <- function(x) { if (!is.labelled(x)) { 0L } else { max(nchar(names(attr(x, "labels")))) } } #' @export #' @importFrom tibble type_sum type_sum.labelled <- function(x) { paste0(tibble::type_sum(unclass(x)), "+lbl") } haven/R/zap_empty.R0000644000176200001440000000050212743423326013712 0ustar liggesusers#' Convert empty strings into missing values. #' #' @param x A character vector #' @return A character vector with empty strings replaced by missing values. #' @family zappers #' @export #' @examples #' x <- c("a", "", "c") #' zap_empty(x) zap_empty <- function(x) { stopifnot(is.character(x)) x[x == ""] <- NA x } haven/R/zap_widths.R0000644000176200001440000000110613042170233014044 0ustar liggesusers#' Remove display width attributes #' #' To provide some mild support for round-tripping variables between SPSS #' and R, haven stores display widths in an attribute: \code{display_width}. If this #' causes problems for your code, you can get rid of them with \code{zap_widths}. #' #' @param x A vector or data frame. #' @family zappers #' @export zap_widths <- function(x) { UseMethod("zap_widths") } #' @export zap_widths.default <- function(x) { attr(x, "display_width") <- NULL x } #' @export zap_widths.data.frame <- function(x) { x[] <- lapply(x, zap_widths) x } haven/R/zzz.R0000644000176200001440000000014613227701014012532 0ustar liggesusers# nocov start .onUnload <- function(libpath) { library.dynam.unload("haven", libpath) } # nocov end haven/R/zap_labels.R0000644000176200001440000000213412743423326014021 0ustar liggesusers#' Zap labels #' #' Removes labels, leaving unlabelled vectors as is. Use this if you want to #' simply drop all labelling from a data frame. Zapping labels from #' \code{\link{labelled_spss}} also removes user-defined missing values, #' replacing all with \code{NA}s. #' #' @param x A vector or data frame #' @family zappers #' @export #' @examples #' x1 <- labelled(1:5, c(good = 1, bad = 5)) #' x1 #' zap_labels(x1) #' #' x2 <- labelled_spss(c(1:4, 9), c(good = 1, bad = 5), na_values = 9) #' x2 #' zap_labels(x2) #' #' # zap_labels also works with data frames #' df <- tibble::data_frame(x1, x2) #' df #' zap_labels(df) zap_labels <- function(x) { UseMethod("zap_labels") } #' @export zap_labels.default <- function(x) { x } #' @export zap_labels.labelled <- function(x) { attr(x, "labels") <- NULL class(x) <- NULL x } #' @export zap_labels.labelled_spss <- function(x) { x[is.na(x)] <- NA attr(x, "labels") <- NULL attr(x, "na_values") <- NULL attr(x, "na_range") <- NULL class(x) <- NULL x } #' @export zap_labels.data.frame <- function(x) { x[] <- lapply(x, zap_labels) x } haven/vignettes/0000755000176200001440000000000013227731765013400 5ustar liggesusershaven/vignettes/semantics.Rmd0000644000176200001440000001445313224443423016025 0ustar liggesusers--- title: "Conversion semantics" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Conversion semantics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(haven) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` There are some differences between the way that R, SAS, SPSS, and Stata represented labelled data and missing values. While SAS, SPSS, and Stata share some obvious similarities, R is little different. This vignette explores the differences, and shows you how haven bridges the gap. ## Value labels Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways: * SPSS and SAS can label numeric and character values, not just integer values. * The value do not need to be exhaustive. It is common to label the special missing values (e.g. `.D` = did not respond, `.N` = not applicable), while leaving other values as is. Value labels in SAS are a little different again. In SAS, labels are just special case of general formats. Formats include currencies and dates, but user-defined just assigns labels to individual values (including special missings value). Formats have names and existing independently of the variables they are associated with. You create a named format with `PROC FORMAT` and then associated with variables in a `DATA` step (the names of character formats thealways start with `$`). ### `labelled()` To allow you to import labelled vectors into R, haven provides the S3 labelled class, created with `labelled()`. This class allows you to associated arbitrary labels with numeric or character vectors: ```{r} x1 <- labelled( sample(1:5), c(Good = 1, Bad = 5) ) x1 x2 <- labelled( c("M", "F", "F", "F", "M"), c(Male = "M", Female = "F") ) x2 ``` The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels: ```{r} as_factor(x1) zap_labels(x1) as_factor(x2) zap_labels(x2) ``` See the documentation for `as_factor()` for more options to control exactly what the factor uses for levels. Both `as_factor()` and `zap_labels()` have data frame methods if you want to apply the same strategy to every column in a data frame: ```{r} df <- tibble::data_frame(x1, x2, z = 1:5) df zap_labels(df) as_factor(df) ``` ## Missing values All three tools provide a global "system missing value" which is displayed as `.`. This is roughly equivalent to R's `NA`, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. `-inf`), and Stata treats it as the largest possible number (i.e. `inf`). Each tool also provides a mechanism for recording multiple types of missingness: * Stata has "extended" missing values, `.A` through `.Z`. * SAS has "special" missing values, `.A` through `.Z` plus `._`. * SPSS has per-column "user" missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing. Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value. Haven models these missing values in two different ways: * For SAS and Stata, haven provides "tagged" missing values which extend R's regular `NA` to add a single character label. * For SPSS, haven provides a subclass of `labelled` that also provides user defined values and ranges. ### Tagged missing values To support Stata's extended and SAS's special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag. The R interface for creating with tagged NAs is a little clunky because generally they'll be created by haven for you. But you can create your own with `tagged_na()`: ```{r} x <- c(1:3, tagged_na("a", "z"), 3:1) x ``` Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use `print_tagged_na()`: ```{r} print_tagged_na(x) ``` To test if a value is a tagged NA, use `is_tagged_na()`, and to extract the value of the tag, use `na_tag()`: ```{r} is_tagged_na(x) is_tagged_na(x, "a") na_tag(x) ``` My expectation is that tagged missings are most often used in conjuction with labels (described below), so labelled vectors print the tags for you, and `as_factor()` knows how to relabel: ```{r} y <- labelled(x, c("Not home" = tagged_na("a"), "Refused" = tagged_na("z"))) y as_factor(y) ``` ### User defined missing values SPSS's user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides `labelled_spss()` as a subclass of `labelled()` to model these additional user-defined missings. ```{r} x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99) x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf)) x1 x2 ``` These objects are somewhat dangerous to work with in R because most R functions don't know those values are missing: ```{r} mean(x1) ``` Because of that danger, the default behaviour of `read_spss()` is to return regular labelled objects where user-defined missing values have been converted to `NA`s. To get `read_spss()` to return `labelled_spss()` objects, you'll need to set `user_na = TRUE`. I've defined an `is.na()` method so you can find them yourself: ```{r} is.na(x1) ``` And the presence of that method does mean many functions with an `na.rm` argument will work correctly: ```{r} mean(x1, na.rm = TRUE) ``` But generally you should either convert to a factor, convert to regular missing vaues, or strip the all the labels: ```{r} as_factor(x1) zap_missing(x1) zap_labels(x1) ``` haven/vignettes/releases/0000755000176200001440000000000013227731765015203 5ustar liggesusershaven/vignettes/releases/haven-1.0.0.Rmd0000644000176200001440000001012412775011450017365 0ustar liggesusers--- title: "haven 1.0.0" date: "2016-09-30" --- ```{r, include = FALSE} library(haven) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` I'm pleased to announce the release of haven. Haven is designed to faciliate the transfer of data between R and SAS, SPSS, and Stata. It makes it easy to read SAS, SPSS, and Stata file formats in to R data frames, and makes it easy to save your R data frames in to SAS, SPSS, and Stata if you need to collaborate with others using closed source statistical software. Install haven by running: ```R install.packages("haven") ``` haven 1.0.0 is a major release, and indicates that haven is now largely feature complete and has been tested on many real world datasets. There are four major changes in this version of haven: 1. Improvements to the underlying ReadStat library 1. Better handling of "special" missing values 1. Improved date/time support 1. Support for other file metadata. There were also a whole bunch of other minor improvements and bug fixes: you can see the complete list in the [release notes](http://haven.tidyverse.org/news/index.html#haven-1.0.0). ## ReadStat Haven builds on top of the [ReadStat](http://github.com/WizardMac/ReadStat/issues) C library by [Evan Miller](http://www.evanmiller.org). This version of haven includes many improvements thanks to Evan's hard work on ReadStat: * Can read binary/Ross compressed SAS files. * Support for reading and writing Stata 14 data files. * New `write_sas()` allows you to write data frames out to `sas7bdat` files. This is still somewhat experimental. * `read_por()` now actually works. * Many other bug fixes and minor improvements. ## Missing values haven 1.0.0 includes comprehensive support for the "special" types of missing values found in SAS, SPSS, and Stata. All three tools provide a global "system missing value", displayed as `.`. This is roughly equivalent to R's `NA`, although neither Stata nor SAS propagate missingness in numeric comparisons (SAS treats the missing value as the smallest possible number and Stata treats it as the largest possible number). Each tool also provides a mechanism for recording multiple types of missingness: * Stata has "extended" missing values, `.A` through `.Z`. * SAS has "special" missing values, `.A` through `.Z` plus `._`. * SPSS has per-column "user" missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing. Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value. Haven models these missing values in two different ways: * For SAS and Stata, haven provides `tagged_na()` which extend R's regular `NA` to add a single character label. * For SPSS, haven provides `labelled_spss()` that also models user defined values and ranges. Use `zap_missing()` if you just want to convert to R's regular `NA`s. You can get more details in the [semantics vignette](http://haven.tidyverse.org/articles/semantics.html). ## Date/times Support for date/times has substantially improved: * `read_dta()` now recognises "%d" and custom date types. * `read_sav()` now correctly recognises EDATE and JDATE formats as dates. Variables with format DATE, ADATE, EDATE, JDATE or SDATE are imported as `Date` variables instead of `POSIXct`. * `write_dta()` and `write_sav()` support writing date/times. * Support for `hms()` has been moved into the [hms](https://github.com/rstats-db/hms) package. Time varibles now have class `c("hms", "difftime")` and a `units` attribute with value "secs". ## Other metadata Haven is slowly adding support for other types of metadata: * Variable formats can be read and written. Similarly to to variable labels, formats are stored as an attribute on the vector. Use `zap_formats()` if you want to remove these attributes. * Added support for reading file "label" and "notes". These are not currently printed, but are stored in the attributes if you need to access them. haven/vignettes/releases/haven-0.1.0.Rmd0000644000176200001440000000746212773512125017403 0ustar liggesusers--- title: "haven 0.1.0" date: "2015-03-04" --- ```{r, echo = FALSE} knitr::opts_chunk$set(comment = "#>", collapse = T) ``` Haven makes it easy to read data from SAS, SPSS and Stata. Haven has the same goal as the [foreign](http://cran.r-project.org/package=foreign) package, but it: * Can read binary SAS7BDAT files. * Can read Stata13 files. * Always returns a data frame. (Haven also has experimental support for writing SPSS and Stata data. This still has some rough edges but please try it out and [report any problems](https://github.com/hadley/haven/issues) that you find.) Haven is a binding to the [ReadStat](http://github.com/WizardMac/ReadStat/issues) C library by [Evan Miller](http://www.evanmiller.org). Haven wouldn't be possible without his hard work - thanks Evan! I'd also like to thank Matt Shotwell who spend a lot of time reverse engineering the SAS binary data format, and Dennis Fisher who tested the SAS code with thousands of SAS files. ## Usage Using haven is easy: * Install it, `install.packages("haven")`, * Load it, `library(haven)`, * Then pick the appropriate read function: * SAS: `read_sas()` * SPSS: `read_dta()` or `read_por()` * Stata: `read_sav()`. These only need the name of the path. (`read_sas()` optionally also takes the path to a catolog file.) ## Output All functions return a data frame: * The output also has class `tbl_df` which will improve the default print method (to only show the first ten rows and the variables that fit on one screen) if you have dplyr loaded. If you don't use dplyr, it has no effect. * Variable labels are attached as an attribute to each variable. These are not printed (because they tend to be long), but if you have a [preview version of RStudio](http://www.rstudio.com/products/rstudio/download/preview/), you'll see them in the [revamped viewer pane](http://blog.rstudio.org/2015/02/24/rstudio-v0-99-preview-data-viewer-improvements/). * Missing values in numeric variables should be seemlessly converted. Missing values in character variables are converted to the empty string, `""`: if you want to convert them to missing values, use `zap_empty()`. * Dates are converted in to `Date`s, and datetimes to `POSIXct`s. Time variables are read into a new class called `hms` which represents an offset in seconds from midnight. It has `print()` and `format()` methods to nicely display times, but otherwise behaves like an integer vector. * Variables with labelled values are turned into a new `labelled` class, as described next. ### Labelled variables SAS, Stata and SPSS all have the notion of a "labelled" variable. These are similar to factors, but: * Integer, numeric and character vectors can be labelled. * Not every value must be associated with a label. Factors, by contrast, are always integers and every integer value must be associated with a label. Haven provides a `labelled` class to model these objects. It doesn't implement any common methods, but instead focusses of ways to turn a labelled variable into standard R variable: * `as_factor()`: turns labelled integers into factors. Any values that don't have a label associated with them will become a missing value. (NB: there's no way to make `as.factor()` work with labelled variables, so you'll need to use this new function.) * `zap_labels()`: turns any labelled values into missing values. This deals with the common pattern where you have a continuous variable that has missing values indiciated by sentinel values. If you have a use case that's not covered by these function, please let me know. ## Development Haven is still under very active development. If you have problems loading a dataset, please try the [development version](https://github.com/hadley/haven), and if that doesn't work, [file an issue](https://github.com/hadley/haven/issues). haven/vignettes/rsconnect/0000755000176200001440000000000012726302003015354 5ustar liggesusershaven/vignettes/rsconnect/documents/0000755000176200001440000000000012726302003017355 5ustar liggesusershaven/vignettes/rsconnect/documents/semantics.Rmd/0000755000176200001440000000000012726302003022064 5ustar liggesusershaven/vignettes/rsconnect/documents/semantics.Rmd/rpubs.com/0000755000176200001440000000000012726302003023774 5ustar liggesusershaven/vignettes/rsconnect/documents/semantics.Rmd/rpubs.com/rpubs/0000755000176200001440000000000012726302003025127 5ustar liggesusershaven/vignettes/rsconnect/documents/semantics.Rmd/rpubs.com/rpubs/Document.dcf0000644000176200001440000000050012726302003027356 0ustar liggesusersname: Document account: rpubs server: rpubs.com appId: https://api.rpubs.com/api/v1/document/188133/bfa3384136db4a71a621d364bedbca69 bundleId: https://api.rpubs.com/api/v1/document/188133/bfa3384136db4a71a621d364bedbca69 url: http://rpubs.com/publish/claim/188133/03a0f551d3a04c1da47cb56aa1a7e640 when: 1465484291.89364 haven/vignettes/datetimes.Rmd0000644000176200001440000000242213006437521016007 0ustar liggesusers--- title: "Dates and times" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Date times} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Formats There are three common formats across SAS, SPSS and Stata. Date (number of days) * SAS: MMDDYY, DDMMYY, YYMMDD, DATE * Spss: n/a * Stata: %td Time (number of seconds): * SAS: TIME, HHMM, TOD * Spss: TIME, DTIME * Stata: n/a DateTime (number of seconds): * SAS: DATETIME * Spss: DATE, ADATE, SDATE, DATETIME (as milliseconds) * Stata: %tc, %tC ## Offsets Dates and date times use a difference offset to R: * SAS: 1960-01-01 (`r -as.integer(as.Date("1960-01-01"))` days) * Spss: 1582-10-14. (`r -as.integer(as.Date("1582-10-14"))` days) * Stata: 1960-01-01. (`r -as.integer(as.Date("1960-01-01"))` days) ## References * SAS: , * Spss: * Stata: haven/README.md0000644000176200001440000000450513127733361012644 0ustar liggesusers# Haven [![Travis-CI Build Status](https://travis-ci.org/tidyverse/haven.svg?branch=master)](https://travis-ci.org/tidyverse/haven) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/tidyverse/haven?branch=master&svg=true)](https://ci.appveyor.com/project/tidyverse/haven) [![codecov](https://codecov.io/gh/tidyverse/haven/branch/master/graph/badge.svg)](https://codecov.io/gh/tidyverse/haven) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/haven)](https://cran.r-project.org/package=haven) ## Overview Haven enables R to read and write various data formats used by other statistical packages by wrapping the fantastic [ReadStat](https://github.com/WizardMac/ReadStat) C library written by [Evan Miller](http://www.evanmiller.org). Haven is part of the [tidyverse](http://tidyverse.org). Currently it supports: * __SAS__: `read_sas()` reads `.sas7bdat` + `.sas7bcat` files and `read_xpt()` reads SAS transport files (version 5 and version 8). `write_sas()` writes `.sas7bdat` files. * __SPSS__: `read_sav()` reads `.sav` files and `read_por()` reads the older `.por` files. `write_sav()` writes `.sav` files. * __Stata__: `read_dta()` reads `.dta` files (up to version 14). `write_dta()` writes `.dta` files (versions 8-14). The output objects: * Are [tibbles](http://github.com/hadley/tibble), which have a better print method for very long and very wide files. * Translate value labels into a new `labelled()` class, which preserves the original semantics and can easily be coerced to factors with `as_factor()`. Special missing values are preserved. See `vignette("semantics")` for more details. * Dates and times are converted to R date/time classes. Character vectors are not converted to factors. ## Installation ```R # The easiest way to get haven is to install the whole tidyverse: install.packages("tidyverse") # Alternatively, install just haven: install.packages("haven") # Or the the development version from GitHub: # install.packages("devtools") devtools::install_github("tidyverse/haven") ``` ## Usage ```R library(haven) # SAS read_sas("mtcars.sas7bdat") write_sas(mtcars, "mtcars.sas7bdat") # SPSS read_sav("mtcars.sav") write_sav(mtcars, "mtcars.sav") # Stata read_dta("mtcars.dta") write_dta(mtcars, "mtcars.dta") ``` haven/MD50000644000176200001440000002265013230073603011665 0ustar liggesusersf3b4bb4b33feeaabe869845943d3b44b *DESCRIPTION cf97b41bc7b0fbb83c7b4b381eed739f *LICENSE 6e2eccf5929d7f8930420b1d8fb3101c *NAMESPACE 3c736c62177ab366aabcdb8a62a00681 *NEWS.md 2938efd7ff3346b4a5f1f056af121f70 *R/RcppExports.R 64a0871119f365cf1f58bbebf730a701 *R/as_factor.R bdd4e4783dc0ae93030dfb445e2fc2e0 *R/haven.R cc700d2d50ed59baacf3537e61999ba5 *R/labelled.R 9bc30fdcc06a132826dbd99b8fad71c9 *R/labelled_spss.R 331f19673fea9928f19b2b876364ac13 *R/tagged_na.R 3cac71e3df49f46251542a4f59ddd240 *R/update.R 2b8e9a061d48c8562a53d0d085ac0324 *R/utils.R 45c4b7e4353d543a49dd17e993243941 *R/zap_empty.R b92da8e8fa9cc5371e374d454e3200d4 *R/zap_formats.R 598ed97fb462c646b00b22ae9dd1e703 *R/zap_labels.R 2b5453ebca97b7d2d46ca3177d15583d *R/zap_missing.R b266c609a7d36e8158db997181cd90c0 *R/zap_widths.R 43cb29a8578e4e75f9470f81551c0fa4 *R/zzz.R ef545a1571507ee7a844caa0f73c3058 *README.md ecfe99d94b3df23a95816beec2d42c56 *build/vignette.rds f042585a32d10af7532a86f7dc23e402 *inst/doc/datetimes.Rmd 0dcfe8b1d08a50ed64a69e2205eb099c *inst/doc/datetimes.html 718d86f09ace4c9e6d7834d9e5c36b2a *inst/doc/semantics.R 7597ff3a37dd05812a625a401f75ab60 *inst/doc/semantics.Rmd 7206673800dbdeca023556c548bf911c *inst/doc/semantics.html 782776cdad132bd02616ab5e0ddbb1a6 *inst/examples/iris.dta 6d7292019b3784d97ba2e2ced4ce5bac *inst/examples/iris.sas7bdat 9eec419726af6bb92009b25c5224e32b *inst/examples/iris.sav 579f0da1427bb136f6c6c5d31e51ced1 *man/as_factor.Rd c76c8f96d13b2e1402be718c6ded7bb1 *man/figures/logo.png e81fd52da0361d9a30b1c6d680cc7f7d *man/labelled.Rd 26d12f030d67537d1fed0066623a9c54 *man/labelled_spss.Rd 137578b46e3c1e3cab43e122eb392a07 *man/print_labels.Rd 5644d432a4adb383f03055fb4ba7a965 *man/read_dta.Rd 41a227b501708f9462312c998661ba2f *man/read_sas.Rd 74fb24dd951b1016675207af95ff3d5a *man/read_spss.Rd 703bb1f883e810657048b89451407374 *man/read_xpt.Rd 63677c09f81f127d6d177f1578a51899 *man/tagged_na.Rd 26126b95c9f75103100a81eb28540b99 *man/zap_empty.Rd 81a71ca828f5fdd9f4f9d9fb868ae41f *man/zap_formats.Rd d291f02d1984e275368a4795cb96a153 *man/zap_labels.Rd 0ae106b088c1632323217b5eb50dfc6d *man/zap_missing.Rd 5c6a0fdab97125c8079d346b14ba1eed *man/zap_widths.Rd 465c03a29d22275b15ce8f5aff99e1ed *src/DfReader.cpp 984dd7e226f5713d00eaff48c26e9c48 *src/DfWriter.cpp 44c42f00a127ef1415a68cdea88d19a1 *src/Makevars aa8544e8051e9d30db10548246cee3d3 *src/Makevars.win b71507f1d9b230cc567d2b55ad9b350c *src/RcppExports.cpp a8fd1bcd1f0d891803886c86f202e5d1 *src/haven_types.cpp fb5b5688dfb00868de3a8a7749a94d76 *src/haven_types.h b6a70b3738f6c7c2e674d0c361799ff8 *src/readstat/CKHashTable.c 9231e05d5d63880216aff4c90122df91 *src/readstat/CKHashTable.h 02c61c60b75fda545b8b04f5e01b13a5 *src/readstat/readstat.h 91afb7a9b9258fa50c7066b55ea97999 *src/readstat/readstat_bits.c ee0bb989382f7fde0b4ecd721c63437a *src/readstat/readstat_bits.h 1efee7b9fef772d5217fa754d6b1c0f2 *src/readstat/readstat_convert.c 94288276f3982ee9834e010e151d6645 *src/readstat/readstat_convert.h 90b3b0ee00711b1fa8943bbb0e82e572 *src/readstat/readstat_error.c f2f76beaef37c669245f0dc4cdccd662 *src/readstat/readstat_iconv.h 77f501957f84964511353fb77e2d645e *src/readstat/readstat_io_unistd.c 6d20f5534b536f5fbd48480cb1df9a7a *src/readstat/readstat_io_unistd.h 3949f78cb8fad4b234a72ba3bacf3294 *src/readstat/readstat_malloc.c 1e19c96593352dfd6acf55fd290101f4 *src/readstat/readstat_malloc.h ba3ca0521c68c97c7cb128c878e5ed26 *src/readstat/readstat_parser.c d316daca6ec0818738940ff36ce9846a *src/readstat/readstat_value.c 7fe324a298e29a923945c47d032ae6fa *src/readstat/readstat_variable.c 3743d04c08d04f715d4f55a0d17a4c47 *src/readstat/readstat_writer.c 5663f9ea2bcae2364a3948cbe872a905 *src/readstat/readstat_writer.h 70b27ed1d71dc01c682c05ae8ef25440 *src/readstat/sas/ieee.c a9f00e5b895054ef83c9bf82eb8f257f *src/readstat/sas/ieee.h edec7d1418791431a8a83f476a1bb373 *src/readstat/sas/readstat_sas.c 397321067501217054588d9007fe25cb *src/readstat/sas/readstat_sas.h 9eaab989ade1d617013b95de9a651e72 *src/readstat/sas/readstat_sas7bcat_read.c 1f418d50be6f90b658fdc3efc76cf42b *src/readstat/sas/readstat_sas7bcat_write.c 9782edce54bc924024d7426ff20feb4b *src/readstat/sas/readstat_sas7bdat_read.c 52d5b532f536eb9d5cd2a9f23ab53389 *src/readstat/sas/readstat_sas7bdat_write.c 47161634845f45bb45213a45b86aec42 *src/readstat/sas/readstat_sas_rle.c a54f285ec408075cfd4f16f30fc4dc8e *src/readstat/sas/readstat_sas_rle.h 10f0fb38bd48b60a82a1123c3d9ecfaf *src/readstat/sas/readstat_xport.c 1b1cf645d9e9e2a0704cbb2d46dca15c *src/readstat/sas/readstat_xport.h a6c003df6ba865240872483511ec08e4 *src/readstat/sas/readstat_xport_read.c ca189a94c7d2a0c487c8f852053900eb *src/readstat/sas/readstat_xport_write.c a644d0305cd087493c26ef0dc26ed62c *src/readstat/spss/readstat_por.c 2ea5c19f08afb7eeb089769344964542 *src/readstat/spss/readstat_por.h 45809db64274609cba8dfc46e56444e6 *src/readstat/spss/readstat_por_parse.c 8fef965008763eac54d94438d33f7408 *src/readstat/spss/readstat_por_parse.h a4af515e8e84dc4ecf32e152de4ba56f *src/readstat/spss/readstat_por_read.c 6c0f51cbc9bb93083cdff22d7e52545c *src/readstat/spss/readstat_por_write.c fff91f5e88472eb7d64152a4f78c05c8 *src/readstat/spss/readstat_sav.c 585dd0e896ee4cbd8a510afd8e784841 *src/readstat/spss/readstat_sav.h 93e68978dd1444c491cec91f04961e49 *src/readstat/spss/readstat_sav_parse.c 0153d271c5d495b5977057d9c19d9f01 *src/readstat/spss/readstat_sav_parse.h 7b77823945ac502a264f85b79b35f07f *src/readstat/spss/readstat_sav_parse_timestamp.c 204b84a989b74ca7b4aad4205b67c357 *src/readstat/spss/readstat_sav_parse_timestamp.h c426cc4064fda3abea07b01a7efe2a58 *src/readstat/spss/readstat_sav_read.c e4bbd2e8658ca2801e8575082af55b52 *src/readstat/spss/readstat_sav_write.c 9bf8f8b819f15d18b332e0d1cd031b01 *src/readstat/spss/readstat_spss.c 2a6bfdb7a163544dbd588f5f0a5fc0b8 *src/readstat/spss/readstat_spss.h dedd15436435a125dce812c465b7d747 *src/readstat/spss/readstat_spss_parse.c 751f5e801cfb6e74dd796bd3c6b032c2 *src/readstat/spss/readstat_spss_parse.h 81aabbd9550bade17b2c52fcb879bdf6 *src/readstat/stata/readstat_dta.c 715ab41b6fed8361d33a746f8e005f2d *src/readstat/stata/readstat_dta.h 1106c513215dc0c4608c97c42928bfbf *src/readstat/stata/readstat_dta_parse_timestamp.c cf98cd82614330671b97dc2ffa92b8f3 *src/readstat/stata/readstat_dta_parse_timestamp.h 888fe8dff4a201adf6d6a1f62bc9ae96 *src/readstat/stata/readstat_dta_read.c 57d20036cf233b63f42e082898d85a7f *src/readstat/stata/readstat_dta_write.c aae161361ab10ceb021df485c8cda71b *src/tagged_na.c f3a180e741a675c27b36d4ce86e1922e *src/tagged_na.h 130ef81e2dee1f48002137a5809e7765 *tests/testthat.R a906d1b8e57ae2fbdd5cc3bfbe84757c *tests/testthat/datetime-d.dta ae0a03c2f46d9d09e4888ee98f7faccc *tests/testthat/datetime.sas7bdat 3da136a3cf1086d4ec7132badfc3f9e0 *tests/testthat/datetime.sav f74b89b03972dd66a8ee7e253d91842f *tests/testthat/formats.sas7bcat c7d024dc776bd1891da13439799f8e9b *tests/testthat/hadley.sas7bdat afe9293b155fbd51fd20e0c8dd84927a *tests/testthat/hadley.zip ae9c1afa6aa61720fd3e38f39dde40ad *tests/testthat/helper-lump.R 8fd3c3c0b7e672a5c565b67f587442e2 *tests/testthat/helper-roundtrip.R 1df50141e26e230a726ff6d76da3583d *tests/testthat/labelled-num-na.sav 0446f805dc7bb9273b01d5e19ec042f7 *tests/testthat/labelled-num.sav a34297a2ab33f7d4a4c77dc381198b29 *tests/testthat/labelled-output.txt 89c1a054947d0533cd11a703828e9b58 *tests/testthat/labelled-spss-output.txt 8866383dd5d2c61f89270fbe4a6802a0 *tests/testthat/labelled-str.sav b4963a6dcfd5ea880600fcf70b997215 *tests/testthat/notes.dta 69eef7e56b7f7137bbd7f0e6d3526eb2 *tests/testthat/tagged-na-double.dta b5faae7735cdbdea22666cab82af6444 *tests/testthat/tagged-na-int.dta fb6f2b8e08c1ff0564e3fdcf4d2a824b *tests/testthat/tagged-na.sas7bcat 3c7429bbf497edf98dc73b193420b56c *tests/testthat/tagged-na.sas7bdat cd5be4255eba3124e952eccd43c9460e *tests/testthat/tagged-na.txt a5a96fdc615592e34f01f933cb33b765 *tests/testthat/test-as-factor.R d3f10eed8c73596895005ab5e3a12227 *tests/testthat/test-labelled.R d757fd1c9641ff00a134f4f12507dbd1 *tests/testthat/test-labelled_spss.R 4aa74368024795ad141caf1eba9389c0 *tests/testthat/test-read-connection.R 33d953deca0c08d9bbe1be52e5c00325 *tests/testthat/test-read-sas.R a142a18441f662bf530411b9b4cbe73f *tests/testthat/test-read-sav.R d7b2916a3065d311fdb75270ae910839 *tests/testthat/test-read-stata.R c7d4f8e157a3f07fa8c1be6bb15d565a *tests/testthat/test-read-xpt.R ce95913eb3b1f0559242dcba87740f6b *tests/testthat/test-replace_with.R de2462bf80ba152275e77b010ce219cb *tests/testthat/test-tagged_na.R 636e8a433960fe0c0aabfad557a0a0bd *tests/testthat/test-utils.R 6f95c47461dc3995bc969fe4a94e2ec8 *tests/testthat/test-write-dta.R 6db104810e022716909f84996947573b *tests/testthat/test-write-sas.R c103ee080c172eb57487941bdf4a8110 *tests/testthat/test-write-sav.R 7c45b097a489d5d80258b4244760efb6 *tests/testthat/test-write-xpt.R 2b5f5167b375fec7ce4d326461986ed1 *tests/testthat/test-zap-empty.R 26e86dceb79f10837dd9289f73c179ce *tests/testthat/test-zap_labels.R f8a9c07282b38738fd3598aa9a8b2d49 *tests/testthat/test-zap_missing.R c6b21f003d0cb2fcb5aec3221a41f09f *tests/testthat/test-zap_widths.R 5781ee0b0dd56a2d8cb1ac287731b8f0 *tests/testthat/types.dta 2cb2c744d5da2db3f373f4a5dd1ae86e *tests/testthat/umlauts.sav 556e6b180ab9777783d8ea55a37e6178 *tests/testthat/variable-label.sav f042585a32d10af7532a86f7dc23e402 *vignettes/datetimes.Rmd f024be9b02b3acf60943e92f5908e1dd *vignettes/releases/haven-0.1.0.Rmd f04e742c015d25a7a8ba3a44cba5099c *vignettes/releases/haven-1.0.0.Rmd 5e3d6c6c8a4ecf75edac41aeb6ebc73e *vignettes/rsconnect/documents/semantics.Rmd/rpubs.com/rpubs/Document.dcf 7597ff3a37dd05812a625a401f75ab60 *vignettes/semantics.Rmd haven/build/0000755000176200001440000000000013227731764012466 5ustar liggesusershaven/build/vignette.rds0000644000176200001440000000034313227731764015025 0ustar liggesusersuK0C  C\m&PLHyrpW⢝w!Ld&2-H6ֱ6r!9DҒ(-s )14RŠ{"sބ ZpTTiSӠ.5HA2Y ÝQNhAtÍʱqR#U6&/r݅|C0smf(+Ў4pࠇhaven/DESCRIPTION0000644000176200001440000000233013230073603013054 0ustar liggesusersPackage: haven Title: Import and Export 'SPSS', 'Stata' and 'SAS' Files Version: 1.1.1 Authors@R: c( person("Hadley", "Wickham", , "hadley@rstudio.com", role = c("aut", "cre")), person("Evan", "Miller", role = c("aut", "cph"), comment = "Author of included ReadStat code" ), person("RStudio", role = c("cph", "fnd")) ) Description: Import foreign statistical formats into R via the embedded 'ReadStat' C library, . License: MIT + file LICENSE URL: http://haven.tidyverse.org, https://github.com/tidyverse/haven, https://github.com/WizardMac/ReadStat BugReports: https://github.com/tidyverse/haven/issues Depends: R (>= 3.1) Imports: forcats (>= 0.2.0), hms, Rcpp (>= 0.11.4), readr (>= 0.1.0), tibble Suggests: covr, knitr, rmarkdown, testthat LinkingTo: Rcpp VignetteBuilder: knitr Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1 SystemRequirements: GNU make NeedsCompilation: yes Packaged: 2018-01-17 20:37:09 UTC; hadley Author: Hadley Wickham [aut, cre], Evan Miller [aut, cph] (Author of included ReadStat code), RStudio [cph, fnd] Maintainer: Hadley Wickham Repository: CRAN Date/Publication: 2018-01-18 10:31:31 UTC haven/man/0000755000176200001440000000000013227731765012143 5ustar liggesusershaven/man/figures/0000755000176200001440000000000013127733241013575 5ustar liggesusershaven/man/figures/logo.png0000644000176200001440000003570712773557643015277 0ustar liggesusersPNG  IHDRxb]esRGB pHYs  iTXtXML:com.adobe.xmp Adobe ImageReady 1 ).=9IDATx} `Ty̝ьF-f1` qۀ[HMlm:K`,MyIܦY-Y؀fGHB4{! $8۹gο?\!V.%U\7kU,ܱ9=tR*ѷl-d+_hjc!C뎸y6<.]%)RRZ 욛GNRc~~;+MYy4ÃѤࢧkDPI[Y%bz(!0q]9r_gs3i0)vbŃ^ŲR|3Gq[~0[OPԏ; z^peͫK;.'M1-I"/*h9)Z"t$V "?^Qi6WeWAΖ(y7-z@5PȉuV2iY8MSN_^].77T34 kg} r~'@HJ9"5)QlmKs,)u:o8l[^ ۃ4S`WK$W,~:Yq ^_aPL9dXZ-g.K,{P΂Jl $ǐȋ4XiRnHB#qag>6cs{l[_~ۦ'q,][DY 6TA~>%g+<#fYWFjC`DdV`j5$N>U'T':l)SS~ޅ7_zl{.X{U: ޾Bf/e76umH,9K n)JTI̠5b8H8 @>d*uߋKŶêSy=rUf x*Pʰ\Y올VRbd/zɎi2RM88DY|ɢhߎ:Vewu^_L۠\vAXOfaNXJbZ4$lo+v Ro6@67l EFKД՚-_tSl$ ÷Uov޶'e^\{s,H//١@rK͊ɒ <,C_TIY\@@>3a2Nid".P5or͞}OB]{CU('aqNհ'Ԏ^R0)q-qъq3vG"@fos(} 'zh/?El?Qy=Сʯ=am_Y4bp3Cy QHr4esXElAx0%`ʧL70]`g#Ǟl9+2rzaPd=GY\ɢ ,'!^, %RL}crW=; H;YrvӠ,{;XǍS7l8בA:!R2 d'`u3(^R$0d{kdMS Eir6 T(_rwfod|&agg%u%2,HdLvƃ{T [7+g]"m o-{nl '37+[>+5{07~k ܒ}v]x_~cJ/~]x.u@b0Cβ8}u ,' 63'9"rKy`ܒ}~Yn)P-B90[7W>T* c~m[׾;63L{Ωj<ōG3ʍgBh0T pxx eΩ Z{47dJvؿ1#\ȶnILtiaE$U!9=-M[$}svKU(wY7nx4n< ǶRq w=RvlHr8UpgٸGOAQq`!-)M0c "-Sm~Tl{nܒgݾQ(r(kw="*dB20 u MΊie*\"/ T.ȱ?5di.Z&py!om֦c񅠈/vKƒji=p9%Lr4 /@uKDğ[*ˈR1IvKM0&01108t$w?0Gq}JHh6@n? QEy@w@蜚-P h>?F]z%&\{ H2Lg\nI8_jm[m@SNK]?ra̙rGe7f"ԂMT2 KR|$U3E/Op.䡅Y$) OtX6);G*Gcq]R%w{c~}{3YRQ.aFzNYyG@O8U:' ((NeQ@s`Ur KXˑΘ-)pKZ/-X!g}'dw|eH0"~u|Vvc+odnhQxFT*SO  {3 lNșC~+tWieN .pZ Ph$~]i_yZ!@Y!-PP'Ĉ?J @sϧʣcTHבiv?dsQ@&)\ Pp:v-{VzP{%Q_+/)}ACݒH+POwK*uXӧӍ;1͎Y'cHgx v)(Ij!b2tm4*`ˎJx\̨xG0[$x ͽUmheD:MYTODиT-ehԚ fײL~['c26ܒ uܒaLx"d%Zm?ze%4v8Ў97ˍ3#B!$?Fn4{J7bvaHF OpPxJP4'~3tʸ;+$IɰĈ7K 6HQ)dU`}vGd X(sU- W%T5qXH_ww۪,–' vŬ eM9#7(͊1j*,񹄮T4xR@Ѡ@[#9+yi` Sy0grDINBML[ WcPXrsdD/ͤnLJ*z3=uD-zKFApZ$+mAĵYDdb m4Xl    ̤"8vu5ym0TJ\ OA޴^qbSϖ3PÚ4t8-'aA'6(/Hm?gpY,Wy"xn*~ $ 碷`ؒ(gc1gĎZ l}PsaS]!?OһH Kg'!@}'?4lLrt0ڥQ3+UVQK;7cpbNԑu5Eh pu<*jGt 9O7\f&5 @uqvYcVt!$;%\}Xu1W_\TF#XfN{)W]#eR2bNЇ~77NbuZonrN3{^wPrO͐L=,pX:K#r,ģByQ 6a#+|t!dLKԊ>b&>[uƾ:k(5Ǥ;jx',*J8}ҡzi.zX2Zľ:8 0~}atMO)st: C`0aP.-W,V ǥ`I57PKsBCzH0 S`@F}˲,eW}D jͅ% ?ʈ&8qr$yR%{+d7+WU# ?l9֔1CY8`PԢ^1Ȩ$PV |a񗖉7^@RvA]nxPRoZ@J2&^1V<)2x0gt2M3)xO wr^)j#.sW(ŭpEjfGDƧ!m9v;x4 `2T0Fm pTPFބ#-G3 n]9ZOefP` $C'2 0>-&mQJSKS+M&,7?3pk=7{=N1;Ds.'lthܦa26"=cÔ^BbptXQGSYo o @+ aHI#9Z0D$m[:`^둃rr?10 aΪ\^d| Sᰑ.vggW5 +ɘs%<ѼGb;CV3G67oıf,vTs 1a:=[2'k$w6evK(jԅT 7R1 \ۋI1d^|!9$:wBR?閡[s!=fxT`ôb`3oʡahDy0,x~ 8KaT#{ͼ=5l~"8s$)̓E[e Sf*7c;Z/!`4H޴Xr0dx.eb x `îi Y=b, ~'QҢT 2av -&+ƛYsu6Ȋg\cQݣ);3%ByxVUG`3ɴ2(ݍj*m>w<*]I2 P0+FȆ U1Av16*#7U:0m9$~,aw|鎓n\PPc BԎ|8ޙ*a?ey2P) ƒ#q w.Z™)8]569UuLuc=c'0vQk;I٘Ѳ_U)WඅŗD(qط&Ag#M\t&Gl Nl+u%R>: QÔۈt<`";`-!A{FO Jv@~FY NFj8 v)78tHX J7RQCbtUkA||!:)3'zV"08@$%L6n_Gg> ]q`*HKo]/э_3URo1HZMIլ~)$r7Xix.y)9 `2 Hp3pfe;28uKz #{ui~Px M\Z7eg1^8 CA W"5V5RJr=G{Rz MʒKC9H?58=ćAkoP,~Cr)]Gޏ8KzHH[߇z+C.|r(o"UXrVFel6 xr'}$мb\2G_~ANcҲGyƮL' I߈~l NF>T+#O0&l71'~{cr_[kk\v*i/$em5~J2ɝDm}.hڸZWX:}،(,V fĶXac%oj >_>_ɜRSHn_?$'BbǫUBgZ^ۤ WK˿)Cl7V>@!sB iTm\5C|JU¶K_[v(~vi;O[Uypk W]D_t|~[7֫wBW҅ͅS[Ww' `&c:oc#+zΚ S:_``Egb^486>1bcX5D+=z*>t?-{lpКՁo* &_uM,CA]>I"OV)K7kkr*NJw"1ѡaP$kE6G WE{vw ߣ,x i2XhN/WZfjC@MxQ9K$*E\C ņԂD?oЌ9uGzRi;n/yj\gϭ mɆ-5]JD^<gk;e>C މvȞ+_v,A\{0aq9%|d^{Y ۷*~Q^ MbG,}7H;._5@tW-|:W! Kă ?]PVR|>Vlf_p ]OiJ0OIkN%BKyþWλ/\*KYU1 پm%%I|þNV,RSBշYQ } NF(I9]H7rFˈ՟.R˴KE7L8w )˕Mk$~ht)aX h7RȰa#޺JRo߈rcī3a.-$ Se+MRtB \zD{wK s27S^mr&)\2L?GnUIusZ & ?x2&aФЉfe'EY8l}#X(㿥s*W&_ؠH#M| fCi4*_L4O=tqu ytWo",,d~I6_;5` /&cdՍ{ i,Xvtql6UrZ JKl1يOŵ|%*Wج:Vԑ7Ę1_`)^/'OXtk|f,.7%64r?R:4:MeB0FV1Iw` mkcqq+O?nVF4x^| ~ (aɲͰ%kE2x0].!V2"8'UtH~#*>*8);GeǑ\J3ŵzIY'vhmc pxw$5\-{LNRƒ7rINPc\UBe7՞+(k]P>;s.êd`Er1[0 TqȉȦcΰ "C=Y ڴ)X 9̡(ؘقI :jKVRP]M3f( m6GʀmnGZ-T*9r*=H3 MX?C)M[o藄Q$62Ppւ4hXwg8I_8k$Z4KεI$ӌ zR6kRP`$qmpҖ0 |2U KDx׽ }Үf!PWg3( 0b=Y%YL bs{RP/c 柈yLXX XŒ(AǓgkÎm }7͞W>7by +JMgjW>$ 84,5oWpR0hP']rvXm<ۇ2w@9w{`U*LzA\ݲV5Nq E6{Kua%֜C p̶'`3s?LF";~'͍{.0&ySB ʼ[uܹl8\ؖ^ΪvQfO|'U V/[h\GQ>CKV9ê|=x i'οԈ[ Q|#- \AI67`jVkoILR\@+s< -W?Iʮj"gߩv.^=wl"4{`KaGɺ|N)X 0P wK4UyECxsJ+qۂ0Ͷl~u/ c=O*dpzB{l﫡,8yoCⲗn{J>nIܒ(/<$JY-GKGRk8<ĶVst$uoi̋lhv_f{e=:-[=.5ِ=%!sKr9kcu4B5wb=XRL`ɑHDA I;;in-iGHQ6 $ il;>J }0y]-+oB)0-+u夿-0P.t Q'ތr*dNޘ^_sns-(- /a(% q)K=aSJ}М k'1X͔мٚ16fٻ]_oi1&Ƌ&-f=pC ,7;-=g yحSnI=o35}EMS$;9qP 6a+ҭԠ)1m}ȉҙs F,t nck-vqxǀ۞~XWȺ5 Y֪.8-zu4%CJ+%\yp|𻖩L7Ƴ,P)`!gڍY<KtX3* F Sm˙jxMxoٮVJy*ՍU~m)W8fO_xG0pe<4k6/u9 ^Y閴02Xx`ōIC ``YnX- pq5[U6TAh7[֋sh*(cnkTܳn~d=v|(y4oIENDB`haven/man/tagged_na.Rd0000644000176200001440000000311313165724163014334 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/tagged_na.R \name{tagged_na} \alias{tagged_na} \alias{na_tag} \alias{is_tagged_na} \alias{format_tagged_na} \alias{print_tagged_na} \title{"Tagged" missing values} \usage{ tagged_na(...) na_tag(x) is_tagged_na(x, tag = NULL) format_tagged_na(x, digits = getOption("digits")) print_tagged_na(x, digits = getOption("digits")) } \arguments{ \item{...}{Vectors containing single character. The letter will be used to "tag" the missing value.} \item{x}{A numeric vector} \item{tag}{If \code{NULL}, will only return true if the tag has this value.} \item{digits}{Number of digits to use in string representation} } \description{ "Tagged" missing values work exactly like regular R missing values except that they store one additional byte of information a tag, which is usually a letter ("a" to "z"). When by loading a SAS and Stata file, the tagged missing values always use lower case values. } \details{ \code{format_tagged_na()} and \code{print_tagged_na()} format tagged NA's as NA(a), NA(b), etc. } \examples{ x <- c(1:5, tagged_na("a"), tagged_na("z"), NA) # Tagged NA's work identically to regular NAs x is.na(x) # To see that they're special, you need to use na_tag(), # is_tagged_na(), or print_tagged_na(): is_tagged_na(x) na_tag(x) print_tagged_na(x) # You can test for specific tagged NAs with the second argument is_tagged_na(x, "a") # Because the support for tagged's NAs is somewhat tagged on to R, # the left-most NA will tend to be preserved in arithmetic operations. na_tag(tagged_na("a") + tagged_na("z")) } haven/man/zap_widths.Rd0000644000176200001440000000112213042172412014561 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/zap_widths.R \name{zap_widths} \alias{zap_widths} \title{Remove display width attributes} \usage{ zap_widths(x) } \arguments{ \item{x}{A vector or data frame.} } \description{ To provide some mild support for round-tripping variables between SPSS and R, haven stores display widths in an attribute: \code{display_width}. If this causes problems for your code, you can get rid of them with \code{zap_widths}. } \seealso{ Other zappers: \code{\link{zap_empty}}, \code{\link{zap_formats}}, \code{\link{zap_labels}} } haven/man/print_labels.Rd0000644000176200001440000000122413042170233015065 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/labelled.R \name{print_labels} \alias{print_labels} \title{Print the labels of a labelled vector} \usage{ print_labels(x, name = NULL) } \arguments{ \item{x}{A labelled vector} \item{name}{The name of the vector (optional)} } \description{ This is a convenience function, useful to explore the variables of a newly imported dataset. } \examples{ s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F")) s2 <- labelled(c(1, 1, 2), c(Male = 1, Female = 2)) labelled_df <- tibble::data_frame(s1, s2) for (var in names(labelled_df)) { print_labels(labelled_df[[var]], var) } } haven/man/read_dta.Rd0000644000176200001440000000351613227455653014201 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/haven.R \name{read_dta} \alias{read_dta} \alias{read_stata} \alias{write_dta} \title{Read and write Stata DTA files.} \usage{ read_dta(file, encoding = NULL) read_stata(file, encoding = NULL) write_dta(data, path, version = 14) } \arguments{ \item{file}{Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in \code{.gz}, \code{.bz2}, \code{.xz}, or \code{.zip} will be automatically uncompressed. Files starting with \code{http://}, \code{https://}, \code{ftp://}, or \code{ftps://} will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed. Literal data is most useful for examples and tests. It must contain at least one new line to be recognised as data (instead of a path).} \item{encoding}{The character encoding used for the file. This defaults to the encoding specified in the file, or UTF-8. But older versions of Stata (13 and earlier) did not store the encoding used, and you'll need to specify manually. A commonly used value is "windows-1252".} \item{data}{Data frame to write.} \item{path}{Path to a file where the data will be written.} \item{version}{File version to use. Supports versions 8-14.} } \value{ A tibble, data frame variant with nice defaults. Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it. } \description{ Currently haven can read and write logical, integer, numeric, character and factors. See \code{\link{labelled}} for how labelled variables in Stata are handled in R. } \examples{ path <- system.file("examples", "iris.dta", package = "haven") read_dta(path) tmp <- tempfile(fileext = ".dta") write_dta(mtcars, tmp) read_dta(tmp) read_stata(tmp) } haven/man/zap_empty.Rd0000644000176200001440000000102313042170233014414 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/zap_empty.R \name{zap_empty} \alias{zap_empty} \title{Convert empty strings into missing values.} \usage{ zap_empty(x) } \arguments{ \item{x}{A character vector} } \value{ A character vector with empty strings replaced by missing values. } \description{ Convert empty strings into missing values. } \examples{ x <- c("a", "", "c") zap_empty(x) } \seealso{ Other zappers: \code{\link{zap_formats}}, \code{\link{zap_labels}}, \code{\link{zap_widths}} } haven/man/labelled_spss.Rd0000644000176200001440000000227313042170233015230 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/labelled_spss.R \name{labelled_spss} \alias{labelled_spss} \title{Labelled vectors for SPSS} \usage{ labelled_spss(x, labels, na_values = NULL, na_range = NULL) } \arguments{ \item{x}{A vector to label. Must be either numeric (integer or double) or character.} \item{labels}{A named vector. The vector should be the same type as \code{x}. Unlike factors, labels don't need to be exhaustive: only a fraction of the values might be labelled.} \item{na_values}{A vector of values that should also be considered as missing.} \item{na_range}{A numeric vector of length two giving the (inclusive) extents of the range. Use \code{-Inf} and \code{Inf} if you want the range to be open ended.} } \description{ This class is only used when \code{user_na = TRUE} in \code{\link{read_sav}()}. It is similar to the \code{\link{labelled}} class but it also models SPSS's user-defined missings, which can be up to three distinct values, or for numeric vectors a range. } \examples{ x1 <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_values = c(9, 10)) is.na(x1) x2 <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_range = c(9, Inf)) is.na(x2) } haven/man/read_spss.Rd0000644000176200001440000000346113111416757014412 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/haven.R \name{read_spss} \alias{read_spss} \alias{read_sav} \alias{read_por} \alias{write_sav} \alias{read_spss} \title{Read SPSS (SAV & POR) files. Write SAV files.} \usage{ read_sav(file, user_na = FALSE) read_por(file, user_na = FALSE) write_sav(data, path) read_spss(file, user_na = FALSE) } \arguments{ \item{file}{Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in \code{.gz}, \code{.bz2}, \code{.xz}, or \code{.zip} will be automatically uncompressed. Files starting with \code{http://}, \code{https://}, \code{ftp://}, or \code{ftps://} will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed. Literal data is most useful for examples and tests. It must contain at least one new line to be recognised as data (instead of a path).} \item{user_na}{If \code{TRUE} variables with user defined missing will be read into \code{\link{labelled_spss}} objects. If \code{FALSE}, the default, user-defined missings will be converted to \code{NA}.} \item{data}{Data frame to write.} \item{path}{Path to a file where the data will be written.} } \value{ A tibble, data frame variant with nice defaults. Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it. } \description{ Currently haven can read and write logical, integer, numeric, character and factors. See \code{\link{labelled_spss}} for how labelled variables in SPSS are handled in R. \code{read_spss} is an alias for \code{read_sav}. } \examples{ path <- system.file("examples", "iris.sav", package = "haven") read_sav(path) tmp <- tempfile(fileext = ".sav") write_sav(mtcars, tmp) read_sav(tmp) } haven/man/zap_missing.Rd0000644000176200001440000000134513042170233014736 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/zap_missing.R \name{zap_missing} \alias{zap_missing} \title{Zap special missings to regular R missings} \usage{ zap_missing(x) } \arguments{ \item{x}{A vector or data frame} } \description{ This is useful if you want to convert tagged missing values from SAS or Stata, or user-defined missings from SPSS, to regular R \code{NA}. } \examples{ x1 <- labelled( c(1, 5, tagged_na("a", "b")), c(Unknown = tagged_na("a"), Refused = tagged_na("b")) ) x1 zap_missing(x1) x2 <- labelled_spss( c(1, 2, 1, 99), c(missing = 99), na_value = 99 ) x2 zap_missing(x2) # You can also apply to data frames df <- tibble::data_frame(x1, x2, y = 4:1) df zap_missing(df) } haven/man/as_factor.Rd0000644000176200001440000000311213042172717014357 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/as_factor.R \name{as_factor} \alias{as_factor} \alias{as_factor.data.frame} \alias{as_factor.labelled} \title{Convert input to a factor.} \usage{ \method{as_factor}{data.frame}(x, ..., only_labelled = TRUE) \method{as_factor}{labelled}(x, levels = c("default", "labels", "values", "both"), ordered = FALSE, ...) } \arguments{ \item{x}{Object to coerce to a factor.} \item{...}{Other arguments passed down to method.} \item{only_labelled}{Only apply to labelled columns?} \item{levels}{How to create the levels of the generated factor: \itemize{ \item "default": uses labels where available, otherwise the values. Labels are sorted by value. \item "both": like "default", but pastes together the level and value \item "label": use only the labels; unlabelled values become \code{NA} \item "values: use only the values }} \item{ordered}{If \code{TRUE} create an ordered (ordinal) factor, if \code{FALSE} (the default) create a regular (nominal) factor.} } \description{ The base function \code{as.factor()} is not a generic, but this variant is. Methods are provided for factors, character vectors, labelled vectors, and data frames. By default, when applied to a data frame, it only affects \code{labelled} columns. } \examples{ x <- labelled(sample(5, 10, replace = TRUE), c(Bad = 1, Good = 5)) # Default method uses values where available as_factor(x) # You can also extract just the labels as_factor(x, "labels") # Or just the values as_factor(x, "values") # Or combine value and label as_factor(x, "both") } haven/man/labelled.Rd0000644000176200001440000000323613042170233014160 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/labelled.R \name{labelled} \alias{labelled} \alias{is.labelled} \title{Create a labelled vector.} \usage{ labelled(x, labels) is.labelled(x) } \arguments{ \item{x}{A vector to label. Must be either numeric (integer or double) or character.} \item{labels}{A named vector. The vector should be the same type as \code{x}. Unlike factors, labels don't need to be exhaustive: only a fraction of the values might be labelled.} \item{...}{Ignored} } \description{ A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. This class makes it possible to import such labelled vectors in to R without loss of fidelity. This class provides few methods, as I expect you'll coerce to a standard R class (e.g. a \code{\link{factor}}) soon after importing. } \examples{ s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F")) s2 <- labelled(c(1, 1, 2), c(Male = 1, Female = 2)) # Unfortunately it's not possible to make as.factor work for labelled objects # so instead use as_factor. This works for all types of labelled vectors. as_factor(s1) as_factor(s1, labels = "values") as_factor(s2) # Other statistical software supports multiple types of missing values s3 <- labelled(c("M", "M", "F", "X", "N/A"), c(Male = "M", Female = "F", Refused = "X", "Not applicable" = "N/A") ) s3 as_factor(s3) # Often when you have a partially labelled numeric vector, labelled values # are special types of missing. Use zap_labels to replace labels with missing # values x <- labelled(c(1, 2, 1, 2, 10, 9), c(Unknown = 9, Refused = 10)) zap_labels(x) } haven/man/zap_formats.Rd0000644000176200001440000000120013042170233014726 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/zap_formats.R \name{zap_formats} \alias{zap_formats} \title{Remove format attributes} \usage{ zap_formats(x) } \arguments{ \item{x}{A vector or data frame.} } \description{ To provide some mild support for round-tripping variables between Stata/SPSS and R, haven stores variable formats in an attribute: \code{format.stata}, \code{format.spss}, or \code{format.sas}. If this causes problems for your code, you can get rid of them with \code{zap_formats}. } \seealso{ Other zappers: \code{\link{zap_empty}}, \code{\link{zap_labels}}, \code{\link{zap_widths}} } haven/man/zap_labels.Rd0000644000176200001440000000150113042170233014521 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/zap_labels.R \name{zap_labels} \alias{zap_labels} \title{Zap labels} \usage{ zap_labels(x) } \arguments{ \item{x}{A vector or data frame} } \description{ Removes labels, leaving unlabelled vectors as is. Use this if you want to simply drop all labelling from a data frame. Zapping labels from \code{\link{labelled_spss}} also removes user-defined missing values, replacing all with \code{NA}s. } \examples{ x1 <- labelled(1:5, c(good = 1, bad = 5)) x1 zap_labels(x1) x2 <- labelled_spss(c(1:4, 9), c(good = 1, bad = 5), na_values = 9) x2 zap_labels(x2) # zap_labels also works with data frames df <- tibble::data_frame(x1, x2) df zap_labels(df) } \seealso{ Other zappers: \code{\link{zap_empty}}, \code{\link{zap_formats}}, \code{\link{zap_widths}} } haven/man/read_xpt.Rd0000644000176200001440000000254013111416757014232 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/haven.R \name{read_xpt} \alias{read_xpt} \alias{write_xpt} \title{Read and write SAS transport files} \usage{ read_xpt(file) write_xpt(data, path, version = 8) } \arguments{ \item{file}{Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in \code{.gz}, \code{.bz2}, \code{.xz}, or \code{.zip} will be automatically uncompressed. Files starting with \code{http://}, \code{https://}, \code{ftp://}, or \code{ftps://} will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed. Literal data is most useful for examples and tests. It must contain at least one new line to be recognised as data (instead of a path).} \item{data}{Data frame to write.} \item{path}{Path to a file where the data will be written.} \item{version}{Version of transport file specification to use: either 5 or 8.} } \value{ A tibble, data frame variant with nice defaults. Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it. } \description{ The SAS transport format is a open format, as is required for submission of the data to the FDA. } \examples{ tmp <- tempfile(fileext = ".xpt") write_xpt(mtcars, tmp) read_xpt(tmp) } haven/man/read_sas.Rd0000644000176200001440000000251513227426277014215 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/haven.R \name{read_sas} \alias{read_sas} \alias{write_sas} \title{Read and write SAS files.} \usage{ read_sas(data_file, catalog_file = NULL, encoding = NULL, catalog_encoding = encoding, cols_only = NULL) write_sas(data, path) } \arguments{ \item{data_file, catalog_file}{Path to data and catalog files. The files are processed with \code{\link[readr]{datasource}()}.} \item{encoding, catalog_encoding}{The character encoding used for the `data_file` and `catalog_encoding` respectively. A value of `NULL` uses the encoding specified in the file; use this argument to override it if it is incorrect.} \item{cols_only}{A character vector giving an experimental way to read in only specified columns.} \item{data}{Data frame to write.} \item{path}{Path to file where the data will be written.} } \value{ A tibble, data frame variant with nice defaults. Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it. } \description{ Reading supports both sas7bdat files and the accompanying sas7bdat files that SAS uses to record value labels. Writing value labels is not currently supported. } \examples{ path <- system.file("examples", "iris.sas7bdat", package = "haven") read_sas(path) } haven/LICENSE0000644000176200001440000000011313130177044012353 0ustar liggesusersYEAR: 2013-2016 COPYRIGHT HOLDER: Hadley Wickham; RStudio; and Evan Miller