Changelog
Source:NEWS.md
orgdata 1.5.0
- Make HDIR-systems default, keep FHI as an option.
- Some internal functions made more general to be used in other packages
orgdata 1.4.9
-
geo_merge()
gains alocaltable
argument. It can now be used on mapping tables generated withgeo_map_multi
. - Add
geo_map_multi()
, to generate a multiyear mapping table. 9b3e1a1 - Fix
geo_merge()
d163c9a - Temporarily made it possible to run on both FHI and HDIR systems.
- Allow specification of encoding in
read_file
ba980a7
orgdata 1.4.7
- Can use different encodings for reading files and Access database. Encoding can be specified in
options
with eitherorgdata.encoding.access
ororgdata.encoding.csv
. Check config file. - Fix geo splitting in
geo_recode()
usingfix=TRUE
as default argument. - Change options name from
debug.row
todebug.rows
. - Fix problem updating global options when config file has changed (#320)
orgdata 1.4.6
- Should not give unnecessary warning with
dummy_grk
(#318) - Rename argument
year
toyear.geo
inmake_file()
to be more explicit (#317)
orgdata 1.4.5
- Have access to future data from API (#316).
- Use
pak
package for upgrade.
orgdata 1.4.1
- Refactor some codes.
- Specify encoding via config.yml file.
- Use
norgeo
package version 2.3.0 from Github ie. dev, instead of CRAN.
orgdata 1.3.0
- Fix selecting files without
KOBLID
(#315) - Add more testing.
- Avoid warning when connection to the database is already closed.
- Create environment for status flow ie.
orgEnv
. - Can use
select
arg to ease choosing file(s) inmake_file()
instead ofKOBLID
. Read the document on how to use the argument. - Refactor some codes.
- Edit some documents.
- Fix
dbDisconnect
error (#312) - Edit message (#313)
- Edit few other messages as well.
- Manage encoding warning for R 4.1 and below.
- Filtering with
IBRUKTIL
andIBRUKFRA
for selecting original files.IBRUKTIL
uses>=
andIBRUKFRA
uses<
of the specified date. (#309) - Update vignette on debugging.
- Utilise input from config when mutating columns #ref
- Add welcome logo.
- Use integer (#310)
- Proper use of boolean (#311)
orgdata 1.2.0
- Update vignette.
- Use keyword
delete
orslett
when recoding with RE options to""
. This replacesempty
andtom
as in #285. - Selecting geo levels when geo codes derived from two columns will use the maximum number of digits to find geo level (#307)
- Rewrite some codes for speed. (#231)
- Add unidentified municipalities as
xx99
togeo_map()
. Unidentified municipalities do not exist from API. (#304) - Age group with
AgeCat
can use mix categories easily with[x]
. See example infind_age_category()
. (#305) - Fix bugs when writing and reading data from warehouse ie. marked the
KONTROLLERT
column in Access (#298) - Fix bugs when accessing or deleting data with
see_data()
from data warehouse (#300) - Upgrade package version dependency.
- Create unknown municipality codes with
xx99
ingeo_recode()
function ie. geo record table in Access. (#302) - Recode unknown municipality codes with
xx99
for known county. (#303) - MANHEADER accepts regular expression to select columname in addition to column index (#301)
- Function
read_file()
accept SPSS file too (#297) - Function
read_file()
now accept txt file extension.
orgdata 1.1.0
- Make geo abbreviation clearer in the global options and no more depending on the order of geo names (#286)
orgdata 1.0
Order does matter when using functions in EXTRA column (#294)
Fix S3 age category functions
find_age_category()
(#294)Delete unwanted column in the output when using
AgeCat
function in EXTRA (#295)Use global options in input argument when relevant instead of using
match.arg()
(#263)Edit document here and there.
Use codebook to recode age category for better speed (#292)
Read or delete data in the data warehouse can use multiple koblid in
see_data()
. Use"all"
inkoblid
argument to select all data on the chosen filegroup (#291)Fixed duplicated age groups (#290).
Use either
empty
ortom
to represent regular expression to replace to since Access makes symbol""
to be invisible (#285)Change function name from
see_org()
tosee_data()
for viewing data in the data warehouse.Use symbol
|
to separate multiple arguments in columnEXTRA
(#288)Group age to specific or specified interval with
AgeCat()
. This function can be use in table for filegroup underEXTRA
column (#287 #289)Delete raw of similar columns with multiple specifications (#282)
Different ways to recode of similar column ie. duplicated, with defined lesid will give error.
Show current installed version against the new release version.
Some minor text editing.
Delete older file of DuckDB automatically since new version can’t read the older files (#280)
Actively ask users to update with given options when new release version is available.
Control all columns that should be numeric don’t contain any string. The columns to be controls for is now dynamic (#281)
Rename function alias from
rf()
tordf()
sincerf()
is already in use in stats package.Deleting row with specification of lesid and without lesid ie. common filegroup, create errors. This is not fixed (#279)
Need to update version whenever a new release is available to ensure everyone is using the latest release. Else users can’t load the package.
Filter active files with date was done pragmatically ie. all files with
BRUKTIL
date other than01-01-9999
will be excluded. Now filtering with date will be compared against current date. (#272)Delete dataset from the database when unmark column
KONTROLLERT
instead of updating the dataset due to time consuming by updating it. The users have to mark the column to save or read the dataset in the database. (#278)Change default for argument
raw
inmake_file()
toFALSE
as in config file.Some text editing here and there.
Function
is_colour_txt()
can specify symbol directly without needing to rely on the global options or to usewithr
package. Just for cosmetic purposes 😃Standard columns should be uppercase. Using lowercase creates error when GEO comes from two separated columns especially when handling recode from codebook (#277)
Standardize arguments names in
see_org()
function.Parallel processing is deactivated since it doesn’t work smoothly with Access and DuckDB connection ie. DBI package, due to the problem with not-exportable objects for future package.
Replace SQLite with DuckDB again due to the speed for DuckDB. But since DuckDB is under active development, the new version might not work with the file created by older version. In such situation older DuckDB database should be deleted and re-run with the newer DuckDB version.
When loading the package, users will be reminded whenever new version is available.
Show as data frame when warning
NAs with coercion
instead of just the GEO number where the coercion took place (#274)Replace DuckDB with SQLite (#271)
Columnames are case insensitive (#115)
Use ellipsis for other arguments of
data.table::fwrite()
insave_file()
function.Implicit null for number of geo digits is moved to config file. This make it easy to maintain and expand.
Use
future.apply
package conditionally to reduce package vulnerability.Deactivate dependency packages for parallel processing. Use it conditionally ie. to install manually if needed.
Reshape multiple columns containing more than one
VAL
is handled properly. When reshaping multiple columns to be just oneVAL
then leavingRESHAPE_VAL
empty ie. use all columns not defined inRESHAPE_ID
should work as before (#269)Give more information when error to reshape data, especially long reshape. Most of the time the source of error is in defining
RESHAPE_KOL
andRESHAPE_VAL
(#268)Can reshape multiple columns when reshaping to LONG. Applicable when there are multiple columns represent number of cases and one column represent denominator for all these numerators (#266)
Use KOBLID to add or delete from Duck database (#261)
Depends for norgeo package refers directly to CRAN version instead of GitHub repo.
Use function
geo_merge()
to add geo granularity that aren’t from API. This can be a csv, xls or any other format that is accepted byread_file
(#262)Deactive KONTROLLERT without the need to umark from Access database by using argument
raw = TRUE
when using functionmake_file()
(#264)Use KOBLID as table name in DuckDB instead of FILID since FILID is not a unique number ie. can be used by multiple KOBLID.
Update text and website
Change function name from
geo_levels()
togeo_map()
for mapping geo codes granularity.Change helper function names for
read_file()
. (#250)Fixed bugs when reading file from the web with https (#251)
read_file()
accept Stata file withdta
extension (#252)Use yaml format for global configuration file to ease update. The file is located in config repo (#256)
Raw data that have been controlled for are saved in DuckDB database format. This increase reading speed especially for big files. Mark column KONTROLLERT in the Access registration database to activate this function. Unmark to read from the original raw file instead. (#257)
Use
see_org()
to read the raw data in the database. Argumentaction = "delete"
can be used to delete the data from the raw database.Function
geo_merge()
for merging geo codes that aren’t available from API to the mapping table ie. tblGeo, in the geo database. The data could be in any file format accepted byread_file()
function. The data to be merged must have column to be merged ie.id.file
, that is equivalent to the column id in the database ie.id.table
. Theid.file
must be unique.Handle unbalanced parentheses in post recode whenever possible when
raw
is used, else give error message. #246read_file()
accept filegroup name as argument infile
to read the completed file after runningmake_file()
function. #247Debug functions can go deeper to show helper functions as well with
debug_opt("deep")
oroptions(orgdata.debug = "deep")
#243Post recode uses type
PS
in codebook. The function is used when there is a need to recode the value of a column after the dataset have been clean and aggregated. Specification to select the row to be recoded uses either standard expression or R syntax ofdata.table
style. When using R syntax the value must haveraw
prefix eg.raw(AAR %in% c(2000, 2005))
. #244 #245Delete rows with
"-"
minus symbol inTIL
column in the codebook is accepted fordo_recode_post()
.Able to aggregate to country level in addition to other geographical levels (#240)
Download the geo code to aggregate with only once despite the different geographical levels the dataset to be aggregated into (#241)
Update package via function
update_orgdata()
. Basically it’s just a wrapper forremotes::install_github()
.Can aggregate to country level as well (#240)
Speed up aggregate process (#241)
Fixed the bugs with reshape wide consisting multiple dimensions ie. more than one TAB columns. (#228)
Ensure melting columns are not converted to factor (#234)
Some columns should be numeric type and these columns are GEO, AAR, KJONN, ALDER and all VALs. They are also specify in global options
orgdata.num
. Ensure thise columns are numeric and give warning as well as log when coercion where NA is introduced. (#235)Use log file to check what or which codes that have problem. (#237)
Warn and give logfile if GEO can’t be converted to integer before geo recode. (#233 #236)
Able to read files that have no specific extension but they have to be coma or semicolon separated (#227)
Ensure all VAL columns are numeric type (#229)
Give explicit warning when GEO codes have character that can cause coercion resulting in converting GEO codes to
NA
(#229)Log files for
code99
also include koblid. The files will be named ascode99_koblidxxx
(#222)Running parallel processing is unstable. Not sure if the process crash due to MS Access that can’t handle constant requests or … Anyway, running parallel now will only use 50% of the available cores (#224)
User can specify more or less cores than 50% if needed. To use 75% cores by specifying
parallel = 0.75
in the argmentmake_file()
or in the global optionsorgdata.parallel
(#225)Connect to both database with common function. (#212)
Use global options
orgdata.year
to specify production year if not using current year. (#216)Use parallel processing in
make_file()
with argumentparallel = TRUE
. (#217)Get feedback for successful and unsuccessful filegroups when running
make_filegroups()
. (#199)Reshape wide is not limited to only 3 reshape columns as it was. (#200 #201)
Create log files in default orgdata folder instead of occupying the REM. (#202)
Delete old bydel codes ie. before 2003, except for Oslo. This is due to the unstable geographical recoding of bydel. Use
EXTRA
column on filegroup level with argumentDeleteOldBydel
. (#204 #206)Log files use
KOBLID
to be more specific. (#208)Reshape to wide format can now implement multiple reshape dimensions which are usually one or multiple TAB columns. For instance for LESEFERD of
tot_elev
is the product of bothmestringsnivå
andklassetrinn
. (#188)The output will be arrange on a standard column order. (#190)
Debugging will be easier with a wrapper function
debug_opt()
. (#196)Fixed #147 to implement reshape wide format to create denominator. This function is only applicable to certain file structure received from SSB. For example filegroup LESEFERD where column
tot_elev
represents the total number of student withmestringsnivå
and not the grand total of students. The number of students withmestringsnivå
is represented in columnant_elev
. Therefore the long format formestringsnivå
needs to be restructured to wide with value from theant_elev
to ensure summing uptot_elev
when creating denominator will not create a grand total of students instead of the total number of students withmestringsnivå
. (#184)Fixed #162 when saving geographical codebook with
write = TRUE
and the table doesn’t exist in the geo-code database.Stop looping the geographical levels when
orgdata.debug.geo
ororgdata.debug.aggregate
are active and make the default tokommune
. (#166)Add batch date from codebook as reference when debugging. (#168)
Hide warning messages if file has been controlled for any possible errors. It means the column KONTROLLERT in table for original files is marked for the specific file. (#171)
Refactoring function for recode of geographical codes. (#174)
Geo codes for enumeration areas and towns that aren’t able to be merged will be converted to
unknown
codes with eitherxxxx9999
orxxxx99
. (#177)Create default folder at
C:/Users/YourUserName/orgdata_logs
whenpath
argument is not specified insave_file()
. (#179)Create unknown codes for enumeration areas and town codes if not able to recode. Most probably is due to the enumeration codes before 2002. The Excel file from SSB is not clean and unstable to be used for recoding. The unknown codes are based on municipality codes with added
xxxx99
orxxxx9999
. As in #177 but recode is done on municipality codes before merging back to the original dataset. (#182)Add
codeDelete
inlog
for geographical codes that aren’t able to be merged. The codes will be excluded in the dataset. To access all the deleted codes uselog$codeDelete
. (#149)Give explicit error message when
path
is missing insave_file()
. (#152)Able to select any valid year to recode geographical codes and aggregate accordingly. But recoding geographical codes backward isn’t possible. (#153)
Error message will split a long vector if exist. (#153)
Add column with batch date when the geographical codes were downloaded form SSB when creating a geo codebook with
geo_map()
orgeo_recode()
. (#156)Actively select only the first geo code when old geo code is split into multiple new geo codes. (#159)
-
Able to select a base year for geo recode based on the year available in the original data. This is available in column
AAR
in the dataset. Use argumentbase
or global optionorgdata.recode.base
with logical input.TRUE
will select the base year for recoding geographical code from the year of the original file to the current year. Default isFALSE
ie. include all available geographical codes available in the codebook. (#157) Fixed #139 for
orgdata.debug.geo
keep original geo codes for enumeration areas before adding 9999. (#140)Fixed #142 show codes that have problem to recode directly instead of row numbers (#144)
Save all codes that have problem in
log
environment for easy access. To list the codes is either withlog$code00
orlog$codeShort
Recode geo even when argument
aggregate = FALSE
inmake_file()
function.Rename
make_filegroup
andlag_filgruppe
to plural ie.make_filegroups
andlag_filgrupper
.Use options
orgdata.debug.rows
to select only specific row(s) for further processing. It can be activated via global options withoptions(orgdata.debug.rows = 20:50)
or via argumentrow = 20:50
inmake_file()
to select row 20 to 50.Make multiple filegroups via
make_filegroups
. (#137)Fixed #132 LANDSSB must be string
Convert whitespace to NA to be able to delete all rows with NA
Fixed #118 warning text when column(s) aren’t defined in FILGRUPPE and will be deleted (#128)
Edit error message for columns with existing NA value before aggregating. Total value will be NA and this will conflict with the allready existing NA category in the aggregated column(s). Therefore existing NA value in the selected column(s) must be recoded to a valid value.
Use
options(orgdata.debug.geo = TRUE)
to keep old geo codes for debuging (#120)Use
reset_options()
to reset to default options.Warn when process discontinued due to debugging.
Add vignettes for Standardize git and Debugging
Change database filename to raw-database_BE.accdb
Fixed #108 #112 grunnkrets codes that have changed before 2002 not available via API from SSB while code changes for municipality includes changes from 1977. Check from SSB website. We use the municipality codes to create uspesified grunnkrets codes for data before 2002 (#109 #113)
Fixed #110 updating SQL code for new table name for codebook (#111)
Check columns to aggregate for any possible
NA
(#98). Columns that haveNA
should be recoded touoppgitt
or something equivalent since leaving the category toNA
will conflict withNA
representing total value when aggregating.Fixed #100 for grunnkrets that ends with
00
have no correspond codes from SSB API. Need to add it manually (#101)Fixed #99 when geo codes fails to be recoded then the row index will be shown (#103)
Geo codes ends with 4 zeros
xxxx0000
neither have equivalent codes from SSB nor representing a correct coding structure as so calledDelområde
that ends with 2 zerosxxxxxx00
. To avoid missing the information, these geo codes are recoded toxxxx9999
with functionis_grunnkrets_0000()
as in PR (#103).see_file()
accepts just a single numeric as well.Fixed #85
see_file()
list all the columns when columnames or column indexes are not specified. The variables are sorted whenever possible. (#87)Add more function tests (#88)
Exclude
LANDSSB
in aggregate when split toLANDBAK
andINNVKAT
. This is because code0
will be recoded to20
when split and causes unnecessary more rows (#84)Delete deprecated functions.
Fixed #93 when source level can’t be identified due to
NA
.Fixed #95 for grunnkrets codes that aren’t missing but have less number of digits ie. less than 7 digits. Assuming these are codes for municipality then
9999
is added at the end of these codes (#96)Gives row number for GEO codes that get coerced as
NA
when converted to integer. This will make it easy to check in the original raw data (#96)Aggregate now give total to all dimensions including those specified in
AGGKOL
(#82)Function
see_file()
accept column index as well (#83)Recode variables using regular expression when defined in codebook with type
RE
. Finding pattern can either be written in ordinary regular expression ie.\\d{4}.*
or withrex()
package. (#78)New feature for checking categories for variables with
see_file()
(#75)Fixed #65 make TABS and VALS dynamic for easy extension for these columns (#66)
Fixed #63 implicit null includes all possible VAL columns when exist (#69)
Fixed #67 aggregate with total values for standard variables ie.
UTDANN
,LANDSSB
,LANDBAK
andINNVKAT
(#72)Fixed #61 use AGGKOL in Access registration database to specify other columns to aggregate other than the standard eg.
KJONN
,TAB1
,TAB2
etc. (#73)Fixed #55 to recode standard variables via codebook instead of hard coded (#58)
Fixed #57 split column with duplicated values will keep the original column (#60)
Fixed #56 aggregate all VAL columns whenever specified and not only specific to
VAL1
(#62)Edit verbose messages
Reshape dataset from wide to long. Reshape can have more than one
measure variables
. Please read how this is specified in Access registration database.Split columns must have equal number of values to the defined
SPLITTIL
. Duplicate the value if it is less than the maximumSPLITTIL
. For example for value0
in columnLANDSSB
which will be split intoLANDBAK
andINNVKAT
, the value will be duplicated into00
to avoid split with valueNA
.Recode for
LANDBAK
andINNVKAT
after aggregating are done internally ie. hard coded, indo_aggregate_recode_standard()
. Total is coded with20
. Any eventuality for future change should also look other related functions such asis_aggregate_standard_cols()
andis_col_num()
.Change argument parameter for
find_spec()
function.Update text document in several places.
Add colour type warn2 for warning message without
Warning:
prefix.Request (#43) messages with specific colour
Fixed (#46) recode to string even though columns is type integer or numeric.
Unknown bydel ie. (uoppgitt) is added when enumeration areas codes ie. (grunnkrets) for bydel is
XXXX9999
in functiongeo_level()
.Add unknown grunnkrets for kommune when not available since some of the datasets have unknown grunnkrets that aren’t listed in API downloaded data (#39).
Exclude
TAB1
,TAB2
andTAB3
from being aggregated. (#44)Recode for aggregated variables uses
AG
in TYPE column in the codebook instead of FILGRUPPE withAGGREGATE
as it was implemented in ver 0.2.0. This will make it possible so specify FILGRUPPE and LESID to implement the principle for GENERAL, COMMON and SPECIFIC variables.Change function name
do_aggregate_recode
todo_aggregate_recode_standard
for standard variables.Recode for aggregated categories can be defined in Recode form ie. codebook, and use
AGGREGATE
in the specification under FILGRUPPEDelete rows when defined in codebook using minus symbol under TIL column. Similar principles is implemented for GENERAL, COMMON and SPECIFIC feature as in recode. Read detail in ver 0.0.5 - alpha.
Display both columnames to be recoded that are found in the dataset or those that aren’t found when defined as
ALLE
in the codebook so user will be aware of its existence.Standardize some most used arguments to
read_file()
such asnrows
,header
,skip
,trimws
andna
. Read details inread_file()
function description.Output to
read_file()
as data.table class.Use standard columnames with
V1
,V2
etc when argmentheader = FALSE
is specified.Error message with list of unmatch columns in
do_column_standard()
.Give clearer message and debug message eg.
Execute: read_file()
.Change
MAPPE
toUTMAPPE
to make it more explicit for path specification to save file.Defun
orgdata.active
global options to use columnames from original dataset.Use global options
options(orgdata.debug.nrow = TRUE)
to read only first 20 rows. Suitable for debug purposes.Fix (#28) GEO derived from two columns with empty INNLESARG.
Add column
LEVEL
for granularity level ie. grunnkrets, fylke, kommune, bydel etc
orgdata 0.0.1 - alpha version
When MANHEADER is used then the new columname must be specified in the respective standard column (#21)
Deprecated arguments
geo
andval
inmake_file
. Output data must use standard columnames instead of keeping the columnames from original dataset.Rename functions
read_raw
orlesraw
tomake_file
orlag_fil
(#27)Alle functions uses underscore “_” for both english and norwegian.
Add new columns if one of the standard columns is missing in the original data. The value to be inserted to the new column must use symbol less than
<
and more than>
. For instance when columnKJONN
doesn’t exist in the original data, we can specify with<2>
in under columnKJONN
in the Access registration database. The output will add a new columnKJONN
with value2
. (#15)Default
orgdata.verbose
isTRUE
.Options for
orgdata.implicit.null
with default asTRUE
. Useoptions(orgdata.implicit.null = FALSE)
to deactivate (#19)Use version specific for imported packages.
Rename standard column
LANDBAK
toLANDSSB
for column in original data received from SOB containing information about country of origin.Save file as specified in column
MAPPE
in Access registration database or specify inpath
argument for functionsave_file
. (#12)Changes is in PR #8 is as the following:
Create GEO code from two separate columns. This has to be defined in Access registration under
GEO
with comma separated eg.nameGeoCol1, nameGeoCol2
.-
Order standard columns in the output dataset with this order for the first four columns:
-
GEO
,AAR
,ALDER
,KJONN
-
Change norwegian name for
save_file
fromlagfil
tolagrefil
.Use column name
KOLNAVN
instead ofADDKOL
.Rename function
do_addcols
andget_addcols
todo_colname
andget_colname
to be consistent with the changes in Access registration database.Changes is in PR #4 is as the following:
-
Recode variables from specification in
tbl_KodeBok
uses:-
GENERAL variables are defined in FILGRUPPE as
ALLE
and are used to recode variables in all groups. - COMMON variables are when FILGRUPPE is specified but have empty LESID. This will recode variables within selected group.
- SPECIFIC variables are when FILGRUPPE and LESID are specified. This will recode variables in that specified FILGRUPPE of the specified FILID.
-
GENERAL variables are defined in FILGRUPPE as
-
When all these three specification exist in
tbl_KodeBok
:- SPECIFIC variables will overrule COMMON variables
- COMMON variables will overrule GENERAL variables
Write as
<NA>
in codebook under columnFRA
when specifying missing variables indicating that a missing column to be recoded to value in columnTIL
. This will differentiate between real missing and a real column value ofNA
. (#5)Error message will be given if LESID is specified without FILGRUPPE since LESID is not unique ID.
orgdata 0.0.4 - alpha
- MANHEADER and KOLNAVN uses common helper function
is_col_separate()
- MANHEADER
old
convert to integer and use and index for columns
Changes is in PR #2