R tutorial: Kako uvesti podatke u R

Nabavite kompletnu knjigu
Praktični R za masovnu komunikaciju i novinarstvo MSRP 59,95 USD Pogledajte

Ovaj je članak izdvojen iz knjige „Praktični rad za masovnu komunikaciju i novinarstvo” uz dopuštenje izdavača. © 2019 Taylor & Francis Group, LLC.

Da biste mogli analizirati i vizualizirati podatke, morate ih unijeti u R. Postoje različiti načini za to, ovisno o načinu oblikovanja podataka i gdje se nalaze.

Obično funkcija koju koristite za uvoz podataka ovisi o formatu datoteke podataka. Na primjer, u bazu R možete uvesti CSV datoteku pomoću read.csv(). Hadley Wickham stvorio je paket nazvan readxl koji, kao što ste mogli očekivati, ima funkciju za čitanje u Excel datotekama. Postoji još jedan paket, googlesheets, za uvlačenje podataka iz Google proračunskih tablica.

Ali ako se ne želite sjetiti svega toga, tu je rio.

Čarolija rio

"Cilj rio-a je olakšati unos / izlaz podataka [uvoz / izlaz] u R-u uvođenjem tri jednostavne funkcije u stilu švicarskog noža", prema GitHub stranici projekta. Ti su funkcije import(), export()te convert().

Dakle, Rio paket ima samo jednu funkciju za čitanje u mnoge različite vrste datoteka: import(). Ako znate import("myfile.csv"), zna koristiti funkciju za čitanje CSV datoteke. import("myspreadsheet.xlsx")radi na isti način. U stvari, rio obrađuje više od dva tuceta formata, uključujući podatke odvojene karticama (s nastavkom .tsv), JSON, Stata i podatke formata fiksne širine (.fwf).

Paketi potrebni za ovaj vodič

  • rio
  • htmltab
  • readxl
  • googlesheets
  • Pac Man
  • domar
  • rmiscutils (pm GitHub) ili readr
  • pomicanje

Jednom kada analizirate podatke, ako želite rezultate spremiti u CSV, Excel proračunsku tablicu ili druge formate, export()funkcija rio to može podnijeti.

Ako na svom sustavu još nemate paket rio, instalirajte ga sada s install.packages("rio").

Postavio sam neke uzorke podataka s podacima o zimi snijega u Bostonu. Možete otići na //bit.ly/BostonSnowfallCSV i kliknuti desnom tipkom miša da biste datoteku spremili kao BostonWinterSnowfalls.csv u svoj trenutni radni direktorij R projekta. No, jedna od točaka skriptiranja je zamjena ručnog rada - zamornog ili drugačijeg - automatizacijom koju je lako reproducirati. Umjesto da kliknete za preuzimanje, možete koristiti R-ovu download.filefunkciju sa sintaksom download.file("url", "destinationFileName.csv"):

download.file ("// bit.ly/BostonSnowfallCSV", "BostonWinterSnowfalls.csv")

To pretpostavlja da će vaš sustav preusmjeriti s te prečice Bit.ly URL-a i uspješno pronaći pravi URL datoteke, //raw.githubusercontent.com/smach/NICAR15data/master/BostonWinterSnowfalls.csv. Povremeno sam imao problema s pristupom web sadržaju na starim Windows računalima. Ako imate jednu od njih, a ova veza Bit.ly ne radi, možete zamijeniti stvarni URL veze Bit.ly. Druga je mogućnost nadogradnja računala sa sustavom Windows na sustav Windows 10 ako je moguće da biste vidjeli je li to u redu.

Ako želite da Rio može samo uvesti podatke izravno s URL-a, u stvari može, a do toga ću doći u sljedećem odjeljku. Poanta ovog odjeljka je uvježbati rad s lokalnom datotekom.

Nakon što imate testnu datoteku na svom lokalnom sustavu, možete te podatke učitati u R objekt nazvan snowdata s kodom:

snowdata <- rio :: import ("BostonWinterSnowfalls.csv")

Imajte na umu da je moguće da će rio zatražiti da datoteku ponovo preuzmete u binarnom formatu, u tom slučaju ćete trebati pokrenuti

download.file ("// bit.ly/BostonSnowfallCSV", "BostonWinterSnowfalls.csv", mode = "wb")

Obavezno upotrijebite opcije dovršavanja kartice RStudio. Ako tipkate rio::i čekate, dobit ćete popis svih dostupnih funkcija. Upišite snowi pričekajte, a kao opciju trebali biste vidjeti puno ime objekta. Koristite tipke sa strelicama gore i dolje za kretanje između prijedloga za automatsko dovršavanje. Nakon što se označi željena opcija, pritisnite tipku Tab (ili Enter) da biste u skriptu dodali puni naziv objekta ili funkcije.

Trebali biste vidjeti da se objekt snowdatapojavljuje na kartici vašeg okruženja u gornjem desnom oknu RStudio. (Ako gornje desno okno prikazuje vašu povijest naredbi umjesto vašeg okruženja, odaberite karticu Okoliš.)

Grupa Taylor & Francis

snowdatatreba pokazati da ima 76 "obs." - opažanja ili retke - i dvije varijable ili stupce. Ako kliknete strelicu s lijeve strane da snowdatabiste proširili popis, vidjet ćete dva naziva stupaca i vrstu podataka koje svaki stupac sadrži. Niz Winterznakova je, a Totalstupac je numerički. Također biste trebali moći vidjeti prvih nekoliko vrijednosti svakog stupca u oknu Okolina.

Grupa Taylor & Francis

Kliknite samu riječ snowdatana kartici Okoliš za pregled podataka sličnijih proračunskim tablicama. Taj isti pogled možete dobiti s R konzole naredbom View(snowdata)(to mora biti veliko V u prikazu - viewneće raditi). Napomena: snowdatanije pod navodnicima jer se pozivate na naziv R objekta u vašem okruženju. U rio::importnaredbi prije BostonWinterSnowfalls.csv nalazi se pod navodnicima jer to nije R objekt; to je naziv niza znakova datoteke izvan R.

Grupa Taylor & Francis

Ovaj pogled ima nekoliko ponašanja sličnih proračunskim tablicama. Kliknite zaglavlje stupca da bi se sortiralo prema vrijednosti tog stupca u rastućem redoslijedu; kliknite drugi zaglavlje istog stupca drugi put da biste sortirali silaznim redoslijedom. Postoji okvir za pretraživanje za pronalaženje redaka koji odgovaraju određenim znakovima.

Ako kliknete ikonu Filtar, dobit ćete filtar za svaki stupac. Stupac Winterznakova radi onako kako ste očekivali, filtrirajući sve retke koji sadrže znakove koje unesete. Ako Total, međutim, kliknete filtar numeričkog stupca, starije verzije RStudio prikazuju klizač, a novije histogram i okvir za filtriranje .

Uvezite datoteku s weba

Ako datoteku želite preuzeti i uvesti s weba, to možete učiniti ako je javno dostupna i u formatu kao što je Excel ili CSV. Probati

snowdata <- rio :: import ("// bit.ly/BostonSnowfallCSV", format)

Mnogo sustava može slijediti URL za preusmjeravanje na datoteku čak i nakon što vam je prvo poslalo poruku o pogrešci, pod uvjetom da navedete format "csv"jer ovdje naziv datoteke ne uključuje .csv. Ako vaš neće raditi, umjesto toga upotrijebite URL //raw.githubusercontent.com/smach/R4JournalismBook/master/data/BostonSnowfall.csv.

rio također može uvoziti dobro oblikovane HTML tablice s web stranica, ali tablice moraju biti izuzetno dobro oblikovane. Recimo da želite preuzeti tablicu koja opisuje ocjenu ozbiljnosti Nacionalne vremenske službe za snježne oluje. Stranica Regionalnih indeksa snježnih padavina Nacionalnih centara za informacije o okolišu ima samo jednu tablicu, vrlo dobro izrađenu, pa bi ovakav kod trebao raditi:

rsi_description <- rio :: import ("//www.ncdc.noaa.gov/snow-and-ice/rsi/", format = "html")

Ponovno imajte na umu da u ovom slučaju morate uključiti format format="html". jer sam URL ne daje naznake o kakvoj se datoteci radi. Da URL sadrži naziv datoteke s .htmlnastavkom, Rio bi to znao.

Međutim, u stvarnom se životu web podaci rijetko pojavljuju u tako urednom, izoliranom obliku. Dobra opcija za slučajeve koji nisu baš dobro izrađeni često je paket htmltab. Instalirajte ga pomoću install.packages("htmltab"). Funkcija paketa za čitanje HTML tablice također se naziva htmltab. Ali ako pokrenete ovo:

biblioteka (htmltab) citytable <- htmltab ("// en.wikipedia.org/wiki/List_of_United_States_cities_by_population") str (citytable)

vidite da nemate ispravnu tablicu, jer okvir podataka sadrži jedan objekt. Budući da nisam odredio koju tablicu, povukao je prvu HTML tablicu na stranici. To slučajno nije bila ona koju želim. Ne želi mi se uvoziti svaku tablicu na stranici dok ne pronađem pravu, ali srećom imam Chromeovo proširenje pod nazivom Table Capture koje mi omogućuje pregled popisa tablica na stranici.

Posljednji put kada sam provjeravao, tablicu 5 s više od 300 redaka želio sam. Ako vam to sada ne uspije, pokušajte instalirati Table Capture na preglednik Chrome kako biste provjerili koju tablicu želite preuzeti.

Pokušat ću ponovo, navodeći tablicu 5, a zatim vidjeti koja su imena stupaca u novoj tablici grada. Imajte na umu da sam u sljedećem kodu citytable <- htmltab()naredbu stavio u više redaka. To je tako da nije prekoračio margine - sve možete držati na jednom retku. Ako se broj tablice promijenio od objavljivanja ovog članka, zamijenite which = 5točnim brojem.

Umjesto da koristite stranicu na Wikipediji, URL Wikipedije možete zamijeniti URL-om kopije datoteke koju sam stvorio. Ta je datoteka na //bit.ly/WikiCityList. Koristiti tu verziju, tip bit.ly/WikiCityListu pregledniku, a zatim kopirajte dugotrajan URL preusmjerava na i korištenje da umjesto Wikipedia URL u kodu u nastavku:

knjižnica (htmltab) citytable <- htmltab ("// en.wikipedia.org/wiki/List_of_United_States_cities_by_population", which = 5) kolna imena (citytable)

Kako sam znao da whichje argument potreban za određivanje broja tablice? Pročitao sam htmltabdatoteku pomoći pomoću naredbe ?htmltab. To je uključivalo sve dostupne argumente. Skenirao sam mogućnosti i " whichvektor duljine jedan za identifikaciju tablice u dokumentu" izgledao je ispravno.

Imajte na umu i ono što sam koristio colnames(citytable)umjesto names(citytable)da vidim nazive stupaca. Bilo koje će uspjeti. Baza R također ima  rownames()funkciju.

U svakom slučaju, ti su rezultati tablice puno bolji, iako se iz izvođenja vidi str(citytable)da je nekoliko stupaca koji bi trebali biti brojevi ušlo kao nizovi znakova. To možete vidjeti i chrpored naziva stupca i navodnika oko vrijednosti poput 8,550,405.

Ovo je jedna od R-ovih sitnih smetnji: R uglavnom ne razumije da 8,550je to broj. I sam sam se riješio ovog problema napisavši vlastitu funkciju u vlastiti paket rmiscutils kako bih sve one "nizove znakova" koji su stvarno brojevi sa zarezima vratio u brojeve. Svatko može preuzeti paket s GitHub-a i koristiti ga.

Najpopularniji način instaliranja paketa s GitHub-a je upotreba paketa nazvanog devtools. devtools je izuzetno moćan paket dizajniran uglavnom za ljude koji žele pisati vlastite pakete, a uključuje nekoliko načina za instaliranje paketa s drugih mjesta osim s CRAN-a. Međutim, devtools obično zahtijeva nekoliko dodatnih koraka za instalaciju u usporedbi s tipičnim paketom, a dosadne zadatke administratora sustava želim ostaviti sve dok to prijeko nije potrebno.

Međutim, paket pacman također instalira pakete iz izvora koji nisu CRAN-a poput GitHub-a. Ako još niste, instalirajte pacman sainstall.packages("pacman").

pacmanova p_install_gh("username/packagerepo")funkcija instalira se iz GitHub repo-a.

p_load_gh("username/packagerepo")učitava paket u memoriju ako već postoji na vašem sustavu, a prvo se instalira, a zatim učitava paket iz GitHub-a ako paket lokalno ne postoji.

Moj paket komunalnih usluga rmisc možete pronaći na smach/rmiscutils. Pokrenite pacman::p_load_gh("smach/rmiscutils")da instalirate moj paket rmiscutils.

Napomena: Alternativni paket za instaliranje paketa s GitHub-a naziva se daljinski upravljači koje možete instalirati putem  install.packages("remotes"). Njegova je glavna svrha instalirati pakete iz udaljenih spremišta poput GitHub-a. Datoteku pomoći možete pogledati pomoću help(package="remotes").

I, možda je najslađi od svih paket koji se naziva githubinstall. Cilj mu je pogoditi repo gdje se paket nalazi. Instalirajte ga putem  install.packages("githubinstall"); tada možete instalirati moj paket rmiscutils koristeći  githubinstall::gh_install_packages("rmiscutils"). Upitat ćete želite li instalirati paket na smach/rmisutils(želite).

Sad kad ste instalirali moju kolekciju funkcija, moju number_with_commas()funkciju možete koristiti za promjenu nizova znakova koji bi trebali biti brojevi natrag u brojeve. Toplo predlažem dodavanje novog stupca u podatkovni okvir umjesto mijenjanja postojećeg stupca - to je dobra praksa analize podataka bez obzira koju platformu koristite.

U ovom primjeru nazvat ću novi stupac PopEst2017. (Ako je tablica od tada ažurirana, upotrijebite odgovarajuća imena stupaca.)

biblioteka (rmiscutils) citytable $ PopEst2017 <- number_with_commas (citytable $ `2017 predračun`)

Inače, moj paket rmiscutils nije jedini način za rješavanje uvezenih brojeva koji imaju zareze. Nakon što sam kreirao svoj paket rmiscutils i njegovu number_with_commas()funkciju, rodio se paket tidyverse readr. readr također uključuje funkciju koja pretvara znakovnih nizova u brojeve, parse_number().

Nakon instalacije readra, pomoću generacije readr možete generirati brojeve iz stupca procjene za 2017. godinu:

citytable $ PopEst2017 <- readr :: parse_number (citytable $ `procjena 2017)

Jedna od prednosti readr::parse_number()je ta što možete definirati vlastiti locale()za upravljanje stvarima poput kodiranja i decimalnih znakova, što može biti zanimljivo čitateljima izvan SAD-a. Pokrenite ?parse_number za više informacija.

Napomena: Ako niste koristili završetak kartice za stupac procjene za 2017. godinu, možda ste imali problema s tim nazivom stupca ako u njemu postoji prostor u trenutku kada ste pokrenuli ovaj kôd. U mom gornjem kodu primijetite da postoje unazad jednostruke navodnike ( `) oko naziva stupca. To je zato što je postojeće ime imalo razmak koji ne biste trebali imati u R. Taj naziv stupca ima još jedan problem: započinje brojem, također obično R ne-ne. RStudio to zna i automatski dodaje potrebne povratne citate oko imena pomoću automatskog dovršavanja kartice.

Bonus savjet: Postoji R paket (naravno da postoji!) Koji se naziva domar koji može automatski popraviti problematična imena stupaca uvezena iz izvora podataka koji nije prikladan za R. Instalirajte ga pomoću install.packages("janitor"). Zatim možete stvoriti nova čista imena stupaca pomoću clean_names()funkcije domara .

Sada ću stvoriti potpuno novi podatkovni okvir, umjesto da mijenjam imena stupaca na svom izvornom podatkovnom okviru, i na izvornim podacima pokrenut ću domaće clean_names (). Zatim provjerite nazive stupaca okvira podataka pomoću names():

citytable_cleaned <- domar :: čista_imena (citytable)

imena (citytable_cleaned)

Vidite da su razmaci promijenjeni u donje podvlake, koje su legalne u nazivima R varijabli (kao i točke). I, svi nazivi stupaca koji su počinjali s brojem sada imaju znak xna početku.

Ako ne želite gubiti pamćenje imaju dvije kopije suštinski istim podacima, možete ukloniti R objekt sa svog radnog sjednice s  rm()funkcijom: rm(citytable).

Uvoz podataka iz paketa

Postoji nekoliko paketa koji vam omogućuju pristup podacima izravno s R. Jedan je quantmod, koji vam omogućuje da izvučete neke američke vlade i financijske podatke izravno u R.

Drugi je prigodno nazvan paket vremenskih podataka na CRAN-u. Može izvući podatke iz API-ja Weather Underground, koji ima informacije za mnoge zemlje širom svijeta. 

Paket rnoaa, projekt grupe rOpenSci, uključuje nekoliko različitih podataka američke Nacionalne uprave za oceane i atmosferu, uključujući dnevne podatke o klimi, plutačama i oluji.

Ako ste zainteresirani za podatke državne ili lokalne uprave u SAD-u ili Kanadi, možda biste trebali provjeriti RSocrata i provjeriti objavljuje li agencija koja vas zanima tamo podatke. Još nisam pronašao cjelovit popis svih dostupnih skupova podataka Socrata, ali postoji stranica za pretraživanje na //www.opendatanetwork.com. Ipak, budite oprezni: postoje skupovi preneseni u zajednicu zajedno sa službenim vladinim podacima, zato provjerite vlasnika skupa podataka i izvor prijenosa prije nego što se na njega oslanjate više od R prakse. Rezultat toga „ODN skup podataka“ znači da je riječ o datoteci koju je prenio netko iz šire javnosti. Službeni vladini skupovi podataka obično žive na URL-ovima poput //data.CityOrStateName.gov//data.CityOrStateName.us.

Za više paketa za uvoz podataka, pogledajte moju tablicu za pretraživanje na //bit.ly/RDataPkgs. Ako radite s podacima američke vlade, možda će vas posebno zanimati popis stanovništva i tidicenzus, koji oboje koriste podatke američkog ureda za popis stanovništva. Ostali korisni vladini paketi podataka uključuju eu.us.opendata vlada SAD-a i Europske unije kako bi se olakšala usporedba podataka u obje regije i Cancensus za kanadske popisne podatke.

Kada podaci nisu idealno formatirani

U svim ovim primjerima podataka podaci nisu samo dobro oblikovani, već i idealni: kad sam ih pronašao, bio je savršeno strukturiran za R. Što pod tim mislim? Bila je pravokutna, a svaka ćelija imala je jednu vrijednost umjesto spojenih ćelija. I prvi je redak imao zaglavlja stupaca, za razliku od, recimo, naslovnog retka velikim fontom u više ćelija kako bi izgledao lijepo - ili uopće nije imao zaglavlja stupaca.

Dealing with untidy data can, unfortunately, get pretty complicated. But there are a couple of common issues that are easy to fix.

Beginning rows that aren’t part of the data. If you know that the first few rows of an Excel spreadsheeet don’t have data you want, you can tell rio to skip one or more lines. The syntax is rio::import("mySpreadsheet.xlsx", skip=3) to exclude the first three rows. skip takes an integer.

There are no column names in the spreadsheet. The default import assumes the first row of your sheet is the column names. If your data doesn’t have headers, the first row of your data may end up as your column headers. To avoid that, use rio::import("mySpreadsheet.xlsx", col_names = FALSE) so R will generate default headers of X0, X1, X2, and so on. Or, use a syntax such as rio::import("mySpreadsheet.xlsx", col_names = c("City", "State", "Population")) to set your own column names.

If there are multiple tabs in your spreadsheet, the which argument overrides the default of reading in the first worksheet. rio::import("mySpreadsheet.xlsx", which = 2) reads in the second worksheet.

What’s a data frame? And what can you do with one?

rio imports a spreadsheet or CSV file as an R data frame. How do you know whether you’ve got a data frame? In the case of snowdata, class(snowdata) returns the class, or type, of object it is. str(snowdata) also tells you the class and adds a bit more information. Much of the info you see with str() is similar to what you saw for this example in the RStudio environment pane: snowdata has 76 observations (rows) and two variables (columns).

Data frames are somewhat like spreadsheets in that they have columns and rows. However, data frames are more structured. Each column in a data frame is an R vector, which means that every item in a column has to be the same data type. One column can be all numbers and another column can be all strings, but within a column, the data has to be consistent.

If you’ve got a data frame column with the values 5, 7, 4, and “value to come,” R will not simply be unhappy and give you an error. Instead, it will coerce all your values to be the same data type. Because “value to come” can’t be turned into a number, 5, 7, and 4 will end up being turned into character strings of "5", "7", and "4". This isn’t usually what you want, so it’s important to be aware of what type of data is in each column. One stray character string value in a column of 1,000 numbers can turn the whole thing into characters. If you want numbers, make sure you have them!

R does have a ways of referring to missing data that won’t screw up the rest of your columns: NA means “not available.”

Okviri podataka pravokutni su: Svaki redak mora imati jednak broj unosa (iako neki mogu biti prazni), a svaki stupac mora imati jednak broj stavki.

Stupci proračunske tablice programa Excel obično se nazivaju slovima: Stupac A, Stupac B itd. Na stupac okvira podataka možete se pozvati njegovim nazivom, koristeći sintaksu dataFrameName$columnName. Dakle, ako utipkate snowdata$Totali pritisnete Enter, vidjet ćete sve vrijednosti u Totalstupcu, kao što je prikazano na donjoj slici. (Zbog toga se prilikom pokretanja str(snowdata)naredbe ispred naziva svakog stupca nalazi znak dolara.)

Grupa Taylor & Francis

A reminder that those bracketed numbers at the left of the listing aren’t part of the data; they’re just telling you what position each line of data starts with. [1] means that line starts with the first item in the vector, [10] the tenth, etc.

RStudio tab completion works with data frame column names as well as object and function names. This is pretty useful to make sure you don’t misspell a column name and break your script—and it also saves typing if you’ve got long column names.

Type snowdata$ and wait, then you see a list of all the column names in snowdata.

It’s easy to add a column to a data frame. Currently, the Total column shows winter snowfall in inches. To add a column showing totals in meters, you can use this format:

snowdata$Meters <- snowdata$Total * 0.0254

The name of the new column is on the left, and there’s a formula on the right. In Excel, you might have used =A2 * 0.0254 and then copied the formula down the column. With a script, you don’t have to worry about whether you’ve applied the formula properly to all the values in the column.

Now look at your snowdata object in the Environment tab. It should have a third variable, Meters.

Because snowdata is a data frame, it has certain data-frame properties that you can access from the command line. nrow(snowdata) gives you the numbers of rows and ncol(snowdata) the number of columns. Yes, you can view this in the RStudio environment to see how many observations and variables there are, but there will probably be times when you’ll want to know this as part of a script. colnames(snowdata) or names(snowdata) gives you the name of snowdata columns. rownames(snowdata) give you any row names (if none were set, it will default to character strings of the row number such as "1", "2", "3", etc.).

Some of these special dataframe functions, also known as methods, not only give you information but let you change characteristics of the data frame. So, names(snowdata) tells you the column names in the data frame, but

names(snowdata) <- c("Winter", "SnowInches", "SnowMeters")

changes the column names in the data frame.

You probably won’t need to know all available methods for a data frame object, but if you’re curious, methods(class=class(snowdata)) displays them. To find out more about any method, run the usual help query with a question mark, such as ?merge or ?subset.

When a number’s not really a number

ZIP codes are a good example of “numbers” that shouldn’t really be treated as such. Although technically numeric, it doesn’t make sense to do things like add two ZIP codes together or take an average of ZIP codes in a community. If you import a ZIP-code column, R will likely turn it into a column of numbers. And if you’re dealing with areas in New England where ZIP codes start with 0, the 0 will disappear.

I have a tab-delineated file of Boston ZIP codes by neighborhood, downloaded from a Massachusetts government agency, at //raw.githubusercontent.com/smach/R4JournalismBook/master/data/bostonzips.txt. If I tried to import it with zips <- rio::import("bostonzips.txt"), the ZIP codes would come in as 2118, 2119, etc. and not 02118, 02119, and so on.

This is where it helps to know a little bit about the underlying function that rio’s import() function uses. You can find those underlying functions by reading the import help file at ?import. For pulling in tab-separated files, import uses either fread() from the data.table package or base R’s read.table() function. The ?read.table help says that you can specify column classes with the colClasses argument.

Create a data subdirectory in your current project directory, then download the bostonzips.txt file with

download.file("//raw.githubusercontent.com/smach/R4JournalismBook/master/data/bostonzips.txt", "data/bostonzips.txt")

If you import this file specifying both columns as character strings, the ZIP codes will come in properly formated:

zips <- rio::import("data/bostonzips.txt", colClasses = c("character”", "character")) str(zips)

Note that the column classes have to be set using the c() function, c("character", "character"). If you tried colClasses, "character", you’d get an error message. This is a typical error for R beginners, but it shouldn’t take long to get into the c() habit.

A save-yourself-some-typing tip: Writing out c("character", "character") isn’t all that arduous; but if you’ve got a spreadsheet with 16 columns where the first 14 need to be character strings, this can get annoying. R’s rep() function can help. rep(), as you might have guessed, repeats whatever item you give it however many times you tell it to, using the format rep(myitem, numtimes). rep("character", 2) is the same as c("character", "character"), so colClasses = rep("character", 2) is equivalent to colClasses = c("character", "character") . And, colClasses = c(rep("character", 14), rep("numeric", 2)) sets the first 14 columns as character strings and the last two as numbers. All the names of column classes here need to be in quotation marks because names are character strings.

I suggest you play around a little with rep() so you get used to the format, since it’s a syntax that other R functions use, too.

Easy sample data

R comes with some built-in data sets that are easy to use if you want to play around with new functions or other programming techniques. They’re also used a lot by people teaching R, since instructors can be sure that all students are starting off with the same data in exactly the same format.

Type data() to see available built-in data sets in base R and whatever installed packages are currently loaded. data(package = .packages(all.available = TRUE)) from base R displays all possible data sets from packages that are installed in your system, whether or not they’re loaded into memory in your current working session.

You can get more information about a data set the same way you get help with functions: ?datasetname or help("datasetname"). mtcars and iris are among those I’ve seen used very often.

If you type mtcars, the entire mtcars data set prints out in your console. You can use the head() function to look at the first few rows with head(mtcars).

You can store that data set in another variable if you want, with a format like cardata <- mtcars.

Or, running the data function with the data set name, such as data(mtcars), loads the data set into your working environment.

One of the most interesting packages with sample data sets for journalists is the fivethirtyeight package, which has data from stories published on the FiveThirtyEight.com website. The package was created by several academics in consultation with FiveThirtyEight editors; it is designed to be a resource for teaching undergraduate statistics.

Prepackaged data can be useful—and in some cases fun. In the real world, though, you may not be using data that’s quite so conveniently packaged.

Create a data frame manually in R

Chances are, you’ll often be dealing with data that starts off outside of R and you import from a spreadsheet, CSV file, API, or other source. But sometimes you might just want to type a small amount of data directly into R, or otherwise create a data frame manually. So let’s take a quick look at how that works.

R data frames are assembled column by column by default, not one row at a time. If you wanted to assemble a quick data frame of town election results, you could create a vector of candidate names, a second vector with their party affiliation, and then a vector of their vote totals:

candidates <- c("Smith", "Jones", "Write-ins", "Blanks")

party <- c("Democrat", "Republican", "", "")

votes <- c(15248, 16723, 230, 5234)

Remember not to use commas in your numbers, like you might do in Excel.

To create a data frame from those columns, use the data.frame() function and the synatx data.frame(column1, column2, column3).

myresults <- data.frame(candidates, party, votes)

Check its structure with str():

str(myresults)

While the candidates and party vectors are characters, the candidates and party data frame columns have been turned into a class of R objects called factors. It’s a bit too in-the-weeds at this point to delve into how factors are different from characters, except to say that

  1. Factors can be useful if you want to order items in a certain, nonalphabetical way for graphing and other purposes, such as Poor is less than Fair is less than Good is less than Excellent.
  2. Factors can behave differently than you might expect at times. I recommend sticking with character strings unless you have a good reason to specifically want factors.

You can keep your character strings intact when creating data frames by adding the argument stringsAsFactors = FALSE:

myresults <- data.frame(candidates, party, votes, stringsAsFactors = FALSE) str(myresults)

Now, the values are what you expected.

There’s one more thing I need to warn you about when creating data frames this way: If one column is shorter than the other(s), R will sometimes repeat data from the shorter column—whether or not you want that to happen.

Say, for example, you created the election results columns for candidates and party but only entered votes results for Smith and Jones, not for Write-ins and Blanks. You might expect the data frame would show the other two entries as blank, but you’d be wrong. Try it and see, by creating a new votes vector with just two numbers, and using that new votes vector to create another data frame:

votes <- c(15248, 16723)

myresults2 <- data.frame(candidates, party, votes)

str(myresults2)

That’s right, R reused the first two numbers, which is definitely not what you’d want. If you try this with three numbers in the votes vector instead of two or four, R would throw an error. That’s because each entry couldn’t be recycled the same number of times.

If by now you’re thinking, “Why can’t I create data frames that don’t change strings into factors automatically? And why do I have to worry about data frames reusing one column’s data if I forget to complete all the data?” Hadley Wickham had the same thought. His tibble package creates an R class, also called tibble, that he says is a “modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.”

If this appeals to you, install the tibble package if it’s not on your system and then try to create a tibble with

myresults3 <- tibble::tibble(candidates, party, votes)

and you’ll get an error message that the votes column needs to be either 4four items long or one item long (tibble() will repeat a single item as many times as needed, but only for one item).

Put the votes column back to four entries if you’d like to create a tibble with this data:

library(tibble)

votes <- c(15248, 16723, 230, 5234)

myresults3 <- tibble(candidates, party, votes)

str(myresults3)

It looks similar to a data frame—in fact, it is a data frame, but with some special behaviors, such as how it prints. Also notice that the candidates column is character strings, not factors.

If you like this behavior, go ahead and use tibbles. However, given how prevelant conventional data frames remain in R, it’s still important to know about their default behaviors.

Exporting data

Often after you’ve wrangled your data in R, you want to save your results. Here are some of the ways to export your data that I tend to use most:

Save to a CSV file with rio::export(myObjectName, file="myFileName.csv") and to an Excel file with rio::export(myObjectName, file="myFileName.xlsx"). rio understands what file format you want based on the extension of the file name. There are several other available formats, including .tsv for tab-separated data, .json for JSON, and .xml for XML.

Save to an R binary object that makes it easy to load back into R in future sessions. There are two options.

Generic save() saves one or more objects into a file, such as save(objectName1, objectName2, file="myfilename.RData"). To read this data back into R, you just use the command load("myfilename.RData") and all the objects return with the same names in the same state they had before.

You can also save a single object into a file with saveRDS(myobject, file="filename.rds"). The logical assumption is that loadRDS would read the file back in, but instead the command is readRDS—and in this case, just the data has been stored, not the object name. So, you need to read the data into a new object name, such as mydata <- readRDS("filename.rds").

There’s a third way of saving an R object specifically for R: generating the R commands that would recreate the object instead of the object with final results. The base R functions for generating an R file to recreate an object are dput() or dump(). However, I find rio::export(myobject, "mysavedfile.R") even easier to remember.

Finally, there are additional ways to save files that optimize for readability, speed, or compression, which I mention in the additional resources section at the end of this article.

You can also export an R object into your Windows or Mac clipboard with rio: rio::export(myObjectName, format). And, you can import data into R from your clipboard the same way: rio::import(file).

Bonus: rio’s convert() function lets you—you guessed it—convert one file type to another without having to manually pull the data into and then out of R. See ?convert for more info.

Final point: RStudio lets you click to import a file, without having to write code at all. This isn’t something I recommend until you’re comfortable importing from the command line, beause I think it’s important to understand the code behind importing. But, I admit this can be a handy shortcut.

In the Files tab of RStudio’s lower right pane, navigate to the file you want to import and click it. You’ll see an option to either View File or Import Dataset. Choose Import Dataset to see a dialog that previews the data, lets you modify how the data is imported, and previews the code that will be generated.

Make whatever changes you want and click Import, and your data will be pulled into R.

Additional resources

rio alternatives. While rio is a great Swiss Army knife of file handling, there may be times when you want a bit more control over how your data is pulled into or saved out of R. In addition, there have been times when I’ve had a challenging data file that rio choked on but another package could handle it. Some other functions and packages you may want to explore:

  • Base R’s read.csv() and read.table() to import text files (use ?read.csv and ?read.table to get more information). stringsAsFactors = FALSE is needed with these if you want to keep your character strings as character strings. write.csv() saves to CSV.
  • rio uses Hadley Wickham’s readxl package for reading Excel files. Another alternative for Excel is openxlsx, which can write to an Excel file as well as read one. Look at the openxlsx package vignettes for information about formatting your spreadsheets as you export.
  • Wickham’s readr package is also worth a look as part of the “tidyverse.” readr includes functions to read CSV, tab-separated, fixed-width, web logs, and several other types of files. readr prints out the type of data it has determined for each column—integer, character, double (non-whole numbers), etc. It creates tibbles.

Import directly from a Google spreadsheet. The googlesheets package lets you import data from a Google Sheets spreadsheet, even if it’s private, by authenticating your Google account. The package is available on CRAN; install it via install.packages("googlesheets"). After loading it with library("googlesheets"), read the excellent introductory vignette. At the time of this writing, the intro vignette was available in R at vignette("basic-usage", package="googlesheets"). If you don’t see it, try help(package="googlesheets") and click the User Guides, Package Vignettes and Other Documentation link for available vignettes, or look at the package information on GitHub at //github.com/jennybc/googlesheets.

Scrape data from Web pages with the rvest package and SelectorGadget browser extension or JavaScript bookmarklet. SelectorGadget helps you discover the CSS elements of data you want to copy that are on an HTML page; then rvest uses R to find and save that data. This is not a technique for raw beginners, but once you’ve got some R experience under your belt, you may want to come back and revisit this. I have some instructions and a video on how to do this at //bit.ly/Rscraping. RStudio has a webinar available on demand as well.

Alternatives to base R’s save and read functions. If you are working with large data sets, speed may become important to you when saving and loading files. The data.table package has a speedy fread() function, but beware that resulting objects are data.tables and not plain data frames; some behaviors are different. If you want a conventional data frame, you can get one with the as.data.frame(mydatatable) syntax. The data.table package’s fwrite() function is aimed at writing to a CSV file considerably faster than base R’s write.csv().

Dva druga paketa mogla bi biti od interesa za spremanje i dohvaćanje podataka. Paket pera sprema u binarni format koji se može pročitati u R ili Python. I FST paket je read.fst()i write.fst()ponuda brzo spremanje i učitavanje R okvira podataka objekata plus mogućnost sažimanje datoteke.