[读书笔记]机器学习:实用案例解析(1)

news/2025/2/26 13:44:30

第1章  使用R语言

#machine learing for heckers
#chapter 1

library(ggplot2)
library(plyr)

  

#.tsv文件用制表符进行分割
#字符串默认为factor类型,因此stringsAsFactors置FALSE防止转换
#header置FALSE防止将第一行当做表头
#定义空字符串为NA:na.strings = ""

ufo <- read.delim("ML_for_Hackers/01-Introduction/data/ufo/ufo_awesome.tsv", 
                  sep = "\t", stringsAsFactors = FALSE, header = FALSE, 
                  na.strings = "")

  

查看数据集前6行

tail() 可查看后6行

 

#names()既可以写入列名,也可以读取列名

names(ufo) <- c("DateOccurred", "DateReported", "Location", 
                "ShortDescription", "Duration", "LongDescription")

  

#as.Date用法,可以将字符串转为Date对象,具体格式可以设定,参考help

 

#错误:输入过长,考虑有畸形数据
#畸形数据处理

head(ufo[which(nchar(ufo$DateOccurred) != 8 
               | nchar(ufo$DateReported) != 8), 1])

  

#新建向量,布尔值F为不符合要求的行
#计数不符要求的行数,并只留下符合要求的行

good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 
                    | nchar(ufo$DateReported) != 8, FALSE, TRUE)
length(which(!good.rows))
ufo <- ufo[good.rows, ]

  运行结果是731条,而书上是371条,应该是书上有误

#转换

ufo$DateOccurred <- as.Date(ufo$DateOccurred, format = "%Y%m%d")
ufo$DateReported <- as.Date(ufo$DateReported, format = "%Y%m%d")

  

#输入为字符串,进行目击地点清洗
#strsplit用于分割字符串,在遇到不符条件的字符串会抛出异常,由tryCatch捕获,并返回缺失
#gsub将原始数据中的空格去掉(通过替换)
#条件语句用于检查是否多个逗号,返回缺失

get.location <- function(l){
  split.location <- tryCatch(strsplit(l, ",")[[1]], error = function(e) return(c(NA, NA)))
  clean.location <- gsub("^ ", "", split.location)
  if(length(clean.location) > 2){
    return(c(NA, NA))
  }
  else{
    return(clean.location)
  }
}

  

#lapply(list-apply)将function逐一用到向量元素上,并返回链表(list)

city.state <- lapply(ufo$Location, get.location)

  

#将list转换成matrix
#do.call在一个list上执行一个函数调用
#transform函数给ufo创建两个新列,tolower函数将大写变小写,为了统一格式

location.matrix <- do.call(rbind, city.state)
ufo <- transform(ufo, USCity = location.matrix[, 1], USState = tolower(location.matrix[, 2]), 
                 stringsAsFactors = FALSE)

  

#识别非美国地名,并置为NA

us.states <- c("ak", "al", "ar", "az", "ca", "co", "ct", "de", "fl", "ga", "hi", "ia", "id", 
               "il", "in", "ks", "ky", "la", "ma", "md", "me", "mi", "mn", "mo", "ms", "mt", 
               "nc", "nd", "ne", "nh", "nj", "nm", "nv", "ny", "oh", "ok", "or", "pa", "ri", 
               "sc", "sd", "tn", "tx", "ut", "va", "vt", "wa", "wi", "wv", "wy")
ufo$USState <- us.states[match(ufo$USState, us.states)]
ufo$USCity[is.na(ufo$USState)] <- NA

  

#只留下美国境内的记录

ufo.us <- subset(ufo, !is.na(USState))

  

#对时间维度进行分析:
#预处理:对时间范围进行概述

summary(ufo.us$DateOccurred)
quick.hist <- ggplot(ufo.us, aes(x = DateOccurred)) + geom_histogram() + scale_x_date(date_breaks = "50 years")
print(quick.hist)

  

 

#取出1990年后的数据并作图

ufo.us <- subset(ufo.us, DateOccurred >= as.Date("1990-01-01"))
quick.hist.new <- ggplot(ufo.us, aes(x = DateOccurred)) + geom_histogram() + scale_x_date(date_breaks = "50 years")
print(quick.hist.new)

  

 

#统计每个年-月的目击个数
#时间信息转化为以月为单位,每个月的目击次数的数据框
#产生一个以月为单位的序列,包含了所有月信息,并与地点相结合生成数据框

ufo.us$YearMonth <- strftime(ufo.us$DateOccurred, format = "%Y-%m")
sightings.counts <- ddply(ufo.us, .(USState, YearMonth), nrow)
date.range <- seq.Date(from = as.Date(min(ufo.us$DateOccurred)), 
                       to = as.Date(max(ufo.us$DateOccurred)), by = "month")
date.strings <- strftime(date.range, "%Y-%m")
states.dates <- lapply(us.states, function(s) cbind(s, date.strings))
states.dates <- data.frame(do.call(rbind, states.dates), stringsAsFactors = FALSE)

  

#将两个数据框合并,merge函数,传入两个数据框,可以将相同的列合并,by.x和by.y指定列名
#all置为TRUE可以将未匹配处填充为NA
#进一步将all.sithtings细节优化,包括缺失值置0和转化变量类型

all.sightings <- merge(states.dates, sightings.counts, 
                       by.x = c("s", "date.strings"), 
                       by.y = c("USState", "YearMonth"), all = TRUE)
names(all.sightings) <- c("State", "YearMonth", "Sightings")
all.sightings$Sightings[is.na(all.sightings$Sightings)] <- 0
all.sightings$YearMonth <- as.Date(rep(date.range, length(us.states)))
all.sightings$State <- as.factor(toupper(all.sightings$State))

  

#分析数据
#geom_line表示曲线图,facet_wrap用于创建分块绘制的图形,并使用分类变量State
#theme_bw设定了图形背景主题
#scale_color_manual定义第二行中字符串"darkblue"的值,这个值相当于"darkblue"对应的值

state.plot <- ggplot(all.sightings, aes(x = YearMonth, y = Sightings)) + 
  geom_line(aes(color = "darkblue")) + 
  facet_wrap(~State, nrow = 10, ncol = 5) + 
  theme_bw() + 
  scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") + 
  xlab("Time") + 
  ylab("Number of Sightings") + 
  ggtitle("Number of UFO sightings by Month-Year and U.S. State (1990-2010)")
print(state.plot)

  

 

转载于:https://www.cnblogs.com/gyjerry/p/5562002.html


http://www.niftyadmin.cn/n/2422876.html

相关文章

使用QEMU调试Linux内核代码

http://blog.chinaunix.net/uid-20729583-id-1884617.html http://www.linuxidc.com/Linux/2014-08/105510.htm Linux内核代码的调试非常麻烦&#xff0c;一般都是加printk, 或者用JTAG调试。这里的方法是用QEMU来调试Linux内核。因为QEMU自己实现了一个gdb server, 所以可以非…

MongoCola使用教程 2 - MongoDB的Replset 初始化和配置

前言 首先再次感谢博客园的各位朋友。正是你们的关注才让我有信心将这个工具开发下去。 这周同样也有热心网友对于MongoCola存在的问题给予了反馈。 这次工具更新到了版本1.20&#xff0c;强化的地方是增加了Replset和Sharding的管理能力。MongoVUE和Mongocola以前在显示一个R…

Oracle RMAN 的 show,list,crosscheck,delete命令整理

1、SHOW命令&#xff1a;显示rman配置&#xff1a; RMAN> show all;2、REPORT命令&#xff1a;2.1、RMAN> report schema 报告目标数据库的物理结构;2.2、RMAN> report need backup days3/days 3; 报告最近3天没有被备份的数据文件&#xff1b;2.3、RMAN> report n…

python turtle画笑脸_如何用python画笑脸QQ表情——turtle库实践

from turtle import *screensize(600,600)speed(10)#笑脸的小圆脸pensize(5)color(dim grey,yellow)pu()goto(0,-100)begin_fill()circle(100)end_fill()#腮红#左侧seth(90)color(Light Pink,Light Pink)pu()goto(-55,-5)pd()begin_fill()circle(20)end_fill()#右侧color(Light…

maven项目迁入内网的各个坑

前言&#xff1a;我之前做的一个项目一直是在内网环境&#xff0c;进行开发的时候是在外网开发好了后打包传入内网。有许多的不便 因此我整个项目迁入内网才内网开发&#xff0c;琢磨了好一会才找到各个问题的解决方案。最近公司新进了一个新同事 然后让我带带&#xff0c;这就…

npm 模块安装机制简介

npm 是 Node 的模块管理器&#xff0c;功能极其强大。它是 Node 获得成功的重要原因之一。 正因为有了npm&#xff0c;我们只要一行命令&#xff0c;就能安装别人写好的模块 。 $ npm install 本文介绍 npm 模块安装机制的细节&#xff0c;以及如何解决安装速度慢的问题。 一、…

python在linux中的应用_Linux中为Python应用安装uwsgi

一般直接用pip install uwsgi可能会出错&#xff0c;所以在这之前先安装其他必要的引用库yum groupinstall "Development tools"yum install zlib-devel bzip2-devel pcre-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-develyum install pytho…

SSL3_GET_SERVER_CERTIFICATE 错误解决办法

requests模块之前一直正常的&#xff0c;某一天开始对https的请求都抛错误了&#xff1a;requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed完整的&#xff1a; /usr/local/lib/python…