Full description not available
A**.
Love this book!
If you love data and work with a lot of difficult data (and weirdly enjoy this sort of work), you will love this book. The essay format makes it an entertaining read. This is not a ‘how to’ book. It takes you down the thought process of how others (very smart others) have approached various different difficult data situations and the aha moments that helped them to unravel the mess. And, it’s therapeutic… because this work is a lot of trial-and-error… mostly error… and it’s comforting to know I’m not alone in this. Sometimes it veers off into ‘deep thoughts’ on tangent subjects, but so do I so I kind of enjoy that. It feels like you’re hanging out with your favorite data nerd friend and swapping war stories over a bottle of wine. I can’t talk with any normal people about the data work I do… within 5 minutes, all eyes are glazed over or people are trying to secretly escape the room. 😆
S**Y
Taming Bad DAta
A great concept for a book. In this day and age as we seem to be increasingly engaging with things we call datasets, engaging in challenges to make sense of big data and engaging with one another around stuff we call data - here are a series of lessons to deal with data ... Taking a very case-oriented approach, the collection of articles in this edited volume look at the problems we run into - either overtly or unawarely when working with data. How many have run into the character encoding challenge, received data in a semi-structured form and needed to transform it quickly and efficiently into something more usable, or had to determine a means to identify the potential bias or results from collection errors? Well, that's what the Bad Data Handbook is all about.Editor, Q. Ethan McCallum has assembled an impressive array of contributors who present articles on determining data quality and detecting potential flaws, fixing data errors to make it usable for your specific usage, and using the most up to date techniques and methods available today to tame data and effectively interrogate it for analytical purposes. The precept of this book is data not fit for purpose ... or at least the purpose you might have in mind for it and in that respect, we will call it bad data. The various chapters look at doing 'sniff' tests' on the data to see whether it is sound for the purposes you might consider putting it to. How do we find outliers? Can we spot gaps? through the use of some handy automated routines. The second chapter looks to techniques useful for transforming data that was formatted for human consumption and provides means to transform it to useful for machine readability. Subsequently the authors explore ways to consider the data models that have been used to define the collection and processing procedures that may or may not render data unfair for purpose.The collection of articles in this book are deadly valuable and the solutions proposed are code based. The routines for dealing with the data ultimately involve application of routines to make data suit your needs. The routines are python-based so about as approachable as possible by users who may be less familiar or accustomed to using code to deal with data problems.I was particularly impressed by the inclusion of a section on working with various text encoding formats and apply techniques to remedy situations which render the data 'bad'. The inclusion of a series of quick exercises in this section are particularly apt.The general presentation of the book is to identify a specific problem, explain its significance and then to provide hands-on examples of how a user can approach a solution.The transition to applied techniques to look at data from a more broad basis, such as using sentiment analysis and Natural Language Processing to sniff out whether online reviews are genuine or not addresses real world problems with online information - more than data itself.This is an intriguing book. It looks at the down and dirty manipulation and mungingg of data, then takes higher level looks at how we might mistake information for solid data. In all cases it applies good techniques, suggests how one can use sound statistical reasoning, interrogate the data model or delve into code based manipulation in the pursuit of more truthful data. Due to the broad coverage of this book it is harder to determine who it is directly aimed towards. I believe that selective reading of it could inform general practitioners in the digital humanities and in emerging areas of study increasingly engaging with data in new ways. It brings to light many lessons of experience that are simply invaluable and would normally be developed only through hands-on tinkering and discovery often well into larger projects.It has broader appeal to data scientists more broadly who benefit for similar reasons, but also for the wealth of hands-on techniques provided that refine and empower standard practice.In any case I do feel that as a collection of it articles it can a very helpful reference source and individual sections consulted as needed - by no means does is this a linear designed volume. It is however, a very valuable contribution to a field that is gaining mass popular engagement.
E**L
Excellent real-world information
There is a tremendous amount of information here that provides best practices for dealing with bad data. Sometimes it seems like bad data is everywhere. This book provides excellent information for both identifying and correcting bad data. Several of the techniques I've already used at work. There is a lot of value packed into this slim volume.
M**O
This is a misleading title
For someone who is not a high-end programmer, it is a total waste - it could not be understood or comprehended.
P**K
Real-world anecdotes and lessons-learned
TL;DR summary of the review - awesome book. If you work with real-world datasets, or you work with people who do, you owe it to yourself to read this book. I wish it had been around 8 years earlier when I started working with large-scale social sciences census data. All of the fun, and all of the pain, of dealing with government data and social sciences data is particularly true for census information.Much of the book could be summed up as noting that less-than-perfect data is still very useful, but you need to understand how the data is bad - is it random? What kinds of bias are introduced, if any? What impact will that have on your conclusions? Go get your hands dirty with the data itself - go look at a few hundred records in a text editor to see what you've got. You'll want to test the data all through your analysis, to ensure that you can identify both where you're hitting issues and where you're introducing issues yourself, and you'll be happier if you can automate these tests so that you can run them often without creating a burden for yourself. Prefer simple tools and portable file formats - in particular, Excel is not your friend. The book discusses a number of different case studies and anecdotes for dealing with data that has problems of one flavor or another. The authors have been there before and you can learn from their experience.Discussions of social sciences survey data and its inherent imperfections and messy metadata definitely rang true with my experiences dealing with census data, as did the chapter on the lowly, undervalued flat file as a data structure.I'll summarize three takeaway messages that resonated for my own experience:1. It's generally easy to do some basic analysis of your data to look for problems, gaps, inconsistencies, unusual distributions; and doing so will give you insight into what you're dealing with. Going through your actual data file, rather than trusting the metadata and documentation, is the only way to really know what sort of issues are lying in wait.2. There's lots of interesting data that's structured for human consumption rather than machine-driven analysis. Restructuring it to be in a format that's more amenable for machine analysis can be tedious, but it's also automatable. Rather than converting a huge list of documents by hand, write some code to restructure it. This notion is explored in chapter 2, where the code is in the R stats language. R is is a good fit for two-dimensional data such as tables, rather than the base unix tools (perl, sed, awk) which tend to be line-oriented. However, there's nothing here that can't be done in awk too. Don't shy away from writing code to transform data into something useful, and expect that to be an iterative process.3. Oftentimes, "plain text" files are anything but. You can find "plan text" files that are ASCII, or UTF-8, or ISO-8859, or CP-1252, all of which will look the same until you start to run into non-English characters. I've seen this in dealing with internationally-sourced data, or even US data that includes Puerto Rico. The authors provides some guidance about how to deal with this in chapter 4, but more importantly, they discuss the fact that it's a surprisingly and frustratingly complex problem that you need to be aware of. Another issue is that when looking at data generated from a web app, you may find text that's been encoded or escaped to avoid SQL injection or cross-site scripting attacks. These are web app best practices, and it's generally easy to get it back to plain text once you know what you're looking at. The author gives code samples in python, which has strong library support for text transformation, but the main point is to see how to identify these kinds of problems with your input data.My only negatives are that, as a collection of individual essays, the writing style and tone tends to be all over the map. All in all, this is a book that I enjoyed reading, and have recommended to other software developers starting to work with data scientists.
K**様
日本の事例を差し上げたい
Excelなどを間違った使い方をしているために再利用できなくなっているデータを、どうやってクレンジングするか、修正のための豊富なレシピとともに書かれています。ウェブの世界には、もう少しの努力で有用なデータとなるものに満ちているのですね。しかし!!! 日本の官公庁の作り出す面妖な「デジタルデータ」に対しては本書のレシピでも力不足です。誰かレシピを書いてくれませんか?
Trustpilot
1 month ago
1 month ago